3.1. Object Detection in Forestry
Because of its potential to give vital insights into forest management and conservation, semantic segmentation in forestry has become an essential topic of research. Much research in recent years has focused on creating and evaluating various algorithms and strategies for the semantic segmentation of forest photos. We explore 10 works that have contributed to this topic in this literature review.
The authors of the paper “Tree species classification of forest stands using multisource remote sensing data” [
37], worked on creating a system that could identify tree species automatically using deep learning algorithms. Their goal was to make this system available on mobile devices. They identified tree leaves in images and utilized them to categorize the tree species. The authors used a U-Net architecture, a popular deep-learning model for medical image segmentation [
32], to separate the leaves from the images. The U-Net model includes two networks: an encoder network that captures image features and a decoder network that generates the segmentation map. The authors employed two categories for the segmentation task: “leaf” and “background”. The U-Net model was trained with a dataset of 9,000 tree leaf images that were manually annotated with their respective species labels.
The authors employed a U-Net to classify the tree species after segmenting the leaves. The VGG16 model, a pre-trained CNN, was utilized for computer vision tasks. The VGG16 model was adjusted by using the segmented leaves and their respective species labels. The classification task was performed using a dataset of 10,000 images of trees, which included 20 different species.
The system was tested with a dataset of 900 tree images belonging to five distinct species by the authors. According to the findings, the accuracy rate for species classification was 93.3%.
The paper does not mention if the authors have made their code available for their method. The TensorFlow Lite framework was utilized for deploying the model on mobile devices, indicating that the method could potentially be implemented with this framework.
Lagos et al. [
16] introduced the “FinnWoodlands Dataset” in their academic article, which is a dataset tailored for image analysis in the setting of Finnish woodlands. The authors’ primary focus was on segmentation tasks, wherein they provided valuable insights into the classes utilized and the specific objects that were segmented.
The article lacks clarity on whether the segmentation process was limited to trees or extended to encompass other objects present in the woodland area. Given the context of the dataset and the authors’ strong emphasis on image analysis in forests, it can be inferred that the segmentation task involved identifying and classifying various elements in the Finnish forests, such as trees, plants, leaves, the ground, and potentially other pertinent components.
Panoptic segmentation [
14], is a relatively recent advancement in computer vision, and the work predates its release. Consider a world in which machines can perceive as humans do. They could distinguish between a tree and a bird, or between a forest and a city. This method enables machines to perceive and grasp the world in a surprisingly human-like manner. Panoptic segmentation is very useful in the field of tree/forest segmentation. Unlike older approaches, panoptic segmentation recognizes individual items and their context holistically. It’s like giving machines the ability to distinguish between trees and the entire forest. This technique improves accuracy in detecting and segmenting trees, especially in tightly packed forests. It is about making the unseen visible. From forest management to autonomous navigation, real-time use of this technology has the potential to transform numerous fields. However, Panoptic segmentation necessitates substantial processing resources, which may be a barrier for some applications. Integrating semantic and instance segmentation remains a problem for academics. While it works well in many situations, panoptic segmentation may suffer in certain settings, such as poor light or when the trees are of similar shape and size. Panoptic segmentation is a two-edged sword, whether it’s due to the painstaking precision or the computational needs. It is a game changer in tree/forest segmentation, but like any technology, it has its own set of obstacles.
As a result, there is no precise information available on how panoptic segmentation was utilized or if it was employed at all to discriminate tree species or other items in the dataset. The paper lacks explicit information on the methodology the authors used to distinguish between various tree types with regard to the differentiation of tree species. It is plausible that the differentiation of tree types was achieved through the utilization of diverse visual attributes, including but not limited to shape, texture, color, or a composite of these characteristics. Nonetheless, the precise approach to the classification of tree types has not been revealed.
The methods, models, or frameworks employed for segmentation tasks are not explicitly referenced in the paper. Given the characteristics of the dataset, it is conceivable that the researchers used traditional image analysis and computer vision techniques. Code available at Github.
Nevalainen et al. [
26], in their paper, ‘Individual tree detection and classification with UAV-based photogrammetric point clouds and hyperspectral imaging’, offer a unique deep learning strategy for identifying single-tree species in densely populated regions using hyperspectral data. The approach analyzes photos acquired across a semideciduous forest in the Brazilian Atlantic biome using 25 spectral bands spanning from 506 to 820 nm. A band combination selection step, feature map extraction, and multi-stage model refinement of the confidence map are all part of the network’s design. In a complex forest, the technique obtained state-of-the-art performance for recognizing and geolocating each tree species in UAV-based hyperspectral pictures. When compared to a principal component analysis (PCA) methodology, the strategy is better.
Within the network’s design, the authors estimate a combination of hyperspectral bands that contribute the most to the given goal. A unique deep-learning algorithm for hyperspectral imaging is proposed in this study to recognize and geolocate single-tree species in a tropical forest [
21]. The strategy is intended to deal with a crowded scene and the Hughes effect. Within the network’s design, the suggested technique seeks to estimate a combination of hyperspectral bands that contribute the most to the job. The architecture of the network decreases noise and improves performance in the given job. The suggested technique is successful under different scenarios, and the network’s performance is commensurate with past deep learning studies.
The suggested approach may be used to detect Syagrus romanzoffiana, a palm tree important for forest regeneration, and can also be used in wildlife investigations. The proposed approach can also be used to identify other tree species, such as tapirs, which eat palm tree fruits and transmit their seeds through their excrement. This work provides a unique deep-learning algorithm based on a CNN architecture for detecting single-tree species in hyperspectral UAV-based photos with high dimensionality. The strategy was built with a band selection feature in the first phase, which was effective for dealing with high dimensionality and outperformed the baseline method that considered all 25 spectral bands and the PCA approach. Following the CNN architecture, feature map extraction and a multi-stage model refinement of the confidence map are performed.
The suggested technique performed exceptionally well at recognizing and geolocating trees in UAV-based hyperspectral pictures, with f-measure, precision, and recall values of 0.959, 0.973, and 0.945, respectively. The method presented here is useful for monitoring forest environments while accurately identifying specific trees. The use of a unique hyperspectral camera from a UAV or aircraft to detect bark beetle damage in urban forests at the individual tree level has piqued researchers’ curiosity. Peng et al. [
31] in their paper, Densely based multi-scale and multi-modal fully convolutional networks for high-resolution remote-sensing image semantic segmentation, employed convolutional neural networks, weighted and conventional support vector machines, and random forests (RF) to classify tree species using hyperspectral and photogrammetric data. Deep learning in remote sensing applications has resulted in numerous advances, including the detection of fir trees damaged by bark beetles in unmanned aerial vehicle images, oil palm tree detection and counting for high-resolution remote sensing images, and the use of deep convolutional networks for large-scale image recognition.
The use of a worldview-2/3 and LiDAR data fusion technique, as well as the application of convolutional neural networks for the simultaneous extraction of roads and buildings in remote sensing imagery, have also been investigated [
13]. Deep learning’s application in remote sensing data processing has also resulted in the creation of new applications and problems in the area. The work addresses numerous remote sensing investigations, such as the processing and evaluation of spectrometric, and stereoscopic images, radiometric correction of close-range spectral picture blocks, and enhancement item counts using heatmap regulation. It also looks at automated land cover categorization, land use mapping, change detection, and forest inventory are among the applications. Deep learning algorithms are constantly being developed, which has greatly improved the accuracy and efficiency of these remote sensing jobs [
16].
The method was tested on two datasets: one with a single multispectral image and the other with a series of images taken over time [
40]. The forest segmentation accuracy on both datasets was high, as achieved by the authors. The study conducted by the authors did not involve the segmentation of individual trees or the differentiation of tree species. The technique they suggested may have the potential to classify tree species in upcoming research.
The forest segmentation was performed using a U-Net CNN architecture by the authors. The U-Net weights were initialized using transfer learning with a pre-trained VGG-16 network. The network was implemented using the Keras deep learning framework. The code for the method was not included in the paper. The author shared information about the software and hardware utilized in their experiments.
Chen et al. [
4] paper, “Individual Tree Species Identification Based on a Combination of Deep Learning and Traditional Features” wanted to classify tree species in a study region using machine learning algorithms on UAV-based hyperspectral data. They did not categorize the data too meticulously. They instead employed supervised learning to categorize the tree species based on the spectral properties of the UAV-based hyperspectral data.
For their investigation, the authors chose six tree species: Holm oak, Cork oak, Stone pine, Eucalyptus, Maritime pine, and Acacia. In the categorization procedure, they used these six classes as their target classes. The authors used two feature selection methods to extract features from hyperspectral data: the Sequential Forward Selection algorithm and the Mutual Information (MI) algorithm. To categorize the tree species, the scientists utilized a variety of machine learning methods, including Random Forest, Support Vector Machine (SVM), and k-Nearest Neighbors (k-NN). They also compared the algorithms’ performance using several assessment measures, such as overall accuracy, precision, recall, and F1 score.
In terms of code availability, the authors did not specifically state whether code for their approach is available. They did, however, note that they utilized the R statistical program and associated packages to undertake data processing and analysis, implying that their technique might be implemented using these tools.
“Assessing potential of UAV multispectral imagery for estimation of AGB and carbon stock in conifer forest over UAV RGB imagery” by Gaden [
6], classified various tree species by segmenting individual trees from Very High Resolution (VHR) RGB imagery. The user employed a U-Net CNN architecture to segment trees and a ResNet-50 CNN for classifying species.
The VHR RGB imagery was analyzed by the authors using the U-Net CNN architecture to exclusively identify trees through image segmentation. Although VHR pictures can provide high-precision measurements for classification techniques, they typically contain clouds and cast shadows, which create issues for trustworthy information extraction. This type of CNN is frequently employed for such tasks [
15].
The ResNet-50 CNN model was used to classify each segmented tree into one of the six tree species. The authors utilized a set of features extracted from the RGB image of each segmented tree to differentiate between various tree species. These features were then fed into the ResNet-50 model as inputs. The ResNet-50 model was pre-trained on a large dataset of natural images using the transfer learning technique. Afterward, it was fine-tuned using the VHR RGB imagery dataset. The paper does not come with any code, but the authors have given a thorough explanation of their approach and findings.
The authors of the publication “Forest segmentation using a combination of multi-scale features and deep learning” [
6] aimed to segment forests in high-resolution remote sensing images. The forest was divided into two groups: forest and non-forest. The authors did not distinguish between different types of trees within the forest category.
The method employed by the authors for forest segmentation was “multi-scale features and deep learning”. The forest segmentation results were made more accurate by utilizing a deep learning framework that incorporated multi-scale image features. The authors used the Faster R-CNN model, which is a well-known deep-learning object detection framework. The model detected objects at different scales by extracting multi-scale features from the input image using a pyramid scheme. The specific pyramid scheme used for extracting multi-scale features from the input image was not explicitly mentioned in the paper.
Pyramid schemes are approaches used in computer vision and image processing to extract multi-scale features from an input image. Image pyramids, which are a sequence of scaled-down reproductions of the original image at different resolutions, are used in these schemes. The image is shown at a different size at each level of the pyramid, with higher levels having poorer resolution.
Pyramid methods are used to record data at numerous sizes, allowing algorithms to evaluate images at various levels of detail. They contribute to addressing the problem of identifying objects of varying sizes and dealing with objects that appear at varied scales inside a picture.
Pyramid systems that are commonly employed in computer vision include Gaussian pyramids, Laplacian pyramids, and steerable pyramids.
By repeatedly applying a Gaussian filter to the source image and subsampling it, this method produces a Gaussian pyramid. The image is represented at a different scale and resolution at each level of the pyramid.
The Laplacian pyramid is derived from the Gaussian pyramid. The details, or residuals between the respective levels of the Gaussian pyramid and its extended counterpart are represented by each level of the Laplacian pyramid. It aids in capturing small details in the photograph.
Steerable Pyramids: Steerable pyramids are multi-scale representations that extract information from different orientations and scales using a mix of filters. They are especially effective for assessing photos with objects of various orientations.
Pyramid methods are used in a variety of computer vision tasks, including object identification, image segmentation, and feature extraction. Algorithms that extract multi-scale features can effectively handle objects of varying sizes and capture both fine-grained details and wider context information in the image.
You can find the code for this paper on GitHub. Python and the PyTorch deep learning framework were used to implement the code.
Stan et al. [
36] utilized deep convolutional neural networks to perform semantic segmentation of forest regions in their paper, “Semantic Segmentation of Forest Regions Using Deep Convolutional Neural Networks”. The goal was to divide various areas of the forest, including trees, roads, water bodies, and other types of land cover. The authors utilized two distinct datasets: the National Agriculture Imagery Program and the Spatio-Temporal Asset Catalog.
The study included six categories: tree, road, building, water, field, and others. The authors utilized spectral clustering, a method that groups pixels with comparable spectral features, to differentiate between various tree species.
The U-Net deep CNN architecture was utilized by the authors for semantic segmentation. The structure of U-Net consists of an encoder-decoder with drop out layers, which aid in preserving spatial information. Furthermore, the authors employed methods of data augmentation to expand their training dataset and prevent overfitting. The paper does not discuss the availability of code.
Ma et al. [
23] suggested a method for automatically segmenting Terrestrial LiDAR Data (TLD) to distinguish individual trees in their paper, “Automated extraction of driving lines from mobile laser scanning point clouds”. The researchers separated the trees in the TLD on an individual basis but did not identify the specific species of each tree. The research divided the trunks and branches of trees, along with the nearby plants. The approach they used for segmentation involved two steps: region growth and convex hull fitting. To obtain the tree structure, the point cloud was segmented into regions and then fitted with convex hulls.
Ma et al. [
23] evaluated their technique on a variety of datasets with varying levels of complexity. One of the datasets, for example, had trees with overlapping canopies. Their method’s performance was tested using many quality criteria, including but not limited to completeness and accuracy. These metrics were used to evaluate the segmentation findings’ correctness and effectiveness. Furthermore, the researchers compared their method to other sophisticated techniques available in the literature, albeit the specific techniques evaluated were not specified in the material provided.
Ma et al. [
23] exhibited substantial levels of accuracy in segmenting individual trees from Terrestrial LiDAR data by comparing their method’s completeness and correctness measures to those of other advanced algorithms. The authors did not make their code publicly available for this paper.
The study was conducted “Semantic segmentation of remote-sensing imagery using heterogeneous big data: International society for photogrammetry and remote sensing potsdam and cityscape datasets”, by Song & Kim [
23]. The aerial imagery was divided by the authors to isolate tree crowns. They then categorized each crown based on various forest inventory characteristics, such as tree species, height, diameter at breast height, and crown width. They separated both the trees and the vegetation in the surrounding background and understory.
The authors identified different tree species by analyzing spectral and spatial features obtained from segmented tree crowns. The researchers utilized U-Net, which is based on deep learning. The encoder network weights were initialized using transfer learning with a pre-trained VGG-16 network, which helped to enhance the model’s performance.
The authors have made the work’s code available as open source on GitHub. The code contains the U-Net architecture implementation, pre-processing procedures, and scripts for training and testing.
The authors of the study “The Semantic Segmentation of Standing Tree Images Based on the Yolo V7 Deep Learning Algorithm” by Cao et al. [
2] provided a thorough method for semantic segmentation of standing tree images with the aim of differentiating between different tree types. Instead of just segmenting trees generally, the study concentrated on dividing tree areas into distinct tree species.
The authors employed the YOLO V7 deep learning algorithm, a commonly used approach at the time known for its potential effectiveness and precision in object identification tasks, to perform the segmentation and classification tasks. Using the YOLO V7 network, the input photos were first pre-processed, after which they were segmented, and finally, the resultant tree areas were classified into the appropriate species.
Numerous tree species that were pertinent to the geographical region under examination were included in the classifications employed in this study. Although the authors did not state how many classes they intended to have, it is clear that they wanted to include a wide variety of tree species. The YOLO V7 algorithm made it easier to distinguish between different tree kinds based on the distinctive traits and qualities that each species possesses, such as the texture of the bark, the form of the leaves, the branching patterns, and the general morphology. In terms of methodologies, models, and frameworks, the YOLO V7 deep learning algorithm was the main segmentation and classification tool that the authors used. To improve the performance of the model, they added further strategies to the process, such as data augmentation and transfer learning. The original query omitted any particular implementation frameworks and specifics.
Regarding code accessibility, the authors have not made the research’s source code available to the general public.
3.2. Semantic Segmentation in Forestry
Because of its potential to give vital insights into forest management and conservation, semantic segmentation in forestry has become an essential topic of research. Much research in recent years has focused on creating and evaluating various algorithms and strategies for the semantic segmentation of forest photos. We explore 10 works that have contributed to this topic in this literature review.
The authors of the paper “Tree species classification of forest stands using multisource remote sensing data” [
37], worked on creating a system that could identify tree species automatically using deep learning algorithms. Their goal was to make this system available on mobile devices. They identified tree leaves in images and utilized them to categorize the tree species. The authors used a U-Net architecture, a popular deep-learning model for medical image segmentation [
32], to separate the leaves from the images. The U-Net model includes two networks: an encoder network that captures image features and a decoder network that generates the segmentation map. The authors employed two categories for the segmentation task: “leaf” and “background”. The U-Net model was trained with a dataset of 9,000 tree leaf images that were manually annotated with their respective species labels.
The authors employed a U-Net to classify the tree species after segmenting the leaves. The VGG16 model, a pre-trained CNN, was utilized for computer vision tasks. The VGG16 model was adjusted by using the segmented leaves and their respective species labels. The classification task was performed using a dataset of 10,000 images of trees, which included 20 different species.
The system was tested with a dataset of 900 tree images belonging to five distinct species by the authors. According to the findings, the accuracy rate for species classification was 93.3%.
The paper does not mention if the authors have made their code available for their method. The TensorFlow Lite framework was utilized for deploying the model on mobile devices, indicating that the method could potentially be implemented with this framework.
Lagos et al. [
16] introduced the “FinnWoodlands Dataset” in their academic article, which is a dataset tailored for image analysis in the setting of Finnish woodlands. The authors’ primary focus was on segmentation tasks, wherein they provided valuable insights into the classes utilized and the specific objects that were segmented.
The article lacks clarity on whether the segmentation process was limited to trees or extended to encompass other objects present in the woodland area. Given the context of the dataset and the authors’ strong emphasis on image analysis in forests, it can be inferred that the segmentation task involved identifying and classifying various elements in the Finnish forests, such as trees, plants, leaves, the ground, and potentially other pertinent components.
Panoptic segmentation [
14], is a relatively recent advancement in computer vision, and the work predates its release. Consider a world in which machines can perceive as humans do. They could distinguish between a tree and a bird, or between a forest and a city. This method enables machines to perceive and grasp the world in a surprisingly human-like manner. Panoptic segmentation is very useful in the field of tree/forest segmentation. Unlike older approaches, panoptic segmentation recognizes individual items and their context holistically. It’s like giving machines the ability to distinguish between trees and the entire forest. This technique improves accuracy in detecting and segmenting trees, especially in tightly packed forests. It is about making the unseen visible. From forest management to autonomous navigation, real-time use of this technology has the potential to transform numerous fields. However, Panoptic segmentation necessitates substantial processing resources, which may be a barrier for some applications. Integrating semantic and instance segmentation remains a problem for academics. While it works well in many situations, panoptic segmentation may suffer in certain settings, such as poor light or when the trees are of similar shape and size. Panoptic segmentation is a two-edged sword, whether it’s due to the painstaking precision or the computational needs. It is a game changer in tree/forest segmentation, but like any technology, it has its own set of obstacles.
As a result, there is no precise information available on how panoptic segmentation was utilized or if it was employed at all to discriminate tree species or other items in the dataset. The paper lacks explicit information on the methodology the authors used to distinguish between various tree types with regard to the differentiation of tree species. It is plausible that the differentiation of tree types was achieved through the utilization of diverse visual attributes, including but not limited to shape, texture, color, or a composite of these characteristics. Nonetheless, the precise approach to the classification of tree types has not been revealed.
The methods, models, or frameworks employed for segmentation tasks are not explicitly referenced in the paper. Given the characteristics of the dataset, it is conceivable that the researchers used traditional image analysis and computer vision techniques. Code available at Github.
Nevalainen et al. [
12], in their paper, ‘Individual tree detection and classification with UAV-based photogrammetric point clouds and hyperspectral imaging’, offer a unique deep learning strategy for identifying single-tree species in densely populated regions using hyperspectral data. The approach analyzes photos acquired across a semideciduous forest in the Brazilian Atlantic biome using 25 spectral bands spanning from 506 to 820 nm. A band combination selection step, feature map extraction, and multi-stage model refinement of the confidence map are all part of the network’s design. In a complex forest, the technique obtained state-of-the-art performance for recognizing and geolocating each tree species in UAV-based hyperspectral pictures. When compared to a principal component analysis (PCA) methodology, the strategy is better.
Within the network’s design, the authors estimate a combination of hyperspectral bands that contribute the most to the given goal. A unique deep-learning algorithm for hyperspectral imaging is proposed in this study to recognize and geolocate single-tree species in a tropical forest [
13].The strategy is intended to deal with a crowded scene and the Hughes effect. Within the network’s design, the suggested technique seeks to estimate a combination of hyperspectral bands that contribute the most to the job. The architecture of the network decreases noise and improves performance in the given job. The suggested technique is successful under different scenarios, and the network’s performance is commensurate with past deep learning studies.
The suggested approach may be used to detect Syagrus romanzoffiana, a palm tree important for forest regeneration, and can also be used in wildlife investigations. The proposed approach can also be used to identify other tree species, such as tapirs, which eat palm tree fruits and transmit their seeds through their excrement. This work provides a unique deep-learning algorithm based on a CNN architecture for detecting single-tree species in hyperspectral UAV-based photos with high dimensionality. The strategy was built with a band selection feature in the first phase, which was effective for dealing with high dimensionality and outperformed the baseline method that considered all 25 spectral bands and the PCA approach. Following the CNN architecture, feature map extraction and a multi-stage model refinement of the confidence map are performed.
The suggested technique performed exceptionally well at recognizing and geolocating trees in UAV-based hyperspectral pictures, with f-measure, precision, and recall values of 0.959, 0.973, and 0.945, respectively. The method presented here is useful for monitoring forest environments while accurately identifying specific trees. The use of a unique hyperspectral camera from a UAV or aircraft to detect bark beetle damage in urban forests at the individual tree level has piqued researchers’ curiosity. Peng et al. [
31] in their paper, Densely based multi-scale and multi-modal fully convolutional networks for high-resolution remote-sensing image semantic segmentation, employed convolutional neural networks, weighted and conventional support vector machines, and random forests (RF) to classify tree species using hyperspectral and photogrammetric data. Deep learning in remote sensing applications has resulted in numerous advances, including the detection of fir trees damaged by bark beetles in unmanned aerial vehicle images, oil palm tree detection and counting for high-resolution remote sensing images, and the use of deep convolutional networks for large-scale image recognition.
The use of a worldview-2/3 and LiDAR data fusion technique, as well as the application of convolutional neural networks for the simultaneous extraction of roads and buildings in remote sensing imagery, have also been investigated [
13]. Deep learning’s application in remote sensing data processing has also resulted in the creation of new applications and problems in the area. The work addresses numerous remote sensing investigations, such as the processing and evaluation of spectrometric, and stereoscopic images, radiometric correction of close-range spectral picture blocks, and enhancement item counts using heatmap regulation. It also looks at automated land cover categorization, land use mapping, change detection, and forest inventory are among the applications. Deep learning algorithms are constantly being developed, which has greatly improved the accuracy and efficiency of these remote sensing jobs [
16].
The method was tested on two datasets: one with a single multispectral image and the other with a series of images taken over time [
17]. The forest segmentation accuracy on both datasets was high, as achieved by the authors. The study conducted by the authors did not involve the segmentation of individual trees or the differentiation of tree species. The technique they suggested may have the potential to classify tree species in upcoming research.
The forest segmentation was performed using a U-Net CNN architecture by the authors. The U-Net weights were initialized using transfer learning with a pre-trained VGG-16 network. The network was implemented using the Keras deep learning framework. The code for the method was not included in the paper. The author shared information about the software and hardware utilized in their experiments.
Chen et al. [
4] paper, “Individual Tree Species Identification Based on a Combination of Deep Learning and Traditional Features” wanted to classify tree species in a study region using machine learning algorithms on UAV-based hyperspectral data. They did not categorize the data too meticulously. They instead employed supervised learning to categorize the tree species based on the spectral properties of the UAV-based hyperspectral data.
For their investigation, the authors chose six tree species: Holm oak, Cork oak, Stone pine, Eucalyptus, Maritime pine, and Acacia. In the categorization procedure, they used these six classes as their target classes. The authors used two feature selection methods to extract features from hyperspectral data: the Sequential Forward Selection algorithm and the Mutual Information (MI) algorithm. To categorize the tree species, the scientists utilized a variety of machine learning methods, including Random Forest, Support Vector Machine (SVM), and k-Nearest Neighbors (k-NN). They also compared the algorithms’ performance using several assessment measures, such as overall accuracy, precision, recall, and F1 score.
In terms of code availability, the authors did not specifically state whether code for their approach is available. They did, however, note that they utilized the R statistical program and associated packages to undertake data processing and analysis, implying that their technique might be implemented using these tools.
“Assessing potential of UAV multispectral imagery for estimation of AGB and carbon stock in conifer forest over UAV RGB imagery” by Gaden [
6], classified various tree species by segmenting individual trees from Very High Resolution (VHR) RGB imagery. The user employed a U-Net CNN architecture to segment trees and a ResNet-50 CNN for classifying species.
The VHR RGB imagery was analyzed by the authors using the U-Net CNN architecture to exclusively identify trees through image segmentation. Although VHR pictures can provide high-precision measurements for classification techniques, they typically contain clouds and cast shadows, which create issues for trustworthy information extraction. This type of CNN is frequently employed for such tasks [
15].
The ResNet-50 CNN model was used to classify each segmented tree into one of the six tree species. The authors utilized a set of features extracted from the RGB image of each segmented tree to differentiate between various tree species. These features were then fed into the ResNet-50 model as inputs. The ResNet-50 model was pre-trained on a large dataset of natural images using the transfer learning technique. Afterward, it was fine-tuned using the VHR RGB imagery dataset. The paper does not come with any code, but the authors have given a thorough explanation of their approach and findings.
The authors of the publication “Forest segmentation using a combination of multi-scale features and deep learning” [
6] aimed to segment forests in high-resolution remote sensing images. The forest was divided into two groups: forest and non-forest. The authors did not distinguish between different types of trees within the forest category.
The method employed by the authors for forest segmentation was “multi-scale features and deep learning”. The forest segmentation results were made more accurate by utilizing a deep learning framework that incorporated multi-scale image features. The authors used the Faster R-CNN model, which is a well-known deep-learning object detection framework. The model detected objects at different scales by extracting multi-scale features from the input image using a pyramid scheme. The specific pyramid scheme used for extracting multi-scale features from the input image was not explicitly mentioned in the paper.
Pyramid schemes are approaches used in computer vision and image processing to extract multi-scale features from an input image. Image pyramids, which are a sequence of scaled-down reproductions of the original image at different resolutions, are used in these schemes. The image is shown at a different size at each level of the pyramid, with higher levels having poorer resolution.
Pyramid methods are used to record data at numerous sizes, allowing algorithms to evaluate images at various levels of detail. They contribute to addressing the problem of identifying objects of varying sizes and dealing with objects that appear at varied scales inside a picture.
Pyramid systems that are commonly employed in computer vision include Gaussian pyramids, Laplacian pyramids, and steerable pyramids.
By repeatedly applying a Gaussian filter to the source image and subsampling it, this method produces a Gaussian pyramid. The image is represented at a different scale and resolution at each level of the pyramid.
The Laplacian pyramid is derived from the Gaussian pyramid. The details, or residuals between the respective levels of the Gaussian pyramid and its extended counterpart are represented by each level of the Laplacian pyramid. It aids in capturing small details in the photograph.
Steerable Pyramids: Steerable pyramids are multi-scale representations that extract information from different orientations and scales using a mix of filters. They are especially effective for assessing photos with objects of various orientations.
Pyramid methods are used in a variety of computer vision tasks, including object identification, image segmentation, and feature extraction. Algorithms that extract multi-scale features can effectively handle objects of varying sizes and capture both fine-grained details and wider context information in the image.
You can find the code for this paper on GitHub. Python and the PyTorch deep learning framework were used to implement the code.
Stan et al. [
36] utilized deep convolutional neural networks to perform semantic segmentation of forest regions in their paper, “Semantic Segmentation of Forest Regions Using Deep Convolutional Neural Networks”. The goal was to divide various areas of the forest, including trees, roads, water bodies, and other types of land cover. The authors utilized two distinct datasets: the National Agriculture Imagery Program and the Spatio-Temporal Asset Catalog.
The study included six categories: tree, road, building, water, field, and others. The authors utilized spectral clustering, a method that groups pixels with comparable spectral features, to differentiate between various tree species.
The U-Net deep CNN architecture was utilized by the authors for semantic segmentation. The structure of U-Net consists of an encoder-decoder with drop out layers, which aid in preserving spatial information. Furthermore, the authors employed methods of data augmentation to expand their training dataset and prevent overfitting. The paper does not discuss the availability of code.
Ma et al. [
23] suggested a method for automatically segmenting Terrestrial LiDAR Data (TLD) to distinguish individual trees in their paper, “Automated extraction of driving lines from mobile laser scanning point clouds”. The researchers separated the trees in the TLD on an individual basis but did not identify the specific species of each tree. The research divided the trunks and branches of trees, along with the nearby plants. The approach they used for segmentation involved two steps: region growth and convex hull fitting. To obtain the tree structure, the point cloud was segmented into regions and then fitted with convex hulls.
Ma et al. [
23] evaluated their technique on a variety of datasets with varying levels of complexity. One of the datasets, for example, had trees with overlapping canopies. Their method’s performance was tested using many quality criteria, including but not limited to completeness and accuracy. These metrics were used to evaluate the segmentation findings’ correctness and effectiveness. Furthermore, the researchers compared their method to other sophisticated techniques available in the literature, albeit the specific techniques evaluated were not specified in the material provided.
Ma et al. [
23] exhibited substantial levels of accuracy in segmenting individual trees from Terrestrial LiDAR data by comparing their method’s completeness and correctness measures to those of other advanced algorithms. The authors did not make their code publicly available for this paper.
The study was conducted “Semantic segmentation of remote-sensing imagery using heterogeneous big data: International society for photogrammetry and remote sensing potsdam and cityscape datasets”, by Song & Kim [
35]. The aerial imagery was divided by the authors to isolate tree crowns. They then categorized each crown based on various forest inventory characteristics, such as tree species, height, diameter at breast height, and crown width. They separated both the trees and the vegetation in the surrounding background and understory.
The authors identified different tree species by analyzing spectral and spatial features obtained from segmented tree crowns. The researchers utilized U-Net, which is based on deep learning. The encoder network weights were initialized using transfer learning with a pre-trained VGG-16 network, which helped to enhance the model’s performance.
The authors have made the work’s code available as open source on GitHub. The code contains the U-Net architecture implementation, pre-processing procedures, and scripts for training and testing.
The authors of the study “The Semantic Segmentation of Standing Tree Images Based on the Yolo V7 Deep Learning Algorithm” by Cao et al. [
24] provided a thorough method for semantic segmentation of standing tree images with the aim of differentiating between different tree types. Instead of just segmenting trees generally, the study concentrated on dividing tree areas into distinct tree species.
The authors employed the YOLO V7 deep learning algorithm, a commonly used approach at the time known for its potential effectiveness and precision in object identification tasks, to perform the segmentation and classification tasks. Using the YOLO V7 network, the input photos were first pre-processed, after which they were segmented, and finally, the resultant tree areas were classified into the appropriate species.
Numerous tree species that were pertinent to the geographical region under examination were included in the classifications employed in this study. Although the authors did not state how many classes they intended to have, it is clear that they wanted to include a wide variety of tree species. The YOLO V7 algorithm made it easier to distinguish between different tree kinds based on the distinctive traits and qualities that each species possesses, such as the texture of the bark, the form of the leaves, the branching patterns, and the general morphology. In terms of methodologies, models, and frameworks, the YOLO V7 deep learning algorithm was the main segmentation and classification tool that the authors used. To improve the performance of the model, they added further strategies to the process, such as data augmentation and transfer learning. The original query omitted any particular implementation frameworks and specifics.
Regarding code accessibility, the authors have not made the research’s source code available to the general public.
The study by (Lim, Zulkifley et al. 2023) Attention-Based Semantic Segmentation Networks for Forest Applications,” developed and tested an optimal attention-embedded high-resolution segmentation network called HRNet + CBAM in order to classify non-forest and forest areas in Malaysia. The data and input are gathered using Landsat-8 satellite images from ten locations in Malaysia from 2016, 2018 and 2020 [
19].
The manual annotation of images is conducted for efficient training of the model and data set is categorized into 20% for testing and 80% for training of the model. The learning rate, and optimizer, are among the hyperparameters that the basic HRNet model is tuned to. The mean Intersection over Union (mIoU) of this baseline HRNet model is 84.84%, accuracy is 91.81%, and loss is 0.6142. The Convolutional Block Attention Module (CBAM) is embedded into HRNet, leading to an improvement in performance to 92.24% accuracy, 85.58% mIoU, and 0.6770 loss. HRNet and HRNet + CBAM beat other models, such as U-Net, SegNet, and FC-DenseNet, when benchmarked against them with regards to precision and mIoU. [
19]
Nevertheless, neither the availability of the code nor the specific framework that was utilized to create these models are specified. In order to manage huge datasets, the paper recommends employing more data with additional modifications beyond forests, trying different attention processes in various architectures, and exploring higher-end GPUs or alternate data loading methods.
In “Semantic Segmentation Network Slimming and Edge Deployment for Real-Time Forest Fire or Flood Monitoring Systems Using Unmanned Aerial Vehicles”“ by (Lee, Jung et al. 2023), an innovative approach for employing drones outfitted with cutting-edge deep learning models to monitor forest fires and floods in real time is utilized. Through the application of semantic segmentation models such as DeepLabV3 and V3+, the system effectively identifies and demarcates impacted regions from UAV-captured data. The use of channel pruning-based network slimming, which drastically lowers model size and computing requirements without sacrificing accuracy, is the primary advancement of the study [
17].
The results indicate that for the FLAME dataset: mIoU accuracy of 88.29%: This indicates the mean Intersection over Union (mIoU) accuracy of the model in correctly identifying and delineating the regions affected by forest fires. Higher mIoU values signify better segmentation accuracy. Similarly for the FloodNet dataset, mIoU accuracy of 94.15%: Similarly, this indicates the mean Intersection over Union (mIoU) accuracy of the model, but for identifying flooded areas in the FloodNet dataset. Again, higher values imply better segmentation accuracy.
The slimmed models exhibit minimal performance loss compared to baseline networks but achieve a remarkable 20-fold increase in inference speed. Moreover, the reduction in model size and computational requirements by approximately 90% not only enhances processing efficiency but also slashes power consumption, prolonging drone endurance. This breakthrough paves the way for effective and energy-efficient UAV-based monitoring systems tailored for mitigating natural disasters like floods and forest fires, safeguarding lives and ecosystems with real-time insights.
The study by (Ma, Dong et al. 2023) titled as “Forest-PointNet: A Deep Learning Model for Vertical Structure Segmentation in Complex Forest Scenes” offers a semantic segmentation technique centered on the Forest-PointNet model, which was created especially to recognize the vertical arrangement of forest by using terrestrial LiDAR data. The model takes advantage of the benefits that the PointNet structure offers by using an optimization strategy that improves the extraction of local features. When used for semantic segmentation in complex forest environments, it maintains important spatial characteristics, guaranteeing precise identification of forest components. Terrestrial LiDAR scans that collect point clouds of forest habitats make up the data inputs; however, particular datasets are not mentioned. Although the deep learning framework used is still unknown, the Forest-PointNet model performs admirably. It achieves an average recognition accuracy of 90.98%, which is around 4% better than current approaches, especially when compared to PointCNN and PointNet++ [
22].
The model outperforms segmentation techniques based on three-dimensional structural reconstruction and outperforms traditional machine learning techniques by doing away with the requirement for human feature engineering. The study indicates that the Forest-PointNet model offers a viable approach for tasks involving semantic segmentation in varied forest landscapes, demonstrating robust efficiency and adaptability in complicated environments, even though code availability has not been indicated.
A ground-based LiDAR point cloud semantic segmentation technique for complex forest undergrowth scenarios is presented in this paper by (Li, Liu et al. 2023). We build forestry point cloud datasets, that are fused with undergrowth point cloud features, and use the DMM module based on point DMM module for semantic segmentation as a deep learning technique. The LiDAR equipment used to gather the forestry dataset is backpack-style. The study also suggests a point cloud data annotation method based on single-tree positioning to address the challenges of occlusion in forestry environments, sparse distribution as well as the lack of a database along with large location scales and elevated data volume in point clouds representing forestry resource environments (Li, Liu et al. 2023) [
18].
The study utilized the DMM module to integrate tree features and an energy segmentation function to build a critical segmentation chart with the goal to address the less-than-ideal fractal structures and the attributes of large data, large sale scenes, uneven sparsity disorder, and diversity, in forestry environments. Next, we employ cutpursuit to figure out the graph and accomplish semantic pre-segmentation. With its severe occlusion, difficult terrain, numerous return information, high density, and unequal scales, our approach closes the gap in the current deep models used for complicated forestry environment point cloud information. We provide pointDMM, an end-to-end deep learning model that significantly enhances the intelligent analysis of complicated forestry environment scenarios by training a multi-level lightweight deep learning network.
Our approach shows good results for segmentation on the DMM dataset, with a 21% improvement in the identification accuracy of live tree compared to other methods, and an overall accuracy of 93% on the large-scale forest environment point cloud dataset DMM-3. This approach offers major benefits over manually conducted point cloud segmentation when it comes to retrieving feature data from TLS-acquired artificial forest point clouds. It also lays the groundwork for forestry Informa ionization, intelligence and automation.
Moving further, in order to accomplish semantic segmentation, the segmentation strategy covered in the article by (Mazhar, Fakhar et al. 2023) makes use of convolutional neural networks (CNNs), with an emphasis on encoder-decoder topologies. The technique known as “semantic segmentation,” which is used in medical imaging applications, involves assigning a class to each pixel in an image. The CNN’s encoder module is in charge of obtaining feature maps from the input pictures. The decoder then reconstructs these feature maps to regain the spatial resolution and generates segmentation predictions that are precisely pixel-by-pixel [
24].
The study highlights many unique CNN architectures that perform well in semantic segmentation tasks. Famous for its ease of use and efficacy, the U-Net architecture is one such model that is widely used in medical picture segmentation. Another noteworthy architecture that is intended to learn distinguishing characteristics is the Dens-Res-Inception Net (DRINet), which has shown usefulness in the segmentation of CT images of the abdomen, brain tumors, and brain. It is suggested to use dense multi-scale connections to create the high-quality multi-scale encoder–decoder network (HMEDN), which would provide accurate semantic information needed for multi-modal brain tumor segmentation and pelvic CT scans. Furthermore, when trained with Dice loss and cross-entropy, Fully Convolutional Networks (FCNs) are evaluated for their uncertainty estimation and segmentation quality, especially in applications related to the brain, heart, and prostate.
The study lists a number of models that are more advanced, including the Multi-Scale Residual Fusion Network (MSRF-Net), which makes use of a Dual-Scale Dense Fusion (DSDF) block to improve multi-scale feature communication, and INet, which uses overlapping maximum pooling for sharper feature extraction. These techniques show significant improvements in model training efficiency and segmentation accuracy.
A wide range of medical imaging modalities are represented in the data types and input picture used in these studies. These include biomedical MRI, X-rays, endoscopic imaging, mammograms, brain CT scans, brain tumor images, abdominal CT scans, pelvic CT scans, multi-modal brain tumor datasets, prostate CT scans, heart CT scans, and images for pattern detection of interstitial lung disease (ILD). These numerous image types show the adaptability and strength of CNN-based methods by posing different segmentation opportunities and problems.
It is implied that well-known frameworks like TensorFlow or PyTorch are probably utilized given their extensive use in the area, even though the precise deep learning frameworks used to create these models are not stated explicitly.
The study’s findings highlight CNNs’ impressive performance in a variety of medical picture segmentation tasks, especially when using encoder–decoder architectures. Notable results include greater segmentation quality and uncertainty estimation with FCNs trained using Dice loss, improved multi-scale feature communication with the MSRF-Net, and improved accuracy and streamlined training procedures with the U-Net model incorporating robust connection.
To add further, based on another study conducted by (Li, Liu et al. 2023), the research uses ground-based LiDAR data to offer a sophisticated method for semantic segmentation of point clouds in intricate forest undergrowth situations. The fundamental approach uses a deep learning method called pointDMM, which effectively pre-segments semantics by utilizing a DMM unit and the cutpursuit algorithm. LiDAR point cloud data is the main sort of imagery that is used. It is painstakingly gathered using backpack-style LiDAR equipment, guaranteeing thorough coverage of forestry areas. The DMM dataset, particularly focused on the large forest habitat point cloud dataset identified as DMM-3, is essential to our investigation.
Given the nature of the deep learning techniques used, it is reasonable to assume that TensorFlow or a comparable deep learning framework was used, even though the precise deep learning framework used is not stated. The segmentation method efficiently addresses the difficulties presented by blockage, high density, complex topography, and uneven scales in forested environments. It involves the building of a crucial segmentation graph and utilizes an energy segmentation function.
The study presents important results, one of which is the incredible 93% accuracy on the DMM-3 dataset. Compared to current techniques, this accuracy represents a significant 21% boost in live tree recognition accuracy. This improvement demonstrates how well the pointDMM approach handles the complex and varied features found in forestry point cloud data. This method has significant advantages when it comes to the collection of feature information from artificial forest point clouds generated by terrestrial laser scanning (TLS). This underscores the method’s potential to further technology, intelligence, and informatization in the forestry area.
Nevertheless, it is not made obvious whether the implementation details are available for additional studies or practical application because the code utilized in this study is not publicly available. In conclusion, this study presents a strong ground-based LiDAR point- cloud semantic segmentation method using pointDMM, shows appreciable gains in segmentation precision, and emphasizes noteworthy developments in feature extraction capabilities; however, the availability of the underlying code is still unknown.
Another study by (Zhang, Li et al. 2022) states that by contrasting three network variants—one without SE Block and RAM, one with just the SE Block, and the suggested SERNet—the study analyzes the effects of SE Block and RAM on semantic segmentation performance. SE Block improves the mean Intersection over Union (mIoU) by 1.49%, the Accuracy Factor (AF) by 1.29%, and the Total Accuracy (OA) by 1.40%, hence boosting feature representation and segmentation accuracy, particularly for the “Surface” and “Car” categories. RAM raises the mIoU by 0.31%, AF by 0.41%, and OA by 0.41%, but only slightly improves performance due to its focus on global information. The ISPRS Vaihingen and Potsdam datasets, which include DSM (Digital Surface Model) and IRRG (Infrared, Red, Green) photos, were the datasets used for this assessment [
42].
TensorFlow and PyTorch are common frameworks used in these kinds of research. The findings show that when DSM data is included, the suggested SERNet model obtains improved segmentation accuracy, especially for vegetation categories. The study does, however, admit certain limitations, including the possibility of feature redundancy and adverse mutual influence as a result of the straightforward fusion method utilized to combine DSM and IRRG data. Furthermore, the computational overhead of SERNet is increased by its huge number of parameters.
All things considered, the study emphasizes how important it is to recalibrate features and transfer information across the network in order to improve the accuracy of semantic segmentation, especially for images from remote sensing with high resolution (HRRSIs). Although the article presents encouraging findings, it makes no mention of where the code is available for replication or additional research.
3.3. Instance Segmentation in Forestry
The earth’s ecosystem relies on forests, which provide a habitat for many different types of plants and animals. Forestry studies rely on precise identification, mapping, and monitoring of various tree species, which is made possible through instance segmentation. The task of instance segmentation involves recognizing and pin-pointing distinct objects in an image, while also labeling each object accordingly.
In the 2018 paper “Instance Segmentation in Very High-Resolution Remote Sensing Imagery Based on Hard-to-Segment Instance Learning and Boundary Shape Analysis” by Gong et al. [
7], the authors broke up individual trees in forest areas. They used a binary classification method to tell which places had trees and which did not. They used U-Net and as raw data, the writers used very high-resolution aerial photos, which helped them find and separate individual trees. In this study, they did not tell the difference between different kinds of trees.
The writers preprocessed the images by making them a set size and adjusting the pixel values so that they were all the same. They also used methods for data augmentation, like rotating and flipping, to make their col-lection bigger. Then, they trained their U-Net model using a set of manually identified images. The model was taught to figure out how likely it was that each pixel was a tree or not. The authors used a test set to see how well their model worked and found that it was very accurate. In terms of code availability, the writers of this work have not made the code public.
The objective of Panagiotidis et al. [
30], was to identify single trees and calculate their diameters with the help of UAV images. The method used to identify individual trees involved detecting the crowns and estimating the diameters. They categorized the areas into tree and non-tree classes. The trees were only separated into segments, without any distinction being made between various tree species.
A Deep CNN was utilized to identify tree crowns and approximate tree diameters. Convolutional Neural Networks (CNNs) are a type of deep, feed-forward artificial neural network. They are typically used for visual imagery analysis by processing data using a grid-like architecture. Deep CNN, on the other hand, is a CNN improvement. Deep CNNs include more layers than conventional CNNs, allowing them to extract more com-plicated features from input data.
The fundamental distinction between CNN and Deep CNN is the depth of their networks. Deep CNNs include additional layers, which allow for the extraction of more complicated characteristics but also increase their computational complexity. CNNs and Deep CNNs are now used almost synonymously. Most CNNs are actual-ly Deep CNNs [
41]. The intricacy of the problem to be solved determines whether you use a CNN or a Deep CNN [
10]. They utilized a U-Net architecture that was modified and included a VGG-16 encoder for feature extraction. A vast collection of images that were manually annotated was used to train the network.
To estimate the diameter, they utilized a regression model that takes the DCNN’s extracted features as input and predicts the tree’s diameter. The diameter measurements used to train the model were collected in the field. Regrettably, the code was not made accessible by the authors.
Hao et al. [
11] demonstrated the use of convolutional neural networks to distinguish the crowns of trees from terrestrial laser scanning data. The authors aimed to improve forest inventory and management by distinguishing individual tree crowns as separate entities.
The authors used terrestrial laser scanning data to segment tree crowns and collect precise information about the tree’s structure, such as its leaves, branches, and trunks. The authors used a method called binary segmentation to classify each point in the point cloud as either a tree or not. They did not differentiate between different tree species. The researchers used a two-stage CNN approach for the purpose of segmenting individual tree crowns. At first, a segmentation network was used to classify each point in the point cloud as a tree or not. Then, the region-growing algorithm was employed to cluster adjacent points on the trees into distinct tree crowns.
The segmentation network used in the study was the U-Net architecture with residual connections. During the training of the segmentation network, both labeled and unlabeled data were utilized. Labeled data consisted of manually segmented tree crowns, while unlabeled data consisted of unsegmented terrestrial laser scanning data. Data augmentation techniques were employed to increase the quantity of training data. Unfortunately, the code associated with the paper is not publicly available.
In Ostovar et al. [
29], the authors wanted to use RGB images to find and describe tree species. They didn’t break apart individual trees or other items. Instead, they put the whole picture into a certain tree species.
The authors used a deep learning method that mixed a CNN with an SVM algorithm to train their model on a set of RGB images of trees from different species. The CNN was used to pull out features from the raw pictures, which were then sent to the SVM for classification.
With their method, the writers were able to classify different tree types with a high level of accuracy. They also compared their method to other well-known machine learning methods to show that their deep learning method was better. As for whether or not code is available, the writers have not said whether or not code is available.
In “A novel deep learning method to identify single tree species in UAV-based hyperspectral images” Miyoshi et al. [
25] used deep learning methods to find and identify trees from LiDAR and hyperspectral images. They cut each tree into pieces and put them into different kinds.
The authors used a region proposal network (RPN) based on the Faster R-CNN design to divide the trees in-to groups. This network pulls features from the LiDAR point clouds and hyperspectral pictures. The RPN makes object suggestions, which are then improved with a region-based fully convolutional network to make correct segmentation masks for each tree.
The writers used a DCNN with a ResNet-50 design to put the tree species into groups. The DCNN was taught with a big set of hyperspectral images and names for the types of trees they showed. When the tree segmentation masks and the tree species classifications were put together, the end classification results were found. The writers tried their method on a set of LiDAR and hyperspectral images from a mixed-species forest. Both finding trees and figuring out what species they were were done with high accuracy. The source for this study can be found on GitHub.
In Ocer et al. [
28] the authors tried to come up with a deep learning method for separating out individual trees in UAV (unmanned aerial vehicle) images. They only cut trees into pieces, but they didn’t tell the different kinds of trees apart.
For instance segmentation of trees, the authors used a deep learning system called Mask R-CNN (Region-based Convolutional Neural Networks with Masking). They first resized and normalized the images, and then used labeled data to teach the Mask R-CNN model what to do. The labeled data was made up of UAV shots of trees and ground-truth models that showed where each tree was and how big it was.
The writers tested how well their method worked on a test dataset and found that it was accurate based on measures like intersection over union (IoU) and mean average precision (mAP). They also compared their meth-od to other cutting-edge methods and found that theirs was better. The writers did not make their code open to the public, which is a shame.
In the paper, “ Tree species classification of drone hyperspectral and RGB imagery with deep learning convolutional neural networks” by Nezami et al. [
27], the writers focused on identifying tree species using hyper-spectral imagery and CNNs. In this study, they did not use segmentation.
The writers used a field spectrometer and a flying hyperspectral imaging sensor to gather hyperspectral data. Then, they used the hyperspectral data to teach a CNN how to sort tree types. Tree types like birch, cedar, fir, larch, pine, spruce, and other forest species were used as classes. The writers used a CNN with three convolutional layers and three fully connected layers to tell the difference between different kinds of trees. Before putting the hyperspectral data into the CNN, they also used principal component analysis (PCA) to reduce the number of dimensions in the data.
On their test set, the writers were able to sort things correctly more than 90% of the time. They also carried out tests to determine how variables like the number of spectral bands and the size of the training set affected classification accuracy. The writers did not include any code in the study.
He et al. [
12], wrote a paper titled “Generative adversarial networks-based semi-supervised learning for hyperspectral image classification”, discusses the detection and segmentation of individual trees from high-resolution remote sensing imagery. They utilized a Generative Adversarial Network (GAN) for semi-supervised tree detection and instance segmentation. They utilized two categories, namely trees, and back-ground, to identify and isolate each tree present in the picture.
By using labeled data, the GAN was trained to differentiate between various types of trees, as per the authors. The applications of GANs are nearly unlimited, particularly in image identification and segmentation. CycleGAN, a sort of GAN that morphs images from one domain to another without the need for matched training data, is one such technique that stands out.
Let me explain this even more. Consider the following scenario: we have an array of trees, each representing a different type. The goal now is to distinguish between these categories, however, we only have limited labeled data. Here comes CycleGAN. It enables us to efficiently bridge the data gap by creating synthetically labeled data for image-to-image translation.
A Mask R-CNN model is used for actual tree recognition and instance segmentation. This concept goes be-yond only answering the questions ‘what’ and ‘where’ to include addressing the question ‘how many’. It’s like having a super-intelligent sight that not only notices but also understands the differences between trees. The authors reported that a Mask R-CNN model was utilized for detecting trees and performing instance segmentation. You can find the code for this paper on the authors’ GitHub page.
Based on an additional study conducted by (Wielgosz, Puliti et al. 2023), a new framework for segmenting point cloud instances is presented, which is adaptable and flexible and for different pipeline components to be added or removed as needed. This flexibility is essential because it allows new or different modules to be installed in place of specific components, like the instance segmentation module. These additional modules can be implemented in Java or C++, for example, and yet work flawlessly with the entire framework. For researchers and developers who might need to use a certain language for a job or optimization, this flexibility is essential [
38].
An optimization module at the heart of the system is intended to improve segmentation by optimizing key parameters. Important parameter visualization tools are also included in the framework, which help in comprehending and modifying the segmentation models’ performance. The study specifically draws attention to the TLS2trees instance segmentation pipeline’s hyperparameter adjustment procedure. This pipeline, which was originally designed for tropical forests, was optimized to produce coniferous forest settings, proving the framework’s flexibility to many environmental circumstances.
Additional research was done to assess how employing a semantic segmentation model designed specially to recognize coniferous tree stems affected the overall accuracy of tree instance segmentation. The results of the investigation showed that for the given data, hyperparameter adjustment greatly enhanced the segmentation output quality. Nevertheless, the performance was noticeably worse than with the default values when these adjusted parameters were used on an external dataset, the LAUTx dataset. This conclusion implies that larger databases of publicly available annotated point cloud data encompassing a wider range of forest types than those employed in this work are required in order to construct a more robust and transportable set of hyperparameters.
The study also explored the effects of various models for semantic segmentation, namely the P2T model with p2t semantic and fsct semantic models. It was found that the use of these various semantic segmentation models had a negligible effect on instance segmentation accuracy when optimized. On the other hand, in thick forests or forests with lots of low branches, like those containing non-self-pruning species, the instance segmentation based on the p2t semantic model showed reduced susceptibility to hyperparameter selections, making it more resilient.
To sum up, the research presents a versatile and adaptable framework for segmenting point cloud instances, highlighting the need of fine-tuning hyperparameters and the requirement for a variety of annotated datasets to enhance the resilience and applicability of segmentation models. Although the study shows notable advancements in certain situations, it also identifies areas in which additional research and refinement are required to increase the framework’s generalizability across other forest types. Future researchers and developers looking to expand on this work may find it interesting because the article makes no reference of the framework’s code being available.
Another study by (Zvorișteanu, Caraiman et al. 2022) states that by combining cutting-edge techniques from the semantic instance segmentation and the optical flow fields, the suggested remedy for semantic instance segmentation presents a revolutionary strategy. Reconciling the frequently incompatible needs of high precision and real-time processing capabilities is the main objective of this methodology. Semantic instance segmentation approaches have typically prioritized either increasing instance mask accuracy or sacrificing some accuracy in order to attain real-time speed. But the goal of this creative idea is to accomplish both. [
43]
The fundamental technique uses a novel inference approach to lower processing expense while preserving high frame rates. The framework only does inference on each fifth frame of the video stream, as opposed to every frame. It efficiently propagates the segmentation information over several frames by warping the results of the semantic instance segmentation network for the intervening frames using calculated motion maps. This method greatly shortens the time needed to process a single frame, enabling the system to operate at a remarkable rate of up to 50 frames per second (fps) for 1280 x 720 pixel video frames.
To further improve accuracy, depth maps are incorporated into the framework. The depth map aids in the segmentation process by restricting the data to a particular range, guaranteeing that the technique retains a high degree of precision while operating in real-time. This field has advanced significantly with the simultaneous focus on speed and accuracy, especially for applications that need to handle high-resolution video streams in real-time.
The paragraph gives a thorough explanation of the process and its benefits, but it doesn’t name the specific frameworks—such as PyTorch, TensorFlow, or others—that are employed. The details of the underlying technological stack are still unknown due to this lack of clarity in the implementation framework. Furthermore, there is no indication of the source for this innovative method, therefore it is unclear if others within the region will be able to access or replicate the implementation.
In conclusion, the innovative approach to semantic instance segmentation achieves real-time performance without sacrificing accuracy by utilizing the combination of semantic instance segmentation and optical flow fields. The system can process video streams at 50 frames per second on pictures with a resolution of 1280 × 720 by employing motion maps for intermediate frames and executing inference every fifth frame. Depth maps increase accuracy even more by concentrating data within a narrow range. It is unclear which particular frameworks were utilized and whether the code is publicly available, which leaves some implementation details lacking.
To add further, another research carried out by (Guo, Gao et al. 2022) showed some different results. Feature Pyramid Network (FPN) for increased feature extraction and an upgraded Mask RCNN model with a Swin Transformer backbone are the key components of the approach to instance segmentation of pests in complex natural environments. The process starts with Labelme v4.9 annotations and then adds FPN and Swin Transformer blocks to the Mask RCNN framework. Region proposals are generated by a Region Proposal Network (RPN), and accurate feature alignment is guaranteed by RoIAlign. The training settings consist of a batch size of 2, a learning momentum of 0.90, a weight decay of 0.05, a learning rate of 0.00001, and a training duration of 6 hours, spanning across 100 epochs. The three main evaluation criteria are F1 score, recall, and precision [
9].
The dataset comprises 987 JPEG photos, each with a resolution of 6240 × 4160 pixels, taken at the China Academy of Forestry Sciences’ Tropical Forestry Experimental Center in 2021 using a mobile device. The dataset comprises 198 test photographs and 798 training images that were shot under different lighting situations and during natural daylight. TensorFlow is used for the model implementation, while Python 3.7 is used as the programming language. The system is outfitted with an Intel Xeon CPU E5-2643 v4, 96 GB of RAM, and an NVIDIA Tesla K40c GPU that has 12 GB of memory.
An accuracy of 87.23%, a recall of 90.95%, and an F1 score of 89.01% are shown by the performance measures. Comparing these results to typical Mask RCNN models using ResNet50 (MR50) and ResNet101 (MR101) backbones, segmentation accuracy has significantly improved. The enhanced model shows strong segmentation performance, able to handle circumstances with overlapping, occluded, shaded, and uneven lighting. By integrating the Swin Transformer and FPN, the accuracy of pest segmentation is much improved. The robustness of the approach is ensured by a comprehensive validation process conducted under a variety of scenarios and a well-documented experimental setup.
The study by (Guan, Miao et al. 2022) utilize of deep learning techniques to track forest fires have made significant progress in the past few years. Safeguarding forest resources and comprehending the geographic spread of forest fires depend on using drone technology and refining current models to improve segmentation quality and recognition accuracy. Because fires spread quickly and exhibit erratic behavior, it can be difficult to effectively detect fires in complex situations [
8].
This work tackles two deep-learning challenges using the FLAME aerial imaging collection. First, it achieves an identification rate of 93.65% by classifying video clips into two distinct categories (no-fire and fire) through the utilization of channel domain attention mechanism as a novel image classification method. Secondly, it suggests a brand-new instance segmentation technique for early-stage forest fire detection and segmentation dubbed MaskSU R-CNN, which is based on the MS R-CNN model.To minimize segmentation inaccuracies a U-shaped network is used to reconstitute the MaskIoU branch. MaskSU R-CNN outperforms numerous cutting-edge segmentation models in the experiment, with an accuracy of 91.85%, retention of 88.81%, F1-score of 90.30%, and mean intersection over union (mIoU) of 82.31%.
The primary contributions of the paper are rebuilding the MaskIoU branch of the MS R-CNN using a U-shaped network, incorporating the novel attention mechanism (DSA module) into ResNet to enhance the extraction of features, and inventing a novel attention mechanism (DSA module) to better feature channel representation. The outcome is the MaskSU R-CNN model, which is very good in identifying and classifying forest fires in their early stages. T However, the MaskSU R-CNN model shows great promise for autonomous fire monitoring across vast forest regions because to its adaptable architecture and superior performance.
The study by (Sani-Mohammed, Yao et al. 2022) aimed to map the standing dead trees in their study s “Instance segmentation of standing dead trees in dense forest from aerial imagery using deep learning”. Particularly in natural forests, identifying existing dead trees is crucial for assessing the overall health condition of the forest, its capacity to store carbon, and the preservation of biodiversity. It appears that natural forests have bigger expanses, which makes the traditional field surveying method extremely difficult, unsustainable, time consuming and labor-intensive. Thus, an automated method that would be economical is required for efficient management of forests. Since the development of deep learning, machine learning has effectively produced remarkable outcomes [
33].
By Employing a small training dataset of 195 images, this work has offered an improved Mask R-CNN Deep Learning method for recognizing and categorizing existing dead trees in a mixture of dense forest using CIR aerial photography.Initially, the image augmentation technique is combined with transfer learning to take advantage of the training dataset limitations. Next, we carefully chose the hyperparameters of our model to correspond to the structure of our data, which is photos of decaying trees. Lastly, an experiment dataset that was not subjected to a deep neural network model was utilized for a thorough evaluation in order to gauge the capacity for generalization of our model’s performance.
Despite our rather low resolution (20 cm) dataset, the model produced encouraging results, with mean average precision, mean recall, and mean F1-Score of 0.85, 0.88, and 0.87, respectively. Thus, our approach may find application in automating the identification and division of standing dead trees for improved management of forests. This is important for both estimation of carbon storage in forests and preserving biodiversity.