1. Introduction
Pavement cracking is a common type of damage that significantly reduces the service life of roads and poses a safety risk to road users [
1,
2,
3]. Visual interpretation is the main approach of crack detection but it is inefficient and prone to subjective errors. In the past decade, various automatic or semi-automatic methods have been proposed, including the use of sensors such as line scan cameras, RGB-D sensors, and laser scanners[
4,
5,
6]. However, these sensor-equipped vehicles are costly and easy to cause traffic disruption and road type restrictions.
Nowadays, unmanned aerial vehicles (UAVs) have emerged as efficient and versatile tools for structural inspection [
7,
8]. UAV-based road crack detection offers significant advantages, including efficient, cost-effective, safe, and flexible image data acquisition [
9]. In recent decades, digital image processing has been utilized for crack segmentation [
10]. These approaches often require manual feature extraction, which can overlook the interdependence between cracks and lead to unsatisfactory results in practice [
11]. UAV remote sensing has successfully solved the data source for crack detection, how to quickly and accurately detect and measure cracks has become the main problem at present.
Machine learning-based detection methods have been rapidly developed in recent years and are becoming the mainstream approach for crack detection [
12,
13,
14]. These algorithms require different preprocessing of the image to be detected, which can be time-consuming for large images. With the rapid development of deep learning, new ideas are introduced into various computer vision tasks [
15,
16,
17]. Semantic segmentation is often preferred for crack detection as it provides more accurate and effective road health information, such as crack distribution, width, length, and shape. Liu et al. combined FCN and Deep supervision network (DSN) to propose DeepCrack [
18], a multi-scene crack detection algorithm based on the idea of deep supervision. Ren et al. [
19] proposed an improved CrackSegNet for pixel-level crack segmentation of tunnel surfaces, which improves accuracy and generalization by spatial pyramid pooling and skip connection modules. Liu et al. [
20] proposed the use of U-Net for automatic crack detection, but it may generate redundant recognition due to background interference. Wang et al. [
21] combining CNN with transformer, an efficient feedforward network is constructed for global feature extraction. These models often suffer from computational delay and inefficiency due to their large number of parameters and computational redundancy. It becomes particularly acute when dealing with large amounts of data, resulting in a high overhead of computational power.
To reduce computational costs, researchers have proposed some lightweight networks. Many initial lightweight models prioritize speed over spatial detail. These methods may lead to loss of spatial detail and precision, particularly in the boundary regions. Yu et al. introduced the Bilateral Segmentation Network (BiSeNet) in their groundbreaking research [
22], processing semantic and detailed information separately. Lightweight models often lack the ability to effectively extract edge information due to the characteristics of narrow cracks, irregular edges, and the potential for confusion with the road background. This can significantly impact detection accuracy. Short-Term Dense Cascade (STDC) module was proposed as a solution to this issue [
23]. STDC Segmentation network (STDC-Seg) uses the STDC module to extract multi-scale information, which solves the existing backbone network problems of BiSeNet. Overall, STDC-Seg is a suitable segmentation structure for road crack detection.
Quantitative extraction of physical information from cracks is a downstream task in crack detection. To acquire these information, researchers combine crack detection algorithm with crack quantization algorithm to provide a safe and effective solution for road crack detection. Liao et al. [
24] used a spatially constrained strategy on lightweight CNNS to ensure fracture continuity. Yang et al. [
25] attempted to quantitative analysis of detected cracks at the pixel level. However, the quantitative results did not meet expectations. Li et al. [
26] proposed a pixel-level segmentation of crack method, which fuses the SegNet and the dense condition random field and calculated the width and area of one-way and grid cracks. In general, the density, width and length of cracks can provide important reference for road health evaluation, and accurate crack detection results provide an important foundation for the extraction of these elements.
Due to the narrow shape of most cracks, the crack edge information is very important for the accurate location and segmentation of cracks. In addition, in most crack detection tasks, irregular cracks, rough roads, light, shadows and other factors will affect the location and segmentation of cracks. Accurate detection of crack edges is crucial for semantic segmentation networks to address these challenges and extract quantitative information such as crack length and width. Tao et al. [
27] designed a Boundary Awareness Module in their proposed approach, but their label-based learning is prone to misjudge background noise. Pang et al. [
28] introduced a two-branch lightweight network into crack detection, but the lightweight design limited the network's ability to extract global information, so the network was easy to miss small cracks. Holistic nested edge detection (HED)[
29] and side-output residual network (SRN)[
30] are two edge detection networks that build on the idea of deep supervision. Tsai et al. [
31] fused the edge detection results of different sizes extracted by the Sobel edge detector on the semantic branch. However, it is often difficult for existing methods to address the problems of weak perception of crack edge details and uneven crack distribution, which makes quantitative information extraction a challenge.
To overcome these limitations, we propose a rapid road crack detection method for UAV remote sensing images. Specifically, we propose a real-time edge reconstruction crack detection network (ERNet), which integrates edge aggregation and enhancement into semantic segmentation. Inspirited by infrared small target detection [
32], we develop an edge input module utilizing a soft gating mechanism for edge reconstruction. The proposed method achieves the best trade-off between inference speed and accuracy in the models participating in the comparison experiment. The mIoU score on crack500 dataset is 82.48%, and the F1 score is 79.67%. The mIoU score on DeepCrack dataset is 86.6%, and the F1 score is 84.86%. The mIoU score of the generalization experiment on the self-made UAV dataset is 80.25% and the F1 score is 76.21%. Comparative analysis demonstrates the feasibility and superiority of this method. Our main contributions are as follows:
(1) We propose a novel ERNet to achieve high-precision and fast crack edge segmentation through edge reconstruction and realizes the quantification of crack length and width information on this basis. It provides a whole solution from detection to extraction.
(2) We design a key model called BDAM that effectively improves attention at both spatial and channel levels, selectively represents features in the channel and spatial domains, and captures global contextual information.
The rest of the article is organized as follows. Section Ⅱ describes the architecture of ERNet and its components in detail. Section Ⅲ verifies the effectiveness of our method in improving the comprehensive performance of crack detection with experimental results. Section IV is the conclusion.
2. Methodology
The difficulty of the crack detection task comes from the fuzzy boundary transition of the crack, the chaotic background and the foreground interference, etc. The accurate location of the crack edge is the key to deal with these challenges. By reconstructing the edge details, our proposed network improves the location accuracy of the crack edge, improves the coherence of the detection results, and provides accurate detection results for the quantitative extraction of cracks. The overall structure of the network is shown in
Figure 1.
The network uses a three-branch structure to encode features at different levels, including edge path for extracting and preserving high-frequency features, spatial path for preserving detailed information, and semantic path for extracting deep semantic features. In semantic path, we use STDC module for local feature extraction and BDAM for global feature extraction; In the edge path, we input the high-frequency information and semantic information into the edge reconstruction module to encode the edge features, and use the key damage boundary information. In the spatial path, we implement shallow and wide convolution layers to achieve fast downsampling and preserve spatial details.
In this section, we first introduce the backbone network we used, then introduce the bilateral decomposed convolutional feature attention module, and finally describe in detail the side input branch for edge detection and feature fusion module of the model.
2.1. Backbone
Our proposed model uses the STDC module as a feature extractor and retains the spatial branch. We use the STDC-Seg network backbone as the ERNet backbone. The operation of ConvX includes a convolution layer, a batch normalization layer, and a ReLU activation layer. We used feature maps of 1/8 size instead of 1/4 size as input to spatial branching because it reduces the amount of computation and preserves enough spatial detail. The STDC module is the core component of the backbone network, shown in
Figure 2.
Two types of STDC modules, Stride=1 and Stride=2, are used for different tasks. STDC module with a stride of 2 is used to downsampling feature maps, and then the STDC module with a stride of 1 is used for further feature extraction. The number of filters in the i-th convolution layer of the block is N/2i, where N is the output of the STDC module. The number of filters in the last two convolution layers is set to be the same. The STDC module is divided into several blocks. The feature mapping of the i-th block is calculated as equation (1):
Where Xout represents the module output, F represents the concat fusion operation, and x1, x2, …xn are the feature maps of all blocks.
The output of the STDC module integrates the multi-scale information of all blocks. As the number of blocks increases, the receptive field also increases, and the scalable receptive field and information are retained through fusion operations.
2.2. BDAM
The significance of global context in segmentation tasks has been confirmed by numerous previous studies[
33,
34,
35,
36,
37]. Convolution-based methods accomplish this by enlarging the receptive field through increased kernel size or stride, whereas transformer-based methods [
38,
39] usually consider spatial dimension adaptability and ignore channel dimension adaptability, which is important for visual tasks.
To capture distant relationships, we introduce decomposed convolution blocks and design the efficient bilateral decomposed convolutional attention module (BDAM). As illustrated in
Figure 3, The large kernel convolution is divided into three parts by BDAM: depth convolution for capturing multi-scale context, multi-branch depth convolution, and 1×1 convolution for establishing relationships between distinct channels. During the decomposition process, we break down the K×K convolution into depth convolution, multi-scale depth expansion convolution, and 1×1 convolution. The BDAM is described as follows:
Where the Xin denotes input features, corresponding to the multiplication operation of element matrices. Att and Out represent attention maps and outputs, respectively.
In this network, depth dilated convolutions in each branch have kernel sizes of 3 and 7, respectively. This configuration aligns with standard convolutions having kernel sizes of 7 and 19, and enables capturing remote relationships across different scales using depth dilated convolutions with varying kernel sizes in a dual-branch structure. The output of the 1×1 convolution serves as the attention weight for input features, providing both spatial and channel adaptability.
2.3. Edge Reconstruction Module
In detection tasks, the small and narrow cracks are often lost in multiple downsampling processes. The feature information of cracks is closely related to their edge information, which includes fine details of the target. To address this issue, the Laplacian operator is adopted as an edge extraction operator to filter the image and further refine the coarse edge information extracted from it. However, the Laplacian operator's use at each stage increases computational complexity, and setting the threshold for judging the boundary too high or too low can result in ineffective edge detection. In addition, it is difficult for the Laplacian operator to extract edge information from convolutional encoded features. After several experiments, we use 1/8 size images as input and choose a threshold of 40 for the Laplacian operator.
Inspired by small target detection, we use the edge reconstruction module (ERM) based on the second-order Taylor finite difference equation. to process rough edge features [
40]. The structure of the ERM performs a nonlinear transformation of the shallow edge feature map through two residual blocks to obtain features with less noise and clutter. Then, the soft gate mechanism is employed to perform directed learning on the rough edge results obtained by the Laplacian operator, which better suppresses background noise and focuses on the edge information of the target using the semantic features extracted by the backbone, which is shown in
Figure 4. Where
Fi(
x) denotes rough edge features,
Fi+1(
x) denotes Refined edge features, and
Si(
x) denotes high-level semantic features.
The gate convolution learns soft mask automatically from data. Guided by the soft mask, the edge reconstruction module extracts the accurate crack boundary information from the chaotic rough edge features. It is formulated as equation (4):
Where σ is sigmoid thus the output gating values are between zeros and ones. ϕ is ReLU. Wf is a sequence of convolutional filters.
In road surface crack detection, the number of crack pixels is significantly lower than that of non-crack pixels, resulting in a class imbalance problem. Weighted cross-entropy, as mentioned in Ref.[
41], often leads to rough results. To address this issue, we jointly optimize edge learning using binary cross-entropy and Edge loss[
42]. Edge loss is a general Dice-based edge-aware loss module that includes a dice edge loss function for overall contour fitting. The required edge prediction results are defined as follows:
Where
eij and
gij respectively represent the edge prediction results and gradient information vectors at (
©,
j), and
eij is the edge true value directly obtained from the detail ground-truth, and
α is a hyperparameter that controls the model’s sensitivity to object contours. In our experiments, we found that setting
α to 1 achieves an optimum balance between intra-class unification and inter-class discrimination. The boundary refinement is represented by the dice coefficient maximization problem, as defined in the above equation (6). Where,
gd∈
RH×W is the true segmentation map and
pd∈
RH×W is predicted segmentation map, and Θ represents the parameter of the segmentation network. To implement SGD in the training process, the final edge loss is constructed as equation (7):
2.4. Feature Fusion Module
The proposed network’s feature fusion module (FFM) extracts multiple feature responses and fuses information from different level feature maps to achieve multi-element and multi-scale information encoding. As shown in
Figure 5, the edge features are first concatenated with spatial and semantic features, and then the feature map size is divided into
C×
H×1 and
C×1×
W along the
X and
Y coordinates respectively using average pooling. The resulting feature maps are then divided into two separate tensors along the spatial dimension, and an attention vector is generated by sigmoid to guide the feature response of the spatial branch. This encoding of multi-element and multi-scale information integrates low-level feature maps with spatial information, edge reconstruction feature maps with edge information, and high-level feature maps with large receptive fields.
2.5. Crack information quantification
Crack length plays a crucial role in road safety prediction, as longer cracks indicate more severe road damage. The segmentation network generates a crack prediction map by predicting cracks at the pixel level. We extract correct crack skeletons by eliminating a large number of erroneous branches identified by Zhang & Suen et al. [
43] algorithm (shown in Figure6 (b)) based on connected domain analysis. The result of crack skeleton extraction is shown in
Figure 6 (c). The crack trunk can be extracted effectively by debranching algorithm based on connected domain. Finally, the number of adjacent pixels in the crack skeleton and the distance between adjacent cracks are calculated pixel by pixel, and the maximum length value represents the crack length.
The width of the crack is equally important for road damage detection. Based on distance transform method (DTM), the distance between the crack skeleton and the crack edge was calculated, so as to obtain the maximum width. As shown in
Figure 7 (a), the wider the crack area, the greater the gray value.
Figure 7 (b) shows the results of the crack skeleton weighted with DTM values, so as to obtain the maximum width.
3. Experiment and Results
This section details the dataset used for the experiments, training details of the proposed algorithm, evaluation criteria and experimental results.
3.1. Dataset
Because it is difficult to obtain the road crack data based on UAV, considering that the camera resolution of UAV is high enough and the angle of view of UAV is similar to that of mobile phone, it can be considered that the images collected by both have similar definition and imaging angle. We use the public road crack data sets Crack500 and DeepCrack collected by mobile phones, as the training set. In addition, we used the DJI UAV to take some road crack images, and made a small data set for the test of network generalization ability. The specific data set is described as follows.
The UAV dataset is collected for generalization ability test. The images captured using the DJI M300RTK drone equipped with the ZENMUSE H20 camera. The pixel resolution of the images is 5184×3888. Since the size of a single image is too large, we use LabelMe [
44] for semantic annotation, and then crop the image to 512×512 size. By using data enhancement operations such as image flipping, we make a generalization data set containing 4692 images of UAV aerial road crack. The data was collected within the campus of Nanjing University of Aeronautics and Astronautics in Nanjing, Jiangsu Province, China. The annotated dataset includes both cement and asphalt road surfaces, with various types of cracks such as net-shaped cracks, longitudinal cracks, and transverse cracks.
The CRACK500 dataset [
45] consists of 500 road crack images. In this experiment, each original image was divided into 16 non-overlapping images, each with a scale of 640×352. Images containing more than 1000 pixels of crack area are kept and further divided. The training set comprises 1896 images, the validation set comprises 348 images, and the test set comprises 1124 images.
The DeepCrack dataset [
18] consists of 537 road crack images with a size of 544×384, each with a pixel-level binary label image. In our experiments, the dataset was divided into 300 images for the training dataset and 237 images for the validation dataset.
3.2. Implementation Details
All models in the experiments were implemented with the PyTorch framework on a single NVIDIA GTX 3090 GPU. We used SGD[
46] to train our ERNet with batch size 8, and training epoch is set to 100, we apply the “poly” learning rate strategy in which the initial rate is multiplied by equation (8):
where the iter is the number of iterations, max_iter is the maximum number of iterations, and power controls the shape of the curve. The initial learning rate was set to 0.01, and the power was set to 0.9.
3.3. Comparative Experiment
We compared our ERNet with three lightweight semantic segmentation networks (BiSeNet[
22], STDC2-seg[
23], PIDNet[
47]) and three crack detection networks (DeepCrackNet[
18], CT-CrackSeg[
27], LinkCrack[
24]) based on the same implementation details and platform.
The accuracy evaluation standard used in this experiment is Intersection over Union (IoU), Precision (Pr), Recall (Re), F1 score (F1) and accuracy (Acc). We also calculated the average frames per second (FPS) of the network reasoning in the validation set while calculating the IOU. The measurements are shown in equations (9)-(13):
where, NTP is the number of positive samples classified as positive, NTN is the number of negative samples classified as negative, NFP is the number of negative samples classified as positive, and NFN is the number of positive samples classified as negative.
Precision and recall evaluate the detection ability of the method from different perspectives, respectively. The F1 score combines the above two metrics. The IoU can give a better response to the local details of the detection results, and mean IoU(mIoU) is the average of the IoU of road and crack. Acc represents the proportion of correctly classified data (NTP+NTN) relative to the total data These indicators can be used to evaluate the detection performance of the network more objectively. The values of these indicators range between 0 and 1. Higher values, closer to 1, indicate better segmentation ability for crack areas. Validation data is used to select the optimal training iteration.
Table 1 presents the comprehensive indicators for the Crack500 validation dataset, with the best performing values highlighted in bold. Our model has achieved the highest results on IoU, Re, Acc, mIoU and F1 score. And the speed is also the fastest of the four crack detection networks, although our model has a 0.5 FPS lower frame rate than STDC2-seg, it achieves a 2.86% higher IoU for cracks compared to STDC2-seg and a 3.98% higher F1, making this trade-off acceptable. Despite having a lower precision compared to PIDNet, our model's higher F1 score demonstrates its superior ability to distinguish between background and cracks.
Figure 8 presents the segmentation results for several image examples. The first row demonstrates our model’s recognition result has fewer breakpoints and is closer to the original image than other networks, which is not uncommon in the experiments of two datasets. The inference results for the second row of background-mottled crack images reveal that our model exhibits fewer missed detections, clearer edge details, and smoother boundaries compared to other networks. In the third and fourth rows of the image, all other networks have false checks in the shaded area. The fifth row demonstrates our model 's ability to recognize detailed information within the cracks, which is not achieved by other networks.
Upon testing our model on the DeepCrack dataset, it achieved superior IoU, recall, and F1 scores. Although ERNet is 0.73% lower than PIDNet on precision, it is 3.28% higher on recall and thus 1.46% higher than PIDNet on F1. On the other hand, although our model is 1.15% lower than LinkCrack in recall, it is 7.05% ahead in precision and thus 2.87% higher than LinkCrack in F1. In terms of speed, our model is only lower than STDC2-seg. Although our model is slightly lower STDC2-seg on Acc by 0.16%, it leads 7.31% in IoU and 5.07% in F1, demonstrating its robust stability across various datasets. The detailed comparison results are presented in
Table 2.
Figure 9 displays the segmentation results for several image examples with increased interference.
For the blurry image in the first and second rows, the small cracks extracted by our model are more coherent and retain more details. The images in the third and fourth rows demonstrate complex interference conditions due to shadows, light, and widespread net-shaped cracks, leading to decreased segmentation results for all network models, with BiSeNet, STDC2-seg, and CT-CrackSeg all showing extensive false detection, and our model has a smaller false detection range compared to other networks. The identification results of the fifth row show that the detection results of our model can preserve the edge details of the crack well, while reducing the number of breakpoints of the zigzag crack.
3.4. Generalization Ability Experiment
In this section, we use the best weight of each network obtained on the Crack500 training set to predict on the UAV crack data set, so as to detect the generalization ability of each network. The generalization experiment can truly reflect the ability of each network in the actual detection task. The experimental results are presented in
Table 3. Our inference results outperform other models in IoU, mIoU, Precision, F1, and Acc metrics. Meanwhile, other all network performance indicators of a substantial decline. The mIoU of our inference results is only 2.23% lower than that of the Crack500 dataset, the results demonstrate that our proposed model possesses good generalization ability.
Figure 10 illustrates the segmentation results of different networks on UAV remote sensing images. The detection results of DeepCrackNet and LinkCrack are very scattered and lack of clear boundaries. CT-CrackSeg performs best except ERNet, but it still mistakenly detects the pavement markings as cracks, and the missing detection of small cracks is also quite significant. Specifically, the second row compares the segmentation of shallow cracks, our model can identify the main body of shallow cracks and correctly handle pavement markings. The third row contains both longitudinal cracks and transverse cracks. Other networks miss transverse cracks, while our model successfully segments both types of cracks relatively completely. The fourth and fifth rows depict the same area from different angles, demonstrating that our model not only approximates the ground truth but also exhibits high consistency in the same area in the two images. In conclusion, the experiments demonstrate that our proposed model possesses good generalization ability and delivers excellent performance in detecting road images from remote sensing. The fourth and fifth rows depict the same area from different angles, the result demonstrates that our model not only approximates the ground truth but also exhibits high consistency in the same area in the two images. DeepCrackNet has many false detections in the background area far from the crack, which indicates that its ability to extract global semantic information still has room for improvement. In conclusion, the experiments demonstrate that our proposed model possesses good generalization ability and delivers excellent performance in detecting road images from remote sensing.
3.5. Ablation Experiment Results
In this section, we performed six experiments on the Crack500 dataset using the STDC backbone, incorporating BDAM, ERM, and FFM sequentially.
Table 4 presents the contributions of each module and their combinations.
BDAM: In the second experiment, the BDAM enhanced multiscale receptive field information, resulting in a 0.71% improvement in IoU compared to the original. This demonstrates the ability of BDAM to capture global context information.
ERM: In the third experiment, the side input module's enhanced edge positioning led to a 1.66% improvement in crack IoU and a 2.62% increase in recall. This demonstrates that the side input module not only increases the accuracy of edge pixel segmentation but also enhances the overall segmentation accuracy for the category.
Figure 11 shows the segmentation results before and after ERM is added, which demonstrates that ERM can effectively improve the segmentation accuracy of crack boundaries.
FFM: In the fourth experiment, FFM was employed to replace the default feature concatenation operation, resulting in a 1.13% improvement in IoU. This suggests that the FFM-based coordinate feature guidance efficiently encodes multi-element and multiscale information.
3.6. Crack Information Quantification Experiment Results
We simulated the actual detection process, input the image of Crack500 dataset into ERNet to get the prediction map, and obtain the calculated value of crack length and width of the prediction map through the proposed crack information quantization algorithm. Two experiments were designed to verify the accuracy of the quantization results and the accuracy of the crack prediction results.
We select 10 sets of calculated prediction values to compare with the calculated label values, in order to verify the effectiveness of the crack information quantization algorithm we proposed above. And select another 10 sets of calculated prediction values to compare with the measured values of the original RGB images, in order to verify the accuracy of the crack prediction results. To assess the universality of the results, we introduced the average absolute error and average relative error. The detailed results are presented in
Table 5 and
Table 6.
The crack length in the comparison ranges from 150 to 697 pixels, and the width ranges from 24 to 80 pixels. The experimental results show that the average relative errors of the calculated values of crack length and width relative to the labeled cracks are 7.51% and 6.69% respectively, and the average relative errors of the calculated values relative to the original image are 5.85% and 3.92% respectively. Since the crack information measurement based on the original image is manually measured, there will inevitably be errors due to human factors and instrument influences.
Figure 1.
The overview structure of our proposed network for crack detection.
Figure 1.
The overview structure of our proposed network for crack detection.
Figure 2.
Illustration the STDC module. (a) STDC module with a stride of 1. (b) STDC module with a stride of 2.
Figure 2.
Illustration the STDC module. (a) STDC module with a stride of 1. (b) STDC module with a stride of 2.
Figure 3.
Illustration of the bilateral decomposed convolutional attention module. The BDAM is used to refine the corresponding combined features of the decoding stage. Among them, depth-wise separable convolution and dilated convolution are used to capture the global content, and attention vector is used for guidance.
Figure 3.
Illustration of the bilateral decomposed convolutional attention module. The BDAM is used to refine the corresponding combined features of the decoding stage. Among them, depth-wise separable convolution and dilated convolution are used to capture the global content, and attention vector is used for guidance.
Figure 4.
Illustration of the edge reconstruction module.
Figure 4.
Illustration of the edge reconstruction module.
Figure 5.
Illustration of the feature fusion module based on coordinate.
Figure 5.
Illustration of the feature fusion module based on coordinate.
Figure 6.
Diagram of crack skeleton backbone extraction. (a) Crack predicted map. (b) Zhang-Suen thinning algorithm. (c) Crack backbone extraction.
Figure 6.
Diagram of crack skeleton backbone extraction. (a) Crack predicted map. (b) Zhang-Suen thinning algorithm. (c) Crack backbone extraction.
Figure 7.
Result of the crack based on distance transformation method. (a) DTM values of crack prediction map. (b) Result of weighting the crack skeleton with DTM values.
Figure 7.
Result of the crack based on distance transformation method. (a) DTM values of crack prediction map. (b) Result of weighting the crack skeleton with DTM values.
Figure 8.
The visualization of different semantic segmentation detection results of compared methods on CRACK500.
Figure 8.
The visualization of different semantic segmentation detection results of compared methods on CRACK500.
Figure 9.
The visualization of different semantic segmentation detection results of compared methods on DeepCrack.
Figure 9.
The visualization of different semantic segmentation detection results of compared methods on DeepCrack.
Figure 10.
The visualization of different semantic segmentation detection results of compared methods on UAV remote sensing dataset.
Figure 10.
The visualization of different semantic segmentation detection results of compared methods on UAV remote sensing dataset.
Figure 11.
The visualization of edge reconstruction results.
Figure 11.
The visualization of edge reconstruction results.
Table 1.
Comparison of the experimental results of different semantic segmentation networks on the CRACK500 dataset.
Table 1.
Comparison of the experimental results of different semantic segmentation networks on the CRACK500 dataset.
|
IoU(%) |
mIoU(%) |
Pr (%) |
Re (%) |
F1(%) |
Acc(%) |
FPS |
BiSeNet |
62.83 |
79.99 |
73.31 |
78.68 |
75.90 |
97.87 |
11.86 |
STDC2-seg |
63.35 |
80.39 |
74.71 |
76.69 |
75.69 |
98.37 |
22.1 |
PIDNet |
64.28 |
80.88 |
79.87 |
76.71 |
78.26 |
97.59 |
16.35 |
DeepCrackNet |
55.67 |
76.69 |
66.75 |
77.02 |
71.52 |
96.54 |
3.35 |
CT-CrackSeg |
62.54 |
80.04 |
60.50 |
78.00 |
73.30 |
97.02 |
3.42 |
LinkCrack |
57.45 |
77.92 |
72.97 |
72.98 |
72.98 |
96.95 |
11.54 |
ERNet(ours) |
66.21 |
82.48 |
79.21 |
80.14 |
79.67 |
98.51 |
21.6 |
Table 2.
Comparison of the experimental results of different semantic segmentation networks on the DeepCrack dataset.
Table 2.
Comparison of the experimental results of different semantic segmentation networks on the DeepCrack dataset.
|
IoU(%) |
mIoU(%) |
Pr (%) |
Re (%) |
F1(%) |
Acc(%) |
FPS |
BiSeNet |
69.10 |
83.76 |
81.29 |
79.19 |
80.23 |
98.97 |
13.1 |
STDC2-seg |
66.40 |
82.33 |
84.52 |
75.56 |
79.79 |
99.40 |
30.2 |
PIDNet |
71.54 |
85.06 |
88.52 |
78.85 |
83.40 |
98.64 |
25.47 |
DeepCrackNet |
67.90 |
83.53 |
81.21 |
80.56 |
80.88 |
98.35 |
3.40 |
CT-CrackSeg |
64.69 |
81.79 |
76.55 |
80.68 |
78.56 |
98.09 |
3.49 |
LinkCrack |
69.48 |
84.29 |
80.74 |
83.28 |
81.99 |
98.42 |
14.07 |
ERNet(ours) |
73.71 |
86.60 |
87.79 |
82.13 |
84.86 |
99.24 |
26.2 |
Table 3.
Comparison of the experimental results of different semantic segmentation networks on UAV remote sensing dataset.
Table 3.
Comparison of the experimental results of different semantic segmentation networks on UAV remote sensing dataset.
|
IoU(%) |
mIoU(%) |
Pr (%) |
Re (%) |
F1(%) |
Acc(%) |
BiSeNet |
41.87 |
69.73 |
58.71 |
59.36 |
59.03 |
97.62 |
STDC2-seg |
41.71 |
69.66 |
59.40 |
58.34 |
58.87 |
97.65 |
PIDNet |
35.15 |
66.14 |
51.12 |
52.94 |
52.01 |
97.18 |
DeepCrackNet |
21.25 |
57.55 |
23.95 |
65.32 |
35.05 |
93.02 |
CT-CrackSeg |
48.13 |
73.45 |
62.09 |
68.16 |
64.98 |
97.88 |
LinkCrack |
26.91 |
63.21 |
65.26 |
31.42 |
42.41 |
97.54 |
ERNet(ours) |
61.56 |
80.25 |
69.91 |
83.75 |
76.21 |
98.36 |
Table 4.
The impact of BDAM, ERM, and FFM on network performance.
Table 4.
The impact of BDAM, ERM, and FFM on network performance.
|
IoU(%) |
mIoU(%) |
Re(%) |
Acc(%) |
Original |
62.71 |
80.02 |
76.03 |
97.48 |
Original+BDAM |
63.42 |
80.41 |
76.36 |
97.51 |
Original+ERM |
63.72 |
80.54 |
78.56 |
97.47 |
Original+BDAM+ERM |
65.08 |
81.28 |
78.98 |
97.53 |
Original+ERM+FFM |
63.79 |
80.57 |
79.06 |
97.47 |
Original+BDAM+ERM+FFM |
66.21 |
82.48 |
80.14 |
98.51 |
Table 5.
Comparison table of ground-truth and prediction crack calculated parameters.
Table 5.
Comparison table of ground-truth and prediction crack calculated parameters.
|
crack length and error(pixel) |
crack width and error(pixel) |
number |
calculated label value |
calculated prediction value |
absolute error |
relative error/% |
calculated label value |
calculated prediction value |
absolute error |
relative error/% |
1 |
640 |
691 |
51 |
7.97 |
38 |
42 |
4 |
10.53 |
2 |
409 |
441 |
32 |
7.82 |
34 |
37 |
3 |
8.82 |
3 |
638 |
651 |
13 |
2.04 |
64 |
66 |
2 |
3.13 |
4 |
339 |
368 |
29 |
8.55 |
24 |
26 |
2 |
8.33 |
5 |
238 |
272 |
34 |
14.29 |
80 |
80 |
0 |
0.00 |
6 |
150 |
151 |
1 |
0.67 |
37 |
31 |
6 |
16.22 |
7 |
286 |
324 |
38 |
13.29 |
46 |
49 |
3 |
6.52 |
8 |
158 |
175 |
17 |
10.76 |
26 |
25 |
1 |
3.85 |
9 |
247 |
264 |
17 |
6.88 |
44 |
42 |
2 |
4.55 |
10 |
355 |
365 |
10 |
2.82 |
40 |
42 |
2 |
5.00 |
average |
|
|
24.2 |
7.51 |
|
|
2.5 |
6.69 |
Table 6.
Comparison table of measured and calculated crack parameters.
Table 6.
Comparison table of measured and calculated crack parameters.
|
crack length and error(pixel) |
crack width and error(pixel) |
number |
measured value |
calculated prediction value |
absolute error |
relative error/% |
measured value |
calculated prediction value |
absolute error |
relative error/% |
1 |
341 |
323 |
18 |
5.28 |
27 |
26 |
1 |
3.70 |
2 |
697 |
691 |
6 |
0.86 |
44 |
39 |
5 |
11.36 |
3 |
405 |
441 |
36 |
8.89 |
24 |
24 |
0 |
0.00 |
4 |
638 |
651 |
13 |
2.04 |
27 |
26 |
1 |
3.70 |
5 |
623 |
721 |
98 |
15.73 |
30 |
31 |
1 |
3.33 |
6 |
166 |
151 |
15 |
9.04 |
80 |
80 |
0 |
0.00 |
7 |
522 |
500 |
22 |
4.21 |
60 |
56 |
4 |
6.67 |
8 |
215 |
208 |
7 |
3.26 |
50 |
51 |
1 |
2.00 |
9 |
375 |
351 |
24 |
6.40 |
26 |
25 |
1 |
3.85 |
10 |
355 |
365 |
10 |
2.82 |
44 |
42 |
2 |
4.55 |
average |
|
|
24.9 |
5.85 |
|
|
1.6 |
3.92 |