To show the semantic segmentation results of the developed scenarios more prominently, this section compares them using the RandLaNet algorithm. The performances of the proposed prior-level fusion scenarios were evaluated using the SensatUrban dataset. Moreover, the results have been compared with those of the reference approach. The study compared the efficiency of the developed scenarios by evaluating the main metrics namely, F1-score, Recall, Precision, Intersection over Union (IoU), and Confusion matrix. In addition, a qualitative results evaluation was made based on visual analysis of predicted (synthetic) and observed (actual) data. This section presents a detailed demonstration of the results, including a comparative analysis with the reference approach, as well as between the various developed scenarios.
4.2.1. Quantitative Assessments
In this subsection, we evaluate the scenarios S1, S2, and S3 with the reference approach using test set data. The comparisons are reported in
Table 3. Since several scenarios were evaluated in this work, the same data splits were used for the RandLaNet algorithm’ training, validation, and test to ensure a fair and consistent evaluation. Four urban scenes (4 test sets) were used to evaluate the pre-trained models and did not contribute to the training processes. We can see that all developed scenarios outperform the reference approach in all evaluation metrics. A comparison of the three developed scenarios shows that scenario S1 surpassed S2 and S3 scenarios in all indicators. The experimental results show that the first proposed fusion scenario (S1) delivers the best performance over other scenarios, which is manifested mainly in the higher IoU and highest precision in the semantic segmentation results. For example in scene 1, the Intersection over Union of S1, S2, S3, and the reference approach was 80%, 77%, 75%, and 63% respectively.
Table 3 exhibits the semantic segmentation accuracies for the SensatUrban dataset as obtained using the different scenarios and the reference approach.
The proposed fusion scenarios in this paper show a significant improvement for all indicators compared to the reference approach. The first developed approach (S1) has obvious advantages but the difference between it and S2 is relatively small. From the results of each metric, we can see that the S1 scenario achieved 88/80%, 94/88%, 88/79%, and 84/68% semantic segmentation Precision/IoU in the four urban scenes. Compared to the reference approach, the S1 scenario which integrated classified images, RGB information, and point clouds, increases the semantic segmentation IoU of each scene by 17%, 13%, 12%, and 18%, respectively. Also, the S2 scenario integrated geometrical features, optical images, and point clouds; it increases the semantic segmentation IoU of each scene by 14%, 11%, 10%, and 17%, respectively. Additionally, the S3 scenario integrated classified geometrical information, optical images, and point clouds; it increases the semantic segmentation IoU of each scene by 12%, 10%, 9%, and 7%, respectively. The poor precision obtained by the reference approach could be explained by the lack of prior information from images or point clouds (geometric features), which could provide useful information related to urban space. Therefore, it is difficult to obtain accurately diversified objects' semantic segmentation by the direct fusion of LiDAR and optical image data. On the other hand, the S1 scenario has advantages over both the scenarios with geometric features (S2) and with classified geometrical information (S3). The results obtained by the S1 scenario indicated that the integration of prior knowledge from images (image classification) improves the 3D semantic segmentation. It improved the semantic segmentation precision to around 94% for example in scene 2. Additionally, with the help of prior knowledge from classified images in S1, we achieve about 17% increase in overall precision compared to the reference approach. The increased accuracies obtained by S1 can be attributed to complementary semantic information from the optical images. That is, the usage of classified images and spectral data together with the point clouds (XYZ coordinates) improved the accuracy. They are useful to facilitate the distinction of diverse objects. Therefore, we can make the satisfactory conclusion that the overall performance of S1 is a promising scenario by considering the different evaluation metrics.
The results obtained with different developed scenarios have been studied in detail by computing a percentage-based confusion matrix using ground truth data. “This percentage-based analysis provides an idea about the percentage of consistent and non-consistent points”(Ballouch et al., 2022b). The percentage-based confusion matrix obtained by all scenarios for scene 1 is depicted in
Figure 12. The corresponding confusion matrices for the other urban scenes (2, 3, and 4) are available in the
Appendix A.
It can be seen from confusion matrices that the developed scenarios have a real improvement compared to the reference approach. The improvement is more significant. This is because the direct fusion of images and point clouds is not sufficient for the semantic segmentation of complex urban scenes.
The obtained results allow general inferences to be made regarding the selection of a smart fusion approach for DL-based point clouds semantic segmentation. The following are the detailed results of each class or classes group independently:
Firstly, ground and high vegetation classes were successfully extracted in all scenes with all evaluated processus. It is due to their geometric and spectral characteristics which are easy to recognize. That is, they are easily distinguished from other classes. This means that only the point clouds and the optical images fused in the reference approach are sufficient to correctly segment the two classes.
Secondly, the building class is extracted accurately by the first proposed scenario (S1), but the difference between it and other developed scenarios is relatively small. However, despite its performance, a slight confusion was observed between this class and the street furniture object.
Thirdly, by observing the four scenes, we can see that the S1 scenario has a good performance on the point clouds scenes containing rail, traffic roads, street furniture, footpath, and parking objects. The five semantic classes were extracted precisely by this scenario, except for the footpath class, the precision of it was low. Besides, the percentage of consistent points obtained by it surpassed all other developed scenarios and the reference approach; The reference approach failed to label these classes. The two other scenarioes S2 and S3 have also segmented these classes (except the footpath class is not detected) but with lower precision compared to S1. Since these are the geoobjects that have approximately the same geometric features, the scenarioes based on geometric features (S2) or already classified geometric information (S3), failed to extract them precisely as scenario S1, although with the addition of optical images. For example in scene 4, The S1 scenario increases the percentage of consistent of each class by 12%(parking), 2% (rail), 7% (traffic roads ),13% (street furniture), and 7% (footpath), respectively compared to the S2 scenario. The S1 scenario increases the percentage of consistency of each class by 2%(parking), 12% (Rail), 47% (traffic roads ),8% (street furniture), and 9% (footpath), respectively compared to the S3 scenario. So, the positive effect of prior knowledge (image classification) from images on the percentage of consistency is seen more clearly. We can conclude that the percentage of consistent increases if classified images, and spectral information are used together with point clouds (scenario S1). That is, the semantic knowledge from optical images is very useful for separating these complex classes. However, Those semantic classes are often confused with others with similar characteristics. We can list the confusion between the parking class with the ground and traffic roads classes. And also, the confusion between the rail with street furniture and water objects. In addition to the confusion between the traffic roads class with ground and parking geoobjects , and also with bridge class in scene 4.
Fourthly, by observing the four scenes, we can see that the S2 scenario has a good performance on the point clouds scenes containing cars, walls, and bridge objects. The tree classes are successfully extracted by S2. The obtained result in these classes indicated that S2 generally performed better than the other scenarios. If we still take the example of scene 4, the S2 scenario increases the percentage of consistency of each class by 2%(cars), 14% (walls), 12% (bridge), respectively compared to the S1 scenario. Besides, it increases the percentage of consistency of each class by 5%(cars),4% (walls), and 62% (bridge), respectively compared to the S3 scenario. Additionally, the S2 scenario increases the percentage of consistency of each class by 22% (cars), and 42 % (walls), respectively compared to the reference approach. Hence, the bridge class has not been detected completely by the reference approach. So, the semantic segmentation of cars, walls, and bridge objects shows that the positive effect of geometric features on the percentage of consistency is seen clearly. The percentages of consistent points have increased in all urban scenes where optical images, point clouds, and 3D geometric features are used together. The results obtained show that the influence of adding suitable geometric features to point clouds is better addressed in these classes than in the others. Since these classes have different geometrical characteristics, the addition of planarity and verticality as point clouds attributes in S2 has facilitated their distinction of them. On the other hand, since these three classes have almost similar radiometric characteristics, S1 was not obtained the same accuracies as S2, but its results are generally good. Considering that the precisions obtained by the reference approach in cars and wall objects were moderately sufficient and insufficient in the bridge class, we can conclude that the addition of optical images to point clouds was not sufficient in the semantic segmentation of these three objects. However, Those semantic classes are often confused with others objects with similar characteristics. We can cite the confusion between the classes' cars and street furniture in scenes 1 and 4. In addition to the confusion between the class wall and street furniture. Thus, we notice a slight confusion between the wall object and the class buildings (scene 4) and ground (scene 1). Finally, we observe confusion between the class bridge and building in scene 4.
Consequently, it was concluded that the combination of geometric features, optical images, and point clouds (XYZ coordinates ) is important for the semantic segmentation of cars, wall, and bridge classes. Moreover, The effective geometric features provide a real advantage and are very useful for the semantic segmentation of these irregular classes.
Fifthly, concerning the water class, the only scenario that can detect it precisely is the S1 scenario (see confusion matrix results). The water class is confused with the wall object in the S2 scenario. That is, the verticality and planarity selected are not enough to distinguish the water class. We can conclude that XYZ coordinates, optical images, and geometric properties are not useful in the semantic segmentation of water. In addition, it is confused with the ground object in the S3 scenario. On the other hand, the classified images and RGB information merged with the point clouds (case of the S1 scenario) showed a great performance in the semantic segmentation of this water class.
Therefore, we can conclude that the use of prior knowledge from optical images provides an advantage in the semantic segmentation of the water class. However, It is difficult to detect water objects with the attributes integrated into scenarios S1 and S2.
Finally, The bike class is not detected in the case of all scenarios. It is due to the very low percentage of bike samples.
4.2.2. Qualitative Assessments
In addition to the quantitative evaluation, a qualitative analysis was also carried out by visualizing the 3D point clouds to check the semantic segmentation results in detail for the test data set.
Figure 13 demonstrates the visual comparison of the predicted results obtained by the proposed scenarios and their corresponding ground truths. To show the semantic segmentation effect of all scenarios more intuitively,
Figure 13 and
Figure 14 shows the semantic segmentation effect of different scenarios on different point clouds scenes. It can be seen from the figures that the semantic segmentation result of the S1 scenario is closest to the true value effect. Besides, its results are more accurate and coherent compared to the other scenarios. Additionally, the semantic segmentation results show that the 3D urban scene was better semantically segmented with S1, where all classes were extracted precisely with clear boundaries.
The qualitative results of each class are further explained in the following paragraphs:
The semantic segmentation results indicate that the ground and high vegetation classes have been effectively semantically segmented in general by all the processes. Visual investigations show that the ground class is confused with buildings objects in all scenarios. This confusion is caused by their similar spectral information, but the overall semantic segmentation is generally good.
Scene 4 (see confusion matrix results in the
Appendix A) is a perfect patch that demonstrates mixed semantic classes with a diversity of urban objects; we are interested here in the rail, traffic roads, street furniture, and parking classes. We can observe that these classes are hard to be recognized by the reference approach. It failed to label these semantic classes. Furthermore, as observed in quantitative results, S1 shows better performance in these classes by producing very few miss-segmented points compared to others scenarios. Thanks to the prior knowledge from optical images in scenario S1, errors of it were lower than those delivered by other scenarios for these semantic classes. In the case of S2, S3, and the reference approach, several parking class points have been miss-segmented as ground. It is due to the similarity of their geometric and radiometric properties. Moreover, the three scenarios have miss-classified the traffic roads as a ground class. The street furniture shares a similar color to the building and wall in this dataset: in fact, as shown in
Figure 13, part of the street furniture is labeled as a building in semantic segmentation results of S2, S3, and reference approach. Finally, the rail object had not detected by the reference approach. Besides, S2 and S3 have miss-classified it as water and street furniture (see confusion matrix results in the
Appendix A).
Concerning the building class, the visual evaluation has shown that different developed scenarios have extracted correctly this object compared to the reference approach. Generally, the developed scenarios have improved the semantic segmentation results of buildings. In the case of the reference scenario, we observe a slight confusion between the building class and those of ground and high vegetation. Besides, errors of S1 were slightly lower than those delivered by S2 and S3 scenarios for the building class.
Visually, we can observe in
Figure 13 and
Figure 14 that the footpath object is hard to be recognized. S1, S2, and reference scenario failed to label this kind of footpath correctly. While S1 achieves an acceptable performance on it (scene 2).
Concerning the cars, wall, and bridge objects, thanks to the suitable geometric features calculated from point clouds in S2, errors of it were lower than those delivered by the other scenarios. The results indicated that the bridge class is labeled as buildings with the reference approach. Additionally, a part of this class is labeled as buildings in the segmentation results of S1 and S3. Moreover, as shown in
Figure 13, various car class points have been miss-segmented as street furniture, especially in scene 4 (see confusion matrix results). Additionally, the wall is confused with several classes mainly street furniture and building geoobjects.
According to visual comparison, the semantic segmentation based on classified images (S1), geometric features (S2), and classified geometric information (S3) demonstrated a real complementary effect compared to the reference approach. The visual results indicated that the developed scenario S1 generally performed better than S2 and S3. Particularly, the S2 scenario improves the semantic segmentation results of some classes (wall, cars, bridge ) more than the other scenarios.