3.2. Comparative analysis
As addressed in related works [
3,
35], authors suggest running five times each sequence of a benchmark to create cumulative-error plots and account non-deterministic nature of each system [
21]. Nevertheless, authors like [
47,
51] performed their experimental comparisons by running each sequence ten times in forward and backward reproduction directions to better capture the probabilistic behavior of the algorithms against multiple variations like illumination and dynamic objects. In this way, we applied this extended approach, given the large variety of algorithms we tested. In total, we performed ten runs of each of the 50 sequences in forward and backward modalities, gathering a total of 1000 runs for each method, so for the ten evaluated algorithms, we created a database of 10000 trajectory files saved in .txt format that were processed using the MATLAB scripts provided in the official repository of the TUM-Mono benchmark [
3]. Each algorithm must output a file where each
pose for the trajectory must be in the format represented in equation 1, which has to match the benchmark evaluation scripts format. However, some algorithms like SVO, CNN-SVO, and DSM outputs rotation and translation matrixes instead of quaternions, so we modified those algorithms to bring the correct output format using the proposal of Sarabandi and Thomas [
68] by equations 2, 3, 4, and 5.
Given the rotation matrix:
As reported in [
68], the best results that outperform Shepperd’s rotation to quaternion method are achieved for
, so we set this value to build the trajectory files for those methods that did not match the evaluation format. In addition, methods ORB-SLAM2, DF-ORB-SLAM, DynaSLAM, SVO, CNN-SVO, and DSM require a different calibration camera model than the provided by the benchmark that includes full photometric data considering geometric intrinsic calibration, photometric calibration, and non-parametric vignette calibration. In this way, while the rest of the methods use the ATAN camera model based on the FOV distortion model of [
69] provided in the official PTAM repository [
70], we used the ROS calibration package [
71] to estimate three radial and two tangential distortion coefficients
which follows the formulation of equations 6 and 7, and the results were also tested and compared with the OpenCV Camera calibration package [
72].
For each pixel undistorted pixel at
coordinates, its position in the distorted image is
:
Where
is the distorted radius
. As suggested in [
3], to make a fair comparison based on the accumulated drift over the aligned start and end sequences, we disabled loop closure for the SLAM methods ORB-SLAM2, DF-ORB-SLAM, DynaSLAM, LDSO, and DSM.
Figure 4 presents each algorithm’s cumulative error plots for the translational, rotational, and scale errors. These graphs depict the number of runs for each error type below a certain x-value. Hence, methods close to the top left corner are better because they reach a determined error value after more executions.
As can be seen in
Figure 4, sparse direct methods (DSO, LDSO, and DSM) achieved the best overall performance, followed by the sparse indirect method (ORB-SLAM2), the dense indirect method (DF-ORB-SLAM), the hybrid method (SVO) and the dense direct method (LDS-SLAM) shown the worst performance. CNN versions of ORB-SLAM2 and SVO showed an important improvement over their classic versions. At the same time, CNN-DSO did not outperform DSO in accumulated translation, rotation, and scale metrics but kept close to the performance of DSO. Finally, it must be mentioned that the large error observed in LSD-SLAM, SVO, and CNN-SVO methods can be attributed to severe initialization and relocalization problems that the algorithms presented during evaluations in the TUM-mono dataset.
As mentioned before, the alignment error considers translation, rotation, and scale errors equally. Therefore, it is equivalent to the translational RMSE when aligned to the start and end segments (first and last 10-20 seconds of each sequence), for which the ground truth is available. The cumulated alignment error for each algorithm is presented in
Figure 5.
Figure 5 shows that ORB-SLAM and DynaSLAM perform slightly better than the sparse direct methods for start-segment alignment errors. However, on the end-segment, the cumulative drift effect was lower on sparse-direct methods, which ratifies the results observed in
Figure 4. In addition, it can be noticed that CNN-DSO performed better than DSO, suggesting that integrating the Single Image Depth Estimation (SIDE) CNN improved DSO bootstrapping by adding prior depth information, whereas the end-segment performance of both algorithms was similar. On the other hand, the addition of the Mask R-CNN in DynaSLAM is used to remove moving objects information from the scenes, which did not represent an improvement in the performance of the algorithm in the start-segment, but, as shown in
Figure 5, the benefits of the addition of the CNN can be observed over the end-segment by the reduction of the accumulated drift. Additionally, for the hybrid approaches, the addition of the MonoDepth CNN represented an important contribution that helped to overcome the SVO loss of trajectory issues. Similarly to the results observed in
Figure 1, overall alignment error results suggest that sparse-direct methods performed better, followed by sparse-indirect, dense-indirect, hybrid, and finally dense-direct, which reach the threshold error in around 50 runs.
As suggested by [
3], we examined dataset motion bias for each algorithm by running each method ten times forwards and ten times backward and evaluated the results in such modalities and the combination of both to visualize how much each algorithm is affected by this. This situation allowed us to consider the importance of evaluating SLAM and VO methods in large datasets, covering as much variety of environments and motion patterns as possible. The dataset motion bias for each method is presented in
Figure 6.
On
Figure 6, it can be noticed that DSO, LDSO, and SVO are not considerably affected by motion bias. In contrast, ORB-SLAM2, DynaSLAM, and DF-ORB-SLAM are considerably affected by different motion patterns, represented in the difference in their performance when running forwards versus backward. This behavior provides a reference of consistency and robustness of each algorithm for using them in different environments or applications. Here we also can observe that CNN-DSO on forward-only modality outperforms its classic version, but it suffers from a larger motion bias effect which affects its overall performance. In addition, DynaSLAM and CNN-SVO outperformed their classic versions and presented a smaller motion bias effect, representing an additional robustness improvement over their classic versions.
Figure 7 shows the color-coded alignment error for each of the 50 TUM-mono sequences for each run forward and backward to observe which specific sequences were challenging for each algorithm.
Figure 7, first row, presents the sparse-direct methods DSO, CNN-DSO, LDSO, and DSM, which demonstrated an outstanding performance compared to the rest of the evaluated methods that belong to the different classifications of the taxonomy, which places sparse-direct methods as the best alternative for visual odometry, SLAM, and 3D reconstruction tasks. By visualizing
Figure 7, it can be noticed that CNN-DSO performs worse than the original DSO algorithm in sequences 13 and 22 but improves its performance in sequence 39. LDSO performance was close to DSO, but it presents a better trajectory in some forward sequences and overcomes DSO in sequence 21. DSM performed similarly to the rest of the sparse-direct approaches but occasionally presented trajectory loss issues that affect its overall performance. On the other hand, DynaSLAM considerably improves the performance of ORB-SLAM2, especially in challenging sequences like 18, 19, 21, 22, 23, 27, 28, 38, 39, and 40, among others, where ORB-SLAM commonly fails. However, it still occasionally presents trajectory loss and initialization issues. The optical flow implementation of ORB-SLAM2 performs slightly worse on forwards and considerably worse on backward, especially in scenes 21, 22, 38, 39, 40, 46, 48, and 50. The CNN version of SVO considerably reduced the RMSE in most sequences compared to SVO but still constantly fails in outdoor sequences 21 and 22 and presents random initialization and trajectory loss issues. As reported in [
18], SVO and LSD-SLAM methods had the worst results over the whole dataset, which was why Engel et al. [
18] did not include these methods in their study. However, we considered it important to report such results and errors attributed to these algorithms’ commonly known initialization and trajectory loss errors over the sequences of the TUM-mono dataset.
The results processed on the TUM-mono benchmark for the cumulative translation error
, rotation error
, scale error
, start-segment alignment error
, end-segment alignment error
, and translational RMSE
were gathered in a database defining the method as the categoric variable. Statistic results were processed using R programing language. First, we removed blank observations for the executions where each algorithm got lost or could not initialize and applied Mahalanobis distances [
73] as a multivariate data cleaning technique to detect and remove outlier observations. For this, we established a cut score of 22.4577 based on the
distribution for a 99.999% interval, so we detected 344 outlier observations ending with a database of 8860 observations.
Then, we verified the normality and homogeneity assumptions for each dependent variable to select the appropriate statistical test for comparisons. For example, for the translation error, we obtained the p-values of 2.2e-16 for all the DSO, LDSO, CNN-DSO, DSM, DynaSLAM, ORB-SLAM2, DF-ORB-SLAM, CNN-SVO, SVO and LSD-SLAM methods in the Lilliefors (Kolmogorov-Smirnov) normality test, so the sample didn’t reach the normality assumption. We applied Levene’s test obtaining a p-value of 2.2e-16 for the homogeneity assumption, so the sample did not meet the homogeneity assumption. The rest of the dependent variables had similar assumptions verification results; thus, it was concluded that the sample was not parametric, so the Kruskal-Wallis test was selected as the general test and the Wilcoxon signed rank as pairwise posthoc test.
Figure 8 and
Table 2 present the results obtained by applying the differences tests.
As presented in
Figure 8 and
Table 2, the sample allowed us to identify significant differences between the implemented algorithms. By observing the translation error, it can be noticed that DSO, LDSO, and CNN-DSO methods achieved the best significative performance of the ten evaluated algorithms, and despite DSO performing 2.18% and 1.05% worse than DSO and CNN-DSO in this metric, the difference wasn’t significative among them. DSO, LDSO, and CNN-DSO achieved significative lower errors than the dense direct method DSM. Feature-based methods performed significantly worse than sparse-direct, where DynaSLAM achieved a significative better performance than ORB-SLAM2 and DF-ORB-SLAM, reaching 39.19% and 52.02% of translation error reduction, respectively. CNN-SVO performed slightly worse than DynaSLAM, but the difference was not significant, while it significantly outperformed its classic version SVO reaching 47.57% of translation error reduction. LSD-SLAM performed significantly worse in translation error metric among the ten algorithms.
Regarding rotation error, DSO and LDSO achieved significantly better results than the rest of the algorithms. Although DSO showed an average rotation error reduction close to 3.66%, the difference was not significant. DSO performed significantly better than its neuronal version in the accumulated rotation error metric. LDSO performed around 5.02% better than CNN-DSO, but the difference was not significant. DSM performed significantly worse than the rest of the sparse-direct methods. Feature-based methods performed significantly worse than sparse-direct methods in rotation error metric, where DynaSLAM achieved a significatively better performance than ORB-SLAM2 and DF-ORB-SLAM showing an average error reduction close to 33.30% and 54.97%, respectively. CNN-SVO performed significantly better than DF-ORB-SLAM, SVO, and LSD-SLAM in rotation error metric, significantly outperforming its classic SVO version showing an average reduction of 58.07% of rotation error. LSD-SLAM performed significantly worse than the other methods in the rotation error metric.
For the scale error metric, the sparse-direct methods DSO, LDSO, and CNN-DSO performed significantly better than the rest of the methods, where CNN-DSO showed the best performance showing an average reduction of 0.49% and 0.23% compared to DSO and LDSO, but the difference was not significative. DSM performed significantly worse than CNN-DSO. Feature-based methods performed significantly worse than sparse-direct methods on the scale error metric, where DynaSLAM achieved the significatively best performance and an average reduction of 10.60% and 9.02% compared to ORB-SLAM2 and DF-ORB-SLAM. CNN-SVO performed significantly better than the feature-based methods, SVO and LSD-SLAM, exhibiting an average error reduction of 19.14% compared to its classic version, SVO. Again, LSD-SLAM got the significatively worst performance of the ten methods in the scale error metric.
Similarly, we applied the Kruskal-Wallis test as a general test, and the Wilcoxon signed rank test for statistical comparison among the ten methods for the start- and end-segment alignment errors and the overall RMSE.
Figure 9 and
Table 3 present the results obtained by applying the differences tests.
Figure 9 and
Table 3 show many significant differences between the ten compared methods on the alignment error and RMSE metrics. Regarding the start-segment alignment error, DSO, DynaSLAM, and ORB-SLAM methods outperform the rest of the algorithms. Despite DSO slightly reducing the average start-segment alignment error by around 7.28% and 7.81% compared to DynaSLAM and ORB-SLAM2, the differences were not significant. The rest of the sparse-direct methods, LDSO, CNN-DSO, and DSM, performed significantly worse than DSO by an average of 49.84%, 55.77%, and 74.83%, respectively, in the start-segment alignment error metric. For the feature-based methods, DynaSLAM and DSO performed significantly better than DF-ORB-SLAM, while DF-ORB-SLAM achieved an error significantly lower than CNN-SVO, SVO, and LSD-SLAM. When comparing CNN-SVO with its predecessor SVO, the difference was significant, where the neural version reduced the start-segment alignment error by an average of close to 37.86%. LSD-SLAM achieved the significantly worst start-segment alignment error of the ten methods.
By observing the end-segment alignment error, it was found that the sparse-direct methods significantly outperformed the rest of the compared methods. DSO significantly outperformed all the evaluated methods, including the rest of the sparse-direct methods, LDSO, CNN-DSO, and DSM, reducing the average alignment error to around 47.85%, 32.50%, and 73.06%, respectively. In the sparse-indirect category, DynaSLAM and ORB-SLAM2 performed significantly better than DF-ORB-SLAM, but even though ORB-SLAM2 reduced the average end-segment error by approximately 7.49%, the difference was not significative. CNN-SVO end-segment alignment error was significantly lower than the error of SVO, reducing this metric by approximately 47.31%. LDSO performed significantly worse than the rest of the methods.
For the RMSE metric, sparse-direct methods performed significantly better than the rest, where LDSO achieved RMSE values around 0.32% and 6.68% lower than DSO and CNN-DSO; the differences were not significative. LDSO performed significantly better than DSM, with an average RMSE around 10% lower. In the feature-based classification, DynaSLAM performed significantly better than ORB-SLAM2 and DF-ORB-SLAM, reducing the RMSE metric by approximately 24.49% and 34.41%, respectively. For the hybrid methods, CNN-SVO performed significantly better than SVO, reducing the RMSE by around 34.83%. Like in the rest of the metrics, LSD-SLAM performed significantly worse than the rest of the methods in the RMSE metric.
Finally, in
Figure 10, we present son sample trajectories obtained by the three overall best methods evaluated in this comparison study. To exemplify the behavior of the algorithms in different environments, we selected the sequence
seq-02 of the TUM-mono dataset as an example for indoors and the sequence
seq-29 as an example for outdoors. In addition, we provide video samples of the execution of each algorithm as supplementary material in the GitHub repository:
https://github.com/erickherreraresearch/MonocularPureVisualSLAMComparison; along with all the .txt result files of each algorithm run for reproducibility.
As depicted in
Figure 10, the algorithms’ observed behavior ratifies this comparative analysis’s quantitative results. On the top row, for the indoor sequence, it can be noticed that the sparse-direct methods outperform the other evaluated methods starting and ending their trajectory pretty close to the ground truth. Indirect methods behave completely differently, where the system constantly loses the trajectory, accumulates drift, and gets wrong scale measures that are concatenated erroneously when the system achieves relocalization. It can also be noticed that DynaSLAM represent an important contribution to the ORB-SLAM2 system because it estimates the trajectory better than the rest closing the trajectory pretty close to the ground truth, while the rest of the indirect systems lost their trajectories. On the other hand, hybrid methods performed considerably worst indoors, so many of the algorithm’s runs didn’t complete the full frame sequence, and the algorithm typically finishes its trajectory pretty far from the end segment ground truth. In Figure’s 10 bottom row, it can be noticed again that sparse-direct methods outperformed the rest of the evaluated systems, with an appropriate bootstrapping in the start-segment and a small amount of accumulated drift in the end segment.
On the other hand, in
Figure 10e, it can be noticed that similarly to indoors, the indirect methods suffered from trajectory loss issues and, despite the relocalization module typically achieved to continue the system execution, it accumulates a critical amount of drift during relocalization which makes the estimated trajectory end far from the ground truth end-segment. In hybrid methods, SVO suffers from similar issues to indirect methods. However, it can be noticed that the CNN version of SVO improved its performance outdoors, differently from indoors, which can be explained because the added CNN MonoDepth module was trained in the Cityscapes dataset, which was mainly trained from outdoor sequences.