3.1. Calculation of different texture area weights
In the actual extraction process of ORB features, it is necessary to divide the image into several smaller regions for extracting feature points, while the number of feature points in each region is consistent so that the calculation results can be guaranteed to be more accurate. The main steps are as follows [
18]:
1. segmentation of the image;
2. tracting feature points;
3. axing the extraction condition and extracting again if the number of feature points in the region is less than the minimum threshold;
If the number of features is greater than the maximum threshold, select a few of them with the largest Harris response value, and discard the rest.
Figure 3(a) image resolution is 1226*370, which is the road screen when the 05 image sequence of the KITTI dataset is not a region. As seen in the figure, the feature points are mainly concentrated in the plants as well as the outlines of the houses, which not only leads to mismatching and reduces the correct matching rate; but also makes the calculation results cause large errors.
First, the image is segmented so that its feature points can be distributed as evenly as possible in the graph. For an image of original size
, given the segmentation coefficients
,
for width and height, the width and height of the image are divided equally as follows:
The FSAT parameter for the first feature extraction in each region after segmentation is 30, and if no feature points are extracted, the parameter is changed to 3 and extracted again. A total of 1862 feature points were extracted, and the result shows that the extracted feature points are more evenly distributed in the image after region segmentation. The next step is to filter the feature points according to their Harris response values in each region and keep the points with the largest response values in each region, and the results are shown in
Figure 3(b).
The above screening process results in the retention of only the local optimums because the response values of feature points in each region are compared. The relationship between the response values of feature points before and after screening is given in
Figure 4. The blue curve in the figure indicates the response values of all 1862 feature points, and the red dots indicate the response values of the final feature points obtained after the screening. From the figure, it can be seen that a considerable number of feature points are local response value extremes, but their response values are still low in the whole, which indicates that the corner point nature of these points is not obvious compared with other points.
While point feature homogenization ensures that feature points are distributed as evenly as possible, it also makes the corner point properties of some feature points poor. This method works well in the case of rich textures; less rich environmental textures lead to poor matching of feature points. To solve the above problem, the image regions are differentiated [
19]. For the image
, calculate the matrix:
The two eigenvalues of the matrix G represent the texture information of the region. When both eigenvalues are larger, it means that the region is a high-texture region, and vice versa, it means that the region is a low-texture region [
20].
For different texture regions, different weights are given. That is, the weights should be small for low-texture areas of the image, while for high-texture areas of the images, the weights should be large. Definition of weights:
The weight values are determined by the grayscale gradient of the pixel points in the local region of the image. This results in feature points with weights that are uniformly distributed over the image, ready for feature tracking and motion estimation.
3.2. Keyframe-based predictive motion model
Stereo-matching aims to find the corresponding projection points of the same spatial point in images acquired from different viewpoints [
21]. The parallel binocular vision system uses a polar constraint to perform feature matching between the left and right images. From the pair-pole geometry, let the projection points of the same spatial point P on the left and right images be P1 and P2, then the corresponding point P2 of the point P1 must be on the polar line l2 corresponding to P1. For the parallel binocular system, for the same spatial point, the polar line is on the same line; that is, it is P2 on the extension line of the polar line where P1 is located so that when searching for the matching point, it is only necessary to search in the domain of the polar line.
For the feature matching problem of inter-frame images, the robustness and real-time performance of the visual odometry are not guaranteed if we rely only on the unique constraint to reduce the error, and the method of estimating the motion model to narrow the search range is currently used in the front and back frame feature point matching to solve the above problem [
22]. The method is to estimate the motion model of the system based on the images at moment t-1 and moment t. The position of the feature points in the image at moment t in moment t+1 is calculated under this motion model, and the best matching points are searched around this position.
However, in the above method, the overlap between adjacent frames increases for slower vehicles, resulting in almost no change in the projection of feature points, leading to high sensitivity of the system to errors. In this paper, we propose to use keyframes for motion model estimation to solve this problem. These keyframes are characterized by easy identification of feature points between adjacent keyframes, and the mean Euclidean distance between the current frame and the 3D coordinates of all matching points of the previous keyframe are considered as keyframes only when they are within a certain threshold.
where
- the minimum value of the distance threshold;
- the maximum value of the distance threshold;
- the mean value of the Euclidean distance of the 3D coordinates of all matching points of the ith keyframe and the i-1st keyframe.
The specific steps are as follows:
Let the first frame of the input be the reference frame, and the subsequent consecutive frames are calculated with the selected reference frame in Euclidean distance until the one that meets the condition is the current keyframe. After using the current frame, the above operation is repeated to find all the keyframes. As shown in
Figure 6, T0 indicates the reference keyframe and T1 indicates the current keyframe. The motion calculated from the two keyframes is used to estimate the motion model of the current frame and the next frame, and this motion model is used to calculate the position of the feature points in the image at moment t in moment t+1 and the best matching points are cycled around that position early.
3.3. 3D reconstruction
The 3D reconstruction is first performed using the matched pairs of feature points on the image, and then a second projection, called reprojection, is performed using the coordinates of the computed 3D points and the computed camera matrix. For a three-dimensional point P, the projection is:
Measurement errors are always inevitable due to the imperfect accuracy of measuring instruments and the influence of human factors and external conditions. Therefore, there is a certain projection error between the projection points, which is called reprojection error, referring to the difference between the projection and reprojection of the real three-dimensional space points on the image plane, and in order to deal with the problem of errors in these projection points, the number of observations is often more than the number of observations necessary to determine the unknown quantity, that is, to make redundant observations.
Redundant observations can also cause contradictions between observation results, and these contradictions require optimization of the model to find the most reliable results of the observed quantities and to evaluate the accuracy of the measurement results [
25]. The reprojection error of the point feature is calculated as follows:
By constructing a least squares problem with the reprojection error of all points as a cost function, Eq:
For the calculation of the minimized reprojection error, the texture weight values of the feature points are added:
Before calculating the least squares optimization problem, it is necessary to know the derivative of each error term with respect to the optimization variables, i.e., linearization:
When the pixel coordinate error
is two-dimensional and the camera pose
is six-dimensional,
is a 2 × 6 matrix. Transforming to the spatial point sitting under the camera coordinates is marked as
, taking out its first three dimensions:
Then the camera projection model is:
Consider the derivative of the change in
with respect to the amount of perturbation:
Where
denotes the left multiplicative perturbation on the Lie algebra. With the relationship between the variables obtained, it is deduced that:
By taking the first 3 dimensions in the definition of
and multiplying the two terms together, we obtain the 2 × 6 Jacobi matrix:
This Jacobi matrix describes the first-order variation of the reprojection error with respect to the Lie algebra of camera poses. For the derivative of e with respect to
at
spatial point:
Regarding the second item, by definition:
So, the two derivative matrices of the observed camera equations with respect to the camera pose and feature points are obtained.