Our dataset collection framework primarily focuses on exteroceptive sensors mainly used in robotics for perception purposes, in contrast to sensors such as GPS and wheel-encoder that record the status information of the vehicle itself. Currently, one of the primary usages of the perceptive sensor data in the autonomous driving field is the obstacle-type-objects (cars, humans, bicycles) [
43] and traffic-type-objects (traffic signs and road surface) [
44] detection and segmentation. The mainstream research in this field is fusing different sensory data to compensate sensors for each other limitations. There is already a large amount of research focusing on the fusion of camera and LiDAR sensors [
45], but more attention should be given to the integration of radar data. Although LiDAR sensors outperform radar sensors from the perspective of the point clouds density and object texture, radar sensors have the advantages in moving objects detection, speed estimation, and high reliability in harsh environments such as fog and dust. Therefore, this framework innovatively exploits the characteristics of the radar sensors to highlight moving objects in LiDAR point clouds and calculate their relative velocity. The radar and LiDAR fusion result is then projected onto the camera image to achieve the final radar-LiDAR-camera fusion.
Figure 1 presents the framework architecture and data flow overview. In summary, the framework is composed of three modules, that are sensors, processing units and cloud server. The radar, LiDAR, and camera sensors used in the framework’s prototype are TI mmwave AWR1843BOOST, Velodyne VLP-32C, and Raspberry Pi V2, respectively. Sensor drivers are the ROS nodes and forward data to the connected computing unit. The main computer (ROS master) of the prototype is the Intel® NUC 11 with Core™ i7-1165G7 Processor and the supporting computer (ROS slave) is the ROCK PI N10. The ROS master and salve computers are physically connected by an Ethernet cable, the ROS slave simply sends sensory data coming from the camera and the radar to the ROS master for post processing. The communication between cloud server and the ROS master relies on the 4G network.
3.2. Sensor Synchronization
For autonomous vehicles that involve multi-sensor systems and sensor fusion applications, it is critical to address the synchronization of multiple sensors with different acquisition rates. The perceptive sensors’ operating frequencies are usually limited by their own characteristics. For example, as the solid-state sensor, cameras operate at high frequencies; on the contrary, LiDAR sensors usually scan at a rate no more than 20Hz, because of the internal rotating mechanisms. Although it is possible to set the sensors to work at the same frequencies from the hardware perspective, the latency of the sensor data streams is also a problem for matching the measurements.
In practical situations, it is not recommended to set all sensor frequencies to the same. For example, reducing the frame rate of the camera sensors to match the frequencies of the LiDAR sensors means fewer images are produced. However, it is possible to optimize the hardware and communication setup to minimize the latency caused by the data transfer and pre-processing delays. The typical software solution to synchronize sensors matches the message headers’ closest timestamps at the end-processing unit. One of the most popular open-source approaches, ROS
message_filter [
60], developed an adaptive algorithm that first finds the latest message as a reference point among the heads of all
topics (a term in ROS represents the information of sensing modality). The reference point was defined as the
pivot, based on the
pivot and a given time threshold, messages were selected out of all topics in the queues. The whole message-pairing process was shifted along the time domain. Therefore, the messages that cannot be paired (the difference of timestamps relative to other messages exceeds the threshold) would be discarded. One of the characteristics of this adaptive algorithm is that the selection of the reference message was not fixed into one sensor modality stream (shown in
Figure 4 (a)). For the systems with multiple data streams, the amount of the synchronized message sets are always reconciled to the frequency of the slowest sensor.
For any multi-sensor perceptive system, the sensor synchronization principle should correspond to the hardware configuration and post-processing of the sensor fusion. As discussed in
Section 4.1.2 about the sensor configurations of our work, the camera sensor has the highest rate of 15 FPS, and the LiDAR sensor operates at 10Hz. Both camera and LiDAR sensors work at a homogeneous rate, contrary to the heterogeneous radar sensors that only produce data when moving objects are in the detection zone. Therefore, as shown in
Figure 4, depending on the practical scenarios, radar data can be sparser than the camera and LiDAR data and also can scatter unevenly along the time domain. In this case, direct implementation of the synchronization algorithm [
60] will cause heft data loss of the camera and LiDAR sensors. For the generic radar-LiDAR-camera sensor fusion in our work, we divide the whole process into three modules based on the frequencies of the sensors. The first module is the fusion of the LiDAR and camera data because these two sensors have constant rates. The second module is the fusion of the radar and LiDAR sensors as they both produce the point clouds data. Finally, the last module is the fusion of the result of the second module and the camera data, achieving the thorough fusion of all three sensory modalities.
To address the issues of the hardware setup and fulfill the requirement of fusion principles in our work. We develop a specific algorithm to synchronize the data of all sensors. Inspired by the work [
60], our algorithm also relies on the timestamps to synchronize the messages. Instead of the
absolute timestamp used in [
60], we used
relative timestamp to synchronize the message sets. The definitions of two types of timestamps are:
Absolute timestamp is the time when data was produced in sensors. It was usually created by the ROS drivers of the sensors and was written in the header of each message.
Relative timestamp Relative timestamp represents the time data arrives at the central processing unit. It is the Intel® NUC 11 in our prototype.
Theoretically, the absolute timestamp should be the basis of the sensor synchronization as it represents the exact moment in which the data was created. However, absolute timestamp is not always applicable and has certain drawbacks in practical scenarios. First of all, it can be effectively implemented only if all sensors are capable of assigning the timestamp to each message on the fly, which is not always possible because of the computational capacity of the hardware, and software limitations. Out of the cost consideration, some basic perceptive sensors are not integrated with the complex processing ability. For example, our prototype’s Raspberry Pi V2 camera has no additional computing unit to capture the timestamp. However, because it is a modular Raspberry camera sensor and directly connected with the ROCK Pi computer through the CSI socket, the absolute timestamp is available in the header of each image message with the assistance from the ROCK Pi computer. On the other hand, the radar sensors used in the prototype have only serial communications with the computer, and there are no absolute timestamps for point clouds messages.
The second requirement for implementing the absolute timestamp is the clock synchronization between all computers in the data collection framework. There are two computers in our prototype; one serves as the primary computer performing all fundamental operations, and the second is auxiliary computer used simply for launching the sensor and forwarding data messages to the primary computer. There is a need to synchronize the clock of all computers and sensor-embedded computing units to the precision of millisecond if using the absolute timestamps for sensor synchronization. An important aspect to be underlined in the specific field of autonomous driving is that sensor synchronization becomes even more important as the speed of the vehicle increases, causing distortion in sensors’ readings.
To simplify the deployment procedures of this data collection framework, our sensor synchronization algorithms trade off simplicity with accuracy by using the
relative timestamps, which is the clock time of the primary computer when it receives the sensor data. Consequently, the algorithm is sensitive to the delay and bandwidth of the Local Area network (LAN). As mentioned in
Section 4.1.1, all sensors and computers of the prototype are physically connected by internet cables and in the same Gigabyte LAN. In practical tests, before any payload was applied in the communication network, the average delay times between the primary computer and LiDAR sensor, also the secondary computer (camera and radar sensors) are 0.662 ms and 0.441 ms, respectively. By contrast, the corresponding delay times are 0.703 ms and 0.49 ms when data was transferred from sensors to the primary computer. Therefore, the increasing time delay caused by transferring data in LAN is acceptable in practical scenarios. For example, the camera and LiDAR sensors’ time synchronization error of the Waymo dataset is mostly bounded from -6 to 8 ms [
7].
The reference frame selection is another essential issue for sensor synchronization, especially for the acquisition systems with various types of sensors. The essential difference between
message_filter and our algorithms is that the ROS-implemented
message_filter selects the nearest upcoming message as a reference, while our algorithms fix the reference onto the same modality stream (compare the red dot locations in
Figure 4 (a) and (b) (c)). Camera and LiDAR sensors have constant frame rates but radar sensors produce data at a variable frequency, e.g. in the presence of a dynamic object. Therefore, in this case, single reference frame is not applicable to synchronize all sensors. To address this problem, we divide the synchronization process in two steps. The first step is the synchronization of the LiDAR and camera data, as shown in
Figure 4 (b). The LiDAR sensor was chosen as the reference; thus, the frequency of the LiDAR-camera synchronized message set is the same as the LiDAR sensor’s frame rate. The LiDAR-camera synchronization is continuous until the radar sensors capture the dynamic objects; in that case, the radar-LiDAR synchronization step begins, see
Figure 4 (c). The radar sensor is the reference frame in the second synchronization step, which means that every radar message has a corresponding matched LiDAR message. As all LiDAR messages are also synchronized with the unique camera image, for every radar message, there is a thorough synchronized radar-LiDAR-camera message set (
Figure 4 (d)). The novelty of our synchronization method is separating the LiDAR and camera synchronization process from the whole procedure. As a result, we fully exploit the characteristics of density and consistency of the LiDAR and camera sensors, while also keeping the possibility to synchronize the sparse and variable information coming from radar sensors.
3.3. Sensor Fusion
Sensor fusion is critical for most autonomous-based systems, as it integrates acquisition data from multiple sensors to reduce detection errors and uncertainties. Nowadays, most perceptive sensors have advantages in specific perspectives but also suffer drawbacks when working individually. For example, camera sensors may provide texture-dense information but are susceptible to changes in illumination; radar sensors can detect the reliable relative velocities of objects but struggle to produce dense point clouds; the state-of-the-art LiDAR sensors are supposed to address the limitations of camera and radar sensors but lack color and texture information. Relying on LiDAR data only, makes object segmentation systems more challenging to carry out. Therefore, the common solution is combining the sensors to overcome the shortcomings of the independent sensor operation.
Camera, LiDAR, and radar sensors are considered the most popular perceptive sensors for autonomous vehicles. Presently, there are three mainstream fusion strategies: camera-LiDAR, camera-radar, and camera-LiDAR-radar. The fusion of camera and radar sensors has been widely utilized in industry. Car manufacturers combine cameras, radar, and ultrasonic sensors to perceive the vehicles’ surroundings. The camera-LiDAR fusion is often used in deep learning in recent years. The reliable X-Y-Z coordinates of LiDAR data can be projected as three-channel images. The fusion of the coordinate-projected images and the camera’s RGB images can be carried out in different layers of the neural networks. Finally, the camera-LiDAR-radar fusion combines the characteristics of all three sensors to provide excellent resolution of color and texture, precise 3D understanding of the environment, and velocity information.
In this work, we provide the radar-LiDAR-camera fusion as the backend of the dataset collection framework. Notably, we divide the whole fusion process in three steps. The first step is the fusion of the camera and LiDAR sensor because they work at constant frequencies. The second step is the fusion of the LiDAR and radar point clouds data. The last step combines the fusion result of the first two steps to achieve the complete fusion of the camera, LiDAR, and camera sensors. The advantages of our fusion approach are:
In the first step, camera-LiDAR fusion can have a maximum amount of fusion results. Only a few messages were discarded during the sensor synchronization because the camera and LiDAR sensors have close and homogeneous frame rates. Therefore, the projection of the LiDAR point clouds to the camera images can be easily adapted as the input data of the neural networks.
The second step fusion of the LiDAR and radar points grants dataset the capability to filter out moving objects from dense LiDAR point clouds and be aware of objects’ relative velocity.
The thorough camera-LiDAR-radar fusion is the combination of the first two fusion stage results, which consume little computing power and cause minor delays.
3.3.1. LiDAR Camera Fusion
Camera sensors perceive the real world by projecting the objects onto the 2D image planes, while LiDAR point clouds data contains direct 3D geometric information. The study of [
61] classified the fusion of 2D and 3D sensing modalities into three categories: high-level fusion, mid-level fusion, and low-level fusion. The high-level fusion first requires independent post-processing, such as object segmentation or tracking for each modality, then fuses the post-processing results; the low-level fusion is the integration of the basic information in raw data. For example, the 2D/3D geometric coordinates or image pixel values; the mid-level is an abstraction between high-level and low-level fusion, which is also known as feature-level fusion.
Our framework’s low-level backend LiDAR-camera fusion focuses on the spatial coordinate matching of two sensing modalities. Instead of deep learning sensor fusion techniques, we use traditional fusion algorithms for LiDAR-camera fusion, which means the input of the fusion process is the raw data, while the output is the enhanced data [
62]. One of the standard solutions for low-level LiDAR-camera fusion is converting 3D point clouds to 2D occupancy grids within the FoV of the camera sensor. There are two steps of LiDAR-camera fusion in our dataset collection framework. The first step is transforming the LiDAR data to the camera coordinate system based on the sensors’ extrinsic calibration results; the process follows the equation:
where
,
, and
are the 3D point coordinates as seen from the original frame (before the transformation);
,
, and
are the camera frame location coordinates;
,
and
are the Euler angles of the corresponding rotation of the camera frame; the
,
and
are the resulting 3D point coordinates as seen from camera frame (after transformation). The following step is the projection of the 3D points to 2D image pixels as seen from the camera frame; under assumption, the camera focal length and the image resolution are known, the following equation performs the projection:
where
,
, and
are the 3D point coordinates as seen from the camera frame; the
and
are camera horizontal and vertical focal length (which is known from the camera specification or discovered during the camera calibration routine); the
and
here are the coordinates of a principal point (the image center) that derived from image resolution
W and
H; finally the
u and
v are the resulting 2D pixel coordinates. After transforming and projecting the 3D points into a 2D image, the filtering step removes all the points falling out of the camera view.
The fusion results of each frame are saved as two files. The first is an RGB image with the projected point clouds, as shown in
Figure 5 (a). The 2D coordinate of LiDAR points was used to pick out the corresponding pixels in the image. The assignment of the pixel color is based on the depth information of the point, HSV colormap was used to colorize the image. The RGB image is the visualization of the projection result, which helps evaluate the alignment of the point clouds and image pixels. The second file contains the projected 2D coordinates and X, Y, and Z axis values of the LiDAR points within the camera view. All the information was dumped as a pickle file which can be quickly loaded and adapted to other formats, such as array and tensor. The visual demonstrations of the information in the second file are
Figure 5 (b) (c) (d), represents the LiDAR footprint projections in
,
and
planes, respectively. The color of pixels in each plane is proportionally scaled based on the numerical 3D axes value of the corresponding LiDAR points.
The three LiDAR footprint projections are effectively formatted by: first, projecting the LiDAR points onto the camera plane; second, by assigning the value of the LiDAR axis to a projected point. The overall algorithm can be seen in the following subsequent steps:
- (1)
LiDAR point clouds are stored in sparse triplet format , where N is the amount of points in LiDAR data.
- (2)
LiDAR point clouds are transformed to camera reference frame multiplying by the LiDAR-to-camera transform matrix .
- (3)
The transformed LiDAR points are projected to the camera plane, preserving the structure of the original triplet structure; in essence, the transformed LiDAR matrix is multiplied by the camera projection matrix ; as a result, the projected LiDAR matrix now contains the LiDAR point coordinates on the camera plane (pixel coordinates).
- (4)
-
The camera frame width W and height H are used to cut off all the LiDAR points falling out the camera view. In consideration of the projected LiDAR matrix from the previous step. We calculate the matrix row indices where the values satisfy the following:
The row indices where satisfy the expressions are stored in an index array ; the shapes of the and are the same, therefore it is secure to apply the derived indices to both camera-frame-transformed LiDAR matrix and camera-projected matrix .
- (5)
The resulting footprint images , and are initialized following the camera frame resolution ; and subsequently populate with black pixels (zero value).
- (6)
-
Zero-value footprint images are populated as follows:
The algorithm 1 illustrates the procedures described above.
Algorithm 1 LiDAR transposition, projection populating the images |
- 1:
nextFrame
- 2:
conf
- 3:
conf
- 4:
- 5:
- 6:
- 7:
- 8:
- 9:
- 10:
- 11:
|
3.3.2. Radar LiDAR and Camera Fusion
This study uses millimeter wave (mmwave) radar sensors installed on the prototype mount. The motivations of equipping mmwave radar sensors on autonomous vehicles are to robustify perception against adverse weather, to prevent individual sensor failures, and most importantly, to measure the target’s relative velocity based on the Doppler effect. Currently, mmwave radar and vision fusion can be seen as a promising approach to improve object detection [
63]. However, most research relies on advanced image processing methods to extract the features from the data. Therefore, an extra process is needed to process the radar points into image-like data format. Moreover, data conversion and deep-learning-based feature extraction consume a great amount of computing power and require noise-free sensing streams. As radar and LiDAR data are both represented as 3D Cartesian coordinates, the most common solution for data fusion is simply applying a Kalman Filter [
64]. Another example work [
65] first converted the 3D LiDAR point clouds to virtual 2D scans, then converted the 2D radar scans to 2D obstacle maps. However, their radar sensor is Mechanical Pivoting Radar, which differs from our mmwave radar sensors.
In our work, the entire radar-LiDAR-camera fusion operation is divided in two steps. The first step is the fusion of radar and LiDAR sensors. The second step uses the algorithms proposed in
Section 3.3.1 to fuse the first step’s results and camera images. As discussed in
Section 3.1, we calibrate the radar sensors primarily reactive to the dynamic objects. As a result, the principle of the radar-LiDAR fusion in our work is selecting the LiDAR point clouds of the moving objects based on the radar detection results.
Figure 6 illustrates four subsequent procedures of the radar-LiDAR fusion. First, transforming the radar points from the radar frame coordinate to the LiDAR frame coordinate. Corresponding transformation matrices are attained from the extrinsic sensor calibration. Second, applying the Density Based Spatial Clustering of Applications with Noise (DBSCAN) algorithm to the LiDAR point clouds to cluster out the points that potentially represent the objects [
66]. Third, looking up the nearest LiDAR point clusters for the radar points that were transformed into the LiDAR frame coordinate. Fourth, marking out the selected LiDAR point clusters in raw data (arrays contain the X, Y, and Z coordinate values) and appending the radar’s velocity readings as an extra channel for selected LiDAR point clusters (or
in case a LiDAR point belongs to no cluster).
Figure 7 demonstrates the relative locations of the original and coordinate-transformed radar points, and the results of the radar-LiDAR fusion in our work (LiDAR point clusters of the moving objects). The reference frame for the point clouds scattering is the one positioned at the center of the LiDAR sensor. Green dots symbolize the original radar points, whereas red dots stand for the radar points transformed to the LiDAR frame coordinate, which are the result of the first subsequent of our radar-LiDAR fusion. Blue dots are the LiDAR point of the moving objects. The selection of the LiDAR point clusters, representing the detected moving object, relies on the nearest neighbor lookup based on the Euclidean distance metric that takes coordinate-transformed radar points as the reference. Due to inherent characteristics and post-intrinsic calibration, radar sensors in our prototype only produce a handful of points for moving objects in each frame, which means the computation of the whole radar-LiDAR fusion operation is computationally efficient and can be executed on the fly.
The second step of the radar-LiDAR-camera fusion is the continuous process toward the results of the first step of radar-LiDAR fusion. The LiDAR point clusters that belong to the moving objects will be projected onto the camera plane.
Figure 8 (a) visualizes the final outcome of the radar-LiDAR-camera fusion in our dataset collection framework. LiDAR point clouds representing moving objects were filtered from the raw LiDAR data and projected to the camera images. For each frame, moving objects’ LiDAR point clusters were dumped as a pickle file containing 3D-space and 2D-projection coordinates of the points and the relative velocity information. Because of the sparsity of the radar points data, direct projection of the radar points onto camera images has very little practical significance (see
Figure 8 (b)). In fact only two radar points are shown in this frame, and for this reason the significant result is the LiDAR point cluster in
Figure 8 (a).