The (3) Middle Feature Extractor is responsible for extracting more features from the (2) Local Feature Encoders to provide more context for the shape description of objects for the networks of the Detection Head module. Various methods are used; herein, we separated into 3D Backbones and 2D Backbones, which will be described in more detail above.
4.3.1. Backbone 3D
A variety of methods resort to 3D backbones based on sparse convolutions (sparse CNN), such as SECOND (
Figure 4), PV-RCNN (
Figure 6), PartA² (
Figure 7) and Voxel-RCNN (
Figure 8). In particular, PV-RCNN uses a Voxel Set Abstraction 3D backbone, which is used to encode the multiscale semantic features obtained by sparse CNN to
keypoints. PointRCNN (
Figure 5) uses PointNet++ [
9] with multiscale grouping for feature extraction and gets more context to the shape of objects and then passes these features to (4) Detection Head module.
3D Sparse Convolution
The 3D Sparse Convolution method receives the voxel-wise features of VFE, or Mean VFE, .
This backbone is represented as a set of blocks
, in the form
, where
. Each block
can be defined by a set of Sparse Sequential operations denoted as
. Each
is described by
, where
means Submanifold Sparse Convolution 3D [
12],
Spatially-sparse Convolution 3D [
16],
1D Batch Normalisation operation, and
represents ReLU method. The last method assumes the standard procedure as mentioned in [
17].
In our framework, the set of blocks assumes the following configurations:
The input block can be described by ;
The next block is represented in the form ;
The block 3 is represent as ;
The block 4 is denoted as ;
The block 5 is denoted as ;
The last block is defined by .
The Batch Normalisation
element is defined by
, which represents the formula in [
18].
represents the input features, which are the output features of Submanifold Sparse or Spatially-sparse Convolutions 3D, so that
.
represents the eps, and
the momentum values. These values are defined in the following
Table 1.
The element
can be represented as
.
represents the input features of
and it is denoted as
where
represents the output features of Submanifold Sparse Conv3D. The element
represents the output features resulting from applying
.
means kernel size of a Spatially-sparse Convolution 3D and it is denoted as
where
,
and
. The stride
can be described as a set
and
.
designates padding, and a set can define it
.
means dilation and can be defined as a set
. The output padding
is represented as a in the form
and
. The configurations used in our framework are represented in
Table 2.
is represented by
[
12].
represents the input features passed by (2) Local Feature Encoder or by the last Sparse Sequential block
, and
the output features of
. Thus,
in the case of the Local Encoder be Mean VFE, otherwise
, where
F are the output features of the VFE network. Also,
can be represented by
and
, where
represents the output features of a
. The element
represents the kernel size that can be defined as
where
,
and
.
means stride and can be defined as a set
and
.
represents padding, and a set can describe it in the form
.
means dilation and can be described as a set
.
represents the output padding, and a set describes it in the form
and
. The configurations used in our framework are represented in
Table 3.
The hyperparameters used in each are defined in Table 7.
Finally, the output spatial features is defined by , where SP is defined by a tuple (B, C, D, H, W). B represents the batch size, C the output features of represented in as , D depth, H height and W width.
PointNet++
We use a modified version of PointNet++ [
9] based on [
12] to learn undiscretised raw point cloud data (herein denoted as
) features in multi-scale grouping fashion. The objective is to learn to segment the foreground points and contextual information about them. For this purpose, a Set Abstraction module herein denoted as
is used to sub-sampling points at a continuing increase rate, and a Feature Proposal module, described as
, is used to capture feature map per point with the objective of point segmentation and proposal generation. A
is composed by
, where
means PointNet Set Abstraction module operations. Each
is represented by
, where
corresponds to Query and Grouping operations to learn multi-scale patterns from points, and
are the set of specifications of the PointNet before the global pooling for each scale.
means ball query operation
followed by a grouping operation
. It can be defined by the set
where
and
correspond to two query and group operations. A ball query
is represented as
, where
R means the radius within all points will be searched from the query point with an upper limit
, in a process called ball query,
P means the coordinates of the point features in the form
that are used to gather the point features,
represents the coordinates of the centers of the ball query in the form
, where
,
, and
are center coordinates of a ball query. Thus, this ball query algorithm search for point features
P in a radius
R with an upper limit of
query points from the centroids (or ball query centers)
. This operation generates a list of indices
in the form
, where
corresponds to the number of
.
represents the indices of point features that form the query balls. After, a grouping operation
is performed to group point features and can be described by
, in which
and
correspond to point features and indices of the features to group with, respectively. In each
of a
, the number of centroids
will decrease, so that
, and due to the relation of the centroids in ball query search, the number of indices
and corresponding point features will also decrease. Thus, in each
the number of points features is defined by
. The number of centroids defined in QGL during
operations is defined in
Table 4.
Afterwards, a
is performed, defined by a set of specifications of the PointNet before the
operations. The idea herein is to capture point-to-point relations of the point features in each
local region. The point features coordinates translation to the local region relative to the centroid point is performed by the operation
.
,
, and
are coordinates of point features
as mentioned before, and
,
, and
are coordinates of the centroid center.
can be defined by a set
that represent two sequential methods. Each
is represented by the set of operations
, where
means Convolution 2D,
2D Batch Normalisation, and
represents the ReLU method.
is defined by
.
, where
, represents the input features that can be received by
or by the output features
of the
,
the kernel size, and
the stride of the Convolution 2D. The kernel size
is defined by the set
and
. Also, the stride
is represented by a set
, and
with
. The set of specifications used in our models regarding
are summarised in
Table 5.
can be defined as:
, where
denotes max pooling,
denotes random sampling of
features,
multi-layer perceptron network to encode features and relative locations.
Finally, a Feature Proposal
is applied employing a set of feature proposal modules
. Each
is defined by the element
as defined above. Also, the element
assumes a set
and each
has the same operations with the only difference in the element
s that describes the number of operations, assuming
instead of
. The configurations used in our models are summarised in
Table 6.
Voxel Set Abstraction
This method aims to generate a set of keypoints from given point cloud and use a keypoint sampling strategy based on Farthest Point Sampling. This method generates a small number of keypoints that can be represented by , where is the number of points features that have the largest minimum distance, and B the batch size. The Farthest Point Sampling method is defined according to a given subset , where M is the maximum number of features to sample, and subset , where N is the total number of points features of , the point distance metric is calculated based on . Based on D a operation is performed, which calculates the smallest value distance between and . and represents the list of the last known largest minimum distance of point features. Assuming , it returns the index . Based on , thus . Finally, this operation generates a set of indexes in the form , and , where B corresponds to the batch size and M the maximum number of features to sample. The keypoints K are given by
These keypoints K are subject to an interpolation process utilising the semantic features encoded by the 3D Sparse Convolution as . In this interpolation process, these semantic features are mapped with the keypoints to the voxel features that reside. Firstly, this process defines the local relative coordinates of keypoints with Voxels by means . Then, a bilinear interpolation is carried out to map the point features from 3D Sparse Convolution in a radius R with the , the local relative coordinates of keypoints. This is perform . Afterwards, indexes of points are defined according to in the form and another . The expression that gives the features from the BEV perspective based on and is the following:
Thus, the weights between these indexes , and are calculated, as follows:
;
Finally, the bilinear expression that gives the features from the BEV perspective is , where , , , . Also, , , , , and .
The local features of
is indicated by
and aggregated using PointNet++ according with their specification defined above. It will generate
that are voxel-wise features within the neighbouring voxel set
of
, transforming using PointNet++ specifications. This generates
according
and each
are aggregate features of 3D Sparse Convolution
with
from different levels according to
Table 4.
Backbone 2D
PointPillars (
Figure 3) uses only a 2D Backbone since they require fewer computational resources when compared to 3D Backbones. However, they introduce information loss that can be mitigated by readjusting the objects back to LiDAR’s Cartesian 3D system with minimal information loss. For this purpose, the features resulting from the PFN are used by the Backbone Scatter component, which scatters them back into a 2D pseudo-image. The next component, the Detection Head, then uses this 2D pseudo-image.
Other models, such as SECOND (
Figure 4), PV-RCNN (
Figure 6), PartA² (
Figure 7) and Voxel-RCNN (
Figure 8) compress the information into Bird’s-eye view (BEV) after using a 3D Backbone for feature extraction. After, they perform feature encoding and concatenation using an Encoder Conv2D. After this process, the resulting features are passed to the Detection Head module.
Backbone Scatter
The features resulting from the PFN are used by the PointPillars Scatter component, which scatters them back to a 2D pseudo-image of size , where H and W denote height and width, respectively.
BEV Backbone
BEV Backbone module receives 3D feature maps from 3D Sparse Convolution and reshapes them to BEV feature map. Admitting the given sparse features , the new sparse features are . The BEV Backbone is represented as a set of blocks , in the form , where . Each block , is represented by . The element n represents the number of convolutional layers in . The set of convolutional layers C in is described as a set , where . F represents the number of filters of each , U is the number of upsample filters of . Each of the upsample filters has the same characteristics, and their outputs are combined through concatenation. S denotes the stride in . If , we have a downsampled convolutional layer (). There are a certain convolutional layers (, such that ) that follow this layer. BatchNorm and ReLU layers are applied after each convolutional layer.
The input for this set of blocks is spatial features extracted by 3D Sparse Convolution or Voxel Set Abstraction modules and reshaped to BEV feature map.
Table 7.
The different block configuration () used. N.A. - Not Applicable.
Table 7.
The different block configuration () used. N.A. - Not Applicable.
Models |
|
|
|
PointPillars |
(3, 64, 128, 2) |
(5, 128, 128, 2) |
(5, 128, 128, 2) |
SECOND |
(5, 64, 128, 1) |
(5, 128, 256, 2) |
N.A. |
PV-RCNN |
(5, 64, 128, 1) |
(5, 128, 256, 2) |
N.A. |
PointRCNN |
N.A. |
N.A. |
N.A. |
PartA² |
(5, 128, 256, 2) |
(5, 128, 256, 2) |
N.A. |
VoxelRCNN |
(5, 128, 256, 2) |
(5, 128, 256, 2) |
N.A |
Encoder Conv2D
Based on features extract in each block and after upsampled based on , where D means downsample factor of the convolution layer C, upsample features are concatenate, such that , where cat means .