1. Introduction
With the continuous advancement of various sensors and image matching technologies, three-dimensional(3D) point clouds have found widespread applications in various domains. Effective classification of point clouds plays a crucial role in fields such as autonomous driving, robot navigation, augmented reality, and 3D reconstruction. However, due to the irregularity and sparsity inherent in 3D point clouds, classifying them in complex environments is by no means a straightforward task. Furthermore, the density of point clouds can vary depending on the sampling interval and range of the laser scanner, while severe occlusions between objects during the scanning process can result in incomplete coverage of object surfaces. These challenges pose significant hurdles in the classification of 3D point clouds.
As previously mentioned, applying standard convolutional neural networks directly to three-dimensional point clouds is infeasible due to their unordered and unstructured nature. Some researchers have started to regularize point clouds to draw insights from the experience of two-dimensional semantic segmentation networks. In the literature[
1], the authors presented the groundbreaking work PointNet[
1], which operates directly on irregular point clouds, utilizes shared Multi-Layer Perceptrons (MLPs) to learn point features, and employs symmetric pooling functions to capture global features. Building upon PointNet[
1], subsequent scholars have proposed a series of point-wise MLP methods such as PointNet++[
2], Frustum-PointNet[
3], PCNN[
4], DGCNN[
5], and PointWeb[
6]. However, the use of shared MLPs for extracting 3D point cloud features may not adequately capture local geometric characteristics within the point cloud and overlooks interactions between points. Zhang[
7] introduced an interpretable point cloud classification learning method, PointHop, which primarily employs spatial partitioning to address the data challenges in unordered point clouds and explores ensemble methods to enhance classification performance. Ben-Shabat[
8] introduced an intuitive three-dimensional point cloud representation called Fisher Vectors(3DmFV) using grids to design novel network architectures for real-time point cloud classification. 3DpointCapsNet[
9] proposed a 3D point capsule network that preserves the spatial arrangement of input data and designs a 2D latent space, bringing improvements to several common point cloud-related tasks.
Nonetheless, the conventional Multilayer Perceptron (MLP) approach is subject to inherent limitations when addressing global feature interactions between points, owing to the mutual independence of neurons. Moreover, MLP exhibits suboptimal modeling efficacy in the context of long-range dependency relationships. The pioneering Transformer model, introduced by Vaswani[
10], initially garnered remarkable success in the domain of Natural Language Processing (NLP). Subsequently, Wang[
11] introduced the innovative Point-Transformer, effectively managing variable length data and global information, resulting in enhanced classification accuracy and generalization capabilities. Notably, it achieved a notable stride in modeling point-to-point interaction. He[
12] engineered the PointCloudTransformer, harnessing Transformer's self-attention mechanisms to capture the global information of point cloud data, while employing Convolutional Neural Networks for handling local information, thus achieving highly efficient classification. However, Transformers prove less effective in capturing the topological structural characteristics of point clouds.
To enable each point to capture a broader context and obtain richer local hierarchies, some scholars have proposed utilizing graph structures for point cloud analysis. GraphCNN[
5,
13,
14,
15,
16] represents point clouds as graph data based on spatial/feature similarities between points and extends 2D convolution on images to 3D data. To handle unordered point sets with varying neighborhood sizes, standard graph convolution employs shared weight functions for each pair of points to extract corresponding edge features. This results in a fixed/isotropic convolution kernel that is applied to all pairs of points, overlooking their distinct feature correspondences. Intuitively, for points from different semantic parts of a 3D point cloud (such as adjacent points in
Figure 1), the convolution kernel should be able to differentiate them and determine their varying contributions. To address this limitation, several dedicated networks have been introduced, including neighborhood feature pooling-based[
2], attention-based aggregation[
17], and local-global feature fusion methods[
5,
18,
19]. By assigning appropriate attention weights to neighboring points, these approaches attempt to identify their varying importance during convolution. However, these methods still fundamentally rely on fixed kernel convolutions since attention weights are applied to similar features obtained (as indicated by the black arrows in
Figure 1 part b). As illustrated in
Figure 1 part a, standard graph convolution applies a fixed and isotropic kernel (black arrows) to compute features for each point. Part b Based on these features, several attention weights are assigned to determine their importance. In contrast to the previous two, ‘
’, generates an adaptive kernel ‘
’, unique to learning features for each point.
To address this, we propose a novel deep learning model called Att-AdaptNet (
Figure 2). In this paper, featuring attention-based global feature masking and channel weighting, corresponding to the global attention module and adaptive graph convolution (see
Figure 2). The entire end-to-end model takes 768 point clouds as input for classification learning. There are two primary branches in this model. The first branch focuses on the influence of each local point, thus producing a global mask at the branch's end that weights the contribution of each point to the point cloud features. To capture fine-grained regions on the point cloud, the global features are multiplied by the mask to obtain the final attention-based features. The other branch employs adaptive graph convolution to generate adaptive kernels, replacing the aforementioned isotropic kernels (see
Figure 1 part c). The adaptive kernels achieve adaptivity during convolution operations, as opposed to merely assigning different weights to adjacent points.
The experiments demonstrate that, on the widely used ModelNet40 benchmark dataset, the model proposed in this chapter outperforms many existing models. To ensure a fair comparison, following the practice of most deep learning papers, the proposed approach is benchmarked against other models on ModelNet40. The key reason for the superiority of the model proposed in this chapter lies in its innovative introduction of attention mechanisms into point cloud feature extraction, where each point plays a unique role in describing the overall structure. Thus, the model assigns individual weights to each point during the feature integration stage, while also emphasizing crucial feature channels representing intrinsic geometric information in high-dimensional space. The main contributions of this chapter are summarized as follows:
(1) We propose a novel 3D point cloud classification method, named Att-AdaptNet, based on attention and adaptive graph convolution. This method can directly process raw point clouds and employs attention mechanisms through global feature masking and adaptive graph convolution to focus on feature regions.
(2) We utilize adaptive graph convolution to extract global features from 3D point clouds, effectively and precisely capturing diverse relationships among points from different semantic parts.
(3) The approach presented in this chapter is trained and tested on the ModelNet40 benchmark dataset, achieving a classification accuracy of 93.3%. It demonstrates significant improvements in performance compared to other methods.
2. Related Works
Self-attention networks have garnered significant attention for their ability to extract discriminative features of interest, allowing models to identify the focal points. Thus far, self-attention-based models have found wide applications in tasks such as machine translation, caption generation[
20], speech recognition[
21], adversarial networks[
22], among others. The self-attention mechanism is designed to enable the network to learn context beyond the receptive field. One of the initial successful incorporations of this mechanism into CNNs was witnessed in the Squeeze-and-Excitation network[
23].
Petar Veličković introduced the Graph Attention Mechanism and constructed the corresponding Graph Attention Network(GAT)[
24]. It primarily utilizes self-attention to obtain attention coefficients, normalizes them, and then linearly combines them with the corresponding feature vectors, resulting in the final output features. PCAN[
17] proposed an attention mechanism for local feature aggregation to distinguish positively contributing local features. However, this method mainly employs a point-wise structure to extract local features, which does not particularly focus on local geometric structures. GAC[
16] introduced an attention mechanism based on the PointNet architecture, where attention weights learned from neighboring points can capture discriminative features, and this method achieved good performance. Chen et al.[
25] presented the GAPNet model, which aggregates attention features for each point in the neighborhood using a multi-head attention mechanism and applies stacked MLP layers to capture local geometric features from the original point cloud, achieving promising results. Yang et al.[
26] developed the Point-Attention Transformer(PAT) to model interactions between points, employing parameter-efficient Group Shuffle Attention(GSA) instead of expensive multi-head attention mechanisms.
Influenced by attention mechanisms and pyramid pooling, several methods have been proposed to better capture local geometric information, GGM-Net[
27] introduced a Graph Geometry Moment Convolutional Neural Network that learns local geometric features from the geometric moment representations of local point sets to better capture local geometric information. AGCN[
28] avoids the use of shared spectral kernels and instead assigns a customized Laplacian graph to each sample, providing an objective description of its graph convolution topology. Li[
29] aimed to extract precise pixel-level attention from high-level features obtained from CNNs. They proposed the Feature Pyramid Attention(FPA) module, which effectively increases the receptive field and aids in the classification of small objects by embedding context features of different scales in a pixel prediction framework based on FCN. PyramNet[
30] primarily designed two new operators, the Graph Embedding Module(GEM) and the Pyramid Attention Network(PAN). GEM projects point clouds onto graphs and utilizes covariance matrices to explore relationships between points, enhancing the model's ability to represent local features. PAN assigns strong semantic features to each point, preserving fine-grained geometric features as much as possible. Wang et al.[
16] introduced GACNN, an end-to-end encoder-decoder network that captures multi-scale features of point clouds, achieving more accurate point cloud classification.
3. Model Construction
In recent years, deep neural networks have emerged as a primary tool for image analysis. Deep learning, due to its capacity for large-scale learning, has also gained popularity in the realm of 3D point cloud classification. Since the introduction of PointNet [
1], recent works have focused on extracting global features of point sets by grouping and aggregating features of all individual points. However, these approaches are limited to detecting structural differences between different objects. Therefore, this paper proposes a novel deep learning model called Att-AdaptNet.
3.1. Adaptive Graph Convolution Module
The adaptive graph convolution is an extension of graph convolution, and the configuration of the adaptive convolution module in this paper is the same as that in AdaptConvNet[
28]. The structure of this module is illustrated in
Figure 3. Let
be the input point cloud, with corresponding features defined as
. Here,
represents the (x, y, z) coordinates of the i-th point, and in general, it can be augmented with vectors of other attributes such as normals and colors. Then, a graph is constructed for each point, including self-loops, by considering the k-nearest neighbors(KNN) for each point, resulting in a directed graph G(V,E) where
and
represents a set of edges. Given the input feature dimensions, the AdaptConv[
24] layer aims to generate a new set of M-dimensional features with the same number of points while attempting to more accurately reflect local geometric features than previous graph convolutions.
The adaptive kernel, denoted as , is generated from the input features of a pair of points on the edge. It is then convolved with the corresponding spatial input to produce the corresponding edge feature . All dimensions of are concatenated to produce the edge feature , which is finally pooled to output the feature of the central point. What sets AdaptConv apart from other graph convolutions is that the convolution kernel for each pair of points is unique. Here, represents the central point in the graph convolution, and is a set of points in its neighborhood. Due to the irregularity of point clouds, previous methods often used a fixed kernel function for ’ s neighbors to capture the geometric information of the patch. However, different neighborhoods reflect different features of , especially when is located in prominent regions such as corners or edges. A fixed kernel may lead to geometric representations generated by graph convolution that are not well-suited for classification.
Therefore, this chapter aims to capture unique relationships between each pair of points using an adaptive kernel. For each channel in the output M-dimensional features, AdaptConv dynamically generates a kernel based on the point features
, as follows Equation(1):
Here,
represents one of the M output dimensions corresponding to a single filter defined in AdaptConv. To combine the global shape structure captured in the local neighborhood [
6] with feature differences, this chapter defines
as the input feature for the adaptive kernel, where [·, ·] denotes concatenation operation.
is a feature mapping function, and in this case, a multi-layer perceptron is used.
Similar to the computation of 2D convolution, convolution is performed by taking D input channels and their respective filter weights to obtain one of the M output dimensions. Then, convolution is applied between the adaptive kernel and the corresponding points
, as shown in Equation(2):
In Equation(2),
is defined as
, <·,·> denotes the inner product of two vectors, and
is subject to a non-linear activation function σ. As shown in
Figure 3, the m-th adaptive kernel
combines with the spatial relation
of the corresponding point
. The size of the kernel should match in the dot product, meaning the feature mapping
, as mentioned earlier. This allows spatial positions in the input space to be effectively incorporated into each layer and combined with features extracted dynamically from the kernel. The
from each channel is summed together, generating edge features
between points
. Finally, the output feature of the central point is defined by applying an aggregation function to all edge features in the neighborhood:
In Equation(3), max represents a channel-wise maximum pooling function. To summarize, the convolutional weights for AdaptConv are defined by Equation(4):
In this experiment, AdaptConv generates an adaptive kernel for each pair of points based on their respective features . Then, this kernel, denoted as , is applied to point pairs to describe their spatial relationship in the input space. In other cases, the input can be , which includes additional dimensions representing other valuable point attributes, such as point normals and colors. By modifying the adaptive kernel to , AdapConv can capture relationships between feature dimensions and spatial coordinates from different domains. In this chapter's experiments, spatial positions are used as the default input in the convolution. Instead of using , ∆ is employed, and a pair of points' adaptive kernels are designed to establish relationships between their current features at each layer. This allows the kernel to adapt to the features from the previous layer, extracting feature relationships. It is a more direct solution, similar to other convolutional operators, as it generates a new set of learned features from the features of the previous layer in the network.
After two layers of AdaptConv and two layers of graph convolution, specifically following the output of the final layer, the model further utilizes a shared MLP (MLP
) and an SE-1d block to obtain global feature representation
. The computation process is illustrated in Equation(5):
3.2. Global Attention
For each
, a subset is defined with
as the center, and k-1 of the closest points excluding the center
are selected. Thus, the KNN query for
can be calculated as shown in Equation(6):
Where
represents the k-th closest point to
, calculated using a kNN query. Thus, the grouped input can be represented as shown in Equation(7).
The input to this module differs from the AdaptConv module. The Global Attention Module has additional geometric features, and this additional output is represented in the following form as shown in Equation(8):
Where
,
denotes the Euclidean distance, and
represents a set of points' count. The structure of the Global Attention Module is depicted in
Figure 4.
In this module, similar to channel attention in SENet[
31], two 1×1-sized 2D convolutional layers are used to reduce the dimensionality of the grouped features(the input to this module), and a sigmoid function is employed to generate a soft attention mask. For a specific point cluster
centered at
, the calculation of the importance of
is defined by Equation(9):
Where the output channel of is 1, and the activation function Sigmoid is defined as . Finally, the module outputs the learned soft mask .
The reason for designing a global attention mechanism is quite straightforward. Given that each object class possesses distinct feature patterns that may include subtle points such as guitar strings or airplane wings, it's possible for these feature patterns to be overlooked during the aggregation process, which extracts numerous features. Hence, there is a need to measure the importance of each group denoted as , and weight the global feature using a learned soft mask .
Furthermore, the reason for incorporating more crucial geometric information (namely,
) into the global attention module is to expedite and enhance the learning of the global soft mask
. While MLPs can theoretically approximate any nonlinear function, such as high-order information and squared Euclidean distance (2nd order:
), literature suggests that models with high-order convolutional filters
can achieve higher classification accuracy in several benchmarks [
31]. To address the same issue in the proposed model in this paper, additional crucial geometric information (namely,
) was also chosen to assist the shared MLP in effectively discovering feature patterns and determining the importance of each input point xi denoted as
.
3.3. A 3D Point Cloud Classification Method Based on Adaptive Graph Convolution and Attention
After obtaining the mask, denoted as
, from the Global Attention Module and the global features, this paper performs element-wise multiplication on them and generates new global features using the ReLU activation function. Following the principles of PointNet for 3D point cloud data classification, most models use max-pooling instead of average-pooling layers. Intuitively, max-pooling should be superior to avg-pooling, as the strongest activation might represent the most prominent feature of a class. However, the results of avg-pooling can also reflect important class features; otherwise, models using average pooling would yield unreasonable results. To gather more valuable information, the experiment chooses to aggregate all points in the global feature regularization using both max-pooling and average-pooling simultaneously. The results of the avg-pooling layer and max-pooling layer are concatenated into a complete classification vector with a dimension of 2048. Finally, a 3-layer MLP is employed to output the classification scores, where C, C/R, and C represent the dimensions of the three neural layers in the MLP, with R being a reduction factor to reduce parameter complexity, as illustrated in
Figure 5.