3.1. 3D n-Sigmoid Channel and Spatial Attention Mechanism
Attention mechanisms, which enable a neural network to accurately focus on all the relevant elements of the input, have become an essential component to improve the performance of deep neural networks.
In this paper, the authors propose a lighter but more efficient n-shifted sigmoid channel and spatial attention module to improve computational overhead where both spatial and channel attentions are combined and to enhance 3D scene relevant/important features selection (
Figure 1). This module is integrated into the VoteNet architecture to improve the model’s ability to handle complex 3D object detection.
In
Figure 1, the multiplication is an elementwise product between channel and spatial attention. The n-shifted sigmoid CSA does split the channel to allow parallel processing of each group sub-features. For the channel attention branch, it uses a pair of parameters to scale and shift the channel vector. For spatial attention branch, it adopts a group normalization to generate spatial-wise statistics. The two branches are then multiplied together elementwise before all sub-features are aggregated.
The new n-sigmoid CSA layer strategically combines channel and spatial attention mechanisms to enable the network to focus on crucial features while preserving spatial information. Specifically, the CSA layer is designed to carry significant improvements to the feature learning process, which plays a pivotal role in accurately predicting and localizing 3D objects in point clouds [
24].
Contrary to common usage in image related CNNs, here the attention module is not repeatedly placed after each encoder (Set Abstraction) and decoder (Feature Propagation) of the VoteNet backbone but rather once only after the backbone that learns the features and before the voting module that estimate the object centers.
This innovative procedure is able to improve the accuracy score due to several reasons, including (1) upgraded discriminative features where the integration of the CSA module enhances the discriminative power of the features used in the Hough voting, (2) a context-aware voting where the model adapts its voting strategy based on the learned context, (3) an adaptive attention where the model dynamically adjusts the importance of different votes based on the spatial and channel-wise context.
The CSA layer operates by dividing the feature map
where c, h, w are channel number, spatial height, and width. n-shifted sigmoid CSA divides
into
groups according to the channel dimension. At the start of each attention unit, the input of
is split into two branches, namely the channel attention branch and the spatial attention branch (employed in [
12]) and, at the end, their their concatenation by elementwise product has an improvement on the accuracy results. The channel attention branch employs average pooling to capture the essence of the input features across different channels (
Figure 2).
Beyond the previous works, the authors propose that max-pooling be also simultaneously used to gather another important clue about distinctive object features to infer finer channel-wise attention. Thus, both average-pooled and max-pooled features are used concurrently. Using both features greatly improves the networks representation power more than using each independently.
In
Figure 2, the channel sub-module utilizes both max-pooling and average-pooling outputs.
The enhanced application for the channel attention branch (shown in
Figure 3) is to first use global averaging pooling (Eq. 1) together with the global maximum pooling (Eq. 3) to generate channel-wise statistics as s ∈ R
C/2G×1×1.
This operation consists in:
-
Average Pooling Branch:
Average pooling operation
n-shifted sigmoid activation:
where
is the n-shifted activation function
-
Max Pooling Branch:
n-shifted sigmoid activation:
where
is the n-shifted activation function
In both pooling branches, a compact feature is created to enable guidance for precise and adaptive selection. This is achieved by a gating mechanism using n-shifted sigmoid activation [
22] on both the average pooling (Eq. 2) and the max pooling (Eq. 4) operations.
The final output of the channel attention (Eq. 5-6) can be obtained by multiplying both averaged and max_pooled tensors and adding the n-shifted sigmoid:
where
∈ R
c/2g × 1 × 1,
∈ R
C/2g × 1 × 1 are parameters used to scale
This customized non-linear activation function, the n-shifted sigmoid, is tailored to accentuate relevant features while suppressing noise and irrelevant information.
On the other hand, the
spatial attention branch focuses on “where” most scene informatiton lies, and is complementary to channel attention. At the onstart, a group normalization (GN) [
26] is used over
to obtain spatial-wise statistics. F
c are the compact feature generated from the spatial branch. F
c (·) does improve
. The final output of spatial attention is:
where
∈ R
c/2g × 1 × 1,
∈ R
C/2g × 1 × 1 and are parameters of shape R
C/2g × 1 × 1 is the n-sigmoid activation (Eq. 6).
So, channel attention utilizes group normalization to process the features, followed by the application of the same n-sigmoid activation function to refine the spatial information. The n-shifted sigmoid activation function exhibits a distinctive behavior that allows for controlled feature enhancement based on a learned scaling factor.
It effectively emphasizes significant features while dampening the impact of less important ones, thereby enabling the model to focus on relevant information critical for accurate 3D object detection. This controlled non-linearity facilitates the n-shifted sigmoid CSA layer in adaptively reshaping the feature space, leading to a more discriminative and informative representation for subsequent stages of the network.
The last but very important step is to multiply both channel and spatial attention (Eq. 7) that will act as a gating mechanism where each channel’s importance is modulated by its spatial relevance. Here the channel and spatial attentions are multiplied element-wise to preserve the feature representations influenced by both the the channel and spatial attention mechanisms. This provides a different form of modulation where multiplication is expected to emphasize regions where both channel and spatial attentions are high, potentially focusing more on salient features.
where xn is the channel attention and xs is the spatial attention.
Python code of the proposed n-shifted sigmoid CSA
import torch
import torch.nn as nn f
rom torch.nn.parameter import Parameter
class csa_layer(nn.Module):
"""Constructs a Channel Spatial Group module.
Args: k_size: Adaptive selection of kernel size """
def __init__ (self, channel, groups=64):
super (csa_layer, self). __init__ ()
self.groups = groups
self.avg_pool = nn. AdaptiveAvgPool3d (1)
self.max_pool = nn. AdaptiveMaxPool3d (1
)
self.cweight_avg = Parameter(torch.zeros(1, channel // (4 * groups), 1, 1))
self.cbias_avg = Parameter(torch.ones(1, channel // (4 * groups), 1, 1))
self.cweight_max = Parameter(torch.zeros(1, channel // (4 * groups), 1, 1))
self.cbias_max = Parameter(torch.ones(1, channel // (4 * groups), 1, 1))
self.sweight = Parameter(torch.zeros(1, channel // (2 * groups), 1, 1))
self.sbias = Parameter(torch.ones(1, channel // (2 * groups), 1, 1))
self.sigmoid = nn.Sigmoid() self.gn = nn.GroupNorm(channel // (2 * groups), channel // (2 * groups))
def forward (self, x):
b, c, h, w = x.shape
x = x.reshape(b * self.groups, -1, h, w)
x_0, x_1 = x.chunk(2, dim=1)
# Channel attention
# Average pooling branch
xn_avg = self.avg_pool(x_0)
xn_avg = self.cweight_avg * xn_avg + self.cbias_avg
xn_avg = x_0 * self.sigmoid(xn_avg)
# Max pooling branch
xn_max = self.max_pool(x_0)
xn_max = self.cweight_max * xn_max + self.cbias_max
xn_max = x_0 * self.sigmoid(xn_max)
# Concatenate average and max pooled tensors
xn = torch.cat ([xn_avg, xn_max], dim=1)
# Spatial attention
xs = self.gn(x_1)
xs = self.sweight * xs + self.sbias
xs = self.sigmoid(xs)
# Multiply channel and spatial attentions
out = xn * xs
return out
The idea of introducing this new n-shifted sigmoid CSA attention module present several innovations that include:
Combination of Average and Max Pooling: Several attention mechanisms normally use either average pooling or max pooling to aggregate channel-wise data. Merging both average and max pooling allows the model to capture various aspects of the feature maps, thereby enhancing the capacity to attend to relevant features.
Multiplying Channel and Spatial: While addition is a common operation for combining attention mechanisms, multiplying channel and spatial attentions together can provide a totally different type of data regarding the scene in question. Multiplication will lay emphasis on regions where both channel and spatial attentions are important, thus focusing more on salient features.
The use of n-shifted sigmoid activation as a gating mechanism in the attention module.
The flexibility that permits the model to learn various types of interactions between spatial and channel attentions. Furthermore, the exploitation of trainable parameters from both max and average pooling as well as from channel and spatial attention, empowers the model to adaptively learn.