Our architecture is designed as a hybrid model.
3.1. AYOLO Architecture
The general view of AYOLO architecture is shown in
Figure 2 below.
We are trying to perform object detection and recognition by taking YOLOv6 and DETR architecture as reference. In YOLOv6, the FPN structure is divided into two branches to improve the ability to detect objects of different sizes. These branches successively produce large and small feature maps.
Residual Network is often used as a backbone network to create feature maps with different resolutions. While low-resolution images contain a lot of semantic and channel information, high-resolution images contain a lot of spatial (location information) information. FPN’s top-down fusion strategy gradually mines the rich semantic information contained in high-level feature layers. FPN, simple 1x1 convolution to reduce the size, results in the loss of some channel information. This prevents full use of channel information.
The information fusion structure of FPN essentially consists of multiple layers. In our model, we will provide information input through 4 stages for the first pyramid network and 3 stages for the second pyramid network. According to this information input layer, while information fusion is provided directly from adjacent adjacent layers, it is provided indirectly with information obtained from non-adjacent layers. Information overflow can cause significant information loss during calculation. In information interactions between layers, the information selected by the intermediate layers can be changed and some information can be discarded. In this case, we should not ignore the fact that information at a certain level can only help neighboring layers, while the information provided to other layers may decrease. As a result, the overall effectiveness of information fusion is limited.
We update the traditional YOLOv6 architecture to overcome the above-mentioned issues. In order to prevent information loss in the layer-to-layer transmission process, we propose a logical aggregation and information extraction mechanism (Semantic Aggregation, Enhancement and Distribution Module SAEDM) by integrating new location and channel information, inspired by the Visual Transformer (ViT) module. We propose an integrated module that takes information from all levels, integrates them, and then takes advantage of the attention mechanism to use it at different levels. With this model, we prevent the information loss inherent in the traditional YOLO structure and increase information fusion and inference capabilities without significantly increasing latency.
In our application, the SAEDM module consists of the following parts:
Dilated convolutions: Using “Dilated Convolutions” that expand the convolution filter to obtain non-neighbouring, more distant information.
Pyramid architecture: Using the pyramid structure to better integrate features at different scales and obtain semantic information.
In the YOLO architecture, information inputs come from the layers formed as a result of the downlinking of the backbone.
In addition, the sizes of the feature maps from each layer are respectively R , and
Batch Size N, Channels C, dimentions R = H × W represented.
3.2. Feature Alignment Module - FAM
YOLOv6 uses the Rep-PAN [
51] pyramid structure to estimate features at different scales. Rep-PAN requires repetition of pyramid subsampling, which leads to loss of boundary details [
20]. This can cause misalignment of contextual feature maps. Direct use of the fusion method will make it difficult to estimate the boundary information and will cause misclassification of adjacent objects adjacent to the boundary.
FAM is meticulously designed to improve feature fusion, reduce network complexity, and preserve important low-level information, especially for small target features. The most important change we see in this module is bilinear pooling. Bilinear pooling has emerged as a powerful technique for small object detection in the field of computer vision [
23,
41,
51,
52,
53,
54].
Addressing Downsampling Challenges: In traditional YOLO architectures, repeated downsampling in the Backbone network leads to the reduction or even loss of small target features with increasing feature levels. FAM counters this problem by creating a rich receptive field by efficiently combining subsamples, thus strengthening the model’s ability to represent and detect small objects with high resolution and low computational cost.
Optimizing Feature Map Sizes: FAM ensures appropriate resizing of feature maps in each layer, adhering to the smallest size in the group to avoid information loss and manage computational costs. This approach facilitates efficient information collection while keeping computational complexity to a minimum, thanks in part to the integration of transformer modules [
11,
55].
Balancing Feature Preservation with Computational Cost: An important aspect of aligning features in FAM involves maintaining larger feature sizes to preserve essential low-level information that is rich in location and detail. However, as the feature size increases, the computational cost and delay in subsequent blocks also increase. Therefore, it is imperative to control feature size at lower levels to effectively manage delays.
Strategic Selection of Feature Alignment Phase: Considering these considerations, selecting the most appropriate phase for feature alignment in FAM is critical to strike a balance between speed and accuracy. After extensive testing and data processing from different layers, Stage 4 was selected for feature alignment in our model. This decision is crucial to ensure efficient computation and reduced latency while maintaining the integrity of low-level information.
Continuous Optimization and Testing: Our ongoing research includes experimenting with the number of layers and processing data from various layers to further improve the performance of FAM. This continuous optimization aims to increase AYOLO’s accuracy and efficiency, especially in scenarios requiring high-precision object detection with minimal computational latency. The figure below shows the architecture of our FAM module. (
Figure 3 )
3.3. Information Fusion Module IFM
Abstracting Complex Scale Variations for Improved Object Recognition; Because the scale of objects in camera-captured images varies with distance from the lens, object recognition systems must skillfully manage this complexity to maintain performance. The Information Fusion Module (IFM) within the AYOLO architecture is designed to overcome such scale-induced challenges, improving the model’s ability to distinguish complex patterns between noise and semantic information disparities [
56].
Overcoming the Limitations of Traditional Feature Fusion: Although low-level feature maps are adept at capturing location-specific details due to their smaller receptive fields, they are limited due to their sparse semantic content and high noise sensitivity. Conversely, high-level feature maps, although semantically richer, suffer from loss in fine-grained discrimination after extensive convolution operations. Traditional FPN focuses heavily on feature scale and ignores the need to distinguish between objects with varying levels of complexity in similar size ranges.
AYOLO’s Interscalar Fusion Approach: AYOLO enables information to be transferred in a more detailed and efficient manner by filling the semantic gap between different layers. IFM in AYOLO uses a dynamic fusion block that is effective in synthesizing high-level semantics with low-level spatial information, which is crucial for real-time object detection.
Design Innovations in IFM: Channel Number Variability: IFM provides flexibility and control by modulating the number of channels for features at different scales; this minimizes model inference latency with negligible accuracy changes.
RepConv Block Integration: At the heart of IFM is the RepConv Block, which assimilates feature maps from different layers and evolves them to produce outputs that are then forked for encoder processing. [
40]
Encoder Processing and Feature Section: The encoder in IFM processes and splits the combined feature map, allowing an iterative combination with different layer features. This splitting and subsequent recombination creates a rich, versatile feature output that will intertwine with various layer features, creating a new output cascade for subsequent fusion. IFM module architecture:
Figure 4
3.4. Feature Aggregation and Alignment Module Second Pyramid
In the SPFAMFPN architecture, accuracy and speed have been tried to be increased by upsampling as well as downsampling. In the module used for alignment processing in the second stage FPN; calculation time is reduced. SPFAM consists of avgpool. Avgpool is used to reduce the size of input properties to a fixed size. The size of the input property is reduced to the smallest size. For the transformer block input, it is reduced to channel size by the concanate process.
FPN has two pyramid network structures that have fusion interaction with each other. At the end of the first pyramid network, separate feature maps for each level are available in an optimized form. Each of these is used as input for the second network. The second pyramid network in its architecture; Feature maps from three stages are used as input. It is similar to the first pyramid structure, but the ’Feature Alignment Network’ aligns information from three layers. Preserving important features, which are used for selection, and 1x1 convolution is used for channel reduction. This effectively improves multi-scale feature aggregation.
Second Pyramid Feature Alignement Module SPFAM
In the SPFAMFPN architecture
Figure 5, accuracy and speed have been tried to be increased by upsampling as well as downsampling.
In the module used for alignment processing in the second stage FPN; calculation time is reduced. SPFAM consists of avgpool and concanate operations applied to different layers. Avgpool is used to reduce the size of input properties to a fixed size. The size of the input property is reduced to the smallest size. For the transformer block input, it is reduced to channel size by the concanate process.
Second Pyramit Information Fusion Module SPIFM. SPIFM consists of Transformer block and partition block. It is a CNN architecture for global feature extraction, inspired by the Levit [
57] network. This block consists of a series of transactions in stages. Because the transformer module extracts high-level information, Pooling Operation simplifies information collection while reducing computational requirements.
Figure 6
Transformer fusion module: Transformer block consists of stacked transformers. The number of transformer blocks is denoted by L. Each transformer block contains a multi-head attention block (MHA), a Feed-Forward Network (FFN) and Residual Connections. A 1x1 convolution reduces the high-level activation map channel size C to a smaller dimension d. It expects an array as input, so we collapse its spatial dimensions to a single dimension, resulting in a C×HxW feature map. (
Figure 7) [
58]
The attention mechanism of a standard transformer module uses three sets of vectors. Queries (Q), Keys (K) and values (V). A convolutional map is used to generate Q, K, and V in our model instead of using the standard method [
57]. This requires the network to perform multiple normalization operations, reducing the inference speed. The convolution process is generally faster and more efficient than linear mappings. To speed up the network, we used convolution operation instead of linear map, which is slow.
Self-attention leads to high computational complexity ( n × d matrices O(n2d )) [
59]. The computational complexity of self-attention makes it difficult to implement in high-resolution feature maps. Another problem is the process of resizing a matrix to a vector when calculating the correlation coefficients between pixels. Position information is lost in this resizing.
To solve the computational complexity problem, UT-Net [
60] proposed a new self-attention mechanism. In terms of structural features of imaging data, most of the pixels in high-resolution feature maps in a local region represent the same object with similar features. Therefore, pair-wise self-attention is actually low-level. [
61]
The subsampling factors of feature maps with different resolutions were set to eight. The effect of receptive field size at incompatible resolution in feature maps is ignored. In the feature map, pixels of lower resolution reflect more pixels in the original image with higher receptive field. Therefore, the level of the self-attention matrix is larger than the level of the high resolution feature. Based on this, we determined different subsampling factors for feature maps with different resolutions [
59].
The position coding used in the ViT block is one-dimensional absolute position coding [
62]. Relative position coding is much more effective than absolute position coding in the self-attention mechanism. In order not to lose the translation equivalence feature of CNNs in our model, we applied two-dimensional relative position encoding by adding height and width information [
63]. Equation
1
Q, K and V are matrices that represent queries, keys, and values, respectively.
is the dot product of queries and keys that measures the similarity between elements in the array.
and are matrices that represent relative positional embeddings, encoding the positional relationships between different elements in the array.
adds relevant location information to attention scores, allowing the model to take into account how far apart array elements are when calculating attention.
is a scaling factor based on the dimensionality of the keys that helps compensate for gradients during training.
Finally, the output of Softmax is multiplied by the V values to obtain the weighted sum of the input values based on the calculated attention scores [
57].
This is particularly useful in object detection tasks where relative positions are determined. It makes it easier for the model to focus on relevant parts of the input data and better understand spatial relationships within the data, thus improving the overall performance of the network.
We use Batch Normalization to speed up inference.
Figure 8 shows the activation functions. We use GELU for the entire activation function. In fact, the use of ReLU can further minimize the impact on the speed of the transformer module. However:
Mathematical Considerations: It means that ReLU returns the input value when the input value is positive, otherwise zero. This leads to the well-known problem of dead neurons; here neurons may become inactive during training and contribute nothing to the learning process of the model. On the other hand, the GELU activation function is defined using the Gaussian cumulative distribution function as: Equation
2
GELU, unlike ReLU, has a non-zero slope for all values. This allows gradient-based optimization even when the input to the activation function is negative, potentially reducing the likelihood of dead neurons. The non-zero slope around zero for the GELU function allows it to backpropagate meaningful slopes even for near-zero inputs. This allows the network to learn from a wider range of input values, which can be critical for learning complex patterns.
While ReLU has slight computational advantages due to its simplicity, the differentiable nature of the GELU function facilitates optimization using gradient descent; this leads to faster convergence and potentially better performance in practice.
As a result, the choice of GELU in the AYOLO architecture is a strategic choice that prioritizes the activation function’s ability to handle complex patterns and provide a more stable and robust learning process. While ReLU is slightly faster, the benefits of GELU, particularly its impact on the model’s learning capabilities, outweigh the minimal gains in computational speed that ReLU provides.
To build our Feed Forward Network (FFN) [
64,
65], we reference the “Shuffle Transformer Block” methodologies presented in. Our model uses ShuffleNetV2 DW Conv [
66] to reduce computations. To improve the local connectivity of our network, we add a deep-wise convolution layer between two 1x1 convolution layers. We also set the expansion factor of FFN to 2 to balance speed and computational cost. FFN can be formulated as follows.
In the formula
3 and
4;
input,
output feature maps,
output, 3x3 deep-wise convolution layer,
,
1x1 convolution layer, gelu GELU activation function, bn; represents Batch Normalization.
3.5. Information Transformer Encoder Module ITEM
The Information Transformer Encoder Module (ITEM) is designed to combine local and global information for advanced feature representation. This module is based on the principle of efficient computing and the strategic use of global information at various stages of the network.
Leveraging Segmentation Algorithms for Global Information Integration: ITEM uses advanced segmentation algorithms to segment and assimilate global information, facilitating the synthesis of multilayer features [
58]. This modular approach leverages attention mechanisms, as illustrated in Figure 10, to meticulously align features at different scales and sizes [
37,
38].
Attention Mechanism and Feature Scaling
Considering the variation in the size of features containing global and local information, ITEM adopts average pooling (avgpool) and bilge linear interpolation strategies to scale and align them to a uniform size. This alignment is crucial to ensure the consistency of information combined across the network.
RepConv Block for Information Extraction and Fusion: After attention fusion, the ITEM module uses the RepConv Block to further parse and combine features. This block acts as a conduit for the extraction of salient information, enabling a more detailed and nuanced combination of local and global cues.
ITEM’s Role in AYOLO Architecture: It is effective in increasing the accuracy of the detection model by ensuring that the global context and local details are not only preserved but also effectively integrated.
Our aim in this module is to obtain more efficient calculations, collect global information and use it efficiently at different stages. We aim to combine global information from layers with different levels. We align using average pooling (avgpool) and binaural interpolation. At the end of each attention fusion, RepConv Block is used to further extract and combine information.
The architecture is depicted in
Figure 9 . Local information
and global information
are used as input. The architecture can be formulated as follows.
element-wise multiplication is denoted by the symbol ⊗ , element wise addition is denoted by the symbol ⊕.