I. Introduction
Image classification is one of the basic tasks of computer vision, that is, given an input image, a certain classification algorithm is used to determine the category to which the image belongs. There are many ways to classify images, and different classification results will be obtained based on different classification standards. The main processes of image classification include image preprocessing, image feature description and extraction, and classifier design. Preprocessing includes image filtering (such as median filtering, mean filtering, Gaussian filtering, etc.) and normalization operations, whose purpose is to facilitate the subsequent processing of the target image. Image features are descriptions of its salient attributes, and each image has unique characteristics [
1,
2,
3,
4,
5,
6]. Feature extraction is to select and extract appropriate features according to the established classification method based on the characteristics of the image itself. A classifier is an algorithm that classifies target images based on selected features.
Traditional image classification methods are processed according to the above process. Their performance differences mainly depend on feature extraction and classifier selection. The features in traditional image classification algorithms are all manually selected. Commonly used image features include low-level visual features such as shape, texture, and color, as well as local invariant features such as scale-invariant feature transforms , local binary pattern, and oriented gradient histograms [
7,
8,
9]. Although these features have a certain degree of universality, they are not very targeted to specific images and specific division methods. In addition, for images of some complex scenes, it is very difficult to find artificial features that can accurately describe the target image. Traditional classifiers include K nearest neighbors and support vector machines [
10,
11]. For some simple image classification tasks, these classifiers are simple to implement and have good results. However, when the category differences are subtle or the image interference is serious, their classification accuracy drops significantly. Therefore, traditional classifiers are not suitable for the classification of complex images.
With the advent of the intelligent information age, deep learning has emerged. As a branch of machine learning, deep learning aims to simulate the human neural network system, build a deep artificial neural network, analyze and interpret the input data, and combine the underlying features of the data into abstract high-level features. This technology has played an irreplaceable role in artificial intelligence fields such as computer vision and natural language processing. As a typical representative of deep learning, the Deep Convolutional Neural Network (DCNN) performs well in computer vision tasks [
33]. For example, in autonomous driving, CNNs have improved image recognition, environment perception, and path planning [
34]. Similarly, methods like Class Probability Space Regularization (CPSR) have enhanced pixel-level precision in semantic segmentation [
35]. Techniques like Multiple Distributions Representation Learning (MDRL) have further refined segmentation in complex scenes [
36]. In the field of medical image recognition, deep learning systems have been used to automatically identify features in images, improving diagnostic accuracy in tasks like tumor detection [
48]. In cybersecurity and defense, AI and ML have shown great potential in boosting data security and defense capabilities through faster threat detection, predictive analysis, and strategic decision-making. Recent advancements in remote sensing image segmentation with U-Net enhancements using SimAM and CBAM, along with Transformer-based multimodal approaches in healthcare that combine imaging data with clinical reports, have significantly improved segmentation accuracy and stroke treatment outcome predictions over single-modality models [
50,
51,
52,
53].
Compared with traditional image classification algorithms that rely on manual feature extraction, convolutional neural networks extract features from input images through convolution operations, and can effectively learn feature expressions from a large number of samples, thereby enhancing the generalization ability of the model.
Figure 1 shows a classic neural network structure with three levels. The number of nodes in the input layer and the output layer is often fixed, and the number of nodes in the middle layer can be freely specified. The topology and arrows in the neural network structure diagram represent the data flow in the prediction process; the key in the structure diagram is not the circle (representing the neuron), but the connecting line (representing the connection between neurons), each connecting line corresponds to a different weight (its value is called weight), which needs to be trained.
Among the many neural network models, AlexNet, GoogleNet, VGG16 and ResNet are classic representative architectures that have achieved breakthrough results in large-scale image recognition tasks [
12,
13,
14,
15]. These models have demonstrated exceptional performance in real-world applications, including optimizing GPU partitioning techniques in autonomous systems to achieve better control performance [
37]. Additionally, methods like Bayesian optimization have shown tremendous success in tasks such as black-box model optimization, where Neural Processes are used to address the challenges of large parameter spaces and numerous observations [
38]. The introduction of these models not only promoted the development of deep learning research, but also demonstrated strong performance in practical applications [
30,
31,
32].
Although these classic deep learning models have performed well in image classification tasks, as application requirements increase and data scale expands, how to further improve the performance of neural networks remains an important research direction. In recent years, Bayesian optimization, channel attention mechanism, and spatial attention mechanism have received widespread attention as effective means to improve the performance of neural networks. Bayesian optimization efficiently searches for optimal hyperparameters by constructing a proxy model, while channel attention mechanism and spatial attention mechanism enhance the representation ability of the model by adaptively adjusting important information in the feature map [
28,
29]. For instance, in remote sensing, location-refined feature pyramid networks (LR-FPNs) have enhanced the ability to extract positional information, leading to improvements in object detection tasks [
39].
Bayesian optimization has been applied in real-world settings, such as retail, where deep learning systems using optimized YOLOv10 models have improved product recognition accuracy and checkout speed [
40]. Parameter-efficient transfer learning methods like the VMT-Adapter have also enhanced multi-task vision performance in dense scene understanding with minimal overhead [
41]. Additionally, multi-modal learning frameworks like the Multi-modal Alignment Prompt (MmAP) have boosted task complementarity while reducing trainable parameters [
42].
Innovative approaches such as self-training with label-feature consistency (ST-LFC) have tackled domain adaptation challenges, significantly improving benchmark performance [
43]. Similarly, advanced pedestrian detection methods like V2F-Net, which handle occluded pedestrian detection by separating visible region detection and full-body estimation, have shown superior results [
44]. Duality-based approaches for regret minimization in Markov decision processes have demonstrated sublinear regret in decision-making strategies [
45]. Visual defect detection models leveraging confident learning techniques have addressed noisy, imbalanced data in industrial applications [
46]. Lastly, multi-modal deep learning methods have enabled more accurate classification of repairable defects, especially by fusing tabular and image data [
47]. K-means clustering enhanced Support Vector Machine (SVM), have also been employed to improve classification performance in robotics tasks [
49].
This paper aims to reproduce the classic neural network models AlexNet, GoogleNet, VGG16 and ResNet, and on this basis, design and implement an improved model that combines Bayesian optimization, channel attention and spatial attention. Through experimental comparison, this paper will analyze and verify the performance advantages of the improved model in image classification tasks.
II. Methods
In this chapter, we will introduce our proposed method in detail. The network mainly includes channel attention mechanism, spatial attention mechanism and neural network based on Bayesian optimization. First, the input features are processed by the channel attention model and the spatial attention model respectively to extract relevant feature information. The network weights based on Bayesian optimization are not traditional weight values, but are replaced by probability distribution, which greatly increases the generalization ability of the network. Such a model design not only increases the feature extraction capability, but also increases the classification performance of unknown data.
Figure 2 shows the classification network structure proposed in this paper.
A. Bayesian Optimization
In the practice of machine learning and deep learning, the tuning of model parameters is a key step. Common hyperparameters include learning rate, regularization coefficient, number of network layers, number of neurons in each layer, etc. Appropriate parameter configuration can significantly improve the performance of the model. In order to find these optimal parameters, researchers have developed a variety of optimization methods, such as grid search, random search, and Bayesian optimization.
Bayesian optimization is a global optimization algorithm based on Bayesian theorem, which is applicable to situations where the objective function is difficult to calculate or the computational cost is high. The core idea is to guide the search process by establishing a probability model of the objective function, so as to find the parameter configuration that makes the objective function achieve the optimal value. Specifically, in Bayesian optimization, we first assume that the objective function f obeys a Gaussian process (GP), which is in the form of this:
where is the mean function, usually set to zero, and is the kernel function that describes the similarity between any two points. Based on the current observation data , the posterior distribution of the Gaussian process can then be updated:
The mean function and variance function are:
In each iteration, we select a point that is most likely to improve performance for evaluation based on the current probability model of the objective function. After the evaluation is completed, we add the new observations to the model and update the probability model. This process is repeated until the preset number of iterations is reached or other stopping conditions are met. The advantage of Bayesian optimization is that it can intelligently select the next evaluation point based on historical observations, thereby finding a parameter configuration close to the optimal solution in a smaller number of iterations. In addition, Bayesian optimization can also handle complex objective functions such as multi-peak and non-convex, which makes it perform well in many practical applications.
Traditional hyperparameter selection methods mainly include grid search and random search, but they have several problems. Generally, grid search needs to traverse all possible parameter combinations, and the computational cost increases rapidly with the increase in the number of parameters and the range of values. For large neural networks, this method is very time-consuming. Among them, although random search can improve search efficiency in some cases, it may still encounter search blind spots in high-dimensional parameter space, resulting in failure to find the best parameter combination. At the same time, these two methods are prone to fall into local optimal solutions in high-dimensional space, making it difficult to find the global optimal solution.
Assume that the hyperparameter to be optimized is , and the goal is to minimize the loss function . The search process of the traditional method can be expressed as:
where represents the search space of hyperparameters. In this case, Bayesian optimization is a good solution. Because Bayesian optimization uses prior knowledge and existing observation data to continuously update the proxy model (such as Gaussian process), it efficiently explores and utilizes the search space. At the same time, through proxy model prediction and acquisition function selection, Bayesian optimization can quickly approach the optimal solution with limited computing resources, reducing invalid calculations. Its update process is to update the mean and variance of the proxy model:
Among them, is the observation vector and is the covariance matrix. In this way, Bayesian optimization can better escape from the local optimum and find the global optimal solution by constantly adjusting the proxy model and balancing exploration and development. In the Bayesian optimization process, the selection of hyperparameters can be expressed as:
Among them, is the acquisition function, and the next set of hyperparameters is selected by maximizing . Through Bayesian optimization, the neural network model can find the optimal hyperparameter combination in a shorter time, thereby improving the performance and training efficiency of the model. This optimization method has proven its effectiveness and advantages in multiple deep learning applications.
B. Channel Attention Mechanism
The channel attention mechanism in the field of computer vision is an important technique used to improve the performance and efficiency of models. The channel attention mechanism enables the model to better understand and process the input data by focusing on specific areas or features of the image.
The core idea of the channel attention mechanism is to perform weighted processing on different channels of the input data in order to better utilize multi-channel information. In computer vision, multi-channel information usually refers to different color channels or different feature channels of an image. By learning the importance of different channels, the channel attention mechanism can automatically adjust the weight of the input data, thereby achieving better performance in tasks such as classification and detection.
Figure 3 shows the channel attention structure [
16,
17,
18,
19,
20].
Assume that the input feature map is , where is the number of channels, is the height, and is the width. The traditional convolution operation applies the same convolution kernel to each channel to obtain the output feature map :
The channel attention mechanism adaptively adjusts the importance of feature maps by assigning different weights to the feature maps of each channel, thereby extracting key information more effectively.
The channel attention mechanism usually obtains the global information of each channel through global average pooling and global maximum pooling operations, and then calculates the weight of each channel through the fully connected layer. In the channel attention mechanism, global average pooling and global maximum pooling are performed on the input feature map to obtain the global description vector z:
where is the height and is the width. The weight s of each channel is calculated through two fully connected layers and an activation function:
Among them, and are the weight matrices of the fully connected layer, is the ReLU activation function, and is the Sigmoid activation function. Apply the calculated weight vector s to each channel of the original feature map to obtain the weighted feature map :
Among them, is the weighted feature map of the -th channel, and is the weight of the -th channel.
C. Spatial Attention Mechanism
As a type of attention mechanism, the spatial attention mechanism has attracted particular attention from researchers. It is an adaptive region selection mechanism that enables the model to pay more attention to the key areas in the image and improve the performance of the model.
The core idea of the spatial attention mechanism is to enable the model to adaptively select the areas that need attention. Specifically, the spatial attention mechanism adjusts the model’s attention to different areas by assigning different weights to each pixel in the image. The size of the weight depends on the importance of the pixel to the task. For important areas, the weight is larger, and the model will pay more attention to it; for unimportant areas, the weight is smaller, and the model will pay less attention to it.
Figure 4 shows the structure of the spatial attention mechanism [
21,
22,
23,
24,
25].
Assume that the input feature map is , where C is the number of channels, H is the height, and W is the width. The traditional convolution operation applies the same convolution kernel w to all spatial positions to obtain the output feature map Y:
Among them, and are the position coordinates in the feature map. The spatial attention mechanism adjusts the feature map’s spatial distribution by assigning weights to each position based on both local and global information. This mechanism highlights key areas and suppresses irrelevant data by recalculating weights for varying inputs, enhancing model adaptability. It typically uses maximum and average pooling to gather local data, then applies convolution to compute position weights, initially fusing data through pooling to derive two feature maps and :
Next, the weight calculation is performed, the two feature maps are stacked, and the weight of each spatial position is calculated through a convolutional layer:
Among them, is the Sigmoid activation function, and is the convolution operation. Finally, the feature map is re-weighted, and the calculated spatial weight map is applied to each channel of the original feature map to obtain the weighted feature map :
To sum up, the spatial attention mechanism can significantly improve the feature representation ability and classification performance of the model.
D. Loss Function
KL divergence loss (KLDivLoss), also known as Kullback-Leibler divergence loss. KL divergence loss is used to measure the difference between the model’s predicted probability distribution and the true probability distribution. Usually, the true probability distribution is represented by one-hot encoding, while the model’s predicted probability distribution is represented by the probability vector output by the model. KL divergence loss is calculated as follows:
Among them, represents the true probability distribution (one-hot encoding), represents the predicted probability distribution of the model, and represents the summation operation.
KL divergence loss first calculates the ratio of the true probability distribution to the model’s predicted probability distribution, then takes the logarithm, and finally multiplies the two element-by-element and sums them. The smaller the KL divergence loss value, the closer the model’s predicted distribution is to the true distribution [
26,
27].
IV. Conclusion
This paper conducts detailed research and experiments on the performance improvement of classic neural network models in image classification tasks. This article reproduces the classic neural network models (AlexNet, GoogleNet, VGG16 and ResNet), and based on this, proposes an improved model that integrates Bayesian optimization, channel attention mechanism and spatial attention mechanism. This article adopts Bayesian optimization method in hyperparameter tuning, which significantly improves the convergence speed and final performance of the model. Bayesian optimization reduces unnecessary calculations and finds better hyperparameter configurations by intelligently selecting evaluation points, thereby improving the accuracy and other performance indicators of the model. At the same time, this article introduces a channel attention mechanism into the model, allowing the model to adaptively adjust the importance of different channels. By emphasizing important features and weakening irrelevant features, the channel attention mechanism effectively improves the feature extraction capability of the model, thereby improving classification performance. The spatial attention mechanism enables the model to better capture the spatial relationships in the image and focus on key areas. This mechanism improves the model’s classification ability on complex images by optimizing the spatial distribution of feature maps. Through experimental verification on the CIFAR-100 data set, the improved model is significantly better than the classic model in terms of accuracy, loss value, precision, recall and F1 score, especially the accuracy rate reached 77.6%, proving The effectiveness of the proposed improvement method was demonstrated. The successful application of the improved model not only demonstrates the potential of Bayesian optimization and attention mechanisms in improving model performance, but also provides new ideas for the design of future deep learning models. In the future, these improved technologies may be applied and promoted in a wider range of image classification tasks and other computer vision fields, further promoting the development of deep learning technology.