3.1. Quantization
Quantization reduces the bit-width of computations and tensor storage compared to floating-point precision. This leads to more compact model representation, faster operations, and lower computational costs during runtime while maintaining reasonable accuracy.
Quantization involves two main operators: Quantize (Q) and Dequantize (DQ). For as the range of real values and b as the bit-width of the lower precision format. and are the smallest and largest values of the range of real floating-point values. A b-bit signed integer precision format can represent possible integer values, with its value range between . In this super-resolution problem, we consider the real value range as a single-precision floating-point value range, which is a floating-point number with 32-bit precision. The quantization operation maps a real value to a value within the value range of b bits with low precision .
The quantization operator includes two processes: real-value transformation and clipping process. The process of transforming a real value
x into a quantized value
can be defined as Eq. (
1).
where
z is an immutable parameter (zero point) of the same type as the quantized value. It represents the quantization value
corresponding to the real value of zero
.
s is a scaling factor that divides the real value range into a number of parts. In asymmetric quantization
, we can calculate the scaling factor
s and zero point as Eq. (
2).
However, if the output
falls outside the precision range of
b bits, meaning it falls outside the interval
, then we clip the range so that it does not fall outside that range. This means that if
, we adjust
, and if
, we adjust
. Mathematically, we can define the quantized value
with
x as the real value and
b as the low-precision bit width using Eq. (
3).
The approximate quantization of a real value by a quantized operation is defined as Eq. (
4).
In general, the coefficients s and z depend heavily on the real interval . This is called clipping range learning. There are two approaches to determine this clipping range: determining the range without re-training model and determining the range through re-training model. Two typical methods that reflect these approaches are Post-Training Quantization (PTQ) and Quantization-Aware Training (QAT), respectively.
The approach that determines the clipping range without re-training model (such as PTQ) forms the clipping range by feeding representative samples into the model. Then, the model is quantized based on the results obtained after having some representative samples. This method is fast but produces lower output accuracy compared to learning the clipping range through re-training model.
The approach of clipping range through re-training model (such as QAT) involves more steps and takes longer to tune the model but provides better output accuracy than the PTQ approach. Regarding the QAT method, the quantization nodes (Q Nodes) and de-quantization nodes (DQ Nodes) are inserted into the network according to a specific set of rules [
46]. Then, the network is trained again in several batches in a process called fine-tuning. The Q/DQ nodes simulate quantization loss and add it to the training loss during fine-tuning, making the network more flexible in terms of quantization. However, a problem arises during the back-propagation stage of model tuning, where there exists a rounding function that is non-differentiable at some points and equal to 0 at almost all points.
One way to overcome this problem is to use STE [
47]. STE is a way of estimating gradients for thresholding in neural networks. The concept of STE is to set the input gradients to a thresholding function equal to its output gradients, ignoring the derivative of the thresholding function. Specifically, STE allows us to approximate the derivative of the floor function as 1.
This means that we are allowed to "skip" the backpropagation calculation when passing through Q and DQ blocks during network training.
Figure 1 shows an example of a simple forward and backward computation used in QAT. The QAT quantization process consists of the following four stages:
Stage 1: Train the network with floating-point operations.
Stage 2: Insert Fake quantization layers into the trained network. The layer is used to simulate the process of integer quantization using floating-point computations. It is usually performed by a quantization step (Q) followed by a dequantization step (DQ).
Stage 3: Perform fine-tuning of the model. Note that in this process, the gradient is still used in floating-point.
Stage 4: Perform execution by removing and loading the quantized weights.
3.2. Network pruning
Network pruning is an important technique for both memory size and bandwidth reduction. This allows neural networks to be deployed in constrained environments such as embedded systems.
To effectively prune a pre-trained model, two aspects need exploration: pruned architecture and candidate selection. Pruned architecture can be divided into two types: human-defined and algorithmic. Human-defined pruning involves determining a fixed ratio of pruned channels in each layer, while automatic pruning determines the target architecture through algorithms based on global comparisons of structure importance across layers. Unstructured pruning is a form of automatic pruning, where the positions of pruned weights are determined during training and the positions of zeros cannot be predetermined.
Figure 1.
Forward and backward-propagation for QAT with the assumption of using STE. [
48]
Figure 1.
Forward and backward-propagation for QAT with the assumption of using STE. [
48]
For automatic network pruning, the most commonly used approach is to remove redundant weights that provide little information to the pre-trained model. The brute-force method was the earliest approach used, which involves setting each weight to 0 and checking the loss function’s change. However, due to the large search space, this method is not efficient. Therefore, other methods have been developed, broadly categorized into two types: Magnitude-based pruning and Penalty-based pruning. Both approaches generate values close to 0 for weights, effectively overcoming the drawback of the brute-force method.
Magnitude-based pruning method prunes weights based on the idea that those trained with larger values are more important. The most basic methods are to prune weights with a value of 0 or all weights within a given threshold. One method based on the Hessian matrix is LeCun et al.’s Optimal Brain Damage (OBD) [
49], which uses a second-order Taylor expansion approximation to minimize the difference between the loss value of the pruned model weights and the loss value of the model weights before pruning. This method achieves high accuracy but requires significant computation, particularly for the inverse matrix in the optimization solution.
The Penalty-based pruning method aims to reduce the overfitting of the model using regularization. This involves adding an extra term to the loss function to evaluate the complexity of the model. This new loss function is called the regularization loss function, which is usually defined as Eq. (
7).
where
is the original loss function of the model, and
is the regularization term depending only on the weights of the model. The constant
is usually a small positive number, also known as the regularization parameter. The regularization parameter is often chosen to be small to ensure that the solution of the optimization problem for
is not far from the solution of the optimization problem for
. The regularization term often used in the regularization technique is the use of the
regularization function, which is defined as Eq. (
8).
where
n is the number of elements in the vector
. For
, it is called LASSO pruning, and for
, it is called weight decay pruning. The use of LASSO pruning has been shown to be more effective in weight selection than using weight decay pruning because the use of the
regularization function (also known as the ridge loss function) only generates weights that approach 0 rather than being equal to 0, while using the
regularization function generates better 0 weights. However, using the
regularization function raises a trade-off problem between the sparsity and performance of the model [
50].
The mentioned pruning algorithms only prune at the weight-level, but higher-level pruning requires pruning methods based on structures like groups or networks. The regularization technique can be extended to grouped regularization, expressed as the grouped regularization loss function:
where
is the original loss function of the model,
is the set of weights that can be trained for all
K layers in the model, and
is the regularization operator in layer
k to prune the set of weights,
. The parameter
is the regularization parameter, which is used to balance the loss function and the pruning criterion. Yuan and Lin [
51] proposed the grouped LASSO regularization. To reduce subsets of weights such as filters or channels, the subsets need to be considered as groups in the regularization criterion. The grouped LASSO restricts subsets of unnecessary parameters to simultaneously equal 0. The regularization term of the grouped LASSO is defined as Eq. (
10).
where
is a group in the set of groups
G,
is the weight matrix or weight vector of group
g is a submatrix or subvector in
. However, estimating the sum of regularization terms across different layers may not be cumulative and therefore meaningless due to differences in distribution and intensity. Realizing this, Gongfan Fang et al. [
52] proposed a regularization term in which the importance level of regularization terms is normalized for each layer by adding a constant for each different layer, to ensure that removing groups becomes safer. Specifically, the regularization term that Gongfan Fang et al. [
52] defined is represented as Eq. (
11).
where
represents the regularization term of layer
k and
is the regularization parameter for layer
k.
3.3. Knowledge distillation
Knowledge Distillation (KD) refers to the process of transferring knowledge from one or a set of large, complex models to a smaller, deployable model under real-world constraints. It was first demonstrated by Buciluǎ et al. [
53] to compress a model and transfer information to train a smaller model without sacrificing accuracy. A knowledge distillation system consists of three main components: knowledge, distillation schemes, and distillation algorithms.
In a neural network,
knowledge typically refers to the learned weights and biases. There are three main types of knowledge that can be distilled from a teacher model to a student model. The first is response-based knowledge, where the student model mimics the output of the teacher model. This is the most common type of knowledge system. The loss of information in such cases is related to the computation of the divergence between the logits of the teacher and student models. Kullback-Leibler divergence (KL divergence) is commonly used in feedback-based knowledge distillation methods [
54,
55]. In feature-based knowledge systems, the teacher model learns to recognize features in its intermediate layers, which can be used to train the student model. The loss function for knowledge distillation achieves this by minimizing the difference between the feature activations of the teacher and student models. Finally, in relation-based knowledge systems, the relationships between feature mappings are used to train the student model. These relationships can be modeled as correlations between feature maps, graphs, similarity matrices, feature embeddings, or probability distributions based on feature representations.
In terms of
distillation schemes, there are three major techniques that are commonly used: Offline Distillation, Online Distillation, and Self-Distillation. Offline Distillation is the most popular method, where a pre-trained teacher model is used to guide the student model. In the Offline Distillation process, the pre-trained teacher model is usually a large deep neural network. In some use cases, the pre-trained model may not be available for Offline Distillation. To address this limitation, Online Distillation can be used where both teacher and student models are updated simultaneously in an end-to-end training process. Online Distillation can be operated using parallel computing, making it a highly efficient method. Knowledge Distillation has two issues. The
first issue is that the selection of the teacher model significantly impacts the accuracy of the student model. The
second issue is that student models cannot always achieve the same high accuracy as teacher models [
56], which may lead to unacceptable accuracy degradation during deployment. Self-Distillation addresses these issues by using the same network as both teacher and student. First, Self-Distillation attaches some shallow attention-based classifiers after the intermediate layers of the network at different depths. Then, during training, the deeper classifiers are regarded as teacher models. They are used to guide the training of the student models using a divergence-based loss function on the outputs and an
loss function [
57] on the feature maps. In the deployment stage, all the additional shallow classifiers are removed.
Meanwhile, in terms of
distillation algorithm, there are currently nine commonly used algorithms. The
first type is adversarial learning, which enhances existing training sets to improve model performance or allow teacher-student models to learn better data distributions [
58]. The
second type is multi-teacher distillation, which uses different teacher structures that can provide different types of knowledge. When distilled into a student model, it can produce better predictions than individual models. The
third type is cross-modal distillation, which transfers knowledge between different modalities. This situation arises when data or labels are not available for specific modalities during training or testing [
59], so knowledge must be transferred between modalities. The
fourth type is graph-based distillation, which captures internal data relationships using a graph. The graph is used in two ways - as a means of transferring knowledge and to control the transfer of teacher knowledge. The
fifth type is attention-based distillation, which transfers knowledge using attention mappings. The
sixth type is a
data-free distillation process, which synthesizes data from a pre-trained teacher model. The
seventh type is
quantization distillation, which transfers knowledge from a high-precision teacher network (e.g., 32-bit floating point) to a low-precision student network (e.g., 8 bits). The
eighth type is
lifelong distillation based on the continuous learning mechanisms of lifelong learning, continual learning, and meta-learning, where previously learned knowledge is accumulated and transferred to future learning. The
ninth type is NAS-based distillation, which is also used to determine suitable student model architectures for optimizing learning from teacher models.