We train our proposed leakage segmentation network on the NVIDIA GPU. Subsequently, we assess the robustness and generalization of our trained model by cross-validation. The trained model is deployed on the NVIDIA Jetson TX2 intelligent sensors for inference, where it is utilized to identify real-time leakage in images (i.e., highlight the areas of leakage with bounding boxes).
We first introduce our uneven-light image enhancement in our designed ISLS, including the localization and ULIE method. Subsequently, we present the segmentation network, proposing the Multi-scale Feature Fusion Module (MFFM) and the Sequential Self-Attention Mechanisms (SSAM). Detailed explanations of these components will be presented in subsequent sections.
2.1. Uneven-Light Image Enhancement for Illumination-Aware Region Enhancement
The actual leakage segmentation of sauce-packet is often influenced by uneven-light sources, which consist of insufficient illumination and overexposure. To relieve the problem of image blurring caused by uneven illumination, we propose the Uneven-Light Image Enhancement (ULIE) method, employed in the illumination-aware region enhancement stage of ISLS. The ULIE method is inspired by the relevant image enhancement algorithms [
19,
20,
21]. Our ULIE method can enhance the illumination of sauce-packet images under insufficient illumination conditions and improve the image contrast and texture details under overexposure conditions.
The input of ISLS is in a three-channel RGB format, where R, G, and B represent the color space values of red, green, and blue, respectively. We utilize the mean function in OpenCV to calculate the mean value of the RGB three channels in the ROI. Through extensive experimental analysis, we define 115 and 180 as the thresholds for insufficient illumination and overexposure, respectively. The implementation details of our ULIE method are as follows:
In the case of insufficient illumination, the ULIE method is built upon the retinex model [
22,
23]. The retinex model theory posits that a color image can be decomposed into two primary components: the illumination component (lighting) and the reflection component, as shown in Equation (1).
where Li(x) and Re(x) represent the input image and the image to be recovered, respectively. Tr(x) represents the illumination mapping image, and the ◦ operator represents the element-wise multiplication.
Firstly, to simplify the computation of ULIE method, it is commonly assumed that the three channels of images share the same illumination map [
24]. The ULIE method calculates the maximum value among the RGB channels of the image to independently estimate the illumination of each pixel x, obtaining the initial estimation:
where x represents individual pixel, c represents channels, and
is the input image of the maximum channel in the RGB.
Secondly, to ensure that the illumination map does not cause the enhanced image to become overly saturated, the ULIE method modifies Re(x):
where
is a very small constant, to avoid denominating to zero.
Thirdly, the ULIE method employs the augmented lagrangian multiplier optimization method [
25] to preserve the structural information and smooth texture details of sauce-packet images. The ULIE method introduces the following optimization problem to accelerate the processing speed of sauce-packet images:
where
·
and
·
represent the F norm and L1 regularization,
is the coefficient balancing the F norm and L1 regularization, respectively. Additionally, W is the weight matrix, and
represents a first-order derivative filter, encompassing both horizontal and vertical directions.
Finally, the ULIE method iteratively updates according to the retinex model, solving to obtain the result image Re(x) in Equation (1). The ULIE method applies BM3D [
26] for denoising optimization of the result image Re(x). To reduce the computation of the denoising process in ULIE, the method transforms the RGB three channels of the result image Re(x) into YUV three channels [
27] and performs denoising only on the Y channel:
where Y represents luminance, U and V represent blue chrominance and red chrominance, respectively.
In the case of overexposure, the ULIE method divides the image into blocks to obtain overexposure regions (i.e., locally overexposed areas). Firstly, to obtain the illumination information of ROI, we convert the RGB color space into the YUV color space, as shown in Equation (5). The ULIE method divides the input image into several small blocks, and performs Contrast Limited Adaptive Histogram Equalization (CLAHE) [
28] on each block, to enhance the clarity of the image. CLAHE clips and redistributes the histograms of each sub-image, thereby limiting the degree of contrast enhancement. CLAHE prevents the amplification of noise and excessive enhancement [
29]. The ULIE method initially divides the original image into several non-overlapping sub-images, each denoted as s. We compute the the frequency of pixel values
, representing the data distribution of pixel values i within each sub-image. The definition of the
is given by Equation (6):
where
represents the frequency of pixel values equal to i,
represents the number of pixels with a pixel value of i, and N represents the total number of pixels in the sub-image.
Secondly, the ULIE method computes the Cumulative Distribution Function (CDF)
for each sub-image s in Equation (7), representing the cumulative frequency of pixel values less than or equal to i:
where
represents the CDF for the pixel value i,
represents the frequency of pixel values equal to j.
Thirdly, the ULIE method utilizes Equation (8) to compute the transformation function
for each sub-image s, representing the function that maps the original pixel value i to a new pixel value:
where
represents the transformed pixel value for the original pixel value i, L represents the maximum range of pixel values, and
represents the floor function.
The ULIE method clips and redistributes
for each sub-image s, limiting the degree of contrast enhancement, which prevents the amplification of noise and excessive enhancement [
28]. Finally, the ULIE method consolidates all transformed sub-images
into the final image and converts the image from YUV format back to RGB format.
The results of ULIE images are shown in
Figure 2, where the left image is the non-optimized image, and the right image is the optimized image.
Figure 2a shows that the image has improved overall illumination, with a clearer boundary between the leakage and the background. The ULIE method effectively enhances the image contrast and clarity.
Figure 2b reveals that the illumination of the optimized image is more balanced. The ULEIE method alleviates the phenomenon of local overexposure, which further proves that our method effectively avoids gray jump [
30]. We perform convolution and downsampling operations on the yellow and red box regions, obtaining the corresponding feature maps between non-optimized and optimized images. It is shown that the details of the optimized feature map are more obvious. In summary, through the above process, our method can effectively enhance the details and textures of sauce-packet images under insufficient illumination and overexposure.
2.2. ISLS Network Details for Leakage Segmentation
In the leakage region segmentation stage of the ISLS method, we propose our network with the EdgeNext backbone, which comprises only 1.3M parameters [
31]. The EdgeNext integrates the advantages of Convolutional Neural Network (CNN) and Vision Transformer (ViT). The CNN extracts local features of images using convolution operation [
32], and the ViT [
33] captures global contextual information of images. The network is the end-to-end network, where the input channel dimension is 3 (i.e., RGB), and the input image size is 128×512. Our overall network structure is shown in
Figure 3, which includes the encoder and decoder.
The encoder fuses the local and global representation. Firstly, the n×n Conv encoder consists of three modules. The n×n Conv encoder utilizes adaptive kernels to adjust the size of convolutional kernels based on distinct network layers, which aims to decrease computational complexity and enhance the receptive field [
34]. Secondly, the SDTA encoder combines spatial and channel information. The SDTA encoder utilizes deep transposed convolution and adaptive attention mechanisms, which improve the performance of capturing local and global representation. Thirdly, the information of deep and shallow layer feature maps is fused by our MFFM, which improves the performance of encoder feature extraction. Our MFFM structure is as shown in
Figure 4.
Specifically, we extract four feature maps of different sizes from the encoder, denoted as
,
,
,
. Firstly, the MFFM adjusts the channel number of the feature map
to 512 through a 1×1 convolution and 4× downsampling. Next, the MFFM applies similar operations with
to the feature map
, with the 2× downsampling, as illustrated in Equation (9):
where MaxPooling represents the downsampling process through maximum-pooling operation, Conv represents the convolution operator.
Secondly,
has the same size as the output. Therefore, the MFFM only needs to utilize a 1×1 convolution, to adjust the channel number of the feature map
to 512, as shown in Equation (10):
Thirdly, the channel number of
is same with the output, therefore the MFFM performs only an
operation on the feature map
The
is achieved using nearest-neighbor interpolation, as depicted in Equation (11):
where
represents 2× upsampling operation.
Through the above operations, the feature maps
,
,
and
are obtained. Finally, we fuse the feature information of
,
,
and
to output the feature map
, as shown in Equation (12):
The reasons of MFFM small parameter number is that the Conv operator employs a 1×1 convolutional, the
operation uses nearest-neighbor interpolation, and downsampling is achieved through Maxpooling. 1×1 convolution operation only increases a small number of parameters, upsampling and downsampling operations do not increase the number of parameters. Compared to Feature Pyramid Network (FPN) [
35] and AF-FPN [
36], the parameter number of our MFFM is relatively small. That is, our MFFM has a parameter number of only 0.23M, with only 3.05% of the FPN and AF-FPN parameter number.
The decoder includes skip-connection and SSAM. The SSAM in the stage 1 of decoder aims to improve the identification of salient features. The SSAM keeps high-resolution in both channel and spatial branches, which enhances the salient features of sauce-packet ROI. Specifically, the SSAM contains two modules, consisting of channel-only module and the spatial-only module, as shown in
Figure 5. For the channel-only module of SSAM, the output
is generated by fusing the feature map
, obtained from both the skip connection and the channel attention mechanism. The specific computational process for channel attention mechanism is shown in Equation (13).
where
and
represent the input and output of SSAM channel-only module. Conv represents for the convolution operator, Re represents reshape operator, and S represents softmax operator. Additionally, LN represents layer normalization, Sig represents the sigmoid function.
The process of the SSAM spatial-only module is similar, with one part from skip-connection, and the other part from spatial attention mechanism, as shown in Equation (14) for specific operations:
where
and
represent the input and output of SSAM spatial-only module. GP, S, and LN represent to the global pooling operator, softmax operator, and layer normalization, respectively.
and
represent the tensor product and multiplication operations, respectively.
The stages 2 to 5 of our proposed network decoder contains SSAM and Concat operator. The Concat operator concatenates feature maps of two branches in the channel dimension, as shown in green box of
Figure 3. Specifically, the one branch feature map comes from the SSAM output, which is upsampled 2×. The other branch feature map comes from skip connections, which can avoid gradient vanishing and improve the training speed of the network [
37].