1. Introduction
Chip pad inspection is the basis of chip pad alignment inspection, which is a very critical step in the semiconductor manufacturing process. It ensures the accuracy and immediacy of chip alignment detection and alignment correction, and has a significant impact on subsequent decisions.
In the actual alignment detection work, the traditional manual detection of alignment usually requires manual measurement under a microscope, which cannot meet the demand for high precision and high efficiency in industrial assembly lines. In 2010, Chen [
1] et al. used pattern recognition and image processing techniques for fast positioning of graphic tracking for automatic wafer alignment. In 2012, Xiao [
2] proposed a simplified algorithm of template matching to extract wafers from the edge detection processed image to extract the wafer cut channel, and the wafer cut channel center line is obtained by straight line fitting to complete the positioning of the wafer. In 2013, Wu H [
3] et al. proposed feature selection and two-stage classifier for weld joint detection, which improves the recognition rate of weld joints by extracting the color features, average grey level, and template-matching features. In 2017, Xu[
4] et al. proposed a Fourier transform based direction alignment and least squares regression for positional pre-alignment, which improves the pre-alignment accuracy. In 2022, Wang [
5] et al. proposed an adaptive Kalman filter with a dual-rate structure for uncalibrated visual localisation of wafer chips in LED packages by designing an adaptive Kalman filter for estimating the varying calibration parameters, and an introduced dual-rate structure for compensating the visual latency and achieving multi-rate sensor The dual-rate structure is introduced to compensate the visual delay and achieve the time synchronisation of multi-rate sensors.
With the development of computer vision in recent years, various deep learning based methods have been proposed to be applied in inspection. In 2019, Yu [
6] et al. proposed a convolutional neural network based method for pattern recognition and analysis of constable defects, which inspects wafer defects by building an 8-layer CNN model. In 2020, Chien [
7] et al. proposed a deep learning convolutional neural network based method that provides a reliable machine vision method instead of manual inspection by using Faster-RCNN model for training. In 2021, Bian [
8] et al. propose a method based on improved YOLOv5s, which improves the detection accuracy by building an infrared image database for model training, and introduces an ECA module to enhance the feature extraction capability of the network. In 2022, Xu [
9] et al. constructed an attention mechanism with long dependencies to enhance the correlation between features and proposed a design guideline for a single attention layer, which reduces the requirements for hardware devices in real scenarios. The target detection algorithm can quickly detect the chip pads and thus indirectly determine the alignment of the chip. Target detection uses techniques such as image processing and convolutional neural networks to classify and locate targets in images or videos.
Although some research progress has been made by previous researchers in the detection of chip pads, most of the research backgrounds are relatively homogeneous and differ greatly from the environment in actual industrial production. In reality, chip pads tend to be more numerous, densely arranged, and smaller in size. Although traditional CNN models can obtain a relatively good accuracy by stacking layers, their large number of parameters and complex structure lead to the inability of inference and deployment in edge devices with limited computational resources. Therefore, the requirements for chip pad detection networks are to achieve lightweight network models while ensuring detection performance. In 2015, He [
10] et al. proposed the residual connection method, which effectively solves the problem of gradient disappearance or gradient explosion due to the deepening of the network layers. In 2017, Huang [
11] et al. proposed the dense connection method, which solves the problem of parameter redundancy of the deeper network, and further reduces the model size and network size. further reducing the model size and network parameters. Subsequently, Howard [
12,
13,
14] et al. proposed deep separable convolution, which divides the convolution process into two parts: channel-by-channel convolution and point-by-point convolution, and reduces the computation amount of convolution to 1/3 of the ordinary convolution; Zhang [
15,
16] et al. carry out channel disruption during channel-by-channel convolution, which makes the information that was originally not interoperable between the groups flow and interact, and enhances model expression.
There are two main categories of target detection algorithms, one is region-based second-order detection algorithms, such as Faster R-CNN [
17] and R-CNN [
18], etc., which first generate multiple candidate regions from an image, and then extract features and perform classification and regression for each region, so as to improve the detection accuracy. However, the disadvantages of this type of algorithms are many network parameters, complex models, slow detection speed, which are not suitable for real-time detection scenarios. The other category is the single-order detection algorithms represented by SSD [
19] and YOLO [
20,
21,
22,
23], which predict and classify candidate frames directly on the picture, with the advantages of fast speed and simple model, which are more suitable for real-time detection needs. Currently widely used is the fifth generation algorithm of YOLO series, YOLOv5 [
24], of which YOLOv5s version is the YOLOv5 in which a good balance between detection accuracy and model size is achieved. Therefore, in this paper, YOLOv5s is used as the baseline network for chip pad alignment detection. When we apply the YOLOv5s network directly on the chip pad dataset, the detection of small targets is not satisfactory. Zhu [
25] et al. introduced Transformer [
26] into the YOLO network for the first time, and the self-attention mechanism captures the contextual information and improves the detection accuracy of small targets by means of global composition. Literature [
27,
28,
29,
30,
31] combines Swin-Transformer [
32] into YOLO networks to reduce network parameters in global composition by using a moving sliding window. However, both Transformer and Swin-Transformer, the huge consumption of parameter computation and the highly complex model structure make the network impossible to be deployed into embedded devices.
In existing work on target detection, the introduction of more efficient convolutional and attentional modules effectively improves the detection performance of the network, but most of the work does not take into account the relationship between the image resolution and the feature receptive field in a small target detection environment. Stacking convolutional kernels to obtain a larger receptive field can capture richer semantic information about the target, but too deep a network will increase the network parameters and computation, and increasing the receptive field also leads to a decrease in resolution and a reduction in the ability to perceive the details of the image, thus affecting the detection of small targets. And the surrounding of small targets can often provide useful contextual information to help detect small targets.
To address the chip pad detection problem, this paper proposes a lightweight real-time detection network based on YOLOv5s, which not only ensures the detection accuracy of small target chip pads, but also effectively reduces the network parameters to meet the requirements of automated production. The main contributions of our work are as follows:
1. Using lightweight convolutional module (GhostNet) and attention module (CBAM), we improve the feature extraction module of the backbone network, which effectively reduces the parameter redundancy and computational complexity in feature extraction, enhances the network’s attention to the target, and improves the detection effect of the network.
2. Propose a lightweight improvement method for small target detection on chip pads. Starting from the contradictory relationship between resolution and sensory field, the resolution of the customised prediction head is doubled by fusing shallower backbone network feature layers and cropping the last extraction layer. The sensing field is improved by introducing the cavity convolution in the spatial pyramid to enhance the contextual information perception of the key features of the small targets, so as to improve the detection performance of small targets with chip pads.
3. A correction method for real-time detection is designed, and the improved network is deployed in embedded devices to achieve real-time alignment detection and anomaly correction on industrial assembly lines by combining deep learning target detection algorithms with image processing techniques.