4.5. Qualitative Evaluations
Figure 5 shows the qualitative comparison results on the Paris StreetView dataset. The EC generated an edge map corresponding to the final output and identified the global arrangement and long-range features by employing dilated convolutional layers. However, minute textural details were not extracted, resulting in an offset in the local target. For example, in the first row, the straight line above the front sign was not recovered, which corrupted the distinct boundary. In the second to fourth rows, the texture of the tree, the structure of the bricks between the middle highest window, and the pattern of the building surface were distorted. RFR sequentially filled-in the missing pixels by circular feature inference, which generated high-fidelity visual effects; however, serious artifacts developed. For example, in the first to third rows, wrinkled patterns were produced in the large masked regions. In the fourth row, checkerboard artifacts were found on the bricks, which is a common problem of transposed convolutional layers [
69]. CTSDG bound the texture and structure to each other; however, the boundaries were blurred owing to the implicit usage of the structure. For instance, in the first to third rows, the straight lines were obscured, which distorted the distinct boundaries of semantically different regions. In the fourth row, cross-patterned deformities developed throughout the region. SPL conducted knowledge distillation on the pretext models and modified the features for image-inpainting. These not only helped in understanding the global context but also provided structural supervision for the restoration of the local texture. Nonetheless, the local texture was still smoothed out, which resulted in blurring effects. For example, although solid lines were retained in all the rows, the texture of the leaves and brick patterns were not retained in the second to fourth rows, respectively. In contrast, our proposed method restored images with a suitable balance of low- and high-level features. For all rows, the pixels were filled with clear boundaries and a semantically plausible texture, as seen in the second row. This result is attributed to the use of ESA, where the model obtained hints on the texture from all areas of the corresponding feature maps.
Figure 6 shows the qualitative comparison results on the CelebA dataset. EC obtained unsatisfactory results, i.e., the facial structures were extremely distorted. For instance, in the first row, the position of the left eye was not symmetrical to the right eye. In the second row, the nose did not have the appropriate shape, while the mouth was barely visible. In the fourth row, although the eyes and nose had the proper silhouettes, the mouth was hardly seen. RFR provided better results than EC, but the final outputs did not improve. Although the model generated eyes with a normal shape, the mouths in all the rows were distorted, which ruined the degree of image restoration. CTSDG had the least favorable results. For all rows, the facial structures were not retained due to blurring effects, and checkerboard artifacts were found in all inpainted regions. SPL sufficiently recovered the images, but there were still some implausible regions remained. For instance, in the first row, the size of the left eye was different and relatively smaller than the right eye. In the fourth row, the wrinkles and beard on the face disappeared owing to excessive smoothing. In contrast, our model generated images with the best quality. For example, in the first row, the size of the left eye was similar to that of the right eye, and was at a suitable location. In the third row, unlike the other models, our method generated a mouth with teeth that was very close to the ground-truth image. In the fourth row, the wrinkles and beard with a proper mouth were retained, which has the least perceived difference compared with the other methods.
Figure 7 shows the qualitative comparison results on the Places2 dataset. EC restored images with an acceptable quality using an edge map; however, some areas were not appealing. For example, in the third row, the rails of the roller coaster were connected by curved lines, which is unrealistic. In the fourth row, the leaves filled with generated pixels did not have a consistent color compared with the other regions. CTSDG produced images with indistinct boundaries, i.e., unrealistic results. For instance, in the second row, the structure of the window was not fully retained owing to the blurriness of the bottom-left region. In the third row, the ride paths appeared disconnected, which is unrealistic. In the fourth row, the texture of the leaves contrasted with the other regions and was not harmonized with different areas. CR-Fill trained the generator by adopting an auxiliary contextual reconstruction task that made the generated output more plausible, even when restored by the surrounding regions. Hence, CR-Fill reconstructed images with an acceptable quality; however, some regions were still perceived as different. For instance, in the first and third rows, the boundaries of the trees were not obvious, and the color of the middle-right part of the ride was inconsistent. SPL produced outputs with distinct lines connecting the masked regions; however, key textures and patterns were lost owing to excessive smoothing. For example, in the first, third, and fourth rows, the textures of the objects were blurred. The generated image in the second row contained checkerboard artifacts that distorted the texture and quality of the image. In contrast to other methods, our proposed model achieved a balance between the apparent boundaries and textures of various objects. For instance, all the rows had straight lines that separated semantically different areas. Furthermore, the textures of the objects were effectively restored, leading to plausible results.
In summary, our proposed method effectively balanced low- and high-level feature restoration. This proves the generalizability of the proposed method based on qualitative evaluations.
4.6. Quantitative Evaluations
To analyze the inpainting results of our proposed method and those of other models, we applied four different metrics: Fréchet inception distance (FID) [
70], learned perceptual image patch similarity (LPIPS) [
71], structural similarity (SSIM), and peak signal-to-noise ratio (PSNR). The FID is a widely used quantitative metric in the field of image generation that measures the Wasserstein-2 distance between the generated and target images utilizing a pretrained Inception-V3 model [
46]. Except for the FID, the other metrics are full-reference image quality assessments, in which restored images are compared with their corresponding ground-truth images. The LPIPS evaluates the restoration effect by computing the similarity between the deep features of two images using AlexNet [
72]. The SSIM calculates the difference between two images in terms of their brightness, contrast, and structure. Finally, the PSNR analyzes the restoration performance by measuring the distances between the pixels of two images. The quantitative comparison results on the Paris StreetView, CelebA, and Places2 datasets are listed in
Table 2,
Table 3, and
Table 4, respectively. For all the results, the first and second highest values are labeled in bold and underlined (↓ lower is better; ↑ higher is better).
On the Paris StreetView dataset, our PEIPNet method was ranked as the first or second for all metrics. For mask rates of and , PEIPNet achieved the best results, similar to the FID and LPIPS. However, for mask rates of and , RFR had the best results, similar to the FID and LPIPS, while PEIPNet had the second-best results. For SSIM and PSNR, SPL and PEIPNet had the best and second-best results, respectively, for all mask rates. The excellent performance of PEIPNet was attributed to the few artifacts in the generated images. The textures of different objects were also retained, which FID and LPIPS are highly sensitive to. Hence, PEIPNet can fill-in small masked regions, but its strength decreased in the large-hole inpainting task. This is because DDCM and ESA encouraged PEIPNet to obtain various meaningful hints from different regions of the feature maps with small masked areas by identifying global long-range and local features with dilated convolution and nonlocal attention. If there are insufficient regions from which to obtain information, the aforementioned mechanism reduced the performance of PEIPNet.
On the CelebA dataset, PEIPNet was ranked as the first or second best for the LPIPS, SSIM, and PSNR. EC had the best outcomes for the FID with all mask rates, followed by SPL. However, the FID difference between SPL and PEIPNet was very small, except with the mask rate of . For the LPIPS, PEIPNet had the best results with the first three mask rates and the second-best with the highest mask rate. The opposite is true for the first and second-best results of the RFR. For the SSIM and PSNR, SPL had the best values, followed by PEIPNet. As on the Paris StreetView dataset, the disparity in inpainting performance compared with the best method continued to increase as the mask rate increased owing to the aforementioned reason.
On the Places2 dataset, PEIPNet again had either the best or second-best LPIPS, SSIM, and PSNR. Unlike on the CelebA dataset, PEIPNet also had the second-best outcome for the FID with mask rates of and . For the LPIPS, PEIPNet had the lowest values with the first three mask rates, and the second lowest with the highest mask rate; the opposite is true for SPL. For the SSIM and PSNR, PEIPNet had the second highest values for all mask rates, while SPL had the best outcomes. The same phenomenon of the increased inpainting accuracy difference compared with the best result was also observed on the Places2 dataset.
The proposed PEIPNet method showed exceptional performance for all metrics: FID, LPIPS, SSIM, and PSNR. In most cases, PEIPNet had the best or second-best outcome; this tendency was not observed in the other methods. Specifically, PEIPNet achieved at least the second-best results on the Paris StreetView dataset, indicating the advantage of having a small number of model parameters when training with a limited number of samples. Thus, the quantitative evaluations confirmed the generalizability of the proposed method.
4.7. Ablation Studies
To verify the effects of the introduced the DDCM and ESA, ablation studies using our method were conducted on the Paris StreetView dataset. Specifically, we divided the DDCM into two parts for analysis: dilated convolution and dense block. To reduce the training time, we altered the batch size to eight for all combinations.
The quantitative results with different combinations of DDCM and ESA on the Paris StreetView dataset are listed in
Table 5. For the DDCM, eliminating the entire module affected the model performance, where the average FID and LPIPS increased by 5.3607 and 0.0102, while the SSIM and PSNR decreased by 0.0083 and 0.4658, respectively, compared with the original model. Comparison of the two parts of the DDCM showed that applying dilated convolutional layers yielded better results for all metrics, which indicates the importance of long-range feature extraction in the image-inpainting task. ESA plays a crucial role because the average FID and LPIPS increased by 2.32 and 0.0034, while the SSIM and PSNR decreased by 0.0025 and 0.0676, respectively. However, the decline in the ESA performance was lower than that of the DDCM, indicating its dominance in the proposed method.
The qualitative results with different combinations of the DDCM and ESA on the Paris StreetView dataset are shown in
Figure 8. Unlike the original model, the remaining combinations did not retain the streetlight structure. Specifically, the pillar was disconnected from the head of the lamp, which is unrealistic. The authentic model provided the best restoration of the texture of the leaves, demonstrating the strength of the proposed modules.
Finally, we calculated the complexities of different combinations of models, as described in
Section 4.4 and summarized in
Table 6. The contribution of dilated convolution was minor because there was almost no change in the memory when this process was eliminated. Removing the dense block had a greater impact on the memory compared with dilated convolution, but the change remained insignificant. On the other hand, eliminating ESA had a significant impact on the memory through a 4.51% reduction in the computational cost. Thus, adopting self-attention remains costly despite its structural efficiency.