1. Introduction
With the improvements and innovations in the last few years in the fields of image inpainting techniques and machine learning, an increasing number of altered and fake media content has invaded the internet. In the current paper we will do a thorough review of the current inpainting mechanism and ascertain current state-of-the-art techniques to detect these alterations.
Nowadays, in our modern society, we rely greatly on technology. This can be seen as an advantage but also as a drawback. The evolution of technology has impacted the lives of each one of us. We are just a click away from an almost infinite amount of information that can be accessed at any time. Most of the time people rely almost completely on the information that they find online and form their opinions based on those facts, but unfortunately, this is not always a safe approach [
5]. The authenticity of the information that can be found online can sometimes be distorted or even false. That is the reason why its accuracy needs to be always checked. We tend to believe that false information can be transmitted only through textual content, but this is not entirely the case. Nowadays images and videos are also a tool for transmitting information. We use them daily, and we became accustomed to believing everything we see to be the truth. The most powerful example, in this case, can be the images and videos that we see and upload on social networks. This example shows that it is equally important to check the authenticity of images as it is to check the trustworthiness of a written text. All the reasons stated above imply the fact that there is a great need to detect forgeries in images and videos [
7].
The science area focusing on tampered image and video detection is called media forgery detection. The area is quite vast and has an increasing interest as described in recent bibliometric studies done by [
1], [
2], [
3] [
Figure 1 Number of forgery detections per paper in recent years]. The forgery detection methods can be divided into two categories: active and passive. For the active methods, the main focus is embedding some metadata onto the images at the time of creation and it can be later used to validate the authenticity of the image. On the other hand, passive methods which are sometimes called also blind methods, do not offer so much specific information, thus one has to rely entirely on the possible artifacts introduced in the tampering process.
Figure 1.
Trends in forgery detection during the last years.
Figure 1.
Trends in forgery detection during the last years.
If one would look at the tampering process according to the father of digital image forensics, Hany Farid) as mentioned in [
4], the forgery detection can be performed through:
Active methods: briefly, the main idea here is to incorporate various information that can be validated later on, in the moment of image acquisition.
Passive methods: here the area is quite big. Some of these methods focus on the peculiarities of image capturing, camera identification, noise detection, image inconsistencies, or on some specific type of traces which are usually introduced by the forgery mechanism – for e.g. for a copy-paste forgery (combining information multiple images) – some traces like inconsistent coloring, noising, blur, etc., might be noticed.
A more detailed schema based on the above-mentioned categories can be seen in
Figure 2 on the comprehensive work done by Pawel Korus in his research paper [
5]. Based on our analysis, we structured a bit different the categorization. The first difference is related to camera traces, where we are grouping all steps / artifacts that might influence the outcome of the resulted image. We considered this categorization important because camera traces can be used to determine the forged area. As it can be noticed in the later chapters, authors have tried initially to focus only on one type of artifact, however, recent studies suggest that the best approach would be an ensembled method. Compared to the initial categorization done by Korus in his review, in this case, the image Copy-Move is further sub divided into several sub-categories. Let us consider that all types of image operations which are done on a source image, without blending information from other images, would fall under this category. Therefore, the new Copy-Move category contains items like Object Removal, Resample, Blurring, Gamma correction, etc. This is very important, because basically the Copy-Move forgery is redefined as an operation done on image, solely based on the statistical data of the image itself; for e.g. when we inpaint an image, we fill in data based on the overall statistical analysis of that particular image.
Figure 2.
Image forgery detection overview.
Figure 2.
Image forgery detection overview.
Looking from the “attacking” point of view, for the passive methods, we can end up with another classification from the traces that forgery methods might introduce:
Copy-paste methods: in this case the picture is altered by copying parts of the original image into the same image. Of course, things like re-sample & rescaling can be included; usually both resampling and rescaling at their core are not methods of altering but are rather used as a step to apply copy-paste or splicing methods.
Splicing: the forged media is obtained by combining several images into one; e.g., taking the picture of someone and adding it within another one;
The current paper conducts a deep evaluation of the current methods for detecting inpainting/object removal inside images and videos. The material is divided into the following parts: the first part gives a full review of the current state-of-the-art methods in inpainting methods, with a deep focus on object removal. Going further, we plan to analyze all the pros and cons of each method, review the dataset, see the way they behave in real-world scenarios and how they compare to others in terms of quality. After this thorough review, we will shift our focus to describe the detection of inpainting problems as a general matter. After that, we shall analyze the forgery detection methods. We will start with some older variants, their main ideas, and assess their performance. We will continue the journey by analysing also other relevant forgery detections mechanisms and investigate if they can apply successfully to the object removal tasks. For each relevant method we will do a comprehensive analysis on the pro and cons and analyze also how they work outside the tested datasets. Furthermore, additional analysis will be performed on available datasets used for evaluation of these methods and the issues they raise. Lastly, all the relevant findings are briefly summarized and the relevant areas of improvement are also diligently addressed.
2. Inpainting methods
Image inpainting is sometimes called an inverse problem and usually these types of problems are ill-posed. In the process of inpainting, especially in large areas, all three conditions specified by Hadamard are infringed – thus the problem is a so-called inversed problem). Generally, the problem of inpainting consists in finding the best approximation to fill in the region inside the source image and comparing it with the ground truth. All the algorithms which tackled this problem begin with the assumption that there must be some correlation between the pixels presented inside the image, either from a statistical or from a geometrical perspective. This work differs from the one in [
6], by its deep focus on inpainting methods exclusively.
Once the mathematical concepts were formulated, the possible solutions which have been proposed can be identified. An image can be defined as a collection of points, each point having a set of values belonging to that point. Starting with a mathematical approach, first the image model is defined and then the inpainting problem will follow.
The inpainting problem consists of the following:
R will be denoted the reconstructed image and
r(U,K) th
e reconstruction of
U area based on the
K. The aim at image inpainting is to reconstruct the U area as best by comparing with the original image
I, in other words, to minimize the differences between the original image
I and the reconstructed image called
R. Thus, having formulated the mathematical concepts for image inpainting, we shall use as starting point the previous reviews done in [
7], [
8] and most recently in [
9]. All the above authors categorize the inpainting methods as follows:
2.1. Diffusion based methods
The term diffusion (from a chemistry point of view) is the action in which items inside a region of higher concentration tend to move to a lower concentration area. From a mathematical point of view - let’s define it as: let Ω ⊂R^2 denotes the entire image domain 𝑓. The basic idea then is to propagate information from the border of the missing region into it, in such a way that the border of the missing region is no longer visible to the human eye. The border of missing region is going to be called 𝜕𝐷; Figure below ilustrates the inpainting steps.
Figure 3.
Process of inpainting-based on PDE method.
Figure 3.
Process of inpainting-based on PDE method.
Several authors [
10], [
9] have suggested a more detailed approach of splitting the inpainting diffusion class. They have suggested to further divide into the following sub-categories like: isotropic, anisotropic, total variation, PDE based. For the simplicity of this paper, we intented to organize all these methods under one big umbrella, because the starting points are the main ideas observed by [
11], in which the inpainting process is inspired from the “real” inpainting of canvas and consists of the following steps:
Global image properties enforce how to fill in the missing area
The layout of the area is continued into (all edges are preserved)
The area D is split into regions and each region is filled with the color matching (the color information is preserved from the bounding area into the rest of the D area)
Texture is added
The first step in almost all the inpainting algorithms is to apply some sort of regularization. It can be either isotropic, with some rather poor results, anisotropic, or any other type of regularization. This is done in order to ensure that image noise is removed, and thus it shall not interfere in the computation of the structural data needed in the next step.
In order to apply diffusion, the structural and statistical data of the low level image must be indentified. Based on this data, if on an edge on the δD area, we must conserve the edge identified and if δD area belongs to a consistent area, we can then easily replicate the same pixel information from the border. In order to retrieve image geometry, one can use isophotes - curve on surface connecting points of same values. For this one needs to compute first the gradient on each point on the margin area and then to compute the direction as a normal one to the discretized gradient vector.
Having performed these steps, the initial algorithm from [
11] is just a succession of anisotropic filtering, followed by inpainting and then this repeated several times. From a mathematical point of view, based on [
11], the intention is to achieve the following:
In the original implementations the authors made some assumptions choosing α to be 0.1 and as for the smoothness estimator they have used the Laplacian. An interesting part is the choice of the N vector. In their original paper it is suggested that this vector has to be computed each time and is based on the current rotated gradient of the current block size to be inpainted. The problem with this vector is that it has to be computed at each iteration because with each iteration new information arises at the area to be reconstructed. Additionally, they also add an anisotropic diffusion step at each several steps, intending not to lose too much sharpness. From a forensic point of view, this is a very important step, because it does not tend to keep the same level of blur between the original area and the reconstructed area. Later on, the authors in [
12] proposed an improved version of their initial algorithm. The idea was inspired from the mathematical equations of fluid dynamics, specifically the Navier-Stokes equations, which describe the motion of fluid. The proposal was to use the continuity and momentum equations of fluid dynamics to propagate information from known areas of the image or video towards the missing or corrupted areas. This was more or less an improved version of higher PDE version presented initially. As a follow up of his original work, Bertalmio proposed in [
13] the use of 3rd order PDE, which are a better continuation of edges.
The algorithm starts by defining an initial velocity field that guides the propagation of information. This velocity field is then iteratively updated with the known image used as boundary conditions by the use of the Navier-Stokes equations. The result is obtained by advection of the original image along the final velocity field. The algorithm seems to perform slightly better than the initial paper in situations where the missing data is large, or the structure of the image is complex. The Navier-Stokes inpainting algorithm is a method for completing missing or corrupted parts of images or videos that uses the equations of fluid dynamics to propagate information from known areas to missing or corrupted areas. The final result is obtained by advecting the original image or video along the final velocity field.
At the same time, Chan & Shen developed similar algorithms [
14], [
15] in which they postulated the use of the local curvature of an image to guide the reconstruction of missing or obscured parts. Using Euler’s Elastica model, they can predict for what the missing parts of the image might look like. Both Euler’s Elastica and PDE-based inpainting are effective methods for image inpainting and have their own advantages and disadvantages. Euler’s Elastica is particularly well-suited for images that contain thin, flexible objects, while PDE-based inpainting is well-suited for images that are smooth and locally consistent. Depending on the specific characteristics of the image and the desired outcome, one method may be more appropriate than the other.
Based on the work described above, many methods continue in the same direction, trying to map real physical processes into the inpainting process (diffusion, fluid dynamics, osmosis). For e.g. in [
16] the authors proposed curvature-preserving PDE. Their tensor PDE, is used for regularizing images, keeping into account the curvatures of specific integral curves. In this way, they estimate better the shape of the inpainting data, thus reducing the blurring effect on the resulted image. Another variant with very good results and implemented in the computer vision library (OpenCv) is presented in the paper [
17]. The authors present a fast marching technique that estimates the missing pixels in one pass using weighted means of known calculated pixels. This is suboptimal algorithm compared to other inpainting algorithms, but it gains strength in its speediness compared to the example of Bertalmio, in which several iterations were needed and the result was affected by the number of iterations.
In the recent year the focus for diffusion based inpainting has moved towards more and more complex PDE forms. For e.g. in [
18] using high order variational models is suggested, like low curvature image simplifiers or Cahn-Hilliard equation. Another recent paper that goes into the same direction is [
19], which basically integrates the geometric features of image, namely the Gauss curvature. Still, even these methods introduce the blurring artifact also found in initial papers [
11], [
12]. In order to surpass these challenges in the current models, with second-order diffusion-based models that are prone to staircase effects and connectivity issues and fourth-order models that tend to exhibit speckle artifacts, a newer set of models has to be developed. The authors Sridevi & Srinivas Kumar proposed several robust image inpainting models that employ fractional-order nonlinear diffusion, steered by difference curvature in papers [
20], [
21], [
22]. In their most recent paper [
23], a fractional-order variational model is added to mitigate noise and blur effectively. In essence, a variation of DFT is used to consider pixel values from the whole image, not only by relying strictly on the neighboring pixels.
In an attempt to summarize the diffusion inpainting methods, it is found that they usually rely on 2nd or higher order partial derivatives, or via Total variation of energy, in order to be able to “guess” the missing area. One of the major drawbacks of these methods is that in some way, either locally or globally, some sort of anisotropic diffusion is introduced with a blurring effect, which in turn will affect the entire image. Due to this blurring effect nature, in theory image inpainting via PDE can be detected via some sort of inconsistency in the blurring effect of various regions.
2.2. Exemplar based methods
Approximately at the same a newer approach based on texture synthesis started to gain more momentum. The main inspiration came from [
24] where A. A. Efros and T. K. Leung introduced a non-parametric method for texture synthesis, where the algorithm generates new texture images by sampling and matching pixels from a given input texture, based on their neighborhood pixels, thus effectively synthesizing textures that closely resemble the input sample. This approach was notable for its simplicity and ability to produce high-quality results, making it a foundational work in the area of texture synthesis. . The primary goal of these approaches was to enhance the reconstruction of the image section area which is missing. However, the challenges brought by texture synthesis are slightly different from those presented by classical image inpainting. The fundamental objective of texture synthesis is to generate a larger texture that closely resembles a given sample in terms of visual appearance. This challenge is also commonly referred to as sample-based texture synthesis. A considerable amount of research has been conducted in the field of texture synthesis, employing strategies such as local region growing or holistic optimization. Probably one of the main papers that gained a lot of attention was the work of [
25]. In this paper, Criminisi presented a novel algorithm for the removal of large objects from digital images. The technique is known as Exemplar-Based Image Inpainting. The method is based on the idea of priority computation for the fill front, and best exemplar selection for texture synthesis. Given a target region
Ω to be inpainted, the algorithm determines the fill order based on the priority function
P(p), defined for each pixel p on the fill front
∂Ω. P(p) = C(p) * D(p) Where,
C(p) is the confidence term, an indication of the amount of reliable information around pixel p.
D(p) is the data term, a measure of the strength of isophotes hitting the front at p. The algorithm proceeds in a greedy manner, filling in the region of highest priority first with the best match from the source region
Φ. This is identified using the Sum of Squared Differences (
SSD) between patches. The novel aspect of the method is that it combines the structure propagation and texture synthesis into one framework, aiming to preserve the structure of the image, while simultaneously considering the texture. It’s been demonstrated to outperform traditional texture synthesis methods in many complex scenes and it has been influential in the field of image processing.
Figure 4.
Criminisi algorithm [
25].
Figure 4.
Criminisi algorithm [
25].
Based on these seminal works of Leung and Criminisi, the area started to be more and more researched and various places were investigated for further improvements: order of patch processing, faster way of computing the best patch, applying various operations on the best patch found in order not to disturb higher statistical data inside the image, multiscale and overcome global constraints, and even finding the best way of dealing with distances between patches.
Several methods are available for calculating the resemblance between patches of images. The most commonly employed metrics can be grouped into two categories: pixel-based metrics, with gauge similarity based on the difference or cross-correlation among pixel color values, and statistics-based metrics, which estimate the similarity between the probability distributions of pixel color values in patches. The first category includes metrics such as the sum of squared differences (SSD), the Lp norm, and normalized cross-correlation. The second category features statistics-based metrics like the Bhattacharyya distance, normalized mutual information (NMI), and Kullback-Leibler divergence. The SSD is the most frequently used metric when searching for similar patches. One key aspect in the designing of an inpainting forensic tool is that SSD tends to favor uniform regions, meaning it prefers copying pixels from those areas. To address this bias, a weighted Bhattacharya distance, denoted as d(SSD,BC), has been proposed. Nonetheless, when two patches have identical distributions, their Bhattacharya distance (dBC) is zero, implying that the weighted Bhattacharya distance is also zero, even if one patch is a geometrically modified version of the other.
Another area of improvement was one how fast/optimal the best patch is found. For this reason, some methods start by identifying the K-nearest neighbors (K-NNs) within the recognizable sections of the image. One simple approach to solving the nearest neighbor search problem is by calculating the distance from the target patch to all potential patches, regarding each patch as a point in multi-dimensional space. More efficient and approximate nearest neighbor search strategies are available, which structure the potential candidates using certain space-segmenting data structures like k-dimensional trees (kd-trees) or vantage point trees (vp-trees), guided by their spread in the search space. The nearest neighbor search can be performed effectively by using the properties of these trees to quickly discard vast sections of the search space, leaving only a minor portion of candidates for verification. The matching process based on kd-trees is one of the most used methods for identifying the nearest patch. However, the number of nodes examined expands exponentially with the dimension of the space. As a result, when the dimension is big, the search speed slows. A variety of nearest neighbor search algorithms are evaluated in a separate study [
26] to determine their effectiveness in locating similar patches within images. A big improvement in this area was incorporated into Photoshop tool in recent years. The idea is based on the Patch Match algorithm [
27]. Approximate nearest neighbor (ANN) search methods that are tree-based treat each query individually. PatchMatch, a randomized patch search algorithm introduced, takes advantage of the relationships between queries to facilitate collaborative searching. This method operates on the presumption that images maintain coherency. That is to say, once a similar pair of patches in two images is identified, their adjacent patches (those offset by a few pixels) are likely also similar. Consequently, the match result of a specific patch can be transferred to proximate queries, providing an advantageous initial guess that can then be updated with randomly selected candidates. PatchMatch is a speedy algorithm used for calculating dense, approximate nearest neighbor correspondences between patches in two image areas, with these correspondences collectively referred to as the nearest neighbor field (NNF). The algorithm initiates the search for the NNF as follows. The NNF is initially assigned either random values or prior information, with random guesses likely offering only a few beneficial guesses. The NNF is then continually fine-tuned by alternating between two operations known as propagation and random search, carried out on the patch level. The propagation step updates a patch offset using known offsets from its causal neighborhood, leveraging image coherency. During even iterations, offsets are propagated from the top and left patches, while during odd iterations, they are propagated from the right and bottom patches. The second operation carries out a local random search to establish initial patch matches, which are then disseminated by iterating a limited number of times. Although the algorithm is significantly faster than kd-trees, it offers less accuracy. It can get stuck in a local optimum due to the limited distance of propagation.
In the recent years the methods have become more and more complex and tried to exploit various artifacts inside the image and analyzing more in depth the structure near the area to be inpainted. Other approaches like [
28], utilize a patch-based approach that searches for well-matched patches in the texture component using a Markov random field (MRF). Jin and Ye [
29] proposed an alternative patch-based method that incorporates an annihilation property filter and a low rank structured matrix. Their approach aims to remove an object from an image by selecting the target object and restricting the search process to the surrounding background. Additionally, Kawai [
30] presented an approach for object removal in images by employing a target object selection technique and confining the search area to the background. Authors have also explored patch-based methods for recovering corrupted blocks in images using two-stage low rank approximation [
31] and gradient-based low rank approximation [
32]. Another sub-area of focus for some authors was to represent the information by “translating” first the image into another format, so called sparse representation like DCT, DFT, DWT etc.. Here we just want to mention a few interesting research papers [
33], [
34]. They obtained good quality, while the area to be inpainting is rather uniform, but if the area is at edge of various different texture, the methods introduce some pretty ugly artifacts that make the methods unusable.
Some various authors like [
7,
8]–[
10], [
35], [
36] have suggested that another classification of the inpainting procedure can be done. Either they suggest to add a sub-division based on sparse representation of images (like authors have suggest in [
33], [
37]–[
39]) and then later try to apply existing algorithms on this representation, or a so called mixed / fussion mode – in which authors try to incorporate ideas from both world: from diffusion based and from texture synthesis (patch copying). In this latter category we can name a few interesting ideas, like the one Bertalmio explored in his [
40] study, in which they combined PDE based solution together with patch synthesis and coherence map. The resulting energy function is a combination of the 3 metrics. As similar idea to Bertalmio’s above mentioned work, is the research of Aujol J, Ldjal S and Masnou S in their [
41] article, in which they try to use exampler based methods to reconstruct local features like edges. Another inquest in the same idea was the work of Wallace Casaca, Maurílio Boaventura , Marcos Proença de Almeida and Luis Gustavo Nonato on [
42] in which they combine anisotripic diffusion with transport equation to produce better results. Their suggested approach of using a cartoon-driven filling sequence has proven to be highly effective for image inpainting using as metrics both PSNR and speed.
If we were to summarize the exampler (or patch based) inpainting methods, they try to do 3 steps:
find the best order to fill the missing area
find the best patch that approximates that area
try to apply if needed some processing on the copied patch in order to ensure that both local and global characteristics are maintained
Now if we look at the artifacts these methods introduce, we can easily categorize them into two groups: methods that simply “copy-paste” unconnected / unrelated regions (patches) into the missing area and methods that do some enhancement / adaptation of the patch values. The first category is straightforward to determine via a forensic algorithm: we rely solely on the fact that for a given region (usually several times greater than the patch used for inpainting) is comprised of patches that have “references” in other areas. The main problems are that how to determine the correct patch size to be able to determine “copied”, speed, and last but not least – how to correctly eliminate false positives (especially when the patch size is smaller and the window step is small as well). The second category of inpainting patch based methods – that do not simply copy the source patch to destination, is a little harder to detect. The above algorithm, where we search for similar (identical) patches can no longer be applied, we must introduce as well some heuristic, and probably evaluate some parameters on how much the patches resemble. Last, the problem of false positive with these approaches, increases exponentially because we are no longer finding identical patches, but rather nearly identical patches, and on images with smooth large texture, we might end up with a lot of false positives data.
2.3. Machine learning based methods
Starting with approximately 2013, Convolutional Neural Networks (CNNs) and Generative Adversarial Networks (GANs) have emerged as state-of-the-art methods for image inpainting. Usually, these methods are employed as feature extraction tools through convolution, enabling the capture of abstract representations. The combination of CNNs with adversarial training, as proposed by Goodfellow in his 2014 paper, has generated some impressive results in inpainting tasks, achieving perceptual similarity to the original image. The integration of CNNs with GANs has proven advantageous, as CNNs provide an encoder for high-dimensional abstraction extraction, while GANs enhance image sharpness and color synthesis. In the area of machine learning based, the main method, which is attributed to Deepak Pathak in his paper [
43], suggested the use of a system compose of an encoder and a decoder. The encoder will focus on retaining data information (extracting it), and the decoder responsibility will be to generate features based on the encoders learned data. Starting with this approach, several methods have been suggested, like FCN initially, where we have two neural networks with skip connections between them. An improved version of this FCN version, was the U-Net architecture, which resembles FCN, but uses summation as a skip connection mechanism, while in U-NET the concatenation is employed. The advantage of using concatenation is that it can retain more detailed data. One of the existing problem on the inpaiting problem was on how to generate the missing area in highly texturized area. To address this challenge, some authors proposed different mechanisms to exploit both global and local texture information. Another point of variation between inpainting methods was the convolutions used. Here we would like to focus more, because from the perspective of detection, the convolution applied at the decoder level (deconvolution or how some authors call it - transposed convolutional layer), is the one responsible for introducing various type of artifacts. From the analysis the most used convolutions are: simple (or standard) convolution - it is good for reconstruction especially rectangular based shapes and gated convolution - the main ideea is to be able to fill in irregular holes, by using a mask which is updated after each convolution. For a detailed review on the machine learning based technique, we recommend [
9] or [
44] which does a very good overview of the current state of the art methods. Another very recent and good overview on machine learning inpainting methods is the research done in [
38]. Below we present an updated version (only new additions) to the summary presented in the above papers, with focus on the improved versions compared with the above review papers, and by analyzing also the possible artifacts that are introduced by them. The first category of research with relevant good results are [
45], [
46], [
47]. Mainly the methods rely on Fourier convolutions and perceptual loss. Their results are pretty impressive on the CelebA ([
48]) and places ([
49]) dataset. An improvement on the Lama model was presented by the authors in [
50] on the learned perceptual image patch similarity (LPIPS) metric. Their method starts from a noisy image and applies denoising filling in the missing data based on the known area. In [
51] the authors have suggested another approach, they apply already established classical method of inpainting from OpenCV (Telea and NS methods) and use a CNN model to learn the reconstruction of these features. As a backbone, they use a VGG16 model and as features, the authors used three distinct traits: image size, RGB information and brightness levels. The results are straightforward when the area to be inpainted is uniform (for e.g. the middle of the sky), but when the region to be generated is at the border of several highly texturized areas, the methods does not yield proper results. Recently in [
52] they suggest that the task of inpainting should be divided into two separate stages. In the first stage, they use two separate DRGAN modules – one to generate roughly the content of the inpainted area, and one to roughly generate the edges of the missing area – actually they generate a label image where 1 is the edge and 0 represents the background. This information is crucial in the second stage of the algorithm, where the authors used a fine-grained network to generate more coarse pixel information based on the label edges and the roughly already generated data. Again they use a DRGAN, deep residual generative adversarial network, architecture for this part of the method. Analyzing the results and the comparison with some state of the art methods, it seems the proposed method is able to reconstruct highly texturized areas, but has some limitations in – what the authors call “overaly complex” regions. Authors in [
53] have diverged from the mainstream usage of transformers and have incorporated a discrete wavelet transformer along the convolutional layers. Still for upsampling the authors use the standard transpose convolution which generates checkboard artifacts. Another approach with very good results is the work [
54] in which the authors combine autoencoders and transformers on a Resnet architecture. The idea of using transformers, is that they are able to better represent details and thus able to reconstruct the missing area, but still the authors use the same type of architecture (resnet) which employs the same type of upsampler.
From a detection point of view, these methods are becoming more and more challenging due to their ability to propagate patches that are indistinguishable from the rest of image. Also, due to their nature to complete large areas, they are able to reconstruct the entire image characteristics. As attack vector various methods have been proposed, but mainly they are focusing on the artifacts introduced by the various up sampling steps.
5. Results and Discussion
In the following section an analysis is performed on the results obtained from various forgery detection methods applied on various image inpainting methods. The first focus is the dataset. There are some generic forgery datasets, but they lack specificacity (for e.g. Casia, MFC, etc) - they are not focus 100% on image inpainting / object removal. Also from previous analysis (see Dataset chapter), it was observed that in general each forgery detection method comes with its own dataset. With each new dataset, the authors take the some generic input dataset (e.g. MS Coco etc) and apply a different inpainting method, and then they anaylize their method on this context. But, by not being backwards compatible, each new method forgery method is not properly compared with previous others - especially if we talk about machine learning based methods. Also the way the inpainting masks are genereted – usually randomly take a region from the input image– focus the network model to detect different inconsistencies whicht the inpainting method can’t overcome. To address this, we’ve used the Google’s Open Images Dataset V7 released in October 2022 [
120]. We’ve manually select 400 images with additional segmented masks - we’ve only used one mask per image. The selected segmented masks were choose not to be in a very texturized area - we’ve made this limitation because mostly all inpainting methods have problems filling highly texturized areas. Additionally because the forgery methods do not work properly well on big images we’ve emposed that selected images should be maximum of 1024x1024 in size. Another relevant aspect is that on the provided mask from Google’s dataset, we’ve noticed that some of the masks were very close at borders defined,so in order to enhance the inpainting methods, we’ve added a dilation with a kernel of 5x5.
Our next focus is generating the dataset based on various inpainting methods. Our work is inspired from [
85], in which the authors select several inpainting methods an proposed a forgery mechanism to learn the intricacies of all those inpainting methods. The difference is that we did not generate a random mask, but rather we took a valid object from the image and also we apply same inpainting method with same mask to several inpainting methods - in this way we can throughly evaluate given on pair image / mask - and having several inpainted ouputs, how each forgery method is able to determine the traces. For the inpainting methods, we’ve used a number of 5 different inpainting methods: Criminisi original method, an improved version of the Patch Match algorithm [
121], professional editing tools – GIMP [
122] and two machine learning based methods – Lama [
45] and a newer improved version called MAT: Mask-Aware Transformer for Large Hole Image Inpainting [
123].
Next we’ve selected 6 different forgery detection mechanism to evaluate them on the above dataset. The first pick was the CMFD method proposed in [
62]. Altough the CMFD is focus on generic copy-move, we’ve wanted to check how good the method is able to detect inpainted areas. As parameters, we’ve used Zernike moments with a block size 13 - based on authors suggestions. The next item on our list was an classical object removal detection. Here we’ve picked the original paper from [
63]. We’ve selected this one and not a newer variant - like [
65] - because the newer variants improve only in both speed and accuracy - eliminating a lot of false positives. The next four methods we’ve picked were machine learning based. The first on in the list was the Mantranet method [
91] - as it was the basis for numerous newer methods. Based on the authors claims the method should work on detecting also object removal because they’ve trained with images generated by Opencv inpainting methods and additioanally they used more than 300 different classes - like blur inconsistencies, gaussian inconsistencies etc - to train the network. The next method based on matranet net was the IID network [
85] which solely focus on inpainting detection. Additionally we’ve included two more newer methods - Focal [
124] which adds a clustering method to be able to differentiate between forged and not forged areas in the image and PSCC-Net which ecompases a Spatio-Channel Correlation Module to be able to focus on various traces. All the machine learning based detection tests were generated on machine with a NVIDIA Quadro RTX 8000 video card.
Figure 6.
Original and mask image from dataset.
Figure 6.
Original and mask image from dataset.
Figure 7.
Inpainted results: a – Criminisi[
25], b – Gimp[
122], c – NonLocalPatch[
121], Lama[
45], Mat[
123].
Figure 7.
Inpainted results: a – Criminisi[
25], b – Gimp[
122], c – NonLocalPatch[
121], Lama[
45], Mat[
123].
As it is noticed, on the above figure, the first result of [
25] does contain some visible artifacts, while the others are able to complete the image in a very natural way. [
122] by duplicating a section with a lower luminosity and afterwards contrasting it with the overall context of the image, it becomes possible to ascertain that said area has been replicated from a neighboring locatio, so a simple block lookop comparison determines the similar regions. [
121] duplicated and interpolated more smootly the regions nearby, but it is noticed it introduced a blurring arficat on the region of the removed object as one can noticed in the below figure in the highlighted area.
Figure 8.
Sample artifact introduced by [
121].
Figure 8.
Sample artifact introduced by [
121].
Next we will analyze how each of the six algorithms DEBI [
63], CMFD[
62], IID[
85], Mantranet[
91], PSCCNET[
89] and Focal[
124].
Figure 9.
DEBI [
63] results on inpainted images with: a – Criminisi[
25], b – Gimp[
122], c – NonLocalPatch[
121], Lama[
45], Mat[
123].
Figure 9.
DEBI [
63] results on inpainted images with: a – Criminisi[
25], b – Gimp[
122], c – NonLocalPatch[
121], Lama[
45], Mat[
123].
Because DEBI[
63] uses the idea of block comparison, we can easily see that the method is able to detect some clues on images produced by Criminisi [
25], Gimp[
122] and NonLocalPatch[
121]. The results indicate that some regions are tampered, because both 3 inpaiting methods, work rather similarly by copying various patches from different regions. An interesting observation is that except the Criminisi method, all methods affect the overall pixel intensity, not just in the targed masked area. That is why the [
63] method is able to detect different regions - even some which are false positive as one can noticed in the on the above figure on the c picture. Somehow expected are the results for the machine learning inpaiting methods. Because the machine learning methods do not copy a patch, but rather try to synthetize, the block based approach is not able to detect any similar blocks. A possible solution we’ve analyzed, was to search by similar blocks withing a given delta as pixel differences, but by doing so, we had received a lot of false positive results. Similar results we had obtained also from the CMFD framework – because the detection method is rather identical – they work by comparing blocks and applying some filtering logic on similar blocks.
The next analyzed method was Mantranet [
91]. As it can be noticed from the below figure, the detection method works with good results on classical inpainting methods, but rather poorly on machine learning based inpainting methods. Also a relevant item, especially in the computation of F1 score, precision, recall etc., is the fact that [
85], [
89] and [
91] give results an image with gray level intensities. This means that if the region is perfect white, there is a high confidence that the pixels are tampered with, while lower pixel values gives lower confidence. In the evaluation metric measurements, we shall analyze different values of these pixels intensities and see how they affect the overall results.
Figure 10.
Mantranet [
91] results on inpainted images with: a – Criminisi[
25], b – Gimp[
122], c – Non-LocalPatch[
121], Lama[
45], Mat[
123].
Figure 10.
Mantranet [
91] results on inpainted images with: a – Criminisi[
25], b – Gimp[
122], c – Non-LocalPatch[
121], Lama[
45], Mat[
123].
On IID[
85] the results for this image on all 5 different inpainted images were very promising, with a small observation that for [
121] it detected only the sorrounding area
Figure 11.
IID [
85] results on inpainted images with: a – Criminisi[
25], b – Gimp[
122], c – Non-LocalPatch[
121], Lama[
45], Mat[
123].
Figure 11.
IID [
85] results on inpainted images with: a – Criminisi[
25], b – Gimp[
122], c – Non-LocalPatch[
121], Lama[
45], Mat[
123].
On PSCCNET [
89] for the above image, the results were rather poor, on the other hand, interesting results we’ve observed on the Focal [
124] method. The method is able to successfully detect forged areas for block based inpainting methods. It behaves rather strangely on the machine learning based method, where it detected artficats incorrectly related to the object which was not removed, but it was altered in the inpainting method.
Figure 12.
Focal [
124] results on inpainted images with: a – Criminisi[
25], b – Gimp[
122], c – Non-LocalPatch[
121], Lama[
45], Mat[
123].
Figure 12.
Focal [
124] results on inpainted images with: a – Criminisi[
25], b – Gimp[
122], c – Non-LocalPatch[
121], Lama[
45], Mat[
123].
Following this, a comprehensive study has been undertaken to evaluate the results from the perspective of a measuring meter. Initially, an evaluation was conducted to assess the performance of the two block-based method detection techniques. Upon conducting an analysis of the F1 score, precision, recall, and intersection over union (IoU), it becomes evident that the approaches employed yield unsatisfactory outcomes. The CMFD method yielded superior outcomes in comparison to the DEBI method. It is likely that the implementation of enhancements discussed in reference [
65] would enhance the overall performance of DEBI. It is important to acknowledge that the existing methodology (DEBI+CMFD) lacks the capability to detect regions that have experienced indirect replication. This is the reason why we have not shown any results for either DEBI or CMFD in relation to machine learning-based inpainting methods. In summary, the evaluation of the NonLocalPatch inpainting approach using both detection methods suggests a modest level of performance based on the metrics employed. Although the current system demonstrates some accurate detections, there is considerable scope for enhancement, particularly in relation to the bounding box overlap (IoU) and the reduction of erroneous detections (precision). The poor results on detecting images wich underwent an inpainting process via Criminisi method, might be explained by the fact that the areas to be inpainted contained some shadow elements, and also the resolution of test images is quite big comparing with the original paper (1024x1024 vs 256x256). Based on the measurements acquired, it is evident that the DEBI and CMFD detection algorithms, which employ the Criminisi and GIMP inpainting approaches, encounter significant challenges when attempting to detect forgeries. The methods have a notable inadequacy in forgery detection, as it demonstrates a failure to accurately identify a considerable proportion of the region (low recall). Moreover, despite its ability to identify potentially problematic areas, the effectiveness of the detection system is often compromised, leading to a diminished Intersection over Union (IoU) score. The precision metric suggests that around one-third of the system’s detections are accurate. Nevertheless, with careful examination of the subpar recall and IoU scores, it becomes apparent that improvements are necessary across all facets of the methodologies.
Figure 13.
Evaluation metric for the results of DEBI detection method applied on the inpainted Open Images Dataset V7 dataset with the following inpainting methods: Criminis, Gimp, NonLocalPatch.
Figure 13.
Evaluation metric for the results of DEBI detection method applied on the inpainted Open Images Dataset V7 dataset with the following inpainting methods: Criminis, Gimp, NonLocalPatch.
Figure 14.
Evaluation metric for the results of CMFD detection method applied on the inpainted Open Images Dataset V7 dataset with the following inpainting methods: Criminis, Gimp, NonLocalPatch.
Figure 14.
Evaluation metric for the results of CMFD detection method applied on the inpainted Open Images Dataset V7 dataset with the following inpainting methods: Criminis, Gimp, NonLocalPatch.
The method proposed in [
124] demonstrates superior efficacy in detecting the inpainting produced by [
121] in comparison to the rest of the methods. In the case of the block-based approaches, namely Criminisi and Gimp, the detection model exhibits a reasonable level of accuracy in identifying inpainted regions, achieving approximately 60%. However, upon closer examination of the Intersection over Union (IoU) metric, which measures the overlap between the predicted and ground truth regions, it becomes evident that the system erroneously identifies additional sections as manipulated, resulting in an IoU of approximately 30%. The Focal method applied on the NonLocalPatch images has a commendable level of precision in its detections and effectively captures a substantial proportion of the objects that are there. The marginally reduced F1 score indicates a small imbalance, albeit without major divergence. The IoU score indicates that the model’s localization accuracy is satisfactory, while there is room for improvement. It is possible that the accuracy was affected by the wrong selection of the mask for inpainting. Further information regarding the dataset talks on masks may provide insights into this matter. However, the performance indicators for Focal reveal subpar outcomes in relation to the Lama and Mat inpainting approaches and are subjective to a closer review.
Figure 15.
Evaluation metric for the results of Focal detection method applied on the inpainted Open Images Dataset V7 dataset with the following inpainting methods: Criminis, Gimp, NonLocalPatch, Lama and Mat.
Figure 15.
Evaluation metric for the results of Focal detection method applied on the inpainted Open Images Dataset V7 dataset with the following inpainting methods: Criminis, Gimp, NonLocalPatch, Lama and Mat.
The methodologies described in [
85], [
89], and [
91] do not produce a binary mask that may be used to identify counterfeit regions. In contrast, a heat map is presented by them. Pixels with higher values, which are represented by white pixels, imply that the utilized approach possesses a higher degree of confidence in accurately identifying modified pixels. Moreover, as an illustration, the methodology outlined in reference [
89] uses a softmax function on the entire output to augment the response by integrating a binary categorization of the image as either counterfeit or genuine. In order to adequately evaluate the three procedures, we have utilized three separate approaches to analyze the outcomes of the tests. In this study, a set of three sample values (20, 70, 127) has been chosen as threshold for the purpose of identifying forged pixels. Pixels that surpass these values are categorized as indicative of forging, whilst those that fall below are classified as genuine. For example, when the threshold for [
85] is set to 127 instead of 20, the Precision increases by 3%, but the IoU reduces by around 4%. A similar trend may be observed with the Matranet approach when the threshold is set to 127 compared to 20. with this case, there is a 12% gain in Precision, but a decrease of approximately 5% in IoU.
The utilization of [
85] in machine learning approaches has demonstrated superior outcomes, alongside [
123].
Figure 16.
Evaluation metric for the results of IID-NET detection method applied on the inpainted Open Images Dataset V7 dataset with the following inpainting methods: Criminis, Gimp, NonLocalPatch, Lama and Mat.
Figure 16.
Evaluation metric for the results of IID-NET detection method applied on the inpainted Open Images Dataset V7 dataset with the following inpainting methods: Criminis, Gimp, NonLocalPatch, Lama and Mat.
One interesting characteristic of the IID / Matranet / PSCC-NET methods is the observation of a high recall rate accompanied by relatively lower values of precision, F1 score, and IoU (see above and below diagrams). This pattern suggests that the model has an excessive tendency to identify detections, successfully capturing a majority of genuine objects. However, it also tends to incorrectly identify numerous non-objects. Furthermore, even when the model’s identifications are accurate, its ability to precisely localize objects may be compromised. This tendency may provide challenges in several situations, particularly when accuracy is of utmost importance. Additionally, it is possible that the model could get advantages from other optimization techniques in order to enhance precision while minimizing any substantial trade-offs in recall.
Figure 17.
Evaluation metric for the results of Mantranet detection method applied on the inpainted Open Images Dataset V7 dataset with the following inpainting methods: Criminis, Gimp, NonLocalPatch, Lama and Mat.
Figure 17.
Evaluation metric for the results of Mantranet detection method applied on the inpainted Open Images Dataset V7 dataset with the following inpainting methods: Criminis, Gimp, NonLocalPatch, Lama and Mat.
Figure 18.
Evaluation metric for the results of PSCC-NET detection method applied on the inpainted Open Images Dataset V7 dataset with the following inpainting methods: Criminis, Gimp, NonLocalPatch, Lama and Mat.
Figure 18.
Evaluation metric for the results of PSCC-NET detection method applied on the inpainted Open Images Dataset V7 dataset with the following inpainting methods: Criminis, Gimp, NonLocalPatch, Lama and Mat.
Based on the summary of the results presented below, it is observed that Focal [
124] performans well for NonLocalBased inpainting methods, Mantranet [
91] is able to detect older variants of patch based methods, and IID performs the best on the machine learning based inpainting methods.
Figure 19.
Summary evaluation metrics for the results of IID-NET,Focal,Mantranet,PSCC-NET detection methods applied on the inpainted Open Images Dataset V7 dataset with the following inpainting methods: Criminis, Gimp, NonLocalPatch, Lama and Mat.
Figure 19.
Summary evaluation metrics for the results of IID-NET,Focal,Mantranet,PSCC-NET detection methods applied on the inpainted Open Images Dataset V7 dataset with the following inpainting methods: Criminis, Gimp, NonLocalPatch, Lama and Mat.
The domain of picture and video inpainting has witnessed significant advancements in recent years. The usage of exemplar-based techniques has been a particularly intriguing feature of this evolution. Through the use of these methodologies, it has become feasible to rectify substantial segments of impaired or absent regions within a picture or video. At its inception, the practice of inpainting was primarily limited to making modest modifications, such as repairing minor scratches or concealing imperfections in sensor devices. Nevertheless, at present, it has the capacity to address significantly more intricate difficulties, including the removal of considerable items. The applied strategies can be categorized into two primary groups: those that rely on partial differential equations and patch-based methods, sometimes referred to as exemplar-based methods. Contemporary photo editing software, designed for both professional and novice users, frequently incorporates sophisticated inpainting techniques. Exemplar-based inpainting can be understood as a technologically advanced and automated approach to the detection and removal of copy-move forgeries. In this procedure, segments are extracted from different regions of the image or video and adeptly merged to yield enhanced visual outcomes. The blending operation is a vital element of this process, since it must be executed flawlessly in order to guarantee the cohesiveness of the finished product. In the field of Convolutional Neural Networks , notable progress has been made in inpainting approaches, which have demonstrated superior performance compared to skilled human editors. This is particularly evident when employing content-aware filling methodologies. The capacity of CNN to construct a coherent visual storyline using limited data sets has showcased its substantial potential inside this domain.
Nevertheless, the task of detecting these inpainted alterations might be a significant problem, as conventional copy-move detection methods frequently have limited effectiveness in this context. There are various explanations for this phenomenon. Firstly, the target area under consideration may be too minuscule to be effectively detected. Secondly, the modified regions may closely resemble pre-existing areas within the original image. Lastly, the inpainted areas could potentially consist of multiple distinct regions. In response to these constraints, a number of authors have proposed automated techniques for the detection of inpainting forgeries. These methodologies, akin to the ones employed in the detection of copy-move forgery, exploit visually identical picture patches to emphasize locations that raise suspicion. In addition, heuristic criteria are utilized by them in order to minimize the occurrence of false alarms. The heuristic rules exhibit a range of characteristics, since different authors employ various approaches, including the utilization of fuzzy logic, to address this issue. Efforts have also been made to exclude regions that lack indications of amalgamation from several geographical areas. However, the conventional approaches possess inherent constraints. The utilization of substantial computational resources and effort is frequently necessary in order to enhance detection accuracy by reducing patch sizes. Additionally, these methods encounter challenges in effectively mitigating false positives. In recent times, there has been an introduction of machine learning approaches to analyze disparities among various patches. An investigation is conducted to analyze noise, specifically focusing on Photo Response Non-Uniformity (PRNU), in order to quantify noise levels inside individual patches and detect any irregularities in the distribution of noise patterns throughout these patches. Prior research has focused on analyzing artifacts originating from the Color Filter Array and has attempted to utilize similar methodologies employed in noise analysis.. In the current epoch of deep learning, endeavors have been undertaken to devise an automated methodology capable of seamlessly amalgamating diverse artifacts and proficiently identifying anomalies. Nevertheless, the efficacy of these techniques is significantly impacted by the implementation of countermeasures such as noise reduction or addition, color correction, gamma correction, and other similar factors.
One of the key challenges in the field pertains to the limited availability of datasets, which hinders progress. Although there are several well-known datasets for detecting copy-move anomalies, the availability of datasets specifically designed for inpainting tasks, such as object removal or content-aware fill, is limited. Currently, there are just three datasets that can be categorized into this particular area. The restricted accessibility of datasets poses a significant obstacle in the examination and enhancement of inpainting detection techniques, impeding the prospective advancement in this captivating field of study. Hence, it is crucial that additional resources are allocated towards the creation and upkeep of extensive and superior inpainting datasets in order to advance the discipline