<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.1 20151215//EN" "http://jats.nlm.nih.gov/publishing/1.1/JATS-journalpublishing1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:mml="http://www.w3.org/1998/Math/MathML" xml:lang="en" article-type="research-article" dtd-version="1.1">
<front>
<journal-meta>
<journal-id journal-id-type="pmc">CMC</journal-id>
<journal-id journal-id-type="nlm-ta">CMC</journal-id>
<journal-id journal-id-type="publisher-id">CMC</journal-id>
<journal-title-group>
<journal-title>Computers, Materials &#x0026; Continua</journal-title>
</journal-title-group>
<issn pub-type="epub">1546-2226</issn>
<issn pub-type="ppub">1546-2218</issn>
<publisher>
<publisher-name>Tech Science Press</publisher-name>
<publisher-loc>USA</publisher-loc>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="publisher-id">55739</article-id>
<article-id pub-id-type="doi">10.32604/cmc.2025.055739</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Article</subject>
</subj-group>
</article-categories>
<title-group>
<article-title>Image Copy-Move Forgery Detection and Localization Method Based on Sequence-to-Sequence Transformer Structure</article-title>
<alt-title alt-title-type="left-running-head">Image Copy-Move Forgery Detection and Localization Method Based on Sequence-to-Sequence Transformer Structure</alt-title>
<alt-title alt-title-type="right-running-head">Image Copy-Move Forgery Detection and Localization Method Based on Sequence-to-Sequence Transformer Structure</alt-title>
</title-group>
<contrib-group>
<contrib id="author-1" contrib-type="author">
<name name-style="western"><surname>Hao</surname><given-names>Gang</given-names></name></contrib>
<contrib id="author-2" contrib-type="author" corresp="yes">
<name name-style="western"><surname>Liang</surname><given-names>Peng</given-names></name><email>liangpeng@gpnu.edu.cn</email></contrib>
<contrib id="author-3" contrib-type="author">
<name name-style="western"><surname>Li</surname><given-names>Ziyuan</given-names></name></contrib>
<contrib id="author-4" contrib-type="author">
<name name-style="western"><surname>Zhao</surname><given-names>Huimin</given-names></name></contrib>
<contrib id="author-5" contrib-type="author">
<name name-style="western"><surname>Zhang</surname><given-names>Hong</given-names></name></contrib>
<aff id="aff-1"><institution>School of Computer Science, Guangdong Polytechnic Normal University</institution>, <addr-line>Guangzhou, 510630</addr-line>, <country>China</country></aff>
</contrib-group>
<author-notes>
<corresp id="cor1"><label>&#x002A;</label>Corresponding Author: Peng Liang. Email: <email>liangpeng@gpnu.edu.cn</email></corresp>
</author-notes>
<pub-date date-type="collection" publication-format="electronic">
<year>2025</year>
</pub-date>
<pub-date date-type="pub" publication-format="electronic">
<day>06</day><month>03</month><year>2025</year>
</pub-date>
<volume>82</volume>
<issue>3</issue>
<fpage>5221</fpage>
<lpage>5238</lpage>
<history>
<date date-type="received">
<day>29</day>
<month>4</month>
<year>2024</year>
</date>
<date date-type="accepted">
<day>23</day>
<month>12</month>
<year>2024</year>
</date>
</history>
<permissions>
<copyright-statement>&#x00A9; 2025 The Authors.</copyright-statement>
<copyright-year>2025</copyright-year>
<copyright-holder>Published by Tech Science Press.</copyright-holder>
<license xlink:href="https://creativecommons.org/licenses/by/4.0/">
<license-p>This work is licensed under a <ext-link ext-link-type="uri" xlink:type="simple" xlink:href="https://creativecommons.org/licenses/by/4.0/">Creative Commons Attribution 4.0 International License</ext-link>, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
</license>
</permissions>
<self-uri content-type="pdf" xlink:href="TSP_CMC_55739.pdf"></self-uri>
<abstract>
<p>In recent years, the detection of image copy-move forgery (CMFD) has become a critical challenge in verifying the authenticity of digital images, particularly as image manipulation techniques evolve rapidly. While deep convolutional neural networks (DCNNs) have been widely employed for CMFD tasks, they are often hindered by a notable limitation: the progressive reduction in spatial resolution during the encoding process, which leads to the loss of critical image details. These details are essential for the accurate detection and localization of image copy-move forgery. To overcome the limitations of existing methods, this paper proposes a Transformer-based approach for CMFD and localization as an alternative to conventional DCNN-based techniques. The proposed method employs a Transformer structure as an encoder to process images in a sequence-to-sequence manner, substituting the feature correlation calculations of previous methods with self-attention computations. This allows the model to capture long-range dependencies and contextual nuances within the image, preserving finer details that are typically lost in DCNN-based approaches. Moreover, an appropriate decoder is utilized to ensure precise reconstruction of image features, thereby enhancing both the detection accuracy and localization precision. Experimental results demonstrate that the proposed model achieves superior performance on benchmark datasets, such as USCISI, for image copy-move forgery detection. These results show the potential of Transformer architectures in advancing the field of image forgery detection and offer promising directions for future research.</p>
</abstract>
<kwd-group kwd-group-type="author">
<kwd>CMFD</kwd>
<kwd>self-attention</kwd>
<kwd>transformer</kwd>
<kwd>deep convolutional neural networks</kwd>
</kwd-group>
<funding-group>
<award-group id="awg1">
<funding-source>National Natural Science Foundation of China</funding-source>
<award-id>62072123</award-id>
</award-group>
<award-group id="awg2">
<funding-source>Initiatives in Guangdong Province</funding-source>
<award-id>2021B0101220006</award-id>
</award-group>
<award-group id="awg3">
<funding-source>Key Field Projects for Ordinary Colleges and Universities</funding-source>
<award-id>2020ZDZX3059</award-id>
<award-id>2022ZDZX1012</award-id>
<award-id>2023ZDZX1008</award-id>
</award-group>
<award-group id="awg4">
<funding-source>Key R&#x0026;D Projects in Jiangxi Province</funding-source>
<award-id>20212BBE53002</award-id>
</award-group>
<award-group id="awg5">
<funding-source>Key R&#x0026;D Projects in Yichun City</funding-source>
<award-id>20211YFG4270</award-id>
</award-group>
</funding-group>
</article-meta>
</front>
<body>
<sec id="s1">
<label>1</label>
<title>Introduction</title>
<p>The pervasive presence of graphic editing software in current society has allowed for the effortless and inexpensive creation of many realistic counterfeit images. Maliciously manipulated images can have significant adverse consequences, including their use in fraudulent activities, the dissemination of misinformation, the fabrication of evidence, and the misguiding of public opinion. Consequently, it is imperative to develop effective image tampering detection methodologies to assist individuals in determining whether an image has been altered and identifying the specific areas of manipulation.</p>
<p>Among image tampering techniques, copy-move forgery [<xref ref-type="bibr" rid="ref-1">1</xref>] involves duplicating a region within an image and relocating it elsewhere within the same image. In recent years, deep learning methods have emerged as a prominent focus in copy-move forgery detection (CMFD) research due to their advantages of reduced hyper parameters and increased versatility. However, applying these models to CMFD tasks presents several challenges. These include significant loss of detailed information during the feature encoding process through convolution, particularly for small targets. Additionally, the limited size of the convolution kernel restricts the receptive field of convolutional neural network (CNN) models, impeding their ability to effectively capture long-range dependencies. These challenges, among others, continue to pose obstacles in the field.</p>
<p>Recent research has demonstrated that the Transformer [<xref ref-type="bibr" rid="ref-2">2</xref>] model, widely utilized in natural language processing, can be effectively applied to various downstream tasks in computer vision. This development offers an alternative approach to the CMFD task beyond convolutional methods. Specifically, it maintains spatial resolution during image feature encoding, performs direct sequence-to-sequence self-attention calculations, and identifies task-relevant information from the global image context from the outset.</p>
<p>This paper introduces an innovative approach for detecting and localizing image copy-move tampering. The encoding process utilizes a conventional transformer encoder. By implementing a one-to-one feature matching module, it effectively distinguishes tampered features from similar background features, and employs a multi-scale contextual decoder to achieve more precise detection of image copy-move tampering. The primary contributions of this research are as follows:
<list list-type="simple">
<list-item><label>1)</label><p>A one-to-one feature matching module has been developed to mitigate the impact of similar background features on forged elements, while maintaining resilience to variations in image scale.</p></list-item>
<list-item><label>2)</label><p>The multi-scale context decoder consolidates tampering feature information at various levels of granularity through straightforward element-wise addition. This approach incorporates a broader range of potential tampering details, thereby enhancing the precision of tampered region detection.</p></list-item>
</list></p>
</sec>
<sec id="s2">
<label>2</label>
<title>Related Work</title>
<p>Contemporary approaches for detecting image copy-move tampering predominantly fall into two categories: feature-based methods and deep learning-based methods [<xref ref-type="bibr" rid="ref-3">3</xref>]. Feature extraction-based methods involve extracting characteristics that represent an image&#x2019;s content or structure, and then utilizing the similarities or differences among these features to identify tampered areas. These methods can be further classified into block-based and key-point-based approaches [<xref ref-type="bibr" rid="ref-4">4</xref>]. The block-based method divides the image into overlapping or non-overlapping small blocks for feature extraction and comparison [<xref ref-type="bibr" rid="ref-5">5</xref>], employing techniques such as Zernike [<xref ref-type="bibr" rid="ref-6">6</xref>], DCT [<xref ref-type="bibr" rid="ref-7">7</xref>], and PCA [<xref ref-type="bibr" rid="ref-8">8</xref>]. The key-point-based method involves detecting salient or invariant key-points from an image, followed by feature extraction and matching for each key-point [<xref ref-type="bibr" rid="ref-9">9</xref>], using algorithms like SIFT [<xref ref-type="bibr" rid="ref-10">10</xref>], SURF [<xref ref-type="bibr" rid="ref-11">11</xref>], and FREAK [<xref ref-type="bibr" rid="ref-12">12</xref>]. While feature extraction-based methods generally offer faster detection speeds, they have limitations including sensitivity to feature selection and parameter settings, as well as limited adaptability to complex scenes and multiple tampering instances [<xref ref-type="bibr" rid="ref-13">13</xref>].</p>
<p>In the domain of image copy-paste tampering detection, contemporary research primarily employs deep learning techniques. Wu et al. [<xref ref-type="bibr" rid="ref-14">14</xref>] introduced BusterNet, an end-to-end deep neural network comprising two branches: Mani-Det and Simi-Det. The Mani-Det branch identifies tampered regions, while Simi-Det detects similarities between source and target areas, thus localizing them. However, Buster-Net&#x2019;s Simi-Det branch extracts only low-resolution feature information through convolutional networks, and both branches must accurately locate the target area for correct source and target classification. To mitigate these limitations, Chen et al. [<xref ref-type="bibr" rid="ref-15">15</xref>] advanced BusterNet by fusing CMSDNet with STRDNet. They utilized a single-branch dual-network architecture for detecting similarities in source/target regions and incorporated mechanisms such as spatial pyramid pooling, spatial attention and channel-wise attention to improve the model&#x2019;s similarity detection capabilities. Hu et al. [<xref ref-type="bibr" rid="ref-16">16</xref>] developed SPAN, which incorporates a spatial pyramid attention framework to analyze image regions across various resolutions using localized self-attention mechanisms. While SPAN leverages local correlations, it fails to comprehensively harness spatial correlations, thereby restricting the model&#x2019;s generalizability. DOA-GAN [<xref ref-type="bibr" rid="ref-17">17</xref>] employs a two-stage spatial attention mechanism to enhance the capture of location information and discriminative feature information of copied and moved objects, refining localization results through a generative adversarial network. However, its detection effectiveness for small tampered regions remains suboptimal. Dong et al. [<xref ref-type="bibr" rid="ref-18">18</xref>] proposed MVSS-Net, comprising an edge supervision branch and a noise-sensitive branch. These branches aim to capture subtle differences at the boundaries between tampered and untampered regions, as well as noise inconsistencies. By extracting semantic-agnostic features through multi-view feature learning, MVSS-Net obtains more generalized features, facilitating tamper detection while reducing false positives for authentic images.</p>
<p>Presently, the majority of deep learning methods rely on Deep Convolutional Neural Network (DCNN) models. While convolutional down-sampling serves to significantly lower the computational demands of feature correlation processes, it also results in the loss of certain detailed information. This loss can lead to suboptimal detection performance, particularly for small target regions.</p>
<p>There remains substantial potential for enhancing robustness against diverse attacks and augmenting the proficiency in discriminating source regions from target regions within image forensics. Motivated by the advancements in Transformer architectures and self-attention techniques within Natural Language Processing (NLP), researchers have increasingly applied these concepts to Computer Vision (CV). This approach aims to address the limitations of traditional passive forensic techniques and DCNN models, potentially offering more effective solutions for detecting and localizing image manipulations.</p>
<p>The Vision Transformer (ViT) [<xref ref-type="bibr" rid="ref-19">19</xref>] pioneered the application of the standard Transformer model to image classification in computer vision, demonstrating that despite the self-attention mechanism&#x2019;s lack of inductive biases inherent to DCNN, it can match or surpass DCNN performance in image classification tasks with large-scale pre-training. Subsequent research has extended the application of self-attention and Transformer models to other tasks, including object detection (DETR [<xref ref-type="bibr" rid="ref-20">20</xref>] and Deformable DETR [<xref ref-type="bibr" rid="ref-21">21</xref>]) and image segmentation (SETR [<xref ref-type="bibr" rid="ref-22">22</xref>] and TransUNet [<xref ref-type="bibr" rid="ref-23">23</xref>]), while also enhancing image classification performance (DeiT [<xref ref-type="bibr" rid="ref-24">24</xref>] and Swin Transformer [<xref ref-type="bibr" rid="ref-25">25</xref>]). Wang et al. introduced ObjectFormer [<xref ref-type="bibr" rid="ref-26">26</xref>], successfully incorporating Transformer into image tampering detection. However, this approach merely concatenates CNN and Transformer sequentially without effectively integrating their strengths. Additionally, its use of frequency domain features provides minimal benefit for CMFD. Addressing these limitations, our model proposes that a standard Transformer module with a core self-attention mechanism can efficiently identify regions within an image that have identical forms but differ in edge artifacts. This module, when paired with an appropriate decoder, can be directly applied to feature encoding for CMFD tasks.</p>
</sec>
<sec id="s3">
<label>3</label>
<title>Methodology</title>
<p>To elucidate the model design in this paper, we first examine the CMFD methods within the DCNN framework. This framework bears resemblance to the encoder-decoder structure of the Fully Convolutional Network (FCN) [<xref ref-type="bibr" rid="ref-27">27</xref>] for semantic segmentation, comprising three primary modules: a feature extractor based on CNN, a module for feature matching, and a decoder. These components are employed to extract features, compute feature similarity, and generate tampering masks, respectively. To enhance detection efficacy, the model may incorporate additional post-processing modules or advanced designs, such as edge detection, feature pyramids, and multiple feature fusion techniques.</p>
<p>The model proposed in this paper adheres to the encoder-decoder framework, with a notable modification. To preserve the original image resolution during feature extraction, a Transformer encoder is employed. This design, centered on multi-head self-attention (MSA), effectively fulfills both feature extraction and feature matching requirements. <xref ref-type="fig" rid="fig-1">Fig. 1</xref> presents the complete architecture of the model.</p>
<fig id="fig-1">
<label>Figure 1</label>
<caption>
<title>Illustration of proposed CMFD transformer (CMFDTR) model</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_55739-fig-1.tif"/>
</fig>
<sec id="s3_1">
<label>3.1</label>
<title>Feature Encoder</title>
<p>We represent the input image as <inline-formula id="ieqn-1"><mml:math id="mml-ieqn-1"><mml:mi>X</mml:mi><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:mi>H</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mi>W</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mn>3</mml:mn></mml:mrow></mml:msup></mml:math></inline-formula>, with <italic>H</italic> and <italic>W</italic> representing its height and width, respectively. The image <inline-formula id="ieqn-2"><mml:math id="mml-ieqn-2"><mml:mi>X</mml:mi></mml:math></inline-formula> undergoes an initial preprocessing step to transform it into a one-dimensional sequence suitable for input into a Transformer encoder. For feature extraction, a convolutional layer with a kernel size of <inline-formula id="ieqn-3"><mml:math id="mml-ieqn-3"><mml:mi>p</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mi>p</mml:mi></mml:math></inline-formula> and a stride of <inline-formula id="ieqn-4"><mml:math id="mml-ieqn-4"><mml:mi>p</mml:mi></mml:math></inline-formula> is employed. This operation effectively partitions the image into <inline-formula id="ieqn-5"><mml:math id="mml-ieqn-5"><mml:mi>N</mml:mi></mml:math></inline-formula> blocks of size <inline-formula id="ieqn-6"><mml:math id="mml-ieqn-6"><mml:mi>p</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mi>p</mml:mi></mml:math></inline-formula>. The number of blocks <inline-formula id="ieqn-7"><mml:math id="mml-ieqn-7"><mml:mi>N</mml:mi></mml:math></inline-formula> is determined by the formula: <inline-formula id="ieqn-8"><mml:math id="mml-ieqn-8"><mml:mi>N</mml:mi><mml:mo>=</mml:mo><mml:mfrac><mml:mi>H</mml:mi><mml:mi>p</mml:mi></mml:mfrac><mml:mo>&#x00D7;</mml:mo><mml:mfrac><mml:mi>W</mml:mi><mml:mi>p</mml:mi></mml:mfrac></mml:math></inline-formula>. Following this, the image patches are flattened, and each vectorized patch is mapped to a latent d-dimensional embedding space (set to 1024 in this study) using a linear projection layer <inline-formula id="ieqn-9"><mml:math id="mml-ieqn-9"><mml:mi>E</mml:mi></mml:math></inline-formula>, resulting in a one-dimensional sequence of image patch embeddings <inline-formula id="ieqn-10"><mml:math id="mml-ieqn-10"><mml:msub><mml:mi>S</mml:mi><mml:mrow><mml:mi>p</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>.</p>
<p>To facilitate long-range modeling, encoding the spatial information of image patches is essential. Consequently, a learnable position embedding <inline-formula id="ieqn-11"><mml:math id="mml-ieqn-11"><mml:mi>P</mml:mi><mml:mi>o</mml:mi><mml:msub><mml:mi>s</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> with identical dimensions is incorporated into the embedding sequence <inline-formula id="ieqn-12"><mml:math id="mml-ieqn-12"><mml:msup><mml:mi>S</mml:mi><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msup><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>N</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo><mml:mo>&#x00D7;</mml:mo><mml:mi>d</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula>. This process can be expressed by the following equation:
<disp-formula id="eqn-1"><label>(1)</label><mml:math id="mml-eqn-1" display="block"><mml:msubsup><mml:mi>S</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:msub><mml:mi>S</mml:mi><mml:mrow><mml:mi>p</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:mi>P</mml:mi><mml:mi>o</mml:mi><mml:msub><mml:mi>s</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:math></disp-formula></p>
<p>Subsequently, the embedded sequence <inline-formula id="ieqn-13"><mml:math id="mml-ieqn-13"><mml:msup><mml:mi>S</mml:mi><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula> is inputted into the global Transformer encoder. The Transformer encoder is constructed with 24 stacked Transformer encoding blocks, each comprising multiple layers of MSAs and multiple layers of multi-layer perceptions (MLPs). The ultimate output produced by the global Transformer encoder is referred to as <inline-formula id="ieqn-14"><mml:math id="mml-ieqn-14"><mml:msub><mml:mrow><mml:mtext>Z</mml:mtext></mml:mrow><mml:mrow><mml:mn>24</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula>, while the features produced by each stacked layer of the Transformer during the encoding process are represented as {<inline-formula id="ieqn-15"><mml:math id="mml-ieqn-15"><mml:msub><mml:mi>Z</mml:mi><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula>, <inline-formula id="ieqn-16"><mml:math id="mml-ieqn-16"><mml:msub><mml:mi>Z</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mo>&#x22EF;</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mi>Z</mml:mi><mml:mrow><mml:mn>24</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula>}, as depicted in <xref ref-type="fig" rid="fig-1">Fig. 1</xref>.</p>
</sec>
<sec id="s3_2">
<label>3.2</label>
<title>Feature Decoder</title>
<p>The model presented in this paper adheres to the encoder-decoder framework, utilizing the Transformer architecture for feature encoding in the encoder component. However, the extracted features encompass both similar backgrounds and forged regions. To mitigate interference from features of similar backgrounds, a common approach involves dividing the Transformer features into several sets of feature blocks, computing self-correlation scores for each set, and selecting the top k feature blocks with the highest scores. This method aims to reduce interference caused by similar backgrounds [<xref ref-type="bibr" rid="ref-9">9</xref>,<xref ref-type="bibr" rid="ref-15">15</xref>]. However, this empirical approach demonstrates sensitivity to variations in image size.</p>
<p>Considering the inherent nature of image copy-move tampering detection, which involves identifying nearly identical forged targets based on the original source, a one-to-one matching result is preferable to a one-to-many matching outcome. To address this, the proposed model integrates a one-to-one feature matching module and utilizes a multi-scale context decoder to validate the method&#x2019;s efficacy.</p>
<p>The feature matching module integrates the contextual information of each channel in the Transformer features through global average pooling, thereby reducing the feature map&#x2019;s dimensionality from <inline-formula id="ieqn-17"><mml:math id="mml-ieqn-17"><mml:mi>K</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mi>w</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mi>h</mml:mi></mml:math></inline-formula> to <inline-formula id="ieqn-18"><mml:math id="mml-ieqn-18"><mml:mi>K</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mn>1</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mn>1</mml:mn></mml:math></inline-formula>. Subsequently, the channel information weights are calculated using one-dimensional convolution, and the Sigmoid activation function is applied to constrain these weights within the range of (0&#x2013;1). Lastly, the feature weight information is derived through vector multiplication with the original Transformer features, yielding the aggregated information of different channel weights.</p>
<p>The stacked architecture of the Transformer encoder enables each image patch&#x2019;s feature representation to incorporate information from other patches, allowing different encoding layers to capture forgery features at various levels of granularity. To integrate these multi-granularity forgery features, we have designed a CMFDTR-MLW decoder. As illustrated in <xref ref-type="fig" rid="fig-2">Fig. 2</xref>, the CMFDTR-MLW decoder implements a unidirectional feature fusion strategy, facilitating information integration through a top-down pathway. This approach effectively enhances the flow of information within the Transformer encoder, thereby improving the model&#x2019;s ability to perceive and synthesize multi-level features. Specifically, we divide the 24 Transformer encoder blocks evenly into four groups and extract the embedded features (<inline-formula id="ieqn-19"><mml:math id="mml-ieqn-19"><mml:msub><mml:mi>Z</mml:mi><mml:mrow><mml:mn>6</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>Z</mml:mi><mml:mrow><mml:mn>12</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>Z</mml:mi><mml:mrow><mml:mn>18</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>Z</mml:mi><mml:mrow><mml:mn>24</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula>) from the final block of each group as input. These features are then reshaped into 3D features (<inline-formula id="ieqn-20"><mml:math id="mml-ieqn-20"><mml:msubsup><mml:mi>Z</mml:mi><mml:mrow><mml:mn>6</mml:mn></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:msubsup><mml:mi>Z</mml:mi><mml:mrow><mml:mn>12</mml:mn></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:msubsup><mml:mi>Z</mml:mi><mml:mrow><mml:mn>18</mml:mn></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:msubsup><mml:mi>Z</mml:mi><mml:mrow><mml:mn>24</mml:mn></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula>) with dimensions <inline-formula id="ieqn-21"><mml:math id="mml-ieqn-21"><mml:mfrac><mml:mi>H</mml:mi><mml:mi>p</mml:mi></mml:mfrac><mml:mo>&#x00D7;</mml:mo><mml:mfrac><mml:mi>W</mml:mi><mml:mi>p</mml:mi></mml:mfrac><mml:mo>&#x00D7;</mml:mo><mml:mi>C</mml:mi></mml:math></inline-formula>. Subsequently, these reshaped features are processed through a feature matching module before being input into the CMFDTR-MLW decoder.</p>
<fig id="fig-2">
<label>Figure 2</label>
<caption>
<title>Illustration of CMFDTR-MLW decoder</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_55739-fig-2.tif"/>
</fig>
<p>Within the MLW decoder, the high-level feature maps are combined with the feature information of the sub-high-level feature maps through element-wise addition. The resulting aggregated feature map is then added element-wise to the next lower-level feature map. The fusion process maintains a copy of the highest layer feature map <inline-formula id="ieqn-22"><mml:math id="mml-ieqn-22"><mml:msubsup><mml:mrow><mml:mtext>Z</mml:mtext></mml:mrow><mml:mrow><mml:mn>24</mml:mn></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula> (denoted as <inline-formula id="ieqn-23"><mml:math id="mml-ieqn-23"><mml:msub><mml:mrow><mml:mi>P</mml:mi></mml:mrow><mml:mrow><mml:mn>24</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula>), which is subsequently added to the next highest layer feature map. This result is then added to the next feature map. After each addition, the resulting feature map is preserved (i.e., <inline-formula id="ieqn-24"><mml:math id="mml-ieqn-24"><mml:msub><mml:mrow><mml:mi>P</mml:mi></mml:mrow><mml:mrow><mml:mn>18</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula>, <inline-formula id="ieqn-25"><mml:math id="mml-ieqn-25"><mml:msub><mml:mrow><mml:mi>P</mml:mi></mml:mrow><mml:mrow><mml:mn>12</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula>, <inline-formula id="ieqn-26"><mml:math id="mml-ieqn-26"><mml:msub><mml:mrow><mml:mi>P</mml:mi></mml:mrow><mml:mrow><mml:mn>6</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula>). Subsequently, the four fused feature maps undergo processing through two successive 3&#x00D7;3 convolutions to halve the number of channels and reduce it to 3, as well as two 4x up-sampling operations. This process yields four candidate three-class tampering masks (<inline-formula id="ieqn-27"><mml:math id="mml-ieqn-27"><mml:msub><mml:mrow><mml:mi>R</mml:mi></mml:mrow><mml:mrow><mml:mn>6</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula>, <inline-formula id="ieqn-28"><mml:math id="mml-ieqn-28"><mml:msub><mml:mrow><mml:mi>R</mml:mi></mml:mrow><mml:mrow><mml:mn>12</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula>, <inline-formula id="ieqn-29"><mml:math id="mml-ieqn-29"><mml:msub><mml:mrow><mml:mi>R</mml:mi></mml:mrow><mml:mrow><mml:mn>18</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula>, <inline-formula id="ieqn-30"><mml:math id="mml-ieqn-30"><mml:msub><mml:mrow><mml:mi>R</mml:mi></mml:mrow><mml:mrow><mml:mn>24</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula>) with different hierarchical information and dimensions matching the original image. By obtaining four feature maps with distinct mixed conditions through this approach, each of the four feature maps is decoded separately to produce four decoding results. Finally, a union of the four decoding results is taken to enhance the detection accuracy of tampered regions.</p>
</sec>
<sec id="s3_3">
<label>3.3</label>
<title>Loss Function</title>
<p>In the context of image copy-move tampering detection and localization, the model proposed in this paper conducts binary classification at the pixel level and utilizes the binary cross-entropy loss function (<inline-formula id="ieqn-31"><mml:math id="mml-ieqn-31"><mml:mi>B</mml:mi><mml:mi>C</mml:mi><mml:mi>E</mml:mi><mml:mi>L</mml:mi><mml:mi>o</mml:mi><mml:mi>s</mml:mi><mml:mi>s</mml:mi></mml:math></inline-formula>) for network updates. The model&#x2019;s prediction mask is guided by pixel labels from a ground-truth mask, which corresponds to the original image&#x2019;s dimensions. In this mask, pixels labeled as 0 are classified as original, and those labeled as 1 are classified as tampered. The calculation formula for <inline-formula id="ieqn-32"><mml:math id="mml-ieqn-32"><mml:mi>B</mml:mi><mml:mi>C</mml:mi><mml:mi>E</mml:mi><mml:mi>L</mml:mi><mml:mi>o</mml:mi><mml:mi>s</mml:mi><mml:mi>s</mml:mi></mml:math></inline-formula> is presented as follows:
<disp-formula id="eqn-2"><label>(2)</label><mml:math id="mml-eqn-2" display="block"><mml:mi>B</mml:mi><mml:mi>C</mml:mi><mml:mi>E</mml:mi><mml:mi>L</mml:mi><mml:mi>o</mml:mi><mml:mi>s</mml:mi><mml:mi>s</mml:mi><mml:mo>=</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mi>y</mml:mi><mml:mi>l</mml:mi><mml:mi>o</mml:mi><mml:mi>g</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mi>&#x03B3;</mml:mi><mml:mo>)</mml:mo></mml:mrow><mml:mo>+</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:mn>1</mml:mn><mml:mo>&#x2212;</mml:mo><mml:mi>y</mml:mi><mml:mo>)</mml:mo></mml:mrow><mml:mi>log</mml:mi><mml:mo>&#x2061;</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:mn>1</mml:mn><mml:mo>&#x2212;</mml:mo><mml:mi>&#x03B3;</mml:mi><mml:mo>)</mml:mo></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:math></disp-formula>where <inline-formula id="ieqn-33"><mml:math id="mml-ieqn-33"><mml:mi>y</mml:mi></mml:math></inline-formula> denotes the label value of the ground-truth mask of the image.</p>
<p>The loss for the primary task of the proposed model is denoted as <inline-formula id="ieqn-34"><mml:math id="mml-ieqn-34"><mml:mi>d</mml:mi><mml:mi>e</mml:mi><mml:mi>c</mml:mi><mml:mi>o</mml:mi><mml:mi>d</mml:mi><mml:mi>e</mml:mi><mml:mi mathvariant="normal">&#x005F;</mml:mi><mml:mi>l</mml:mi><mml:mi>o</mml:mi><mml:mi>s</mml:mi><mml:mi>s</mml:mi></mml:math></inline-formula>. Furthermore, the model includes an auxiliary decoding head designed to extract outputs from various layers of the Transformer encoder. These extracted results are processed through a fusion module and a decoder, and the auxiliary loss is computed by comparing the predicted mask with the ground-truth mask, represented as <inline-formula id="ieqn-35"><mml:math id="mml-ieqn-35"><mml:mi>a</mml:mi><mml:mi>u</mml:mi><mml:mi mathvariant="normal">&#x005F;</mml:mi><mml:mi>l</mml:mi><mml:mi>o</mml:mi><mml:mi>s</mml:mi><mml:msub><mml:mi>s</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>, where <inline-formula id="ieqn-36"><mml:math id="mml-ieqn-36"><mml:mi>i</mml:mi></mml:math></inline-formula> denotes the <inline-formula id="ieqn-37"><mml:math id="mml-ieqn-37"><mml:mi>i</mml:mi></mml:math></inline-formula>-th encoding block layer. Previous research has shown that incorporating auxiliary losses can enhance model training convergence [<xref ref-type="bibr" rid="ref-28">28</xref>]. The total loss <inline-formula id="ieqn-38"><mml:math id="mml-ieqn-38"><mml:mi>L</mml:mi><mml:mi>o</mml:mi><mml:mi>s</mml:mi><mml:mi>s</mml:mi></mml:math></inline-formula> of the model is calculated by combining the main task loss <inline-formula id="ieqn-39"><mml:math id="mml-ieqn-39"><mml:mi>d</mml:mi><mml:mi>e</mml:mi><mml:mi>c</mml:mi><mml:mi>o</mml:mi><mml:mi>d</mml:mi><mml:mi>e</mml:mi><mml:mi mathvariant="normal">&#x005F;</mml:mi><mml:mi>l</mml:mi><mml:mi>o</mml:mi><mml:mi>s</mml:mi><mml:mi>s</mml:mi></mml:math></inline-formula> with the auxiliary loss <inline-formula id="ieqn-40"><mml:math id="mml-ieqn-40"><mml:mi>a</mml:mi><mml:mi>u</mml:mi><mml:mi mathvariant="normal">&#x005F;</mml:mi><mml:mi>l</mml:mi><mml:mi>o</mml:mi><mml:mi>s</mml:mi><mml:msub><mml:mi>s</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>.</mml:mo></mml:math></inline-formula> <inline-formula id="ieqn-41"><mml:math id="mml-ieqn-41"><mml:mrow><mml:mtext>The</mml:mtext></mml:mrow></mml:math></inline-formula> formula for computing the total loss <inline-formula id="ieqn-42"><mml:math id="mml-ieqn-42"><mml:mi>L</mml:mi><mml:mi>o</mml:mi><mml:mi>s</mml:mi><mml:mi>s</mml:mi></mml:math></inline-formula> is presented as follows:
<disp-formula id="eqn-3"><label>(3)</label><mml:math id="mml-eqn-3" display="block"><mml:mi>L</mml:mi><mml:mi>o</mml:mi><mml:mi>s</mml:mi><mml:mi>s</mml:mi><mml:mo>=</mml:mo><mml:mi>d</mml:mi><mml:mi>e</mml:mi><mml:mi>c</mml:mi><mml:mi>o</mml:mi><mml:mi>d</mml:mi><mml:mi>e</mml:mi><mml:mi mathvariant="normal">&#x005F;</mml:mi><mml:mi>l</mml:mi><mml:mi>o</mml:mi><mml:mi>s</mml:mi><mml:mi>s</mml:mi><mml:mo>+</mml:mo><mml:mo>&#x2211;</mml:mo><mml:mi>a</mml:mi><mml:mi>u</mml:mi><mml:mi mathvariant="normal">&#x005F;</mml:mi><mml:mi>l</mml:mi><mml:mi>o</mml:mi><mml:mi>s</mml:mi><mml:msub><mml:mi>s</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:math></disp-formula></p>
<p>This study incorporated corresponding auxiliary decoding heads at the <inline-formula id="ieqn-43"><mml:math id="mml-ieqn-43"><mml:msub><mml:mrow><mml:mtext>Z</mml:mtext></mml:mrow><mml:mrow><mml:mn>10</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mtext>Z</mml:mtext></mml:mrow><mml:mrow><mml:mn>15</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mtext>Z</mml:mtext></mml:mrow><mml:mrow><mml:mn>20</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mrow><mml:mtext>&#xA0;and&#xA0;</mml:mtext></mml:mrow><mml:msub><mml:mrow><mml:mtext>Z</mml:mtext></mml:mrow><mml:mrow><mml:mn>24</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula> layer of the Transformer encoder.</p>
<p>The MLW decoder ultimately generates four candidate three-category tampering masks (<inline-formula id="ieqn-44"><mml:math id="mml-ieqn-44"><mml:msub><mml:mi>Z</mml:mi><mml:mrow><mml:mn>6</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>Z</mml:mi><mml:mrow><mml:mn>12</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>Z</mml:mi><mml:mrow><mml:mn>18</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>Z</mml:mi><mml:mrow><mml:mn>24</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula>), each containing hierarchical information and matching the original image in size, by integrating the <inline-formula id="ieqn-45"><mml:math id="mml-ieqn-45"><mml:msub><mml:mi>Z</mml:mi><mml:mrow><mml:mn>6</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>Z</mml:mi><mml:mrow><mml:mn>12</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>Z</mml:mi><mml:mrow><mml:mn>18</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mspace width="thinmathspace" /><mml:mi>a</mml:mi><mml:mi>n</mml:mi><mml:mi>d</mml:mi><mml:mtext>&#xA0;</mml:mtext><mml:msub><mml:mi>Z</mml:mi><mml:mrow><mml:mn>24</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula> features from the Transformer encoder. These masks are collectively incorporated into the loss function during model training, where they undergo a weighted summation to obtain a composite cross-entropy. This approach enables more effective model training through back-propagation. The weighted summation process is expressed by the following formula:
<disp-formula id="eqn-4"><label>(4)</label><mml:math id="mml-eqn-4" display="block"><mml:mi>d</mml:mi><mml:mi>e</mml:mi><mml:mi>c</mml:mi><mml:mi>o</mml:mi><mml:mi>d</mml:mi><mml:msub><mml:mi>e</mml:mi><mml:mrow><mml:mi>l</mml:mi><mml:mi>o</mml:mi><mml:mi>s</mml:mi><mml:mi>s</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mi>&#x03B1;</mml:mi><mml:mi>l</mml:mi><mml:mi>o</mml:mi><mml:mi>s</mml:mi><mml:msub><mml:mi>s</mml:mi><mml:mrow><mml:mrow><mml:msub><mml:mi>R</mml:mi><mml:mrow><mml:mn>6</mml:mn></mml:mrow></mml:msub></mml:mrow></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:mi>&#x03B2;</mml:mi><mml:mi>l</mml:mi><mml:mi>o</mml:mi><mml:mi>s</mml:mi><mml:msub><mml:mi>s</mml:mi><mml:mrow><mml:mrow><mml:msub><mml:mi>R</mml:mi><mml:mrow><mml:mn>12</mml:mn></mml:mrow></mml:msub></mml:mrow></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:mi>&#x03B3;</mml:mi><mml:mi>l</mml:mi><mml:mi>o</mml:mi><mml:mi>s</mml:mi><mml:msub><mml:mi>s</mml:mi><mml:mrow><mml:mrow><mml:msub><mml:mi>R</mml:mi><mml:mrow><mml:mn>18</mml:mn></mml:mrow></mml:msub></mml:mrow></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:mi>&#x03B4;</mml:mi><mml:mi>l</mml:mi><mml:mi>o</mml:mi><mml:mi>s</mml:mi><mml:msub><mml:mi>s</mml:mi><mml:mrow><mml:mrow><mml:msub><mml:mi>R</mml:mi><mml:mrow><mml:mn>24</mml:mn></mml:mrow></mml:msub></mml:mrow></mml:mrow></mml:msub></mml:math></disp-formula>where <inline-formula id="ieqn-46"><mml:math id="mml-ieqn-46"><mml:mi>&#x03B1;</mml:mi></mml:math></inline-formula>, <inline-formula id="ieqn-47"><mml:math id="mml-ieqn-47"><mml:mi>&#x03B2;</mml:mi></mml:math></inline-formula>, <inline-formula id="ieqn-48"><mml:math id="mml-ieqn-48"><mml:mi>&#x03B3;</mml:mi></mml:math></inline-formula>, and <inline-formula id="ieqn-49"><mml:math id="mml-ieqn-49"><mml:mi>&#x03B4;</mml:mi></mml:math></inline-formula> are coefficients representing the proportional contribution of the four loss functions to the total decoder loss. In this study, these coefficients are assigned values of 0.1, 0.2, 0.3, and 0.4, respectively.</p>
</sec>
</sec>
<sec id="s4">
<label>4</label>
<title>Experiment</title>
<p>The experiments described in this paper were conducted using the following experimental setup: Ubuntu operating system, Intel i7-11700K @3.60 GHz CPU, and an NVIDIA GeForce RTX 3090 GPU. The implementation was developed using the PyTorch framework and the MMCV library.</p>
<sec id="s4_1">
<label>4.1</label>
<title>Experimental Setup</title>
<p><bold>Datasets:</bold> The performance of our model is evaluated on four widely-recognized datasets in the field of image tamper detection: USCISI [<xref ref-type="bibr" rid="ref-14">14</xref>], CASIAv2.0 [<xref ref-type="bibr" rid="ref-29">29</xref>], DEFACTO [<xref ref-type="bibr" rid="ref-30">30</xref>], and COVERAGE [<xref ref-type="bibr" rid="ref-31">31</xref>]. These datasets not only include images but also provide corresponding ground truth masks that distinguish source, target, and background areas. The source, target, and background areas are denoted in green, red, and blue, respectively. <xref ref-type="table" rid="table-1">Table 1</xref> presents the specific characteristics of these datasets.</p>
<table-wrap id="table-1">
<label>Table 1</label>
<caption>
<title>Condition of datasets</title>
</caption>
<table>
<colgroup>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
</colgroup>
<thead>
<tr>
<th align="center">Datasets</th>
<th align="center">Total number of tampered pictures</th>
<th align="center">For training/<break/>Verification/<break/>Testing</th>
<th align="center">GT mask that distinguishes source/<break/>Destination</th>
</tr>
</thead>
<tbody>
<tr>
<td>USCISI [<xref ref-type="bibr" rid="ref-14">14</xref>]</td>
<td>100,000</td>
<td>80,000/10,000/10,000</td>
<td>Yes</td>
</tr>
<tr>
<td>CASIA CMFD [<xref ref-type="bibr" rid="ref-29">29</xref>]</td>
<td>1311</td>
<td>0/0/1311</td>
<td>Yes</td>
</tr>
<tr>
<td>DEFACTO CMFD [<xref ref-type="bibr" rid="ref-30">30</xref>]</td>
<td>7057</td>
<td>0/0/7057</td>
<td>Yes</td>
</tr>
<tr>
<td>COVERAGE [<xref ref-type="bibr" rid="ref-31">31</xref>]</td>
<td>100</td>
<td>0/0/100</td>
<td>Yes</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>The USCISI dataset, introduced by Wu et al. [<xref ref-type="bibr" rid="ref-14">14</xref>], is a synthetic compilation of digital image tampering instances focusing on copy-move forgeries. It comprises 100,000 samples, each associated with a binary classification mask that distinguishes between untampered and tampered areas for CMFD. In this study&#x2019;s experimental phase, 80,000 samples were randomly extracted from the USCISI dataset for training purposes, while 10,000 samples were allocated for validation, and an additional 10,000 samples were reserved for testing, adhering to an 8:1:1 division ratio.</p>
<p>The CASIA v2.0 dataset [<xref ref-type="bibr" rid="ref-29">29</xref>] comprises 5123 digitally manipulated images, categorized into two types of forgery: splicing and copy-move tampering. This dataset serves as a valuable resource for research in digital image forgery detection. Wu et al. [<xref ref-type="bibr" rid="ref-2">2</xref>] conducted a manual verification of 1313 copy-move forged images within the CASIA V2.0 dataset and generated corresponding binary classification masks, establishing the CASIA-CMFD dataset. In the experimental phase of this study, all samples from the CASIA-CMFD dataset are utilized as the test set to evaluate the efficacy of the proposed model.</p>
<p>DEFACTO [<xref ref-type="bibr" rid="ref-30">30</xref>] is a comprehensive dataset that employs public objects from a contextual database to generate semantically meaningful counterfeit images automatically. This dataset encompasses three categories of forged images: spliced forgeries, copy-paste forgeries, and repair forgeries. From this dataset, we meticulously verified and selected 7057 images containing accurately labeled copy-move tampered images. To enhance the precision of the annotations, we reprocessed the corresponding binary masks for these images, thereby establishing the DEFACTO-CMFD dataset. In the experiments conducted for this paper, the entire DEFACTO-CMFD dataset serves as the test set to evaluate the model proposed herein.</p>
<p>COVERAGE [<xref ref-type="bibr" rid="ref-31">31</xref>] is a dataset specifically designed for CMFD in digital images, consisting of 100 images. The dataset employs a technique of superimposing similar objects onto original authentic images, presenting a significant challenge to human visual recognition. The alterations are subtle and difficult to detect without meticulous examination, as the forged objects are seamlessly integrated. The tampered elements in the dataset encompass a diverse range of items, including merchandise, fruits, furniture, and signage. The intricacy of the forgery details poses a substantial challenge to the generalization capabilities of various copy-move tampering detection models. In this study&#x2019;s experimental phase, all instances from the COVERAGE dataset serve as the test set to assess the performance of the proposed model.</p>
<p><bold>Model parameters:</bold> The parameters of the Transformer encoder were configured to match those of ViT_Large, as detailed in <xref ref-type="table" rid="table-2">Table 2</xref>. Additionally, batch normalization was implemented as the normalization method for each decoding head.</p>
<table-wrap id="table-2">
<label>Table 2</label>
<caption>
<title>Parameter settings of transformer encoder</title>
</caption>
<table>
<colgroup>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
</colgroup>
<thead>
<tr>
<th align="center">Transformer encoder</th>
<th align="center">Number of superimposed layers</th>
<th align="center">Number of embedded channels in image block</th>
<th align="center">Number of self-attention heads</th>
<th align="center">Image block size</th>
</tr>
</thead>
<tbody>
<tr>
<td>ViT_Large</td>
<td>24</td>
<td>1024</td>
<td>16</td>
<td>16 &#x00D7; 16</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>For pre-training, this study utilizes the weights of vit_large_patch16_384 pre-trained on ImageNet, provided by the MMSegmentation [<xref ref-type="bibr" rid="ref-32">32</xref>] project, to initialize the image preprocessing and Transformer encoder modules. The decoder module, in contrast, is randomly initialized.</p>
<p><bold>Training strategy:</bold> For the training data, we implement the following standard data preprocessing techniques from MM-Segmentation: (1) Random scaling with a ratio range of 0.5 to 2.0. (2) Randomly cropped to achieve dimensions of 256 &#x00D7; 256 pixels. (3) Random horizontal flipping. (4) Photometric distortion. (5) Image normalization.</p>
<p><bold>Training parameters:</bold> The training parameters are standardized across all models. The batch size is consistently set to 8. We utilize the SGD optimizer, setting the momentum and weight decay parameters to 0.9 and 0, respectively. The initial learning rate is established at 1e-3. For learning rate adjustment, we implement a polynomial decay strategy, setting the polynomial power to 0.9 and the minimum learning rate to 1e-4.</p>
<p><bold>Test index:</bold> Choosing suitable evaluation metrics is essential for accurately gauging model performance in experimental studies. To quantify localization and other performance aspects, this study employs the most widely used evaluation metrics in the CMFD field, as established in previous literature [<xref ref-type="bibr" rid="ref-26">26</xref>]. Precision measures the ratio of correctly identified positive instances to the total instances predicted as positive. Recall represents the proportion of actual positive instances correctly identified by the model out of all actual positive instances. By merging precision and recall, the F1 score delivers an all-encompassing measure of model effectiveness.</p>
</sec>
<sec id="s4_2">
<label>4.2</label>
<title>Variant Performance Comparison</title>
<p>To effectively assess the text model&#x2019;s capability in distinguishing and locating source/target regions, we evaluate the precision, recall, and F-score for each model across three distinct categories: background, source, and target.</p>
<p><xref ref-type="table" rid="table-3">Table 3</xref> presents the experimental results of various decoder variants of the text model on the USCISI test set. The variants include CMFDTR-MLW (multi-layer weighting scheme), CMFDTR-Na&#x00EF;ve (one-step up-sampling), and CMFDTR-PUP (progressive up-sampling scheme). In the table, data presented in bold text indicate the best performance of the corresponding experimental indicators in the comparative experiments.</p>
<table-wrap id="table-3">
<label>Table 3</label>
<caption>
<title>Test results of different variants of CMFDTR on the USCISI dataset</title>
</caption>
<table>
<colgroup>
<col/>
<col/>
<col/>
<col/>
<col/>
</colgroup>
<thead>
<tr>
<th>Methods</th>
<th>Categories</th>
<th>F1-score</th>
<th>Precision</th>
<th>Recall</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>Background</td>
<td>97.47</td>
<td>96.45</td>
<td>98.5</td>
</tr>
<tr>
<td>CMFDTR-MLW</td>
<td>Source</td>
<td>74.39</td>
<td>84.17</td>
<td>66.64</td>
</tr>
<tr>
<td></td>
<td>Target</td>
<td>80.26</td>
<td><bold>84.72</bold></td>
<td>76.25</td>
</tr>
<tr>
<td></td>
<td>Background</td>
<td>97.38</td>
<td>96.22</td>
<td>98.57</td>
</tr>
<tr>
<td>CMFDTR-Native</td>
<td>Source</td>
<td>73.07</td>
<td>84.5</td>
<td>64.36</td>
</tr>
<tr>
<td></td>
<td>Target</td>
<td>79.26</td>
<td>84.43</td>
<td>74.68</td>
</tr>
<tr>
<td></td>
<td>Background</td>
<td><bold>97.74</bold></td>
<td><bold>96.86</bold></td>
<td><bold>98.64</bold></td>
</tr>
<tr>
<td>CMFDTR-PUP</td>
<td>Source</td>
<td><bold>77.29</bold></td>
<td><bold>86.86</bold></td>
<td><bold>69.62</bold></td>
</tr>
<tr>
<td></td>
<td>Target</td>
<td><bold>82.94</bold></td>
<td>84.39</td>
<td><bold>81.55</bold></td>
</tr>
</tbody>
</table>
</table-wrap>
<p>The CMFDTR-Naive decoder utilizes a single-step up-sampling process. This process involves applying a 3 &#x00D7; 3 convolution to the feature map <inline-formula id="ieqn-50"><mml:math id="mml-ieqn-50"><mml:msub><mml:mi>Z</mml:mi><mml:mrow><mml:mn>24</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula>, followed by a 16-fold up-sampling and batch normalization. In contrast, the CMFDTR-PUP decoder employs a step-by-step up-sampling strategy. This approach processes the feature map <inline-formula id="ieqn-51"><mml:math id="mml-ieqn-51"><mml:msub><mml:mi>Z</mml:mi><mml:mrow><mml:mn>24</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula> through sequential 3 &#x00D7; 3 convolutions, gradual up-sampling, and batch normalization. Each up-sampling step doubles the size of the feature map from the previous step, requiring four operations to transform the feature map from size <inline-formula id="ieqn-52"><mml:math id="mml-ieqn-52"><mml:mi>H</mml:mi><mml:mrow><mml:mo>/</mml:mo></mml:mrow><mml:mn>256</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mi>W</mml:mi><mml:mrow><mml:mo>/</mml:mo></mml:mrow><mml:mn>256</mml:mn></mml:math></inline-formula> to full resolution.</p>
<p>The experimental findings demonstrate that the Transformer encoder exhibits adaptability to various designed decoders, achieving comparable performance in source/target distinction tasks.</p>
</sec>
<sec id="s4_3">
<label>4.3</label>
<title>Comparative Analysis of Source and Target Differentiation in Advanced Technological Contexts</title>
<p>Comparative experiments were performed to demonstrate that the proposed model outperforms current state-of-the-art methods in CMFD. The model was evaluated against four specialized CMFD models: BusterNet, DOA-GAN, CMSDNet, and MVSS-Net. These comparative experiments were performed using four publicly available CMFD datasets: DEFACTO CMFD, USCISI, CASIA CMFD and COVERAGE. This comprehensive approach aimed to thoroughly test and compare the performance of the proposed model across diverse datasets in the Copy-Move forgery domain.</p>
<p>The USCISI dataset was employed by training the model on its training set and directly evaluating it on the test set. In contrast, for the DEFACTO CMFD, CASIA CMFD, and COVERAGE datasets, all samples were utilized as the test set for evaluation without any fine-tuning. This methodology enables a more comprehensive assessment of the model&#x2019;s generalization capabilities.</p>
<p>The pixel-wise localization performance of the compared models on the USCISI test set are shown in <xref ref-type="table" rid="table-4">Table 4</xref>. In the table, data presented in bold text indicate the best performance of the corresponding experimental indicators in the comparative experiments. Owing to the substantial similarity in data distribution between the USCISI test set and the training data, each model has achieved satisfactory three-class localization performance. The results indicate that the CMFDTR proposed in this study outperforms BusterNet and CMSDNet on most metrics in the USCISI dataset, although it slightly underperforms compared to DOA-GAN and MVSS-Net. Notably, the USCISI dataset contains relatively few complex tampered samples. Most tampered samples in this dataset involve simple manipulations where source regions are scaled by a certain factor, subjected to minor random rotations, and then copied and moved to target areas. The DOA-GAN model, employing an adversarial training mechanism through the competitive process between the generator and the discriminator, captures the data&#x2019;s distributional characteristics more effectively. Consequently, DOA-GAN exhibits superior detection performance on the USC-ISI dataset, where the data distribution is largely consistent. However, its performance in detecting complex tampered regions may decline, indicating limited generalization capability. MVSS-Net jointly utilizes tampering boundary artifacts and noise views of the input image to extract semantic-agnostic features, better capturing lower-level features. It enhances detection specificity through multi-scale supervision, at the cost of reduced detection sensitivity, which is compensated for through multi-view feature learning. While MVSS-Net demonstrates excellent detection capabilities on datasets with highly consistent data distribution, its generalization performance may be relatively poor when faced with data that diverges from the training data distribution. By analyzing the detection results from the CASIA CMFD, DEFACTO CMFD, and COVERAGE datasets, it is evident that the CMFDTR&#x2019;s detection metrics generally surpass those of DOA-GAN and MVSS-Net, demonstrating that the proposed CMFDTR exhibits stronger generalization capabilities compared to DOA-GAN and MVSS-Net. As shown in <xref ref-type="fig" rid="fig-3">Fig. 3</xref>, we compare the prediction results on the UISICI dataset using various methods, including BusterNet, DOA-GAN, CMSDNet, MVSS-Net, and our proposed method.</p>
<table-wrap id="table-4">
<label>Table 4</label>
<caption>
<title>Source/target distinguishment test results of comparison model on the USCISI dataset</title>
</caption>
<table>
<colgroup>
<col/>
<col/>
<col/>
<col/>
<col/>
</colgroup>
<thead>
<tr>
<th>Methods</th>
<th>Categories</th>
<th>F1-score</th>
<th>Precision</th>
<th>Recall</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>Background</td>
<td>96.03</td>
<td>94.35</td>
<td>97.77</td>
</tr>
<tr>
<td>BusterNet [<xref ref-type="bibr" rid="ref-14">14</xref>]</td>
<td>Source</td>
<td>60.33</td>
<td>65.86</td>
<td>55.66</td>
</tr>
<tr>
<td></td>
<td>Target</td>
<td>77.76</td>
<td><bold>84.72</bold></td>
<td>71.87</td>
</tr>
<tr>
<td></td>
<td>Background</td>
<td><bold>97.94</bold></td>
<td><bold>96.89</bold></td>
<td><bold>99.0</bold></td>
</tr>
<tr>
<td>DOA-GAN [<xref ref-type="bibr" rid="ref-17">17</xref>]</td>
<td>Source</td>
<td><bold>81.83</bold></td>
<td>84.18</td>
<td><bold>79.6</bold></td>
</tr>
<tr>
<td></td>
<td>Target</td>
<td><bold>86.27</bold></td>
<td>84.08</td>
<td><bold>88.57</bold></td>
</tr>
<tr>
<td></td>
<td>Background</td>
<td>96.44</td>
<td>95.04</td>
<td>97.89</td>
</tr>
<tr>
<td>CMSDNet [<xref ref-type="bibr" rid="ref-15">15</xref>]</td>
<td>Source</td>
<td>63.57</td>
<td>70.97</td>
<td>57.58</td>
</tr>
<tr>
<td></td>
<td>Target</td>
<td>34.82</td>
<td>59.70</td>
<td>24.75</td>
</tr>
<tr>
<td></td>
<td>Background</td>
<td>97.74</td>
<td>96.74</td>
<td>98.75</td>
</tr>
<tr>
<td>MVSS-Net [<xref ref-type="bibr" rid="ref-18">18</xref>]</td>
<td>Source</td>
<td>80.42</td>
<td>81.6</td>
<td>79.27</td>
</tr>
<tr>
<td></td>
<td>Target</td>
<td>85.32</td>
<td>84.42</td>
<td>86.24</td>
</tr>
<tr>
<td></td>
<td>Background</td>
<td>97.47</td>
<td>96.45</td>
<td>98.5</td>
</tr>
<tr>
<td>CMFDTR-MLW</td>
<td>Source</td>
<td>74.39</td>
<td>84.17</td>
<td>66.64</td>
</tr>
<tr>
<td></td>
<td>Target</td>
<td>80.26</td>
<td><bold>84.72</bold></td>
<td>76.25</td>
</tr>
<tr>
<td></td>
<td>Background</td>
<td>97.38</td>
<td>96.22</td>
<td>98.57</td>
</tr>
<tr>
<td>CMFDTR-Naive</td>
<td>Source</td>
<td>73.07</td>
<td>84.5</td>
<td>64.36</td>
</tr>
<tr>
<td></td>
<td>Target</td>
<td>79.26</td>
<td>84.43</td>
<td>74.68</td>
</tr>
<tr>
<td></td>
<td>Background</td>
<td>97.74</td>
<td>96.86</td>
<td>98.64</td>
</tr>
<tr>
<td>CMFDTR-PUP</td>
<td>Source</td>
<td>77.29</td>
<td><bold>86.86</bold></td>
<td>69.62</td>
</tr>
<tr>
<td></td>
<td>Target</td>
<td>82.94</td>
<td>84.39</td>
<td>81.55</td>
</tr>
</tbody>
</table>
</table-wrap><fig id="fig-3">
<label>Figure 3</label>
<caption>
<title>Comparison of prediction results on the UISICI dataset using BusterNet, DOA-GAN, CMSDNet, MVSS-Net, and the proposed method (Blue represents the background, green represents the source, and red represents the target)</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_55739-fig-3.tif"/>
</fig>
<p><xref ref-type="table" rid="table-5">Table 5</xref> displays the pixel-wise localization results for the models evaluated using the CASIA CMFD test set. In the table, data presented in bold text indicate the best performance of the corresponding experimental indicators in the comparative experiments. The dataset incorporates samples with Copy-Move forgeries in highly similar or semantically ambiguous background regions, as well as instances where copied and pasted regions overlap. In contrast, the USCISI dataset seldom contains such intricate tampered samples. Consequently, this disparity may result in suboptimal performance when a model trained exclusively on the USCISI training set is evaluated on the CASIA CMFD test set.</p>
<table-wrap id="table-5">
<label>Table 5</label>
<caption>
<title>Source/Target distinguishment test results of comparison model on the CASIA CMFD dataset</title>
</caption>
<table>
<colgroup>
<col/>
<col/>
<col/>
<col/>
<col/>
</colgroup>
<thead>
<tr>
<th>Methods</th>
<th>Categories</th>
<th>F1-score</th>
<th>Precision</th>
<th>Recall</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>Background</td>
<td>94.58</td>
<td>90.64</td>
<td>98.89</td>
</tr>
<tr>
<td>BusterNet [<xref ref-type="bibr" rid="ref-14">14</xref>]</td>
<td>Source</td>
<td>13.73</td>
<td>25.96</td>
<td>9.33</td>
</tr>
<tr>
<td></td>
<td>Target</td>
<td>1.32</td>
<td>20.76</td>
<td>0.68</td>
</tr>
<tr>
<td></td>
<td>Background</td>
<td>95.17</td>
<td>91.29</td>
<td>99.05</td>
</tr>
<tr>
<td>DOA-GAN [<xref ref-type="bibr" rid="ref-17">17</xref>]</td>
<td>Source</td>
<td>20.17</td>
<td><bold>45.02</bold></td>
<td>13.0</td>
</tr>
<tr>
<td></td>
<td>Target</td>
<td>13.23</td>
<td>37.45</td>
<td>8.03</td>
</tr>
<tr>
<td></td>
<td>Background</td>
<td>94.72</td>
<td>90.52</td>
<td><bold>99.34</bold></td>
</tr>
<tr>
<td>CMSDNet [<xref ref-type="bibr" rid="ref-15">15</xref>]</td>
<td>Source</td>
<td>10.74</td>
<td>43.52</td>
<td>6.12</td>
</tr>
<tr>
<td></td>
<td>Target</td>
<td>1.75</td>
<td>24.52</td>
<td>0.91</td>
</tr>
<tr>
<td></td>
<td>Background</td>
<td>94.65</td>
<td>90.41</td>
<td>99.32</td>
</tr>
<tr>
<td>MVSS-Net [<xref ref-type="bibr" rid="ref-18">18</xref>]</td>
<td>Source</td>
<td>8.55</td>
<td>32.27</td>
<td>4.93</td>
</tr>
<tr>
<td></td>
<td>Target</td>
<td>6.19</td>
<td>33.42</td>
<td>3.41</td>
</tr>
<tr>
<td></td>
<td>Background</td>
<td><bold>95.98</bold></td>
<td><bold>93.59</bold></td>
<td>98.49</td>
</tr>
<tr>
<td>CMFDTR-MLW</td>
<td>Source</td>
<td><bold>24.55</bold></td>
<td>32.68</td>
<td><bold>19.66</bold></td>
</tr>
<tr>
<td></td>
<td>Target</td>
<td>9.92</td>
<td>54.18</td>
<td>5.46</td>
</tr>
<tr>
<td></td>
<td>Background</td>
<td>95.97</td>
<td>93.55</td>
<td>98.51</td>
</tr>
<tr>
<td>CMFDTR-Naive</td>
<td>Source</td>
<td>22.38</td>
<td>31.55</td>
<td>17.34</td>
</tr>
<tr>
<td></td>
<td>Target</td>
<td>11.93</td>
<td>47.85</td>
<td>6.82</td>
</tr>
<tr>
<td></td>
<td>Background</td>
<td>95.89</td>
<td>93.4</td>
<td>98.51</td>
</tr>
<tr>
<td>CMFDTR-PUP</td>
<td>Source</td>
<td>21.61</td>
<td>31.73</td>
<td>16.39</td>
</tr>
<tr>
<td></td>
<td>Target</td>
<td><bold>13.37</bold></td>
<td><bold>55.97</bold></td>
<td><bold>7.59</bold></td>
</tr>
</tbody>
</table>
</table-wrap>
<p>The results of the experiments confirm that the CMFDTR proposed herein surpasses other models in most performance metrics when evaluated on the CASIA CMFD dataset, surpassing the comparison models BusterNet, CMSDNet, DOA-GAN, and MVSS-Net. This indicates enhanced generalization capability and improved performance in detecting tampered regions. As shown in <xref ref-type="fig" rid="fig-4">Fig. 4</xref>, we compare the prediction results on the CASIA CMFD test set using various methods, including BusterNet, DOA-GAN, CMSDNet, MVSS-Net, and our proposed method.</p>
<fig id="fig-4">
<label>Figure 4</label>
<caption>
<title>Comparison of prediction results on the CAISA CMFD dataset using BusterNet, DOA-GAN, CMSDNet, MVSS-Net, and the proposed method (Blue represents the background, green represents the source, and red represents the target)</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_55739-fig-4.tif"/>
</fig>
<p>The pixel-wise localization performance for each benchmark model on the DEFACTO CMFD test set is detailed in <xref ref-type="table" rid="table-6">Table 6</xref>. In the table, data presented in bold text indicate the best performance of the corresponding experimental indicators in the comparative experiments. Importantly, the DEFACTO CMFD dataset exhibits substantial differences in data distribution compared to the USCISI dataset, including more complex samples that are challenging for human visual perception, smaller target samples, and potentially misleading instances. Additionally, the DEFACTO CMFD dataset is more extensive than the CASIA CMFD dataset. The experimental results indicate that the CMFDTR proposed in this study outperforms the compared models (BusterNet, CMSDNet, DOA-GAN, and MVSS-Net) on the DEFACTO CMFD dataset for the majority of evaluation metrics. Notably, the recall is significantly higher than other comparison methods, suggesting that the proposed method, through a one-to-one matching strategy, effectively filters out incorrect matching features. This demonstrates that the CMFDTR possesses stronger generalization capabilities and superior performance in detecting tampered regions. As shown in <xref ref-type="fig" rid="fig-5">Fig. 5</xref>, we compare the prediction results on the DEFACTO CMFD test set using various methods, including BusterNet, DOA-GAN, CMSDNet, MVSS-Net, and our proposed method.</p>
<table-wrap id="table-6">
<label>Table 6</label>
<caption>
<title>Source/target distinguishment test results of comparison model on the DEFACTO CMFD dataset</title>
</caption>
<table>
<colgroup>
<col/>
<col/>
<col/>
<col/>
<col/>
</colgroup>
<thead>
<tr>
<th>Methods</th>
<th>Categories</th>
<th>F1-score</th>
<th>Precision</th>
<th>Recall</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>Background</td>
<td>95.11</td>
<td>91.53</td>
<td>98.99</td>
</tr>
<tr>
<td>BusterNet [<xref ref-type="bibr" rid="ref-14">14</xref>]</td>
<td>Source</td>
<td>18.37</td>
<td>26.85</td>
<td>13.96</td>
</tr>
<tr>
<td></td>
<td>Target</td>
<td>1.23</td>
<td>28.72</td>
<td>0.63</td>
</tr>
<tr>
<td></td>
<td>Background</td>
<td>95.78</td>
<td>92.38</td>
<td><bold>99.44</bold></td>
</tr>
<tr>
<td>DOA-GAN [<xref ref-type="bibr" rid="ref-17">17</xref>]</td>
<td>Source</td>
<td>28.2</td>
<td><bold>48.52</bold></td>
<td>19.88</td>
</tr>
<tr>
<td></td>
<td>Target</td>
<td>16.76</td>
<td>47.21</td>
<td>10.19</td>
</tr>
<tr>
<td></td>
<td>Background</td>
<td>95.22</td>
<td>91.51</td>
<td>99.25</td>
</tr>
<tr>
<td>CMSDNet [<xref ref-type="bibr" rid="ref-15">15</xref>]</td>
<td>Source</td>
<td>11.29</td>
<td>32.7</td>
<td>6.82</td>
</tr>
<tr>
<td></td>
<td>Target</td>
<td>3.47</td>
<td>29.24</td>
<td>1.85</td>
</tr>
<tr>
<td></td>
<td>Background</td>
<td>95.12</td>
<td>91.27</td>
<td>99.32</td>
</tr>
<tr>
<td>MVSS-Net [<xref ref-type="bibr" rid="ref-18">18</xref>]</td>
<td>Source</td>
<td>14.46</td>
<td>37.09</td>
<td>8.98</td>
</tr>
<tr>
<td></td>
<td>Target</td>
<td>11.19</td>
<td>49.39</td>
<td>6.31</td>
</tr>
<tr>
<td></td>
<td>Background</td>
<td><bold>96.14</bold></td>
<td><bold>93.75</bold></td>
<td>98.67</td>
</tr>
<tr>
<td>CMFDTR-MLW</td>
<td>Source</td>
<td><bold>31.07</bold></td>
<td>36.67</td>
<td><bold>26.96</bold></td>
</tr>
<tr>
<td></td>
<td>Target</td>
<td>14.39</td>
<td><bold>61.97</bold></td>
<td>8.15</td>
</tr>
<tr>
<td></td>
<td>Background</td>
<td>95.99</td>
<td>93.48</td>
<td>98.64</td>
</tr>
<tr>
<td>CMFDTR-Naive</td>
<td>Source</td>
<td>26.39</td>
<td>33.91</td>
<td>21.61</td>
</tr>
<tr>
<td></td>
<td>Target</td>
<td>16.72</td>
<td>59.06</td>
<td>9.74</td>
</tr>
<tr>
<td></td>
<td>Background</td>
<td>96.0</td>
<td>93.39</td>
<td>98.75</td>
</tr>
<tr>
<td>CMFDTR-PUP</td>
<td>Source</td>
<td>25.5</td>
<td>35.6</td>
<td>19.86</td>
</tr>
<tr>
<td></td>
<td>Target</td>
<td><bold>18.93</bold></td>
<td>58.3</td>
<td><bold>11.3</bold></td>
</tr>
</tbody>
</table>
</table-wrap><fig id="fig-5">
<label>Figure 5</label>
<caption>
<title>Comparison of prediction results on the DEFACTO CMFD dataset using BusterNet, DOA-GAN, CMSDNet, MVSS-Net, and the proposed method (Blue represents the background, green represents the source, and red represents the target)</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_55739-fig-5.tif"/>
</fig>
<p><xref ref-type="table" rid="table-7">Table 7</xref> summarizes the pixel-wise localization performance of each comparative model evaluated on the COVERAGE test set. In the table, data presented in bold text indicate the best performance of the corresponding experimental indicators in the comparative experiments. The forged images in the COVERAGE dataset are created by overlaying similar objects onto the original real images. This dataset poses a significant challenge to human visual recognition due to its finely detailed forgeries, and it is particularly demanding for image tamper detection and localization models to perform effectively. The experimental results demonstrate that all detection metrics of the proposed CMFDTR in this study surpass those of the compared models BusterNet, CMSDNet, and MVSS-Net, but are slightly lower than the DOA-GAN model. Analysis of the visualization results suggests that the marginally lower detection performance of the method proposed in this study compared to the DOA-GAN model is attributable to the misclassification of a portion of the forged source and target regions. However, considering the comprehensive detection results across the four test sets utilized in this study, it is evident that the proposed CMFDTR exhibits stronger generalization capabilities and superior performance in detecting tampered regions for unknown tampered images. As shown in <xref ref-type="fig" rid="fig-6">Fig. 6</xref>, we compare the prediction results on the COVERAGE test set using various methods, including BusterNet, DOA-GAN, CMSDNet, MVSS-Net, and our proposed method.</p>
<table-wrap id="table-7">
<label>Table 7</label>
<caption>
<title>Source/target distinguishment test results of comparison model on the CAVERAGE dataset</title>
</caption>
<table>
<colgroup>
<col/>
<col/>
<col/>
<col/>
<col/>
</colgroup>
<thead>
<tr>
<th>Methods</th>
<th>Categories</th>
<th>F1-score</th>
<th>Precision</th>
<th>Recall</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>Background</td>
<td>87.29</td>
<td>78.33</td>
<td>98.57</td>
</tr>
<tr>
<td>BusterNet [<xref ref-type="bibr" rid="ref-14">14</xref>]</td>
<td>Source</td>
<td>21.06</td>
<td>38.12</td>
<td>14.55</td>
</tr>
<tr>
<td></td>
<td>Target</td>
<td>1.44</td>
<td>34.37</td>
<td>0.73</td>
</tr>
<tr>
<td></td>
<td>Background</td>
<td>88.43</td>
<td><bold>81.26</bold></td>
<td>96.99</td>
</tr>
<tr>
<td>DOA-GAN [<xref ref-type="bibr" rid="ref-17">17</xref>]</td>
<td>Source</td>
<td><bold>33.84</bold></td>
<td>49.08</td>
<td><bold>25.83</bold></td>
</tr>
<tr>
<td></td>
<td>Target</td>
<td><bold>22.24</bold></td>
<td>49.89</td>
<td><bold>14.31</bold></td>
</tr>
<tr>
<td></td>
<td>Background</td>
<td>88.02</td>
<td>80.14</td>
<td>97.63</td>
</tr>
<tr>
<td>CMSDNet [<xref ref-type="bibr" rid="ref-15">15</xref>]</td>
<td>Source</td>
<td>23.21</td>
<td>36.63</td>
<td>16.99</td>
</tr>
<tr>
<td></td>
<td>Target</td>
<td>2.88</td>
<td>39.58</td>
<td>1.5</td>
</tr>
<tr>
<td></td>
<td>Background</td>
<td>87.14</td>
<td>77.61</td>
<td>99.32</td>
</tr>
<tr>
<td>MVSS-Net [<xref ref-type="bibr" rid="ref-18">18</xref>]</td>
<td>Source</td>
<td>17.19</td>
<td>63.28</td>
<td>9.95</td>
</tr>
<tr>
<td></td>
<td>Target</td>
<td>13.15</td>
<td>62.77</td>
<td>7.35</td>
</tr>
<tr>
<td></td>
<td>Background</td>
<td><bold>89.12</bold></td>
<td>80.89</td>
<td>99.21</td>
</tr>
<tr>
<td>CMFDTR-MLW</td>
<td>Source</td>
<td>33.26</td>
<td>60.22</td>
<td>22.98</td>
</tr>
<tr>
<td></td>
<td>Target</td>
<td>12.87</td>
<td><bold>71.75</bold></td>
<td>7.07</td>
</tr>
<tr>
<td></td>
<td>Background</td>
<td>88.59</td>
<td>80.12</td>
<td>99.07</td>
</tr>
<tr>
<td>CMFDTR-Naive</td>
<td>Source</td>
<td>30.05</td>
<td>61.42</td>
<td>19.89</td>
</tr>
<tr>
<td></td>
<td>Target</td>
<td>10.78</td>
<td>63.81</td>
<td>5.89</td>
</tr>
<tr>
<td></td>
<td>Background</td>
<td>87.91</td>
<td>78.8</td>
<td><bold>99.39</bold></td>
</tr>
<tr>
<td>CMFDTR-PUP</td>
<td>Source</td>
<td>19.31</td>
<td><bold>66.18</bold></td>
<td>11.3</td>
</tr>
<tr>
<td></td>
<td>Target</td>
<td>10.46</td>
<td>64.42</td>
<td>5.69</td>
</tr>
</tbody>
</table>
</table-wrap><fig id="fig-6">
<label>Figure 6</label>
<caption>
<title>Comparison of prediction results on the COVERAGE dataset using BusterNet, DOA-GAN, CMSDNet, MVSS-Net, and the proposed method (Blue represents the background, green represents the source, and red represents the target)</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_55739-fig-6.tif"/>
</fig>
</sec>
</sec>
<sec id="s5">
<label>5</label>
<title>Conclusion</title>
<p>This paper introduces a Transformer structure based on sequence-to-sequence modeling and proposes an end-to-end model called CMFDTR, specifically designed for the characteristics of the CMFD task. In comparison to existing DCNN methods, CMFDTR eliminates the reliance on image down-sampling throughout the feature encoding process, effectively mitigating the common issue of detail loss associated with DCNN approaches. Furthermore, the multi-head mechanism of MSA enhances the model&#x2019;s global context modeling capabilities. The self-attention mechanism demonstrates greater suitability for copy-move forgery characteristics than traditional feature correlation matching calculations. Experimental results demonstrate that our model outperforms other advanced techniques on USCISI, DEFACTO CMFD, CASIA CMFD, and COVERAGE datasets, indicating the high adaptability and promising potential of Transformer for the CMFD task. Future research endeavors will concentrate on expanding the current work, enhancing the Transformer encoder, and refining the model&#x2019;s capabilities for more precise detection and localization of copy-move forgery.</p>
</sec>
</body>
<back>
<ack>
<p>The authors would like to express appreciation to the National Natural Science Foundation of China, Department of Science and Technology of Guangdong Province, China, and Education Department of Guangdong Province, China for their financial support.</p>
</ack>
<sec>
<title>Funding Statement</title>
<p>The research received financial support from the General Program of the National Natural Science Foundation of China (Grant No. 62072123), Key R&#x0026;D Initiatives in Guangdong Province (Grant No. 2021B0101220006), the Guangdong Provincial Department of Education&#x2019;s Key Field Projects for Ordinary Colleges and Universities (Grant Nos. 2020ZDZX3059, 2022ZDZX1012, 2023ZDZX1008), Key R&#x0026;D Projects in Jiangxi Province (Grant No. 20212BBE53002), and Key R&#x0026;D Projects in Yichun City (Grant No. 20211YFG4270).</p>
</sec>
<sec>
<title>Author Contributions</title>
<p>The authors declare their contributions to the paper as encompassing the conception and design of the study: Gang Hao, Peng Liang; data collection: Gang Hao, Ziyuan Li and Hong Zhang; analysis and interpretation of results: Gang Hao, Peng Liang, Huimin Zhao, Ziyuan Li and Hong Zhang; draft manuscript preparation: Gang Hao, Ziyuan Li and Hong Zhang. All authors reviewed the results and approved the final version of the manuscript.</p>
</sec>
<sec sec-type="data-availability">
<title>Availability of Data and Materials</title>
<p>The data that support the findings of this study are openly at: <ext-link ext-link-type="uri" xlink:href="https://www.kaggle.com/datasets/defactodataset/defactocopymove">https://www.kaggle.com/datasets/defactodataset/defactocopymove</ext-link>, accessed on 10 December 2023; <ext-link ext-link-type="uri" xlink:href="https://www.kaggle.com/datasets/divg07/casia-20-image-tampering-detection-dataset">https://www.kaggle.com/datasets/divg07/casia-20-image-tampering-detection-dataset</ext-link>, accessed on 02 January 2024; <ext-link ext-link-type="uri" xlink:href="https://drive.google.com/file/d/1gsx5c-oilsFEzX_j1zKTPP4yWEs6T385/view?usp=sharing">https://drive.google.com/file/d/1gsx5c-oilsFEzX_j1zKTPP4yWEs6T385/view?usp=sharing</ext-link>, accessed on 15 January 2024.</p>
</sec>
<sec>
<title>Ethics Approval</title>
<p>No human or animal subjects were involved, and thus ethical approval was not required.</p>
</sec>
<sec sec-type="COI-statement">
<title>Conflicts of Interest</title>
<p>This research involves the detection and localization of copy-move forgery in digital images using publicly available datasets. All data utilized comply with relevant privacy and data protection regulations.</p>
</sec>
<ref-list content-type="authoryear">
<title>References</title>
<ref id="ref-1"><label>[1]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Fridrich</surname> <given-names>J</given-names></string-name>, <string-name><surname>Soukal</surname> <given-names>D</given-names></string-name>, <string-name><surname>Lukas</surname> <given-names>J</given-names></string-name></person-group>. <article-title>Detection of copy-move forgery in digital images</article-title>. In: <conf-name>Proceedings of Digital Forensic Research Workshop (DFRWS)</conf-name>. <publisher-loc>Cleveland</publisher-loc>: <publisher-name>IEEE</publisher-name>; <year>2003</year>. p. <fpage>67</fpage>&#x2013;<lpage>84</lpage>.</mixed-citation></ref>
<ref id="ref-2"><label>[2]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Vaswani</surname> <given-names>A</given-names></string-name>, <string-name><surname>Shazeer</surname> <given-names>NM</given-names></string-name>, <string-name><surname>Parmar</surname> <given-names>N</given-names></string-name>, <string-name><surname>Uszkoreit</surname> <given-names>J</given-names></string-name>, <string-name><surname>Jones</surname> <given-names>L</given-names></string-name>, <string-name><surname>Gomez</surname> <given-names>AN</given-names></string-name>, <etal>et al.</etal></person-group> <article-title>Attention is all you need</article-title>. <source>Adv Neural Inf Process Syst</source>. <year>2017</year>:<fpage>30</fpage>.</mixed-citation></ref>
<ref id="ref-3"><label>[3]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Warif</surname> <given-names>NBA</given-names></string-name>, <string-name><surname>Idris</surname> <given-names>MYI</given-names></string-name>, <string-name><surname>Wahab</surname> <given-names>AWA</given-names></string-name>, <string-name><surname>Ismail</surname> <given-names>N-SN</given-names></string-name>, <string-name><surname>Salleh</surname> <given-names>R</given-names></string-name></person-group>. <article-title>A comprehensive evaluation procedure for copy-move forgery detection methods: results from a systematic review</article-title>. <source>Multimed Tools Appl</source>. <year>2022</year>;<volume>81</volume>(<issue>11</issue>):<fpage>15171</fpage>. doi:<pub-id pub-id-type="doi">10.1007/s11042-022-12010-2</pub-id>.</mixed-citation></ref>
<ref id="ref-4"><label>[4]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Tan</surname> <given-names>W</given-names></string-name>, <string-name><surname>Wu</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Wu</surname> <given-names>P</given-names></string-name>, <string-name><surname>Chen</surname> <given-names>B</given-names></string-name></person-group>. <article-title>A survey on digital image copy-move forgery localization using passive techniques</article-title>. <source>J New Media</source>. <year>2019</year>;<volume>1</volume>(<issue>1</issue>):<fpage>11</fpage>&#x2013;<lpage>25</lpage>. doi:<pub-id pub-id-type="doi">10.32604/jnm.2019.06219</pub-id>.</mixed-citation></ref>
<ref id="ref-5"><label>[5]</label><mixed-citation publication-type="book"><person-group person-group-type="author"><string-name><surname>Gurunlu</surname> <given-names>B</given-names></string-name>, <string-name><surname>Ozturk</surname> <given-names>S</given-names></string-name></person-group>. <chapter-title>Efficient approach for block-based copy-move forgery detection</chapter-title>. In: <person-group person-group-type="editor"><string-name><surname>Zhang</surname> <given-names>YD</given-names></string-name>, <string-name><surname>Senjyu</surname> <given-names>T</given-names></string-name>, <string-name><surname>So-In</surname> <given-names>C</given-names></string-name>, <string-name><surname>Joshi</surname> <given-names>A</given-names></string-name></person-group>, editors. <source>Smart trends in computing and communications. Lecture notes in networks and systems</source>. Vol. <volume>286</volume>. <publisher-loc>Singapore</publisher-loc>: <publisher-name>Springer</publisher-name>; <year>2022</year>. doi:<pub-id pub-id-type="doi">10.1007/978-981-16-4016-2_16</pub-id>.</mixed-citation></ref>
<ref id="ref-6"><label>[6]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Ryu</surname> <given-names>SJ</given-names></string-name>, <string-name><surname>Lee</surname> <given-names>MJ</given-names></string-name>, <string-name><surname>Lee</surname> <given-names>HK</given-names></string-name></person-group>. <article-title>Detection of copy-rotate-move forgery using Zernike moments</article-title>. In: <conf-name>Information Hiding: 12th International Conference, IH 2010</conf-name>; <year>2010 Jun 28&#x2013;30</year>; <publisher-loc>Calgary, AB, Canada</publisher-loc>: <publisher-name>Revised Selected Papers 12. Springer Berlin Heidelberg</publisher-name>; <year>2010</year>. p. <fpage>51</fpage>&#x2013;<lpage>65</lpage>.</mixed-citation></ref>
<ref id="ref-7"><label>[7]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Mahmood</surname> <given-names>T</given-names></string-name>, <string-name><surname>Nawaz</surname> <given-names>T</given-names></string-name>, <string-name><surname>Irtaza</surname> <given-names>A</given-names></string-name></person-group>. <article-title>Copy-move forgery detection technique for forensic analysis in digital images</article-title>. <source>Math Probl Eng</source>. <year>2016</year>;<volume>2016</volume>(<issue>1</issue>):<fpage>1</fpage>&#x2013;<lpage>13</lpage>. doi:<pub-id pub-id-type="doi">10.1155/2016/8713202</pub-id>.</mixed-citation></ref>
<ref id="ref-8"><label>[8]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Deng-Yuan</surname> <given-names>H</given-names></string-name>, <string-name><surname>Ching-Ning</surname> <given-names>H</given-names></string-name>, <string-name><surname>Wu-chih</surname> <given-names>H</given-names></string-name></person-group>. <article-title>Robustness of copy-move forgery detection under hign JPEG compression artifacts</article-title>. <source>Multimed Tools Appl</source>. <year>2017</year>;<volume>76</volume>(<issue>1</issue>):<fpage>1509</fpage>&#x2013;<lpage>30</lpage>. doi:<pub-id pub-id-type="doi">10.1007/s11042-015-3152-x</pub-id>.</mixed-citation></ref>
<ref id="ref-9"><label>[9]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Li</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Zhou</surname> <given-names>J</given-names></string-name></person-group>. <article-title>Fast and effective image copy-move forgery detection via hierarchical feature point matching</article-title>. <source>IEEE Trans Inf Forensics Secur</source>. <year>2019 May</year>;<volume>14</volume>(<issue>5</issue>):<fpage>1307</fpage>&#x2013;<lpage>22</lpage>. doi:<pub-id pub-id-type="doi">10.1109/TIFS.2018.2876837</pub-id>.</mixed-citation></ref>
<ref id="ref-10"><label>[10]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Amerini</surname> <given-names>I</given-names></string-name>, <string-name><surname>Ballan</surname> <given-names>L</given-names></string-name>, <string-name><surname>Caldelli</surname> <given-names>R</given-names></string-name></person-group>. <article-title>A sift-based forensic method for copy-move attack detection and transformation recovery</article-title>. <source>IEEE Trans Inf Forensics Secur</source>. <year>2011</year>;<volume>6</volume>(<issue>3</issue>):<fpage>1099</fpage>&#x2013;<lpage>110</lpage>. doi:<pub-id pub-id-type="doi">10.1109/TIFS.2011.2129512</pub-id>.</mixed-citation></ref>
<ref id="ref-11"><label>[11]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Shivakumar</surname> <given-names>BL</given-names></string-name>, <string-name><surname>Baboo</surname> <given-names>SS</given-names></string-name></person-group>. <article-title>Detection of region duplication forgery in digital images using SURF</article-title>. <source>Int J Comput Sci Issues (IJCSI)</source>. <year>2011</year>;<volume>8</volume>(<issue>4</issue>):<fpage>199</fpage>&#x2013;<lpage>205</lpage>.</mixed-citation></ref>
<ref id="ref-12"><label>[12]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Diwan</surname> <given-names>A</given-names></string-name>, <string-name><surname>Sharma</surname> <given-names>R</given-names></string-name>, <string-name><surname>Roy</surname> <given-names>AK</given-names></string-name></person-group>. <article-title>Keypoint based comprehensive copy-move forgery detection</article-title>. <source>IET Image Process</source>. <year>2021</year>;<volume>15</volume>(<issue>6</issue>):<fpage>1298</fpage>&#x2013;<lpage>309</lpage>. doi:<pub-id pub-id-type="doi">10.1049/ipr2.12105</pub-id>.</mixed-citation></ref>
<ref id="ref-13"><label>[13]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Barad</surname> <given-names>ZJ</given-names></string-name>, <string-name><surname>Goswami</surname> <given-names>MM</given-names></string-name></person-group>. <article-title>Image forgery detection using deep learning: a survey</article-title>. In: <conf-name>Proceedings of the 2020 6th International Conference on Advanced Computing and Communication Systems (ICACCS)</conf-name>. <publisher-loc>Coimbatore, India</publisher-loc>: <publisher-name>IEEE</publisher-name>; <year>2020</year>.</mixed-citation></ref>
<ref id="ref-14"><label>[14]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Wu</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Abd-Alamgeed</surname> <given-names>W</given-names></string-name>, <string-name><surname>Natarajan</surname> <given-names>P</given-names></string-name></person-group>. <article-title>BusterNet: detecting copy-move image forgery with source/target localization</article-title>. In: <conf-name>Proceedings of the European Conference on Computer Vision (ECCV)</conf-name>; <publisher-loc>Berlin, Germany</publisher-loc>: <publisher-name>Springer</publisher-name>; <year>2018</year>. p. <fpage>168</fpage>&#x2013;<lpage>84</lpage>.</mixed-citation></ref>
<ref id="ref-15"><label>[15]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Chen</surname> <given-names>B</given-names></string-name>, <string-name><surname>Tan</surname> <given-names>W</given-names></string-name>, <string-name><surname>Coatrieux</surname> <given-names>G</given-names></string-name></person-group>. <article-title>A serial image copy-move forgery localization scheme with source/target distinguishment</article-title>. <source>IEEE Trans Multimedia</source>. <year>2020</year>;<volume>23</volume>:<fpage>3506</fpage>&#x2013;<lpage>17</lpage>. doi:<pub-id pub-id-type="doi">10.1109/TMM.2020.3026868</pub-id>.</mixed-citation></ref>
<ref id="ref-16"><label>[16]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Hu</surname> <given-names>X</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>Z</given-names></string-name>, <string-name><surname>jiang</surname> <given-names>Z</given-names></string-name></person-group>. <article-title>Span: spatial pyramid attention network for image manipulation localization</article-title>. In: <conf-name>Proceedings of the European Conference on Computer Vision, ECCV 2020</conf-name>. <publisher-loc>Glasgow, UK</publisher-loc>; <year>2020 Aug 23&#x2013;28</year>.</mixed-citation></ref>
<ref id="ref-17"><label>[17]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Islam</surname> <given-names>A</given-names></string-name>, <string-name><surname>Long</surname> <given-names>G</given-names></string-name>, <string-name><surname>Basharat</surname> <given-names>A</given-names></string-name></person-group>. <article-title>DOA-GAN: dual-order attentive generative adversarial network for image copy-move forgery detection and localization</article-title>. In: <conf-name>Proceedings of the IEEE/CVF Conference On Computer Vision And Pattern Recognition</conf-name>. <publisher-loc>Piscataway, NJ</publisher-loc>: <publisher-name>IEEE</publisher-name>; <year>2020</year>. p. <fpage>4676</fpage>&#x2013;<lpage>85</lpage>.</mixed-citation></ref>
<ref id="ref-18"><label>[18]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Dong</surname> <given-names>C</given-names></string-name>, <string-name><surname>Chen</surname> <given-names>X</given-names></string-name>, <string-name><surname>Hu</surname> <given-names>R</given-names></string-name>, <string-name><surname>Cao</surname> <given-names>J</given-names></string-name>, <string-name><surname>Li</surname> <given-names>X</given-names></string-name></person-group>. <article-title>MVSS-Net: multi-view multi-scale supervised networks for image manipulation detection</article-title>. <source>IEEE Trans Pattern Anal Mach Intell</source>. <year>2023 Mar 1</year>;<volume>45</volume>(<issue>3</issue>):<fpage>3539</fpage>&#x2013;<lpage>53</lpage>. doi:<pub-id pub-id-type="doi">10.1109/TPAMI.2022.3180556</pub-id>; <pub-id pub-id-type="pmid">35671312</pub-id></mixed-citation></ref>
<ref id="ref-19"><label>[19]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Dosovitskiy</surname> <given-names>A</given-names></string-name>, <string-name><surname>Beyer</surname> <given-names>L</given-names></string-name>, <string-name><surname>Kolesnikov</surname> <given-names>A</given-names></string-name></person-group>. <article-title>An image is worth 16x16 words: transformers for image recognition at scale</article-title>. In: <conf-name>Proceedings of the International Conference on Learning Representations</conf-name>; <year>2021 May 4&#x2013;8</year>; <publisher-loc>New Orleans, LA, USA</publisher-loc>.</mixed-citation></ref>
<ref id="ref-20"><label>[20]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Carion</surname> <given-names>N</given-names></string-name>, <string-name><surname>Massa</surname> <given-names>F</given-names></string-name>, <string-name><surname>Synnaeve</surname> <given-names>G</given-names></string-name></person-group>. <article-title>End-to-end object detection with transformers</article-title>. In: <conf-name>Computer Vision&#x2013;ECCV 2020: 16th European Conference ; 2020 Aug 23&#x2013;28</conf-name>; <publisher-loc>Glasgow, UK</publisher-loc>: <publisher-name>Springer International Publishing</publisher-name>; <year>2020</year>. p. <fpage>213</fpage>&#x2013;<lpage>29</lpage>.</mixed-citation></ref>
<ref id="ref-21"><label>[21]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Zhu</surname> <given-names>X</given-names></string-name>, <string-name><surname>Su</surname> <given-names>W</given-names></string-name>, <string-name><surname>Lu</surname> <given-names>L</given-names></string-name>, <string-name><surname>Li</surname> <given-names>B</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>X</given-names></string-name>, <string-name><surname>Dai</surname> <given-names>J</given-names></string-name></person-group>. <article-title>Deformable DETR: deformable transformers for end-to-end object detection</article-title>. In: <conf-name>Proceedings of the International Conference on Learning Representations</conf-name>; <year>2021 May 4&#x2013;8</year>; <publisher-loc>Vancouver, BC, Canada</publisher-loc>.</mixed-citation></ref>
<ref id="ref-22"><label>[22]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Zheng</surname> <given-names>SX</given-names></string-name>, <string-name><surname>Lu</surname> <given-names>JC</given-names></string-name>, <string-name><surname>Zhao</surname> <given-names>HS</given-names></string-name></person-group>. <article-title>Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers</article-title>. In: <conf-name>Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</conf-name>. <publisher-loc>Piscataway, NJ</publisher-loc>: <publisher-name>IEEE</publisher-name>; <year>2021</year>. p. <fpage>6881</fpage>&#x2013;<lpage>980</lpage>.</mixed-citation></ref>
<ref id="ref-23"><label>[23]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Chen</surname> <given-names>JN</given-names></string-name>, <string-name><surname>Lu</surname> <given-names>YY</given-names></string-name>, <string-name><surname>Yu</surname> <given-names>QH</given-names></string-name></person-group>. <article-title>TransUNet: transformers make strong encoders for medical image segmentation</article-title>. In: <conf-name>Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</conf-name>. <publisher-loc>Piscataway, NJ</publisher-loc>: <publisher-name>IEEE</publisher-name>; <year>2021</year>.</mixed-citation></ref>
<ref id="ref-24"><label>[24]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Touvron</surname> <given-names>H</given-names></string-name>, <string-name><surname>Cord</surname> <given-names>M</given-names></string-name>, <string-name><surname>Douze</surname> <given-names>M</given-names></string-name></person-group>. <article-title>Training data-efficient image transformers &#x0026; distillation through attention</article-title>. In: <conf-name>Proceedings of the 38th International Conference on Machine Learning</conf-name>. <publisher-loc>New York, NY</publisher-loc>: <publisher-name>PMLR</publisher-name>; <year>2021</year>. p. <fpage>10347</fpage>&#x2013;<lpage>57</lpage>.</mixed-citation></ref>
<ref id="ref-25"><label>[25]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Liu</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Lin</surname> <given-names>YT</given-names></string-name>, <string-name><surname>Cao</surname> <given-names>Y</given-names></string-name></person-group>. <article-title>Swin transformer: hierarchical vision transformer using shifted windows</article-title>. In: <conf-name>Proceedings of the IEEE/CVF International Conference on Computer Vision</conf-name>. <publisher-loc>Piscataway, NJ</publisher-loc>: <publisher-name>IEEE</publisher-name>; <year>2021</year>. p. <fpage>10012</fpage>&#x2013;<lpage>22</lpage>.</mixed-citation></ref>
<ref id="ref-26"><label>[26]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Wang</surname> <given-names>J</given-names></string-name>, <string-name><surname>Wu</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Chen</surname> <given-names>J</given-names></string-name></person-group>. <article-title>ObjectFormer for image manipulation detection and localization</article-title>. In: <conf-name>Proceedings of the 2022 IEEE Conference on Computer Vision and Pattern Recognition</conf-name>; <year>2022 Jun 18&#x2013;24</year>; <publisher-loc>New Orleans, LA, USA</publisher-loc>.</mixed-citation></ref>
<ref id="ref-27"><label>[27]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Long</surname> <given-names>J</given-names></string-name>, <string-name><surname>Shelhamer</surname> <given-names>E</given-names></string-name>, <string-name><surname>Darrell</surname> <given-names>T</given-names></string-name></person-group>. <article-title>Fully convolutional networks for semantic segmentation</article-title>. In: <conf-name>Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition</conf-name>. <publisher-loc>Piscataway, NJ</publisher-loc>: <publisher-name>IEEE</publisher-name>; <year>2015</year>. p. <fpage>3431</fpage>&#x2013;<lpage>40</lpage>.</mixed-citation></ref>
<ref id="ref-28"><label>[28]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Zhao</surname> <given-names>H</given-names></string-name>, <string-name><surname>Shi</surname> <given-names>J</given-names></string-name>, <string-name><surname>Qi</surname> <given-names>X</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>X</given-names></string-name>, <string-name><surname>Jia</surname> <given-names>J</given-names></string-name></person-group>. <article-title>Pyramid scene parsing network</article-title>. In: <conf-name>Proceedings of the IEEE/CVF International Conference on Computer Vision</conf-name>; <year>2017 Jul 21&#x2013;26</year>; <publisher-loc>Honolulu, HI, USA</publisher-loc>; p. <fpage>2881</fpage>&#x2013;<lpage>90</lpage>.</mixed-citation></ref>
<ref id="ref-29"><label>[29]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Dong</surname> <given-names>J</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>W</given-names></string-name>, <string-name><surname>Tan</surname> <given-names>T</given-names></string-name></person-group>. <article-title>CASIA image tampering detection evaluation database</article-title>. In: <conf-name>Proceedings of the 2013 IEEE China Summit and International Conference on Signal and Information Processing</conf-name>; <year>2017 Jul 6&#x2013;10</year>; <publisher-loc>Beijing, China</publisher-loc>. p. <fpage>422</fpage>&#x2013;<lpage>26</lpage>.</mixed-citation></ref>
<ref id="ref-30"><label>[30]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Mahfoudi</surname> <given-names>G</given-names></string-name>, <string-name><surname>Tajini</surname> <given-names>B</given-names></string-name>, <string-name><surname>Retraint</surname> <given-names>F</given-names></string-name>, <string-name><surname>Morain-Nicolier</surname> <given-names>F</given-names></string-name>, <string-name><surname>Dugelay</surname> <given-names>JL</given-names></string-name>, <string-name><surname>Pic</surname> <given-names>M</given-names></string-name></person-group>. <article-title>DEFACTO: Image and face manipulation dataset</article-title>. In: <conf-name>Proceedings of the 2019 27th European Signal Processing Conference (EUSIPCO)</conf-name>; <year>2019 Sep 2&#x2013;6</year>; <publisher-loc>A Coruna, Spain</publisher-loc>. p. <fpage>1</fpage>&#x2013;<lpage>5</lpage>.</mixed-citation></ref>
<ref id="ref-31"><label>[31]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Wen</surname> <given-names>BH</given-names></string-name>, <string-name><surname>Zhu</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Subramanian</surname> <given-names>R</given-names></string-name>, <string-name><surname>Ng</surname> <given-names>TT</given-names></string-name>, <string-name><surname>Shen</surname> <given-names>XJ</given-names></string-name>, <string-name><surname>Winkler</surname> <given-names>S</given-names></string-name></person-group>. <article-title>COVERAGE&#x2014;a novel database for copy-move forgery detection</article-title>. In: <conf-name>2016 IEEE International Conference on Image Processing (ICIP)</conf-name>; <year>2016</year>; <publisher-loc>Phoenix, AZ, USA</publisher-loc>. p. <fpage>161</fpage>&#x2013;<lpage>5</lpage>. doi:<pub-id pub-id-type="doi">10.1109/ICIP.2016.7532339</pub-id>.</mixed-citation></ref>
<ref id="ref-32"><label>[32]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><collab>OpenMMLab</collab></person-group>. <article-title>Welcome to mmsegmentation&#x2019;s documentation!</article-title> <year>2022 Nov 04 [cited 2023 Jun 25]</year>. Available from: <ext-link ext-link-type="uri" xlink:href="https://mmsegmentation.readthedocs.io/en/latest/">https://mmsegmentation.readthedocs.io/en/latest/</ext-link>.</mixed-citation></ref>
</ref-list>
</back></article>