<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.1 20151215//EN" "http://jats.nlm.nih.gov/publishing/1.1/JATS-journalpublishing1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:mml="http://www.w3.org/1998/Math/MathML" xml:lang="en" article-type="research-article" dtd-version="1.1">
<front>
<journal-meta>
<journal-id journal-id-type="pmc">CMC</journal-id>
<journal-id journal-id-type="nlm-ta">CMC</journal-id>
<journal-id journal-id-type="publisher-id">CMC</journal-id>
<journal-title-group>
<journal-title>Computers, Materials &#x0026; Continua</journal-title>
</journal-title-group>
<issn pub-type="epub">1546-2226</issn>
<issn pub-type="ppub">1546-2218</issn>
<publisher>
<publisher-name>Tech Science Press</publisher-name>
<publisher-loc>USA</publisher-loc>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="publisher-id">66842</article-id>
<article-id pub-id-type="doi">10.32604/cmc.2025.066842</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Article</subject>
</subj-group>
</article-categories>
<title-group>
<article-title>An Image Inpainting Approach Based on Parallel Dual-Branch Learnable Transformer Network</article-title>
<alt-title alt-title-type="left-running-head">An Image Inpainting Approach Based on Parallel Dual-Branch Learnable Transformer Network</alt-title>
<alt-title alt-title-type="right-running-head">An Image Inpainting Approach Based on Parallel Dual-Branch Learnable Transformer Network</alt-title>
</title-group>
<contrib-group>
<contrib id="author-1" contrib-type="author">
<name name-style="western"><surname>Gong</surname><given-names>Rongrong</given-names></name><xref ref-type="aff" rid="aff-1">1</xref><xref ref-type="author-notes" rid="afn1">#</xref></contrib>
<contrib id="author-2" contrib-type="author">
<name name-style="western"><surname>Zhang</surname><given-names>Tingxian</given-names></name><xref ref-type="aff" rid="aff-2">2</xref><xref ref-type="author-notes" rid="afn1">#</xref></contrib>
<contrib id="author-3" contrib-type="author">
<name name-style="western"><surname>Wei</surname><given-names>Yawen</given-names></name><xref ref-type="aff" rid="aff-2">2</xref></contrib>
<contrib id="author-4" contrib-type="author">
<name name-style="western"><surname>Zhang</surname><given-names>Dengyong</given-names></name><xref ref-type="aff" rid="aff-2">2</xref></contrib>
<contrib id="author-5" contrib-type="author" corresp="yes">
<name name-style="western"><surname>Li</surname><given-names>Yan</given-names></name><xref ref-type="aff" rid="aff-3">3</xref><email>leeyeon@inha.ac.kr</email></contrib>
<aff id="aff-1"><label>1</label><institution>School of Software, Changsha Social Work College</institution>, <addr-line>Changsha, 410004</addr-line>, <country>China</country></aff>
<aff id="aff-2"><label>2</label><institution>School of Computer Science and Technology, Changsha University of Science and Technology</institution>, <addr-line>Changsha, 410076</addr-line>, <country>China</country></aff>
<aff id="aff-3"><label>3</label><institution>Department of Computer Engineering, INHA University</institution>, <addr-line>Incheon, 22201</addr-line>, <country>Republic of Korea</country></aff>
</contrib-group>
<author-notes>
<corresp id="cor1"><label>&#x002A;</label>Corresponding Author: Yan Li. Email: <email>leeyeon@inha.ac.kr</email></corresp>
<fn id="afn1">
<p><sup>#</sup>These authors contributed equally to this work</p>
</fn>
</author-notes>
<pub-date date-type="collection" publication-format="electronic">
<year>2025</year>
</pub-date>
<pub-date date-type="pub" publication-format="electronic">
<day>29</day><month>08</month><year>2025</year>
</pub-date>
<volume>85</volume>
<issue>1</issue>
<fpage>1221</fpage>
<lpage>1234</lpage>
<history>
<date date-type="received">
<day>18</day>
<month>4</month>
<year>2025</year>
</date>
<date date-type="accepted">
<day>30</day>
<month>6</month>
<year>2025</year>
</date>
</history>
<permissions>
<copyright-statement>&#x00A9; 2025 The Authors.</copyright-statement>
<copyright-year>2025</copyright-year>
<copyright-holder>Published by Tech Science Press.</copyright-holder>
<license xlink:href="https://creativecommons.org/licenses/by/4.0/">
<license-p>This work is licensed under a <ext-link ext-link-type="uri" xlink:type="simple" xlink:href="https://creativecommons.org/licenses/by/4.0/">Creative Commons Attribution 4.0 International License</ext-link>, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
</license>
</permissions>
<self-uri content-type="pdf" xlink:href="TSP_CMC_66842.pdf"></self-uri>
<abstract>
<p>Image inpainting refers to synthesizing missing content in an image based on known information to restore occluded or damaged regions, which is a typical manifestation of this trend. With the increasing complexity of image in tasks and the growth of data scale, existing deep learning methods still have some limitations. For example, they lack the ability to capture long-range dependencies and their performance in handling multi-scale image structures is suboptimal. To solve this problem, the paper proposes an image inpainting method based on the parallel dual-branch learnable Transformer network. The encoder of the proposed model generator consists of a dual-branch parallel structure with stacked CNN blocks and Transformer blocks, aiming to extract global and local feature information from images. Furthermore, a dual-branch fusion module is adopted to combine the features obtained from both branches. Additionally, a gated full-scale skip connection module is proposed to further enhance the coherence of the inpainting results and alleviate information loss. Finally, experimental results from the three public datasets demonstrate the superior performance of the proposed method.</p>
</abstract>
<kwd-group kwd-group-type="author">
<kwd>Artificial intelligence</kwd>
<kwd>image inpainting</kwd>
<kwd>transformer network</kwd>
<kwd>dual-branch fusion</kwd>
<kwd>gated full-scale skip connection</kwd>
</kwd-group>
<funding-group>
<award-group id="awg1">
<funding-source>Hunan Provincial Natural Science Foundation</funding-source>
<award-id>2023JJ60257</award-id>
</award-group>
<award-group id="awg2">
<funding-source>Intelligent Rehabilitation Robotics</funding-source>
<award-id>2025SH501</award-id>
</award-group>
<award-group id="awg3">
<funding-source>Inha University</funding-source>
<award-id>HX2024123</award-id>
</award-group></funding-group>
</article-meta>
</front>
<body>
<sec id="s1">
<label>1</label>
<title>Introduction</title>
<p>With the rapid advancement of artificial intelligence (AI) technology, we are entering a new era driven by intelligent systems that are not only capable of understanding complex data patterns, but also adapting to changing environmental demands. The proliferation of AI-driven adaptive systems brings unprecedented capabilities for managing distributed computation, fostering innovation in a variety of fields. In this context, image restoration techniques under deep learning are particularly prominent. This technique utilizes existing information to reconstruct damaged images through deep learning models, especially Convolutional Neural Networks (CNNs), which not only improves the accuracy of restoration, but also speeds up the processing speed, making it an important tool for processing image data in distributed computing environments.</p>
<p>Deep learning-based methods for image inpainting have demonstrated substantial advancements in practical applications, largely due to their strengths in automatic learning, contextual modeling, advanced feature representation, and improved inpainting results. In our study, we focus on three key categories of deep learning techniques employed for image inpainting.</p>
<p>The first category encompasses CNN-based approaches, which have laid the groundwork for image inpainting with their ability to capture local patterns and features effectively. Notable examples of CNN-based methods include LeNet [<xref ref-type="bibr" rid="ref-1">1</xref>] and the Neocognitron [<xref ref-type="bibr" rid="ref-2">2</xref>], both of which have contributed significantly to the evolution of image processing techniques. The second category includes generative adversarial network (GAN)-based methods. GANs, further refined by Miyato et al., are designed to generate realistic image content through the interaction between the generator and the discriminator. This adversarial process enhances the quality and realism of the inpainted images. The third category features the Transformer model, originally proposed by Vaswani et al., which has been adapted for image inpainting tasks to leverage its strength in handling complex contextual information. By examining these three categories, our study aims to provide an in-depth analysis of advanced techniques in deep learning for image inpainting, highlighting their specific contributions and advancements in enhancing image inpainting quality.</p>
<p>CNN is suitable for image inpainting tasks. By utilizing shared weights and local connections, CNN reduces the model parameters and computational complexity, leading to significant achievements in various domains [<xref ref-type="bibr" rid="ref-3">3</xref>]. However, CNN has a relatively weak grasp of global information, which may result in a lack of contextual consistency in the inpainting results under certain circumstances.</p>
<p>Generative Adversarial Networks (GANs) are adept at producing high-quality restored images by leveraging adversarial training to improve the realism of inpainting results. Despite their advantages, GANs face challenges during training, including instability and the need for careful balance between the generator and discriminator. Issues such as non-convergence, mode collapse [<xref ref-type="bibr" rid="ref-4">4</xref>], and potential artifacts or blurriness in the generated images can also arise, affecting the final output quality.</p>
<p>The Transformer model demonstrated excellent performance in image inpainting. It possesses a self-attention mechanism that models the global correlations in images. Furthermore, the interpretability of Transformers provides valuable information for model optimization, allowing us to understand the attention levels of the model at each position towards other positions [<xref ref-type="bibr" rid="ref-5">5</xref>]. However, its computational complexity increases quadratically.</p>
<p>Overall, traditional CNNs effectively capture local features but struggle with large missing regions, often causing structural and semantic inconsistencies. In contrast, Transformers model global dependencies well but lack fine-grained detail, leading to blurred textures. Most existing methods combine the two superficially, making it difficult to achieve a balanced synergy between local feature extraction and global semantic reasoning.</p>
<p>To address this, we propose the Parallel Dual-Branch Learnable Transformer Network (PDT-Net), which features a dual-encoder architecture combining CNNs and transformers for joint local-global feature extraction. A dual-branch decoder and feature fusion strategy further preserve fine textures and spatial coherence. Skip connections enhance low-level detail propagation throughout the network. PDT-Net represents a significant advancement in image inpainting technology, incorporating both convolutional and transformer-based techniques. We conducted a comprehensive series of experiments using three distinct datasets (Paris Street View, CelebaA, and Places2) to thoroughly assess the network&#x2019;s performance [<xref ref-type="bibr" rid="ref-6">6</xref>&#x2013;<xref ref-type="bibr" rid="ref-8">8</xref>]. These datasets provide a broad evaluation of the network&#x2019;s capabilities across various types of image data. The results of these experiments, which highlight the network&#x2019;s impressive performance and advantages, are visually represented in <xref ref-type="fig" rid="fig-1">Fig. 1</xref>.</p>
<fig id="fig-1">
<label>Figure 1</label>
<caption>
<title>Figure showcasing the inpainting results of PDT-Net model</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_66842-fig-1.tif"/>
</fig>
<p>Our contributions include the following:
<list list-type="bullet">
<list-item>
<p>An innovative image inpainting model that combines CNN and Transformer is proposed: parallel adoption of CNN blocks and Transformer blocks for feature extraction via downsampling.</p></list-item>
<list-item>
<p>A dual-branch fusion module is proposed to integrate the globally and locally extracted features, thereby improving the quality of image inpainting.</p></list-item>
<list-item>
<p>Through extensive validation on three datasets, the proposed model outperformed existing image inpainting methods across the board.</p></list-item>
</list></p>
</sec>
<sec id="s2">
<label>2</label>
<title>Related Work</title>
<sec id="s2_1">
<label>2.1</label>
<title>Deep Learning Image Inpainting</title>
<p>Deep learning-based image inpainting methods offer notable advantages due to their sophisticated automatic learning and contextual modeling capabilities. Unlike traditional techniques that depend on predefined rules and basic algorithms, deep learning models use advanced neural networks to automatically learn complex patterns, textures, and features from extensive datasets. This capability allows the models to capture extensive visual information without manual intervention. Moreover, these models are proficient in contextual modeling, taking into account not just the nearby area but the entire image context. This comprehensive approach results in inpainting that integrates seamlessly with the surrounding content, ensuring high visual coherence. Consequently, deep learning methods achieve more accurate and realistic inpainting results for damaged images, even in cases of irregular damage and varied textures. Their enhanced ability to deliver visually satisfying results marks a significant improvement over traditional inpainting techniques.</p>
<p>Initially, Pathak et al. [<xref ref-type="bibr" rid="ref-9">9</xref>] attempted to use CNN for image inpainting. Subsequently, in order to propel the progress of deep learning-based image inpainting, researchers have introduced a range of cutting-edge and inventive methodologies. For instance, Liu et al. [<xref ref-type="bibr" rid="ref-10">10</xref>] proposed a probabilistic diversity Generative Adversarial Network called PD-GAN, which introduces random noise and probabilistic sampling to generate high-quality and diverse image inpainting results. Although the introduction of noise can increase diversity, it may sometimes result in inpainting outputs with noise or blurriness, which can impact the quality and authenticity of the inpainting results.</p>
<p>Furthermore, certain approaches integrate attention mechanisms and context modeling into image inpainting networks, elevating their ability to restore crucial regions and intricate details within the image. In conclusion, deep learning-based image inpainting techniques offer an efficient and precise approach to image restoration tasks through the utilization of automatic learning and contextual modeling features of neural networks, along with their ability to manage multi-scale image structures [<xref ref-type="bibr" rid="ref-11">11</xref>].</p>
</sec>
<sec id="s2_2">
<label>2.2</label>
<title>Transformer</title>
<p>The Transformer was first introduced by Vaswani et al. [<xref ref-type="bibr" rid="ref-12">12</xref>] in 2017 to address natural language processing (NLP) tasks. Building on the Transformer, Dosovitskiy et al. [<xref ref-type="bibr" rid="ref-13">13</xref>] introduced the Vision Transformer (ViT) in 2020, successfully applying it to computer vision tasks. ViT segments images into a series of patches, treats each patch as an input sequence, and uses the Transformer model for feature extraction and classification.</p>
<p>With the successful application of Transformer, researchers have made further improvements and extensions to it. For example, Liu et al. [<xref ref-type="bibr" rid="ref-14">14</xref>] have opened up new research directions and expanded the application scope of visual Transformer mechanisms by addressing the challenges of processing large-scale images and providing efficient computational methods. Yuan et al. [<xref ref-type="bibr" rid="ref-15">15</xref>] have improved the performance of Transformer networks by adopting more efficient training strategies that leverage the structural information reconstruction of images.</p>
<p>These advancements have resulted in considerable advancements in applying the Transformer mechanism to a broader spectrum of visual tasks. These studies have also demonstrated the powerful potential of the Transformer mechanism in addressing image-related problems. However, despite the Transformer&#x2019;s advantage in capturing long-range dependencies and generating diverse structures. Nevertheless, the Transformer significantly increases computational complexity, making it challenging to handle high-resolution images.</p>
</sec>
<sec id="s2_3">
<label>2.3</label>
<title>Model Combining CNN and Transformer</title>
<p>Recent studies have shown significant performance improvement in computer vision tasks by combining Convolutional Neural Networks (CNNs) and Transformers. CNNs excel at image feature extraction, while Transformers have advantages in handling long-range dependencies and capturing global relationships. CNNs excel at image feature extraction, while Transformers have advantages in handling long-range dependencies and capturing global relationships. Due to the expertise of Transformers in sequence modeling and CNNs in local feature extraction, the dual-branch encoder-decoder applies to various tasks.</p>
<p>For example, recent studies [<xref ref-type="bibr" rid="ref-16">16</xref>&#x2013;<xref ref-type="bibr" rid="ref-20">20</xref>] have also explored methods for image inpainting, where Transformers are employed to reconstruct complex coherent structures and rough textures, while CNNs enhance local texture details guided by the rough restored images. The effective fusion of global contextual information and local features during the inpainting stage has greatly enhanced the overall results. Such a combination of CNNs and Transformers has achieved significant progress in image inpainting tasks.</p>
<p>Models that combine Convolutional Neural Networks (CNNs) and Transformers showcase exceptional abilities in both feature extraction and sequence modeling. CNNs are effective at capturing intricate spatial details and patterns in image. In contrast, Transformers are adept at modeling sequences and understanding long-range dependencies, which is essential for applications like natural language processing and time-series analysis. By integrating CNNs with Transformers, these hybrid models benefit from the strengths of both approaches: CNNs for detailed spatial feature extraction and Transformers for handling sequential data efficiently. This combination enhances versatility, allowing these models to excel in various deep learning tasks, including the mechanism has superior capabilities to capture the local and nonlocal dependencies on face image in the face reconstruction [<xref ref-type="bibr" rid="ref-21">21</xref>]. Consequently, the fusion of CNN and Transformer architectures is becoming increasingly prevalent and valuable in advancing deep learning technologies.</p>
</sec>
</sec>
<sec id="s3">
<label>3</label>
<title>Our Approach</title>
<sec id="s3_1">
<label>3.1</label>
<title>Network Architecture</title>
<p>The parallel dual-branch learnable Transformer network (PDT-Net) proposed by us for image inpainting tasks is illustrated in <xref ref-type="fig" rid="fig-2">Fig. 2</xref>. In our network, the generator adopts an encoder-decoder architecture, comprising two concurrent encoders. Each encoder utilizes different feature extraction methods and model architectures. The objective of this architecture is to maximize the benefits of distinct encoders, enhancing both the expressive capability and performance of the model.</p>
<fig id="fig-2">
<label>Figure 2</label>
<caption>
<title>Inpainting results of the proposed PDT-Net model</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_66842-fig-2.tif"/>
</fig>
<p>The features extracted by the dual-branch encoders are denoted as <inline-formula id="ieqn-1"><mml:math id="mml-ieqn-1"><mml:msubsup><mml:mi>C</mml:mi><mml:mrow><mml:mi>E</mml:mi><mml:mi>n</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mtext>{i}</mml:mtext></mml:mrow></mml:mrow></mml:msubsup><mml:mo>&#x2208;</mml:mo><mml:mrow><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:mfrac><mml:mi>H</mml:mi><mml:mrow><mml:mrow><mml:msup><mml:mn>2</mml:mn><mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:msup></mml:mrow></mml:mrow></mml:mfrac><mml:mo>&#x00D7;</mml:mo><mml:mfrac><mml:mi>W</mml:mi><mml:mrow><mml:mrow><mml:msup><mml:mn>2</mml:mn><mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:msup></mml:mrow></mml:mrow></mml:mfrac><mml:mo>&#x00D7;</mml:mo><mml:mrow><mml:msup><mml:mn>2</mml:mn><mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:msup></mml:mrow><mml:msup><mml:mi>C</mml:mi><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msup></mml:mrow></mml:msup></mml:mrow></mml:math></inline-formula> and <inline-formula id="ieqn-2"><mml:math id="mml-ieqn-2"><mml:msubsup><mml:mi>T</mml:mi><mml:mrow><mml:mi>E</mml:mi><mml:mi>n</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mtext>{i}</mml:mtext></mml:mrow></mml:mrow></mml:msubsup><mml:mo>&#x2208;</mml:mo><mml:mrow><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:mfrac><mml:mi>H</mml:mi><mml:mrow><mml:mrow><mml:msup><mml:mn>2</mml:mn><mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:msup></mml:mrow></mml:mrow></mml:mfrac><mml:mo>&#x00D7;</mml:mo><mml:mfrac><mml:mi>W</mml:mi><mml:mrow><mml:mrow><mml:msup><mml:mn>2</mml:mn><mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:msup></mml:mrow></mml:mrow></mml:mfrac><mml:mo>&#x00D7;</mml:mo><mml:mrow><mml:msup><mml:mn>2</mml:mn><mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:msup></mml:mrow><mml:msup><mml:mi>C</mml:mi><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msup></mml:mrow></mml:msup></mml:mrow></mml:math></inline-formula>, where <inline-formula id="ieqn-3"><mml:math id="mml-ieqn-3"><mml:mi>i</mml:mi></mml:math></inline-formula> represents the number of encoding layers, and <inline-formula id="ieqn-4"><mml:math id="mml-ieqn-4"><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mo>&#x22EF;</mml:mo><mml:mo>,</mml:mo><mml:mn>5</mml:mn></mml:math></inline-formula>. The outputs of these two encoders are fused through the dual-branch fusion module to integrate the feature representations from both branches, resulting in the fused feature of the encoding layers, denoted as <inline-formula id="ieqn-5"><mml:math id="mml-ieqn-5"><mml:msubsup><mml:mi>F</mml:mi><mml:mrow><mml:mi>E</mml:mi><mml:mi>n</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mtext>{i}</mml:mtext></mml:mrow></mml:mrow></mml:msubsup><mml:mo>&#x2208;</mml:mo><mml:mrow><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:mfrac><mml:mi>H</mml:mi><mml:mrow><mml:mrow><mml:msup><mml:mn>2</mml:mn><mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:msup></mml:mrow></mml:mrow></mml:mfrac><mml:mo>&#x00D7;</mml:mo><mml:mfrac><mml:mi>W</mml:mi><mml:mrow><mml:mrow><mml:msup><mml:mn>2</mml:mn><mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:msup></mml:mrow></mml:mrow></mml:mfrac><mml:mo>&#x00D7;</mml:mo><mml:mrow><mml:msup><mml:mn>2</mml:mn><mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:msup></mml:mrow><mml:msup><mml:mi>C</mml:mi><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msup></mml:mrow></mml:msup></mml:mrow></mml:math></inline-formula>. Then, by utilizing full-scale skip connections, we obtain rich fusion multi-scale feature information with global and local feature interactions, represented as the decoding layer feature map <inline-formula id="ieqn-6"><mml:math id="mml-ieqn-6"><mml:msubsup><mml:mi>F</mml:mi><mml:mrow><mml:mi>D</mml:mi><mml:mi>e</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mtext>{i}</mml:mtext></mml:mrow></mml:mrow></mml:msubsup><mml:mo>&#x2208;</mml:mo><mml:mrow><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:mfrac><mml:mi>H</mml:mi><mml:mrow><mml:mrow><mml:msup><mml:mn>2</mml:mn><mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:msup></mml:mrow></mml:mrow></mml:mfrac><mml:mo>&#x00D7;</mml:mo><mml:mfrac><mml:mi>W</mml:mi><mml:mrow><mml:mrow><mml:msup><mml:mn>2</mml:mn><mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:msup></mml:mrow></mml:mrow></mml:mfrac><mml:mo>&#x00D7;</mml:mo><mml:mrow><mml:msup><mml:mn>2</mml:mn><mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:msup></mml:mrow><mml:msup><mml:mi>C</mml:mi><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msup></mml:mrow></mml:msup></mml:mrow></mml:math></inline-formula>. Finally, the image details and textures are restored through the decoder.</p>
<sec id="s3_1_1">
<label>3.1.1</label>
<title>CNN-Transformer Encoder</title>
<p>The dual-branch encoder architecture integrates two distinct branches to capture both local and global dependencies in data. The first branch utilizes Convolutional Neural Networks (CNNs) for encoding spatial or local features through convolutional operations. This approach excels in identifying patterns and structures within specific areas of the input data. Meanwhile, the second branch uses Transformer models to model extended dependencies and sequential relationships through self-attention mechanisms. Together, these two branches complement each other, allowing the model to effectively process and integrate both fine-grained local details and broader, global contexts. This dual-branch setup enhances the overall capability of the encoder to handle diverse types of information.</p>
<p>The CNN branch encoder extracts local features and texture information from the input data. The convolutional layers capture local features at different positions. This feature extraction approach enables the CNN branch encoder to effectively capture spatial locality in images and exhibit certain robustness to image translation and scale variations.</p>
<p>The main feature of the Transformer encoder is its ability to capture relationships between different positions in a sequence. We introduce a learnable Transformer module to further enhance the feature representation and context modeling capabilities during the inpainting process. In the self-attention mechanism, linear transformations are used to compute attention weights and context representations, replacing the traditional dot product operation with matrix multiplication [<xref ref-type="bibr" rid="ref-12">12</xref>]. The introduction of more parameters and non-linear transformations can enhance the expressive power of the model [<xref ref-type="bibr" rid="ref-22">22</xref>]. <xref ref-type="fig" rid="fig-3">Fig. 3</xref> illustrates the execution process of a Transformer module.</p>
<fig id="fig-3">
<label>Figure 3</label>
<caption>
<title>Linear transformer module</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_66842-fig-3.tif"/>
</fig>
</sec>
<sec id="s3_1_2">
<label>3.1.2</label>
<title>Dual-Branch Fusion Module</title>
<p>The detailed structure of the dual-branch fusion module is illustrated in <xref ref-type="fig" rid="fig-4">Fig. 4</xref>. This module plays a crucial role in combining the outputs from the CNN branch encoder and the Transformer branch encoder. By merging their extracted features into a unified set of fusion features with consistent dimensions, the dual-branch fusion module facilitates effective interaction between the two branches during the inpainting process. The ability of the module to harmonize these different types of information contributes to improved image fidelity and overall quality in the inpainting process.</p>
<fig id="fig-4">
<label>Figure 4</label>
<caption>
<title>Dual-branch fusion module</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_66842-fig-4.tif"/>
</fig>
<p>Specifically, we obtain the fused feature <inline-formula id="ieqn-7"><mml:math id="mml-ieqn-7"><mml:msubsup><mml:mi>F</mml:mi><mml:mrow><mml:mi>E</mml:mi><mml:mi>n</mml:mi></mml:mrow><mml:mi>i</mml:mi></mml:msubsup></mml:math></inline-formula> of the encoding layers through the following operations, as shown in <xref ref-type="disp-formula" rid="eqn-1">Eqs. (1)</xref> and <xref ref-type="disp-formula" rid="eqn-2">(2)</xref>.
<disp-formula id="eqn-1"><label>(1)</label><mml:math id="mml-eqn-1" display="block"><mml:msubsup><mml:mi>f</mml:mi><mml:mrow><mml:mi>E</mml:mi><mml:mi>n</mml:mi></mml:mrow><mml:mi>i</mml:mi></mml:msubsup><mml:mo>=</mml:mo><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:msubsup><mml:mi>T</mml:mi><mml:mrow><mml:mi>E</mml:mi><mml:mi>n</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mtext>{i}</mml:mtext></mml:mrow></mml:mrow></mml:msubsup><mml:mo>&#x2299;</mml:mo><mml:msubsup><mml:mi>C</mml:mi><mml:mrow><mml:mi>E</mml:mi><mml:mi>n</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mtext>{i}</mml:mtext></mml:mrow></mml:mrow></mml:msubsup></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mo>,</mml:mo><mml:mi>S</mml:mi><mml:mi>A</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:msubsup><mml:mi>T</mml:mi><mml:mrow><mml:mi>E</mml:mi><mml:mi>n</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mtext>{i}</mml:mtext></mml:mrow></mml:mrow></mml:msubsup></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mo>,</mml:mo><mml:mi>C</mml:mi><mml:mi>A</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:msubsup><mml:mi>C</mml:mi><mml:mrow><mml:mi>E</mml:mi><mml:mi>n</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mtext>{i}</mml:mtext></mml:mrow></mml:mrow></mml:msubsup></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mrow><mml:mo>]</mml:mo></mml:mrow></mml:math></disp-formula>
<disp-formula id="eqn-2"><label>(2)</label><mml:math id="mml-eqn-2" display="block"><mml:msubsup><mml:mi>F</mml:mi><mml:mrow><mml:mi>E</mml:mi><mml:mi>n</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mtext>{i}</mml:mtext></mml:mrow></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:mi>G</mml:mi><mml:mi>A</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mi>R</mml:mi><mml:mi>e</mml:mi><mml:mi>s</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:msubsup><mml:mi>f</mml:mi><mml:mrow><mml:mi>E</mml:mi><mml:mi>n</mml:mi></mml:mrow><mml:mi>i</mml:mi></mml:msubsup></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:math></disp-formula></p>
<p>Among them, <inline-formula id="ieqn-8"><mml:math id="mml-ieqn-8"><mml:msubsup><mml:mi>f</mml:mi><mml:mrow><mml:mi>E</mml:mi><mml:mi>n</mml:mi></mml:mrow><mml:mi>i</mml:mi></mml:msubsup></mml:math></inline-formula> represents the intermediate feature obtained by simple channel concatenation. <inline-formula id="ieqn-9"><mml:math id="mml-ieqn-9"><mml:mi>S</mml:mi><mml:mi>A</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mo>&#x22C5;</mml:mo><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula> represents spatial attention, <inline-formula id="ieqn-10"><mml:math id="mml-ieqn-10"><mml:mi>C</mml:mi><mml:mi>A</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mo>&#x22C5;</mml:mo><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula> represents channel attention, <inline-formula id="ieqn-11"><mml:math id="mml-ieqn-11"><mml:mrow><mml:mo>[</mml:mo><mml:mo>&#x22C5;</mml:mo><mml:mo>]</mml:mo></mml:mrow></mml:math></inline-formula> represents channel concatenation, <inline-formula id="ieqn-12"><mml:math id="mml-ieqn-12"><mml:mi>R</mml:mi><mml:mi>e</mml:mi><mml:mi>s</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mo>&#x22C5;</mml:mo><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula> represents residual block, and <inline-formula id="ieqn-13"><mml:math id="mml-ieqn-13"><mml:mi>G</mml:mi><mml:mi>A</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mo>&#x22C5;</mml:mo><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula> represents gate mechanism.</p>
</sec>
<sec id="s3_1_3">
<label>3.1.3</label>
<title>Full Scale Skip Connection</title>
<p><xref ref-type="fig" rid="fig-5">Fig. 5</xref> shows the detailed structure diagram of the full-scale skip connection. We incorporate a gated attention mechanism at the end of the traditional full-scale skip connection. The full-scale skip connection with a gated attention mechanism achieves feature transmission and fusion across different levels. It selectively combines the source features and target features through weighted fusion, transferring low-level detail information to higher levels while preserving high-level semantic information.</p>
<fig id="fig-5">
<label>Figure 5</label>
<caption>
<title>Full-scale skip connection</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_66842-fig-5.tif"/>
</fig>
<p>In the formula, the calculation of the feature map stack for the decoding layers represented by <inline-formula id="ieqn-14"><mml:math id="mml-ieqn-14"><mml:msubsup><mml:mi>C</mml:mi><mml:mrow><mml:mi>D</mml:mi><mml:mi>e</mml:mi></mml:mrow><mml:mi>i</mml:mi></mml:msubsup></mml:math></inline-formula> is as follows, as shown in <xref ref-type="disp-formula" rid="eqn-3">Eqs. (3)&#x2013;</xref><xref ref-type="disp-formula" rid="eqn-6">(6)</xref>:
<disp-formula id="eqn-3"><label>(3)</label><mml:math id="mml-eqn-3" display="block"><mml:mrow><mml:msub><mml:mi>c</mml:mi><mml:mn>1</mml:mn></mml:msub></mml:mrow><mml:mo>=</mml:mo><mml:mi>A</mml:mi><mml:mi>t</mml:mi><mml:mi>t</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mi>C</mml:mi><mml:mi>o</mml:mi><mml:mi>n</mml:mi><mml:mi>v</mml:mi><mml:msubsup><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mi>D</mml:mi><mml:mi>o</mml:mi><mml:mi>w</mml:mi><mml:mi>n</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:msubsup><mml:mi>F</mml:mi><mml:mrow><mml:mi>E</mml:mi><mml:mi>n</mml:mi></mml:mrow><mml:mi>k</mml:mi></mml:msubsup></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mrow><mml:mi>k</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msubsup></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:math></disp-formula>
<disp-formula id="eqn-4"><label>(4)</label><mml:math id="mml-eqn-4" display="block"><mml:mrow><mml:msub><mml:mi>c</mml:mi><mml:mn>2</mml:mn></mml:msub></mml:mrow><mml:mo>=</mml:mo><mml:mi>A</mml:mi><mml:mi>t</mml:mi><mml:mi>t</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mi>C</mml:mi><mml:mi>o</mml:mi><mml:mi>n</mml:mi><mml:mi>v</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:msubsup><mml:mi>F</mml:mi><mml:mrow><mml:mi>E</mml:mi><mml:mi>n</mml:mi></mml:mrow><mml:mi>i</mml:mi></mml:msubsup></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:math></disp-formula>
<disp-formula id="eqn-5"><label>(5)</label><mml:math id="mml-eqn-5" display="block"><mml:mrow><mml:msub><mml:mi>c</mml:mi><mml:mn>3</mml:mn></mml:msub></mml:mrow><mml:mo>=</mml:mo><mml:mi>A</mml:mi><mml:mi>t</mml:mi><mml:mi>t</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mi>C</mml:mi><mml:mi>o</mml:mi><mml:mi>n</mml:mi><mml:mi>v</mml:mi><mml:msubsup><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mi>U</mml:mi><mml:mi>p</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:msubsup><mml:mi>F</mml:mi><mml:mrow><mml:mi>D</mml:mi><mml:mi>e</mml:mi></mml:mrow><mml:mi>k</mml:mi></mml:msubsup></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mrow><mml:mi>k</mml:mi><mml:mo>=</mml:mo><mml:mi>i</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mi>N</mml:mi></mml:msubsup></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:math></disp-formula>
<disp-formula id="eqn-6"><label>(6)</label><mml:math id="mml-eqn-6" display="block"><mml:msubsup><mml:mi>F</mml:mi><mml:mrow><mml:mi>D</mml:mi><mml:mi>e</mml:mi></mml:mrow><mml:mi>i</mml:mi></mml:msubsup><mml:mo>=</mml:mo><mml:mrow><mml:mo>{</mml:mo><mml:mrow><mml:mtable rowspacing="4pt" columnspacing="1em"><mml:mtr><mml:mtd><mml:mrow><mml:msubsup><mml:mi>F</mml:mi><mml:mrow><mml:mi>E</mml:mi><mml:mi>n</mml:mi></mml:mrow><mml:mi>i</mml:mi></mml:msubsup><mml:mo>,</mml:mo><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mi>N</mml:mi></mml:mrow></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mrow><mml:mi>C</mml:mi><mml:mi>o</mml:mi><mml:mi>n</mml:mi><mml:mi>v</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mi>C</mml:mi><mml:mi>o</mml:mi><mml:mi>n</mml:mi><mml:mi>c</mml:mi><mml:mi>a</mml:mi><mml:mi>t</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mrow><mml:msub><mml:mi>c</mml:mi><mml:mn>1</mml:mn></mml:msub></mml:mrow><mml:mo>,</mml:mo><mml:mrow><mml:msub><mml:mi>c</mml:mi><mml:mn>2</mml:mn></mml:msub></mml:mrow><mml:mo>,</mml:mo><mml:mrow><mml:msub><mml:mi>c</mml:mi><mml:mn>3</mml:mn></mml:msub></mml:mrow></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mo>,</mml:mo><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mo>&#x22EF;</mml:mo><mml:mi>N</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:mrow><mml:mo fence="true" stretchy="true" symmetric="true"></mml:mo></mml:mrow></mml:math></disp-formula></p>
<p>Among them, the parameters <inline-formula id="ieqn-15"><mml:math id="mml-ieqn-15"><mml:msub><mml:mi>C</mml:mi><mml:mn>1</mml:mn></mml:msub></mml:math></inline-formula>, <inline-formula id="ieqn-16"><mml:math id="mml-ieqn-16"><mml:msub><mml:mi>C</mml:mi><mml:mn>2</mml:mn></mml:msub></mml:math></inline-formula>, and <inline-formula id="ieqn-17"><mml:math id="mml-ieqn-17"><mml:msub><mml:mi>C</mml:mi><mml:mn>3</mml:mn></mml:msub></mml:math></inline-formula> represent the larger size, same size and smaller size of the feature compared to the current size after simple processing for fusing full-scale features, respectively. The function <inline-formula id="ieqn-18"><mml:math id="mml-ieqn-18"><mml:mi>C</mml:mi><mml:mi>o</mml:mi><mml:mi>n</mml:mi><mml:mi>v</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mo>&#x22C5;</mml:mo><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula> represents the operation of convolution. <inline-formula id="ieqn-19"><mml:math id="mml-ieqn-19"><mml:mi>A</mml:mi><mml:mi>t</mml:mi><mml:mi>t</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mo>&#x22C5;</mml:mo><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula> represents the feature fusion mechanism achieved through operations like spatial and channel attention and convolution. Furthermore, <inline-formula id="ieqn-20"><mml:math id="mml-ieqn-20"><mml:mi>D</mml:mi><mml:mi>o</mml:mi><mml:mi>w</mml:mi><mml:mi>n</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mo>&#x22C5;</mml:mo><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula> and <inline-formula id="ieqn-21"><mml:math id="mml-ieqn-21"><mml:mi>U</mml:mi><mml:mi>p</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mo>&#x22C5;</mml:mo><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula> represent the operations of upsampling and downsampling, respectively.</p>
</sec>
</sec>
<sec id="s3_2">
<label>3.2</label>
<title>Loss Function</title>
<p>We optimize our PDT-Net using a joint loss function <italic>L</italic>, which includes multiple components: the reconstruction loss <inline-formula id="ieqn-22"><mml:math id="mml-ieqn-22"><mml:mrow><mml:msub><mml:mi>&#x2113;</mml:mi><mml:mrow><mml:mrow><mml:mrow><mml:mtext>{re}</mml:mtext></mml:mrow></mml:mrow></mml:mrow></mml:msub></mml:mrow></mml:math></inline-formula>, the adversarial loss <inline-formula id="ieqn-23"><mml:math id="mml-ieqn-23"><mml:mrow><mml:msub><mml:mi>&#x2113;</mml:mi><mml:mrow><mml:mrow><mml:mrow><mml:mtext>{adv}</mml:mtext></mml:mrow></mml:mrow></mml:mrow></mml:msub></mml:mrow></mml:math></inline-formula> [<xref ref-type="bibr" rid="ref-23">23</xref>], the perceptual loss <inline-formula id="ieqn-24"><mml:math id="mml-ieqn-24"><mml:mrow><mml:msub><mml:mi>&#x2113;</mml:mi><mml:mrow><mml:mrow><mml:mtext>{p}</mml:mtext></mml:mrow></mml:mrow></mml:msub></mml:mrow></mml:math></inline-formula> [<xref ref-type="bibr" rid="ref-24">24</xref>], and the style loss <inline-formula id="ieqn-25"><mml:math id="mml-ieqn-25"><mml:mrow><mml:msub><mml:mi>&#x2113;</mml:mi><mml:mrow><mml:mrow><mml:mtext>{s}</mml:mtext></mml:mrow></mml:mrow></mml:msub></mml:mrow></mml:math></inline-formula> [<xref ref-type="bibr" rid="ref-25">25</xref>]. These loss functions are commonly employed in various image inpainting methods [<xref ref-type="bibr" rid="ref-18">18</xref>] and are defined as shown in <xref ref-type="disp-formula" rid="eqn-7">Eqs. (7)</xref>&#x2013;<xref ref-type="disp-formula" rid="eqn-11">(11)</xref>:
<disp-formula id="eqn-7"><label>(7)</label><mml:math id="mml-eqn-7" display="block"><mml:mrow><mml:msub><mml:mi>&#x2113;</mml:mi><mml:mrow><mml:mrow><mml:mrow><mml:mtext>{re}</mml:mtext></mml:mrow></mml:mrow></mml:mrow></mml:msub></mml:mrow><mml:mo>=</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mo symmetric="true">&#x2016;</mml:mo><mml:mrow><mml:mrow><mml:msub><mml:mi>I</mml:mi><mml:mrow><mml:mi>o</mml:mi><mml:mi>u</mml:mi><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mrow><mml:msub><mml:mi>I</mml:mi><mml:mrow><mml:mi>g</mml:mi><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:mrow><mml:mo symmetric="true">&#x2016;</mml:mo></mml:mrow><mml:mn>1</mml:mn></mml:msub></mml:mrow></mml:math></disp-formula>
<disp-formula id="eqn-8"><label>(8)</label><mml:math id="mml-eqn-8" display="block"><mml:mrow><mml:msub><mml:mi>&#x2113;</mml:mi><mml:mrow><mml:mi>a</mml:mi><mml:mi>d</mml:mi><mml:mi>v</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo>=</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mrow><mml:mtext>E</mml:mtext></mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:msub><mml:mi>I</mml:mi><mml:mrow><mml:mi>g</mml:mi><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:mi>l</mml:mi><mml:mi>o</mml:mi><mml:mi>g</mml:mi><mml:mi>D</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mrow><mml:msub><mml:mi>I</mml:mi><mml:mrow><mml:mi>g</mml:mi><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mrow><mml:mo>]</mml:mo></mml:mrow><mml:mo>+</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mrow><mml:mtext>E</mml:mtext></mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:msub><mml:mi>I</mml:mi><mml:mrow><mml:mi>o</mml:mi><mml:mi>u</mml:mi><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:mrow></mml:msub></mml:mrow><mml:mi>l</mml:mi><mml:mi>o</mml:mi><mml:mi>g</mml:mi><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:mn>1</mml:mn><mml:mo>&#x2212;</mml:mo><mml:mi>D</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mrow><mml:msub><mml:mi>I</mml:mi><mml:mrow><mml:mi>o</mml:mi><mml:mi>u</mml:mi><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mrow><mml:mo>]</mml:mo></mml:mrow></mml:math></disp-formula>
<disp-formula id="eqn-9"><label>(9)</label><mml:math id="mml-eqn-9" display="block"><mml:mrow><mml:msub><mml:mi>&#x2113;</mml:mi><mml:mi>p</mml:mi></mml:msub></mml:mrow><mml:mo>=</mml:mo><mml:mrow><mml:mrow><mml:mtext>E</mml:mtext></mml:mrow></mml:mrow><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:munder><mml:mrow><mml:mi mathvariant="normal">&#x03A3;</mml:mi></mml:mrow><mml:mi>i</mml:mi></mml:munder><mml:mo>&#x2061;</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:mrow><mml:mrow><mml:msub><mml:mi>N</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mrow></mml:mrow></mml:mfrac><mml:mrow><mml:msub><mml:mrow><mml:mrow><mml:mo symmetric="true">&#x2016;</mml:mo><mml:mrow><mml:mrow><mml:msub><mml:mi>&#x03D5;</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mrow><mml:msub><mml:mi>I</mml:mi><mml:mrow><mml:mi>o</mml:mi><mml:mi>u</mml:mi><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mrow><mml:msub><mml:mi>&#x03D5;</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mrow><mml:msub><mml:mi>I</mml:mi><mml:mrow><mml:mi>g</mml:mi><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mrow><mml:mo symmetric="true">&#x2016;</mml:mo></mml:mrow></mml:mrow><mml:mn>1</mml:mn></mml:msub></mml:mrow></mml:mrow><mml:mo>]</mml:mo></mml:mrow></mml:math></disp-formula>
<disp-formula id="eqn-10"><label>(10)</label><mml:math id="mml-eqn-10" display="block"><mml:mrow><mml:msub><mml:mi>&#x2113;</mml:mi><mml:mi>s</mml:mi></mml:msub></mml:mrow><mml:mo>=</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mrow><mml:mtext>E</mml:mtext></mml:mrow></mml:mrow><mml:mi>i</mml:mi></mml:msub></mml:mrow><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:mrow><mml:mo symmetric="true">&#x2016;</mml:mo><mml:mrow><mml:mrow><mml:msub><mml:mi>&#x03D5;</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mrow><mml:mrow><mml:msup><mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mrow><mml:msub><mml:mi>I</mml:mi><mml:mrow><mml:mi>o</mml:mi><mml:mi>u</mml:mi><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mrow><mml:mi>T</mml:mi></mml:msup></mml:mrow><mml:mrow><mml:msub><mml:mi>&#x03D5;</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mrow><mml:msub><mml:mi>I</mml:mi><mml:mrow><mml:mi>o</mml:mi><mml:mi>u</mml:mi><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mrow><mml:msub><mml:mi>&#x03D5;</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mrow><mml:mrow><mml:msup><mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mrow><mml:msub><mml:mi>I</mml:mi><mml:mrow><mml:mi>g</mml:mi><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mrow><mml:mi>T</mml:mi></mml:msup></mml:mrow><mml:mrow><mml:msub><mml:mi>&#x03D5;</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mrow><mml:msub><mml:mi>I</mml:mi><mml:mrow><mml:mi>g</mml:mi><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mrow><mml:mo symmetric="true">&#x2016;</mml:mo></mml:mrow></mml:mrow><mml:mo>]</mml:mo></mml:mrow></mml:math></disp-formula><disp-formula id="eqn-11"><label>(11)</label><mml:math id="mml-eqn-11" display="block"><mml:mi>L</mml:mi><mml:mo>=</mml:mo><mml:mrow><mml:msub><mml:mi>&#x03BB;</mml:mi><mml:mrow><mml:mi>r</mml:mi><mml:mi>e</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:msub><mml:mi>&#x2113;</mml:mi><mml:mrow><mml:mi>r</mml:mi><mml:mi>e</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo>+</mml:mo><mml:mrow><mml:msub><mml:mi>&#x03BB;</mml:mi><mml:mrow><mml:mi>a</mml:mi><mml:mi>d</mml:mi><mml:mi>v</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:msub><mml:mi>&#x2113;</mml:mi><mml:mrow><mml:mi>a</mml:mi><mml:mi>d</mml:mi><mml:mi>v</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo>+</mml:mo><mml:mrow><mml:msub><mml:mi>&#x03BB;</mml:mi><mml:mi>p</mml:mi></mml:msub></mml:mrow><mml:mrow><mml:msub><mml:mi>&#x2113;</mml:mi><mml:mi>p</mml:mi></mml:msub></mml:mrow><mml:mo>+</mml:mo><mml:mrow><mml:msub><mml:mi>&#x03BB;</mml:mi><mml:mi>s</mml:mi></mml:msub></mml:mrow><mml:mrow><mml:msub><mml:mi>&#x2113;</mml:mi><mml:mi>s</mml:mi></mml:msub></mml:mrow></mml:math></disp-formula></p>
<p>where <italic>D</italic> represents the PatchGAN discriminator [<xref ref-type="bibr" rid="ref-26">26</xref>] with spectral normalization. <inline-formula id="ieqn-26"><mml:math id="mml-ieqn-26"><mml:mrow><mml:msub><mml:mi>&#x03D5;</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mrow></mml:math></inline-formula> refers to the <italic>i</italic>-th layer activation function of the VGG19 [<xref ref-type="bibr" rid="ref-27">27</xref>] network pre-trained on the ImageNet [<xref ref-type="bibr" rid="ref-28">28</xref>] dataset. <inline-formula id="ieqn-27"><mml:math id="mml-ieqn-27"><mml:mrow><mml:msub><mml:mi>N</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mrow></mml:math></inline-formula> represents the number of elements in <inline-formula id="ieqn-28"><mml:math id="mml-ieqn-28"><mml:mrow><mml:msub><mml:mi>&#x03D5;</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mrow></mml:math></inline-formula>. <inline-formula id="ieqn-29"><mml:math id="mml-ieqn-29"><mml:mrow><mml:msub><mml:mi>&#x03BB;</mml:mi><mml:mrow><mml:mrow><mml:mrow><mml:mrow><mml:mi>r</mml:mi><mml:mi>e</mml:mi></mml:mrow></mml:mrow></mml:mrow></mml:mrow></mml:msub></mml:mrow></mml:math></inline-formula>, <inline-formula id="ieqn-30"><mml:math id="mml-ieqn-30"><mml:mrow><mml:msub><mml:mi>&#x03BB;</mml:mi><mml:mrow><mml:mrow><mml:mrow><mml:mrow><mml:mi>a</mml:mi><mml:mi>d</mml:mi><mml:mi>v</mml:mi></mml:mrow></mml:mrow></mml:mrow></mml:mrow></mml:msub></mml:mrow></mml:math></inline-formula>, <inline-formula id="ieqn-31"><mml:math id="mml-ieqn-31"><mml:mrow><mml:msub><mml:mi>&#x03BB;</mml:mi><mml:mrow><mml:mrow><mml:mrow><mml:mrow><mml:mi>p</mml:mi></mml:mrow></mml:mrow></mml:mrow></mml:mrow></mml:msub></mml:mrow></mml:math></inline-formula>, and <inline-formula id="ieqn-32"><mml:math id="mml-ieqn-32"><mml:mrow><mml:msub><mml:mi>&#x03BB;</mml:mi><mml:mrow><mml:mrow><mml:mrow><mml:mi>s</mml:mi></mml:mrow></mml:mrow></mml:mrow></mml:msub></mml:mrow></mml:math></inline-formula> are the weight ratios of the corresponding loss functions. We set <inline-formula id="ieqn-33"><mml:math id="mml-ieqn-33"><mml:mrow><mml:msub><mml:mi>&#x03BB;</mml:mi><mml:mrow><mml:mrow><mml:mrow><mml:mrow><mml:mi>r</mml:mi><mml:mi>e</mml:mi></mml:mrow></mml:mrow></mml:mrow></mml:mrow></mml:msub></mml:mrow></mml:math></inline-formula> &#x003D; 1, <inline-formula id="ieqn-34"><mml:math id="mml-ieqn-34"><mml:mrow><mml:msub><mml:mi>&#x03BB;</mml:mi><mml:mrow><mml:mrow><mml:mrow><mml:mrow><mml:mi>a</mml:mi><mml:mi>d</mml:mi><mml:mi>v</mml:mi></mml:mrow></mml:mrow></mml:mrow></mml:mrow></mml:msub></mml:mrow></mml:math></inline-formula> &#x003D; 0. 1, <inline-formula id="ieqn-35"><mml:math id="mml-ieqn-35"><mml:mrow><mml:msub><mml:mi>&#x03BB;</mml:mi><mml:mrow><mml:mrow><mml:mrow><mml:mi>p</mml:mi></mml:mrow></mml:mrow></mml:mrow></mml:msub></mml:mrow></mml:math></inline-formula> &#x003D; 1, and <inline-formula id="ieqn-36"><mml:math id="mml-ieqn-36"><mml:mrow><mml:msub><mml:mi>&#x03BB;</mml:mi><mml:mrow><mml:mrow><mml:mrow><mml:mi>s</mml:mi></mml:mrow></mml:mrow></mml:mrow></mml:msub></mml:mrow></mml:math></inline-formula> &#x003D; 250.</p>
</sec>
</sec>
<sec id="s4">
<label>4</label>
<title>Experiments</title>
<sec id="s4_1">
<label>4.1</label>
<title>Datasets</title>
<p>As shown in <xref ref-type="table" rid="table-1">Table 1</xref>, the original image dataset consists of three public datasets. Our experiments fully utilized the training and testing sets of the Paris street view dataset, which is composed of a large number of images collected from the streets, for evaluation purposes and strictly followed its original split configuration. For the CelebA and Places2 datasets, we divided the training, validation, and testing sets with a ratio of 8:1:1 for our experiments. The CelebA dataset primarily focuses on human faces and consists of 202,599 face images. We selected 50,000 images from the CelebA dataset for training and 6250 images from the testing set for evaluation, images were selected sequentially from the beginning of the dataset. The Places2 dataset covers various scenes and environments, containing millions of images. We selected twenty scene categories from the Places2 dataset, including attics, airports, arches, campuses, and more. Each category has 5000 training images, making a total of 100,000 training images. We selected 12,500 images from the testing set for evaluation, 5000 images were selected from each relevant subset.</p>
<table-wrap id="table-1">
<label>Table 1</label>
<caption>
<title>Dataset details for image analysis experiments</title>
</caption>
<table>
<colgroup>
<col/>
<col/>
<col/>
<col/>
</colgroup>
<thead>
<tr>
<th>Dataset name</th>
<th>Paris street view</th>
<th>CelebA</th>
<th>Places2</th>
</tr>
</thead>
<tbody>
<tr>
<td>Main content</td>
<td>A vast collection of images gathered from the streets</td>
<td>Primarily focuses on human faces, 202,599 in total</td>
<td>Encompasses various scenes and environments</td>
</tr>
<tr>
<td>Training set size</td>
<td>14,900</td>
<td>50,000 (Randomly Selected)</td>
<td>100,000 (Randomly Selected)</td>
</tr>
<tr>
<td>Test set size</td>
<td>100</td>
<td>6250 (Randomly Selected)</td>
<td>12,500 (Randomly Selected)</td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
<sec id="s4_2">
<label>4.2</label>
<title>Implementation Details</title>
<p>The experiments were conducted on a system running Ubuntu 18.04, equipped with a single NVIDIA A30 Tensor Core GPU that has a memory capacity of 24 GB. For implementing the experiments, we used the PyTorch framework, which is well-suited for deep learning tasks due to its flexibility and efficiency. To optimize the model parameters, we employed the Adam optimizer [<xref ref-type="bibr" rid="ref-29">29</xref>], known for its effectiveness in handling large datasets and complex neural network structures. The Adam optimizer with <inline-formula id="ieqn-37"><mml:math id="mml-ieqn-37"><mml:msub><mml:mi>&#x03B2;</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:mo>=</mml:mo><mml:mn>0.5</mml:mn></mml:math></inline-formula>, <inline-formula id="ieqn-38"><mml:math id="mml-ieqn-38"><mml:msub><mml:mi>&#x03B2;</mml:mi><mml:mn>2</mml:mn></mml:msub><mml:mo>=</mml:mo><mml:mn>0.9</mml:mn></mml:math></inline-formula> was used to train the model, the learning rate was set to <inline-formula id="ieqn-39"><mml:math id="mml-ieqn-39"><mml:msup><mml:mrow><mml:mn>10</mml:mn></mml:mrow><mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mn>4</mml:mn></mml:mrow></mml:msup></mml:math></inline-formula>, and later it was adjusted to <inline-formula id="ieqn-40"><mml:math id="mml-ieqn-40"><mml:msup><mml:mrow><mml:mn>10</mml:mn></mml:mrow><mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mn>5</mml:mn></mml:mrow></mml:msup></mml:math></inline-formula> to fine-tune the model. The choice of hardware and software tools was aimed at ensuring robust performance and accurate results, with the NVIDIA A30 GPU providing the necessary computational power and the PyTorch framework facilitating smooth implementation and experimentation.</p>
</sec>
<sec id="s4_3">
<label>4.3</label>
<title>Comparative Experiment</title>
<p>To offer a clearer evaluation of our model&#x2019;s performance in image inpainting, we compared it against five leading image inpainting models across three different datasets. We utilized three well-established metrics: Peak Signal-to-Noise Ratio (PSNR), Structural Similarity Index (SSIM) [<xref ref-type="bibr" rid="ref-30">30</xref>], and the <inline-formula id="ieqn-41"><mml:math id="mml-ieqn-41"><mml:mrow><mml:msub><mml:mi>&#x2113;</mml:mi><mml:mn>1</mml:mn></mml:msub></mml:mrow></mml:math></inline-formula> loss function. Higher PSNR and SSIM values indicate better reconstruction quality and visual similarity, while lower <inline-formula id="ieqn-42"><mml:math id="mml-ieqn-42"><mml:mrow><mml:msub><mml:mi>&#x2113;</mml:mi><mml:mn>1</mml:mn></mml:msub></mml:mrow></mml:math></inline-formula> loss values suggest better pixel-level accuracy. We ensured consistency in our experimental setup by using the same hardware and software environment for all models, and we maintained a uniform mask coverage rate throughout the experiments. This approach allowed for a fair and comprehensive assessment of our model&#x2019;s performance relative to its peers.
<list list-type="bullet">
<list-item>
<p>PC [<xref ref-type="bibr" rid="ref-31">31</xref>]: Introduces a method for repairing irregular holes in images using partial convolution techniques.</p></list-item>
<list-item>
<p>RFR [<xref ref-type="bibr" rid="ref-32">32</xref>]: Introduces a progressive image inpainting network based on recurrent feature reasoning.</p></list-item>
<list-item>
<p>AOT [<xref ref-type="bibr" rid="ref-33">33</xref>]: Enables context reasoning by capturing information-rich remote contexts and diverse patterns of interest.</p></list-item>
<list-item>
<p>CTSDG [<xref ref-type="bibr" rid="ref-34">34</xref>]: A new dual-branch network for image inpainting that seamlessly combines structurally constrained texture synthesis with texture-guided structural reconstruction.</p></list-item>
<list-item>
<p>T-former [<xref ref-type="bibr" rid="ref-22">22</xref>]: A Transformer network for image inpainting is proposed, incorporating an attention mechanism linearly correlated with the resolution.</p></list-item>
</list></p>
<p>The comparative experimental results presented in <xref ref-type="table" rid="table-2">Table 2</xref> clearly indicate that our model surpasses all baseline models across the three evaluation metrics. This performance improvement is further reflected in the enhanced visual coherence of the inpainted images, as shown in <xref ref-type="fig" rid="fig-6">Fig. 6</xref>. Additionally, <xref ref-type="fig" rid="fig-6">Fig. 6</xref> provides a detailed visual comparison: it features the original corrupted image, the results from PC, RFR, AOT, CTSDG, T-former, our PDT-Net, and the ground truth image. This sequence of images highlights the improved fidelity and visual quality achieved by PDT-Net compared to the other methods.</p>
<table-wrap id="table-2">
<label>Table 2</label>
<caption>
<title>We present the quantitative comparison results obtained from conducting experiments on three publicly available datasets using our PDT-Net model</title>
</caption>
<table>
<colgroup>
<col/>
<col/>
<col/>
<col/>
<col/>
<col/>
<col/>
<col/>
<col/>
<col/>
<col/>
</colgroup>
<thead>
<tr>
<th colspan="2">Datasets</th>
<th colspan="3">Paris street view</th>
<th colspan="3">CelebA</th>
<th colspan="3">Places2</th>
</tr>
<tr>
<th colspan="2">Mask ratio</th>
<th>0%&#x2013;20%</th>
<th>20%&#x2013;40%</th>
<th>40%&#x2013;60%</th>
<th>0%&#x2013;20%</th>
<th>20%&#x2013;40%</th>
<th>40%&#x2013;60%</th>
<th>0%&#x2013;20%</th>
<th>20%&#x2013;40%</th>
<th>40%&#x2013;60%</th>
</tr>
</thead>
<tbody>
<tr>
<td>PSNR<inline-formula id="ieqn-43"><mml:math id="mml-ieqn-43"><mml:mo stretchy="false">&#x2191;</mml:mo></mml:math></inline-formula></td>
<td>PC</td>
<td>32.12</td>
<td>25.56</td>
<td>21.26</td>
<td>31.85</td>
<td>26.55</td>
<td>21.34</td>
<td>29.39</td>
<td>24.26</td>
<td>21.88</td>
</tr>
<tr>
<td></td>
<td>RFR</td>
<td>32.67</td>
<td>26.31</td>
<td>22.41</td>
<td>33.33</td>
<td>27.62</td>
<td>22.63</td>
<td>29.90</td>
<td>24.96</td>
<td>22.14</td>
</tr>
<tr>
<td></td>
<td>AOT</td>
<td>32.80</td>
<td>26.36</td>
<td>22.66</td>
<td>33.58</td>
<td>27.73</td>
<td>22.80</td>
<td>29.96</td>
<td>25.05</td>
<td>22.29</td>
</tr>
<tr>
<td></td>
<td>CTSDG</td>
<td>32.95</td>
<td>27.51</td>
<td>22.89</td>
<td>33.97</td>
<td>27.81</td>
<td>22.94</td>
<td>30.18</td>
<td>25.52</td>
<td>22.58</td>
</tr>
<tr>
<td></td>
<td>T-former</td>
<td>33.12</td>
<td>27.83</td>
<td>22.97</td>
<td>34.10</td>
<td>27.94</td>
<td>23.08</td>
<td>30.36</td>
<td>25.74</td>
<td>22.87</td>
</tr>
<tr>
<td></td>
<td>Ours</td>
<td>33.35</td>
<td>27.95</td>
<td>23.16</td>
<td>34.27</td>
<td>28.06</td>
<td>23.22</td>
<td>30.45</td>
<td>25.99</td>
<td>23.11</td>
</tr>
<tr>
<td>SSIM<inline-formula id="ieqn-44"><mml:math id="mml-ieqn-44"><mml:mo stretchy="false">&#x2191;</mml:mo></mml:math></inline-formula></td>
<td>PC</td>
<td>0.897</td>
<td>0.690</td>
<td>0.500</td>
<td>0.897</td>
<td>0.749</td>
<td>0.557</td>
<td>0.887</td>
<td>0.731</td>
<td>0.527</td>
</tr>
<tr>
<td></td>
<td>RFR</td>
<td>0.920</td>
<td>0.773</td>
<td>0.569</td>
<td>0.917</td>
<td>0.781</td>
<td>0.602</td>
<td>0.899</td>
<td>0.751</td>
<td>0.554</td>
</tr>
<tr>
<td></td>
<td>AOT</td>
<td>0.923</td>
<td>0.776</td>
<td>0.572</td>
<td>0.919</td>
<td>0.783</td>
<td>0.603</td>
<td>0.901</td>
<td>0.757</td>
<td>0.559</td>
</tr>
<tr>
<td></td>
<td>CTSDG</td>
<td>0.924</td>
<td>0.777</td>
<td>0.574</td>
<td>0.921</td>
<td>0.787</td>
<td>0.610</td>
<td>0.905</td>
<td>0.760</td>
<td>0.566</td>
</tr>
<tr>
<td></td>
<td>T-former</td>
<td>0.926</td>
<td>0.779</td>
<td>0.578</td>
<td>0.923</td>
<td>0.788</td>
<td>0.614</td>
<td>0.907</td>
<td>0.769</td>
<td>0.569</td>
</tr>
<tr>
<td></td>
<td>Ours</td>
<td>0.927</td>
<td>0.781</td>
<td>0.579</td>
<td>0.926</td>
<td>0.792</td>
<td>0.616</td>
<td>0.911</td>
<td>0.774</td>
<td>0.570</td>
</tr>
<tr>
<td><inline-formula id="ieqn-45"><mml:math id="mml-ieqn-45"><mml:msub><mml:mi>&#x2113;</mml:mi><mml:mn>1</mml:mn></mml:msub></mml:math></inline-formula>(%)<inline-formula id="ieqn-46"><mml:math id="mml-ieqn-46"><mml:mo stretchy="false">&#x2193;</mml:mo></mml:math></inline-formula></td>
<td>PC</td>
<td>0.057</td>
<td>0.133</td>
<td>0.272</td>
<td>0.044</td>
<td>0.120</td>
<td>0.220</td>
<td>0.064</td>
<td>0.132</td>
<td>0.281</td>
</tr>
<tr>
<td></td>
<td>RFR</td>
<td>0.041</td>
<td>0.113</td>
<td>0.233</td>
<td>0.031</td>
<td>0.090</td>
<td>0.185</td>
<td>0.049</td>
<td>0.100</td>
<td>0.238</td>
</tr>
<tr>
<td></td>
<td>AOT</td>
<td>0.041</td>
<td>0.112</td>
<td>0.231</td>
<td>0.030</td>
<td>0.089</td>
<td>0.183</td>
<td>0.047</td>
<td>0.098</td>
<td>0.237</td>
</tr>
<tr>
<td></td>
<td>CTSDG</td>
<td>0.038</td>
<td>0.105</td>
<td>0.228</td>
<td>0.028</td>
<td>0.080</td>
<td>0.178</td>
<td>0.041</td>
<td>0.094</td>
<td>0.226</td>
</tr>
<tr>
<td></td>
<td>T-former</td>
<td>0.035</td>
<td>0.103</td>
<td>0.224</td>
<td>0.025</td>
<td>0.079</td>
<td>0.175</td>
<td>0.038</td>
<td>0.091</td>
<td>0.223</td>
</tr>
<tr>
<td></td>
<td>Ours</td>
<td>0.034</td>
<td>0.103</td>
<td>0.223</td>
<td>0.025</td>
<td>0.078</td>
<td>0.173</td>
<td>0.037</td>
<td>0.090</td>
<td>0.223</td>
</tr>
</tbody>
</table>
</table-wrap><fig id="fig-6">
<label>Figure 6</label>
<caption>
<title>Inpainting results of the proposed PDT-Net model</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_66842-fig-6.tif"/>
</fig>
</sec>
<sec id="s4_4">
<label>4.4</label>
<title>Ablation Study</title>
<p>We performed a series of ablation experiments on the dataset of Paris street view. By individually eliminating the crucial components of our proposed method, we reevaluated the results of inpainting. <inline-formula id="ieqn-47"><mml:math id="mml-ieqn-47"><mml:mi>N</mml:mi><mml:mi>e</mml:mi><mml:msub><mml:mi>t</mml:mi><mml:mn>1</mml:mn></mml:msub></mml:math></inline-formula> represents the removal of the linear attention module in the Transformer encoder, <inline-formula id="ieqn-48"><mml:math id="mml-ieqn-48"><mml:mi>N</mml:mi><mml:mi>e</mml:mi><mml:msub><mml:mi>t</mml:mi><mml:mn>2</mml:mn></mml:msub></mml:math></inline-formula> represents the replacement of the dual-branch fusion module with regular convolutions, and <inline-formula id="ieqn-49"><mml:math id="mml-ieqn-49"><mml:mi>N</mml:mi><mml:mi>e</mml:mi><mml:msub><mml:mi>t</mml:mi><mml:mn>3</mml:mn></mml:msub></mml:math></inline-formula> represents the replacement of the full-scale skip connections with regular skip connections. The experimental outcomes showcased in <xref ref-type="table" rid="table-3">Table 3</xref> substantiate the vital role and efficacy of the incorporated components within the method. Removing these key components noticeably deteriorated the quality of the inpainting results, further affirming their significance. The comprehensive network model, as depicted in <xref ref-type="fig" rid="fig-7">Fig. 7</xref>, excels in reinstating intricate texture details and enhancing the coherency of the reconstructed structures.</p>
<table-wrap id="table-3">
<label>Table 3</label>
<caption>
<title>Quantitative ablation study of our PDT-Net model on the paris street view dataset</title>
</caption>
<table>
<colgroup>
<col/>
<col/>
<col/>
<col/>
<col/>
<col/>
<col/>
</colgroup>
<thead>
<tr>
<th>Models</th>
<th align="center">Full-scale skip connections</th>
<th align="center">Dual-branch fusion module</th>
<th align="center">Linear Transformer</th>
<th>PSNR<inline-formula id="ieqn-50"><mml:math id="mml-ieqn-50"><mml:mo stretchy="false">&#x2191;</mml:mo></mml:math></inline-formula></th>
<th>SSIM<inline-formula id="ieqn-51"><mml:math id="mml-ieqn-51"><mml:mo stretchy="false">&#x2191;</mml:mo></mml:math></inline-formula></th>
<th><inline-formula id="ieqn-52"><mml:math id="mml-ieqn-52"><mml:msub><mml:mi mathvariant="bold-italic">&#x2113;</mml:mi><mml:mrow><mml:mrow><mml:mn mathvariant="bold">1</mml:mn></mml:mrow></mml:mrow></mml:msub></mml:math></inline-formula>(%)<inline-formula id="ieqn-53"><mml:math id="mml-ieqn-53"><mml:mo stretchy="false">&#x2193;</mml:mo></mml:math></inline-formula></th>
</tr>
</thead>
<tbody>
<tr>
<td><inline-formula id="ieqn-54"><mml:math id="mml-ieqn-54"><mml:mi>N</mml:mi><mml:mi>e</mml:mi><mml:msub><mml:mi>t</mml:mi><mml:mn>1</mml:mn></mml:msub></mml:math></inline-formula></td>
<td></td>
<td>&#x2713;</td>
<td>&#x2713;</td>
<td>28.73</td>
<td>0.878</td>
<td>0.162</td>
</tr>
<tr>
<td><inline-formula id="ieqn-55"><mml:math id="mml-ieqn-55"><mml:mi>N</mml:mi><mml:mi>e</mml:mi><mml:msub><mml:mi>t</mml:mi><mml:mn>2</mml:mn></mml:msub></mml:math></inline-formula></td>
<td>&#x2713;</td>
<td></td>
<td>&#x2713;</td>
<td>29.54</td>
<td>0.879</td>
<td>0.162</td>
</tr>
<tr>
<td><inline-formula id="ieqn-56"><mml:math id="mml-ieqn-56"><mml:mi>N</mml:mi><mml:mi>e</mml:mi><mml:msub><mml:mi>t</mml:mi><mml:mn>3</mml:mn></mml:msub></mml:math></inline-formula></td>
<td>&#x2713;</td>
<td>&#x2713;</td>
<td></td>
<td>29.58</td>
<td>0.881</td>
<td>0.158</td>
</tr>
<tr>
<td>Ours</td>
<td>&#x2713;</td>
<td>&#x2713;</td>
<td>&#x2713;</td>
<td>29.60</td>
<td>0.882</td>
<td>0.156</td>
</tr>
</tbody>
</table>
</table-wrap><fig id="fig-7">
<label>Figure 7</label>
<caption>
<title>Ablation experiment results of our proposed PDT-Net model on the Paris street view dataset</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_66842-fig-7.tif"/>
</fig>
<p>Through extensive comparative and ablation studies, we have rigorously confirmed the advantages and effectiveness of our proposed approach for image inpainting tasks. Our rigorous analysis of the experimental data reveals that our approach consistently delivers higher quality results and enhanced performance compared to existing methods. Additionally, our method outperforms variants of other approaches where key components have been removed. The proposed PDT-Net has approximately 52.1 million parameters and requires 172.1 GFLOPs, reflecting a reasonable computational cost given its dual-branch architecture.</p>
</sec>
</sec>
<sec id="s5">
<label>5</label>
<title>Conclusion</title>
<p>In the field of image inpainting, we present an innovative parallel dual-branch learnable Transformer network. This advanced network architecture incorporates a CNN-Transformer dual encoder, which excels at extracting features from both local and global perspectives. By integrating these two types of encoders, our approach significantly enhances the effectiveness and precision of the inpainting process.</p>
<p>Furthermore, we introduce a sophisticated dual-branch fusion module along with a comprehensive skip connection mechanism. These components work together to propagate more detailed information throughout the network while preserving the structural integrity of the image. As a result, our method produces inpainting outcomes that are both more natural and realistic, demonstrating improved performance and visual coherence compared to existing approaches.</p>
<p>Our empirical analysis confirms that this method not only advances image inpainting capabilities but also effectively upholds the visual consistency of the inpainted images. The superior quality of the inpainting outcomes highlights a greater sense of naturalness and realism, offering valuable insights for ongoing research and development in image inpainting and related domains. This framework also shows potential for practical applications such as photo restoration and object removal.</p>
<p>In future work, we plan to explore real-time performance, extend the method to high-resolution scenarios, and investigate how to improve the anti-forensic robustness of inpainted images to prevent easy detection.</p>
</sec>
</body>
<back>
<ack>
<p>Not applicable.</p>
</ack>
<sec>
<title>Funding Statement</title>
<p>This paper was supported by Scientific Research Fund of Hunan Provincial Natural Science Foundation under Grant 2023JJ60257, Hunan Provincial Engineering Research Center for Intelligent Rehabilitation Robotics and Assistive Equipment under Grant 2025SH501, Inha University and Design of a Conflict Detection and Validation Tool under Grant HX2024123.</p>
</sec>
<sec>
<title>Author Contributions</title>
<p>The authors confirm contribution to the paper as follows: Conceptualization and methodology: Rongrong Gong, Tingxian Zhang; data curation and investigation: Yawen Wei; writing&#x2014;original draft preparation: Rongrong Gong, Tingxian Zhang, Yawen Wei; funding acquisition: Rongrong Gong; writing&#x2014;review and editing: Dengyong Zhang; resources and supervision: Yan Li. All authors reviewed the results and approved the final version of the manuscript.</p>
</sec>
<sec sec-type="data-availability">
<title>Availability of Data and Materials</title>
<p>All data included in this study are available upon request by contact with the corresponding author.</p>
</sec>
<sec>
<title>Ethics Approval</title>
<p>Not applicable.</p>
</sec>
<sec sec-type="COI-statement">
<title>Conflicts of Interest</title>
<p>The authors declare no conflicts of interest to report regarding the present study.</p>
</sec>
<ref-list content-type="authoryear">
<title>References</title>
<ref id="ref-1"><label>[1]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>LeCun</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Bottou</surname> <given-names>L</given-names></string-name>, <string-name><surname>Bengio</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Haffner</surname> <given-names>P</given-names></string-name></person-group>. <article-title>Gradient-based learning applied to document recognition</article-title>. <source>Proc IEEE</source>. <year>1998</year>;<volume>86</volume>(<issue>11</issue>):<fpage>2278</fpage>&#x2013;<lpage>324</lpage>. doi:<pub-id pub-id-type="doi">10.1109/5.726791</pub-id>.</mixed-citation></ref>
<ref id="ref-2"><label>[2]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Fukushima</surname> <given-names>K</given-names></string-name></person-group>. <article-title>Neocognitron: a self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position</article-title>. <source>Biol Cybern</source>. <year>1980</year>;<volume>36</volume>(<issue>4</issue>):<fpage>193</fpage>&#x2013;<lpage>202</lpage>. doi:<pub-id pub-id-type="doi">10.1007/bf00344251</pub-id>; <pub-id pub-id-type="pmid">7370364</pub-id></mixed-citation></ref>
<ref id="ref-3"><label>[3]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Quan</surname> <given-names>W</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>R</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Li</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>J</given-names></string-name>, <string-name><surname>Yan</surname> <given-names>DM</given-names></string-name></person-group>. <article-title>Image inpainting with local and global refinement</article-title>. <source>IEEE Trans Image Process</source>. <year>2022</year>;<volume>31</volume>:<fpage>2405</fpage>&#x2013;<lpage>20</lpage>. doi:<pub-id pub-id-type="doi">10.1109/tip.2022.3152624</pub-id>; <pub-id pub-id-type="pmid">35259102</pub-id></mixed-citation></ref>
<ref id="ref-4"><label>[4]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Wu</surname> <given-names>Q</given-names></string-name>, <string-name><surname>Chen</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Meng</surname> <given-names>J</given-names></string-name></person-group>. <article-title>DCGAN-based data augmentation for tomato leaf disease identification</article-title>. <source>IEEE Access</source>. <year>2020</year>;<volume>8</volume>:<fpage>98716</fpage>&#x2013;<lpage>28</lpage>. doi:<pub-id pub-id-type="doi">10.1109/access.2020.2997001</pub-id>.</mixed-citation></ref>
<ref id="ref-5"><label>[5]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Liu</surname> <given-names>Q</given-names></string-name>, <string-name><surname>Tan</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Chen</surname> <given-names>D</given-names></string-name>, <string-name><surname>Chu</surname> <given-names>Q</given-names></string-name>, <string-name><surname>Dai</surname> <given-names>X</given-names></string-name>, <string-name><surname>Chen</surname> <given-names>Y</given-names></string-name>, <etal>et al</etal></person-group>. <article-title>Reduce information loss in transformers for pluralistic image inpainting</article-title>. In: <conf-name>Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</conf-name>; <year>2022 Jun 18&#x2013;24</year>; <publisher-loc>New Orleans, LA, USA</publisher-loc>. p. <fpage>11347</fpage>&#x2013;<lpage>57</lpage>.</mixed-citation></ref>
<ref id="ref-6"><label>[6]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Doersch</surname> <given-names>C</given-names></string-name>, <string-name><surname>Singh</surname> <given-names>S</given-names></string-name>, <string-name><surname>Gupta</surname> <given-names>A</given-names></string-name>, <string-name><surname>Sivic</surname> <given-names>J</given-names></string-name>, <string-name><surname>Efros</surname> <given-names>A</given-names></string-name></person-group>. <article-title>What makes paris look like paris?</article-title> <source>ACM Trans Graph</source>. <year>2012</year>;<volume>31</volume>(<issue>4</issue>):<fpage>101</fpage>&#x2013;<lpage>9</lpage>. doi:<pub-id pub-id-type="doi">10.1145/2185520.2185597</pub-id>.</mixed-citation></ref>
<ref id="ref-7"><label>[7]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Liu</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Luo</surname> <given-names>P</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>X</given-names></string-name>, <string-name><surname>Tang</surname> <given-names>X</given-names></string-name></person-group>. <article-title>Deep learning face attributes in the wild</article-title>. In: <conf-name>Proceedings of the IEEE International Conference on Computer Vision</conf-name>; <year>2015 Dec 7&#x2013;13</year>; <publisher-loc>Santiago, Chile</publisher-loc>. p. <fpage>3730</fpage>&#x2013;<lpage>8</lpage>.</mixed-citation></ref>
<ref id="ref-8"><label>[8]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Zhou</surname> <given-names>B</given-names></string-name>, <string-name><surname>Lapedriza</surname> <given-names>A</given-names></string-name>, <string-name><surname>Khosla</surname> <given-names>A</given-names></string-name>, <string-name><surname>Oliva</surname> <given-names>A</given-names></string-name>, <string-name><surname>Torralba</surname> <given-names>A</given-names></string-name></person-group>. <article-title>Places: a 10 million image database for scene recognition</article-title>. <source>IEEE Trans Pattern Anal Mach Intell</source>. <year>2017</year>;<volume>40</volume>(<issue>6</issue>):<fpage>1452</fpage>&#x2013;<lpage>64</lpage>. doi:<pub-id pub-id-type="doi">10.1109/tpami.2017.2723009</pub-id>; <pub-id pub-id-type="pmid">28692961</pub-id></mixed-citation></ref>
<ref id="ref-9"><label>[9]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Pathak</surname> <given-names>D</given-names></string-name>, <string-name><surname>Krahenbuhl</surname> <given-names>P</given-names></string-name>, <string-name><surname>Donahue</surname> <given-names>J</given-names></string-name>, <string-name><surname>Darrell</surname> <given-names>T</given-names></string-name>, <string-name><surname>Efros</surname> <given-names>AA</given-names></string-name></person-group>. <article-title>Context encoders: feature learning by inpainting</article-title>. In: <conf-name>Proceedings of the IEEE Conference on Computer Vision and PATTERN Recognition</conf-name>; <year>2016 Jun 27&#x2013;30</year>; <publisher-loc>Las Vegas, NV, USA</publisher-loc>. p. <fpage>2536</fpage>&#x2013;<lpage>44</lpage>.</mixed-citation></ref>
<ref id="ref-10"><label>[10]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Liu</surname> <given-names>H</given-names></string-name>, <string-name><surname>Wan</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Huang</surname> <given-names>W</given-names></string-name>, <string-name><surname>Song</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Han</surname> <given-names>X</given-names></string-name>, <string-name><surname>Liao</surname> <given-names>J</given-names></string-name></person-group>. <article-title>PD-GAN: probabilistic diverse GAN for image inpainting</article-title>. In: <conf-name>Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</conf-name>; <year>2021 Jun 20&#x2013;25</year>; <publisher-loc>Nashville, TN, USA</publisher-loc>. p. <fpage>9371</fpage>&#x2013;<lpage>81</lpage>.</mixed-citation></ref>
<ref id="ref-11"><label>[11]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Yu</surname> <given-names>J</given-names></string-name>, <string-name><surname>Lin</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Yang</surname> <given-names>J</given-names></string-name>, <string-name><surname>Shen</surname> <given-names>X</given-names></string-name>, <string-name><surname>Lu</surname> <given-names>X</given-names></string-name>, <string-name><surname>Huang</surname> <given-names>TS</given-names></string-name></person-group>. <article-title>Free-form image inpainting with gated convolution</article-title>. In: <conf-name>Proceedings of the IEEE/CVF International Conference on Computer Vision</conf-name>; <year>2019 Oct 27&#x2013;Nov 2</year>; <publisher-loc>Seoul, Republic of Korea</publisher-loc>. p. <fpage>4471</fpage>&#x2013;<lpage>80</lpage>.</mixed-citation></ref>
<ref id="ref-12"><label>[12]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Vaswani</surname> <given-names>A</given-names></string-name>, <string-name><surname>Shazeer</surname> <given-names>N</given-names></string-name>, <string-name><surname>Parmar</surname> <given-names>N</given-names></string-name>, <string-name><surname>Uszkoreit</surname> <given-names>J</given-names></string-name>, <string-name><surname>Jones</surname> <given-names>L</given-names></string-name>, <string-name><surname>Gomez</surname> <given-names>AN</given-names></string-name>, <etal>et al</etal></person-group>. <article-title>Attention is all you need</article-title>. <comment>arXiv: 1706.03762</comment>. <year>2017</year>.</mixed-citation></ref>
<ref id="ref-13"><label>[13]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><surname>Dosovitskiy</surname> <given-names>A</given-names></string-name>, <string-name><surname>Beyer</surname> <given-names>L</given-names></string-name>, <string-name><surname>Kolesnikov</surname> <given-names>A</given-names></string-name>, <string-name><surname>Weissenborn</surname> <given-names>D</given-names></string-name>, <string-name><surname>Zhai</surname> <given-names>X</given-names></string-name>, <string-name><surname>Unterthiner</surname> <given-names>T</given-names></string-name>, <etal>et al.</etal></person-group> <article-title>An image is worth 16 &#x00D7; 16 words: transformers for image recognition at scale</article-title>. <comment>arXiv:2010.11929. 2020</comment>.</mixed-citation></ref>
<ref id="ref-14"><label>[14]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Liu</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Lin</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Cao</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Hu</surname> <given-names>H</given-names></string-name>, <string-name><surname>Wei</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>Z</given-names></string-name>, <etal>et al</etal></person-group>. <article-title>Swin transformer: hierarchical vision transformer using shifted windows</article-title>. In: <conf-name>Proceedings of the IEEE/CVF International Conference on Computer Vision</conf-name>; <year>2021 Oct 11&#x2013;17</year>; <publisher-loc>Montreal, QC, Canada</publisher-loc>. p. <fpage>10012</fpage>&#x2013;<lpage>22</lpage>.</mixed-citation></ref>
<ref id="ref-15"><label>[15]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Yuan</surname> <given-names>L</given-names></string-name>, <string-name><surname>Chen</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>T</given-names></string-name>, <string-name><surname>Yu</surname> <given-names>W</given-names></string-name>, <string-name><surname>Shi</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Jiang</surname> <given-names>ZH</given-names></string-name>, <etal>et al.</etal></person-group> <article-title>Tokens-to-Token ViT: training vision transformers from scratch on ImageNet</article-title>. In: <conf-name>Proceedings of the IEEE/CVF International Conference on Computer Vision</conf-name>; <year>2021 Oct 11&#x2013;17</year>; <publisher-loc>Montreal, QC, Canada</publisher-loc>. p. <fpage>558</fpage>&#x2013;<lpage>67</lpage>.</mixed-citation></ref>
<ref id="ref-16"><label>[16]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Wan</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>J</given-names></string-name>, <string-name><surname>Chen</surname> <given-names>D</given-names></string-name>, <string-name><surname>Liao</surname> <given-names>J</given-names></string-name></person-group>. <article-title>High-fidelity pluralistic image completion with transformers</article-title>. In: <conf-name>Proceedings of the IEEE/CVF International Conference on Computer Vision</conf-name>; <year>2021 Oct 11&#x2013;17</year>; <publisher-loc>Montreal, QC, Canada</publisher-loc>. p. <fpage>4692</fpage>&#x2013;<lpage>701</lpage>.</mixed-citation></ref>
<ref id="ref-17"><label>[17]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Zheng</surname> <given-names>C</given-names></string-name>, <string-name><surname>Cham</surname> <given-names>TJ</given-names></string-name>, <string-name><surname>Cai</surname> <given-names>J</given-names></string-name>, <string-name><surname>Phung</surname> <given-names>D</given-names></string-name></person-group>. <article-title>Bridging global context interactions for high-fidelity image completion</article-title>. In: <conf-name>Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</conf-name>; <year>2022 Jun 18&#x2013;24</year>; <publisher-loc>New Orleans, LA, USA</publisher-loc>. p. <fpage>11512</fpage>&#x2013;<lpage>22</lpage>.</mixed-citation></ref>
<ref id="ref-18"><label>[18]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Dong</surname> <given-names>Q</given-names></string-name>, <string-name><surname>Cao</surname> <given-names>C</given-names></string-name>, <string-name><surname>Fu</surname> <given-names>Y</given-names></string-name></person-group>. <article-title>Incremental transformer structure enhanced image inpainting with masking positional encoding</article-title>. In: <conf-name>Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</conf-name>; <year>2022 Jun 18&#x2013;24</year>; <publisher-loc>New Orleans, LA, USA</publisher-loc>. p. <fpage>11358</fpage>&#x2013;<lpage>68</lpage>.</mixed-citation></ref>
<ref id="ref-19"><label>[19]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Huang</surname> <given-names>W</given-names></string-name>, <string-name><surname>Deng</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Hui</surname> <given-names>S</given-names></string-name>, <string-name><surname>Wu</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Zhou</surname> <given-names>S</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>J</given-names></string-name></person-group>. <article-title>Sparse self-attention transformer for image inpainting</article-title>. <source>Pattern Recognit</source>. <year>2024</year>;<volume>145</volume>(<issue>3</issue>):<fpage>109897</fpage>. doi:<pub-id pub-id-type="doi">10.1016/j.patcog.2023.109897</pub-id>.</mixed-citation></ref>
<ref id="ref-20"><label>[20]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><surname>Jin</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Qiu</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>K</given-names></string-name>, <string-name><surname>Li</surname> <given-names>H</given-names></string-name>, <string-name><surname>Luo</surname> <given-names>W</given-names></string-name></person-group>. <article-title>MB-TaylorFormer V2: improved multi-branch linear transformer expanded by taylor formula for image restoration</article-title>. <comment>arXiv:2501.04486. 2025</comment>.</mixed-citation></ref>
<ref id="ref-21"><label>[21]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Shi</surname> <given-names>J</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Yu</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Li</surname> <given-names>G</given-names></string-name>, <string-name><surname>Hong</surname> <given-names>X</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>F</given-names></string-name>, <etal>et al</etal></person-group>. <article-title>Exploiting multi-scale parallel self-attention and local variation via dual-branch transformer-CNN structure for face super-resolution</article-title>. <source>IEEE Trans Multimed</source>. <year>2023</year>;<volume>26</volume>:<fpage>2608</fpage>&#x2013;<lpage>20</lpage>. doi:<pub-id pub-id-type="doi">10.1109/tmm.2023.3301225</pub-id>.</mixed-citation></ref>
<ref id="ref-22"><label>[22]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Deng</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Hui</surname> <given-names>S</given-names></string-name>, <string-name><surname>Zhou</surname> <given-names>S</given-names></string-name>, <string-name><surname>Meng</surname> <given-names>D</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>J</given-names></string-name></person-group>. <article-title>An efficient transformer for image inpainting</article-title>. In: <conf-name>Proceedings of the 30th ACM International Conference on Multimedia</conf-name>; <year>2022 Oct 10&#x2013;14</year>; <publisher-loc>Lisboa, Portugal</publisher-loc>. p. <fpage>6559</fpage>&#x2013;<lpage>68</lpage>.</mixed-citation></ref>
<ref id="ref-23"><label>[23]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Creswell</surname> <given-names>A</given-names></string-name>, <string-name><surname>White</surname> <given-names>T</given-names></string-name>, <string-name><surname>Dumoulin</surname> <given-names>V</given-names></string-name>, <string-name><surname>Arulkumaran</surname> <given-names>K</given-names></string-name>, <string-name><surname>Sengupta</surname> <given-names>B</given-names></string-name>, <string-name><surname>Bharath</surname> <given-names>AA</given-names></string-name></person-group>. <article-title>Generative adversarial networks: an overview</article-title>. <source>IEEE Signal Process Mag</source>. <year>2018</year>;<volume>35</volume>(<issue>1</issue>):<fpage>53</fpage>&#x2013;<lpage>65</lpage>. doi:<pub-id pub-id-type="doi">10.1109/msp.2017.2765202</pub-id>.</mixed-citation></ref>
<ref id="ref-24"><label>[24]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Johnson</surname> <given-names>J</given-names></string-name>, <string-name><surname>Alahi</surname> <given-names>A</given-names></string-name>, <string-name><surname>Fei-Fei</surname> <given-names>L</given-names></string-name></person-group>. <article-title>Perceptual losses for real-time style transfer and super-resolution</article-title>. In: <conf-name>Computer Vision-ECCV 2016: 14th European Conference</conf-name>; <year>2016 Oct 11&#x2013;14</year>; <publisher-loc>Amsterdam, The Netherlands</publisher-loc>. p. <fpage>694</fpage>&#x2013;<lpage>711</lpage>.</mixed-citation></ref>
<ref id="ref-25"><label>[25]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Gatys</surname> <given-names>LA</given-names></string-name>, <string-name><surname>Ecker</surname> <given-names>AS</given-names></string-name>, <string-name><surname>Bethge</surname> <given-names>M</given-names></string-name></person-group>. <article-title>Image style transfer using convolutional neural networks</article-title>. In: <conf-name>Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition</conf-name>; <year>2016 Jun 27&#x2013;30</year>; <publisher-loc>Las Vegas, NV, USA</publisher-loc>. p. <fpage>2414</fpage>&#x2013;<lpage>23</lpage>.</mixed-citation></ref>
<ref id="ref-26"><label>[26]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Zhu</surname> <given-names>JY</given-names></string-name>, <string-name><surname>Park</surname> <given-names>T</given-names></string-name>, <string-name><surname>Isola</surname> <given-names>P</given-names></string-name>, <string-name><surname>Efros</surname> <given-names>AA</given-names></string-name></person-group>. <article-title>Unpaired image-to-image translation using cycle-consistent adversarial networks</article-title>. In: <conf-name>Proceedings of the IEEE International Conference on Computer Vision</conf-name>; <year>2017 Oct 22&#x2013;29</year>; <publisher-loc>Venice, Italy</publisher-loc>. p. <fpage>2223</fpage>&#x2013;<lpage>32</lpage>.</mixed-citation></ref>
<ref id="ref-27"><label>[27]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><surname>Simonyan</surname> <given-names>K</given-names></string-name>, <string-name><surname>Zisserman</surname> <given-names>A</given-names></string-name></person-group>. <article-title>Very deep convolutional networks for large-scale image recognition</article-title>. <comment>arXiv:1409.1556. 2014</comment>.</mixed-citation></ref>
<ref id="ref-28"><label>[28]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Deng</surname> <given-names>J</given-names></string-name>, <string-name><surname>Dong</surname> <given-names>W</given-names></string-name>, <string-name><surname>Socher</surname> <given-names>R</given-names></string-name>, <string-name><surname>Li</surname> <given-names>LJ</given-names></string-name>, <string-name><surname>Li</surname> <given-names>K</given-names></string-name>, <string-name><surname>Li</surname> <given-names>FF</given-names></string-name></person-group>. <article-title>ImageNet: a large-scale hierarchical image database</article-title>. In: <conf-name>2009 IEEE Conference on Computer Vision and Pattern Recognition</conf-name>; <year>2009 Jun 20&#x2013;25</year>; <publisher-loc>Miami, FL, USA</publisher-loc>. p. <fpage>248</fpage>&#x2013;<lpage>55</lpage>.</mixed-citation></ref>
<ref id="ref-29"><label>[29]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><surname>Kingma</surname> <given-names>DP</given-names></string-name>, <string-name><surname>Ba</surname> <given-names>J</given-names></string-name></person-group>. <article-title>Adam: a method for stochastic optimization</article-title>. <comment>arXiv:1412.6980. 2014</comment>.</mixed-citation></ref>
<ref id="ref-30"><label>[30]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Wang</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Bovik</surname> <given-names>AC</given-names></string-name>, <string-name><surname>Sheikh</surname> <given-names>HR</given-names></string-name>, <string-name><surname>Simoncelli</surname> <given-names>EP</given-names></string-name></person-group>. <article-title>Image quality assessment: from error visibility to structural similarity</article-title>. <source>IEEE Trans Image Process</source>. <year>2004</year>;<volume>13</volume>(<issue>4</issue>):<fpage>600</fpage>&#x2013;<lpage>12</lpage>. doi:<pub-id pub-id-type="doi">10.1109/tip.2003.819861</pub-id>; <pub-id pub-id-type="pmid">15376593</pub-id></mixed-citation></ref>
<ref id="ref-31"><label>[31]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Liu</surname> <given-names>G</given-names></string-name>, <string-name><surname>Reda</surname> <given-names>FA</given-names></string-name>, <string-name><surname>Shih</surname> <given-names>KJ</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>TC</given-names></string-name>, <string-name><surname>Tao</surname> <given-names>A</given-names></string-name>, <string-name><surname>Catanzaro</surname> <given-names>B</given-names></string-name></person-group>. <article-title>Image inpainting for irregular holes using partial convolutions</article-title>. In: <conf-name>Proceedings of the European Conference on Computer Vision (ECCV)</conf-name>; <year>2018 Sep 8&#x2013;14</year>; <publisher-loc>Munich, Germany</publisher-loc>. p. <fpage>85</fpage>&#x2013;<lpage>100</lpage>.</mixed-citation></ref>
<ref id="ref-32"><label>[32]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Li</surname> <given-names>J</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>N</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>L</given-names></string-name>, <string-name><surname>Du</surname> <given-names>B</given-names></string-name>, <string-name><surname>Tao</surname> <given-names>D</given-names></string-name></person-group>. <article-title>Recurrent feature reasoning for image inpainting</article-title>. In: <conf-name>Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</conf-name>; <year>2020 Jun 13&#x2013;19</year>; <publisher-loc>Seattle, WA, USA</publisher-loc>. p. <fpage>7760</fpage>&#x2013;<lpage>8</lpage>.</mixed-citation></ref>
<ref id="ref-33"><label>[33]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Zeng</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Fu</surname> <given-names>J</given-names></string-name>, <string-name><surname>Chao</surname> <given-names>H</given-names></string-name>, <string-name><surname>Guo</surname> <given-names>B</given-names></string-name></person-group>. <article-title>Aggregated contextual transformations for high-resolution image inpainting</article-title>. <source>IEEE Trans Vis Comput Graph</source>. <year>2023</year>;<volume>29</volume>(<issue>7</issue>):<fpage>3266</fpage>&#x2013;<lpage>80</lpage>. doi:<pub-id pub-id-type="doi">10.1109/tvcg.2022.3156949</pub-id>; <pub-id pub-id-type="pmid">35254985</pub-id></mixed-citation></ref>
<ref id="ref-34"><label>[34]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Guo</surname> <given-names>X</given-names></string-name>, <string-name><surname>Yang</surname> <given-names>H</given-names></string-name>, <string-name><surname>Huang</surname> <given-names>D</given-names></string-name></person-group>. <article-title>Image inpainting via conditional texture and structure dual generation</article-title>. In: <conf-name>Proceedings of the IEEE/CVF International Conference on Computer Vision</conf-name>; <year>2021 Oct 11&#x2013;17</year>; <publisher-loc>Montreal, QC, Canada</publisher-loc>. p. <fpage>14134</fpage>&#x2013;<lpage>43</lpage>.</mixed-citation></ref>
</ref-list>
</back></article>