<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.1 20151215//EN" "http://jats.nlm.nih.gov/publishing/1.1/JATS-journalpublishing1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:mml="http://www.w3.org/1998/Math/MathML" xml:lang="en" article-type="research-article" dtd-version="1.1">
<front>
<journal-meta>
<journal-id journal-id-type="pmc">CMC</journal-id>
<journal-id journal-id-type="nlm-ta">CMC</journal-id>
<journal-id journal-id-type="publisher-id">CMC</journal-id>
<journal-title-group>
<journal-title>Computers, Materials &#x0026; Continua</journal-title>
</journal-title-group>
<issn pub-type="epub">1546-2226</issn>
<issn pub-type="ppub">1546-2218</issn>
<publisher>
<publisher-name>Tech Science Press</publisher-name>
<publisher-loc>USA</publisher-loc>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="publisher-id">53232</article-id>
<article-id pub-id-type="doi">10.32604/cmc.2024.053232</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Article</subject>
</subj-group>
</article-categories>
<title-group>
<article-title>Research on Restoration of Murals Based on Diffusion Model and Transformer</article-title>
<alt-title alt-title-type="left-running-head">Research on Restoration of Murals Based on Diffusion Model and Transformer</alt-title>
<alt-title alt-title-type="right-running-head">Research on Restoration of Murals Based on Diffusion Model and Transformer</alt-title>
</title-group>
<contrib-group>
<contrib id="author-1" contrib-type="author">
<name name-style="western"><surname>Wang</surname><given-names>Yaoyao</given-names></name><xref ref-type="aff" rid="aff-1">1</xref></contrib>
<contrib id="author-2" contrib-type="author" corresp="yes">
<name name-style="western"><surname>Xiao</surname><given-names>Mansheng</given-names></name><xref ref-type="aff" rid="aff-1">1</xref><email>xiaomansheng@hut.edu.cn</email></contrib>
<contrib id="author-3" contrib-type="author">
<name name-style="western"><surname>Hu</surname><given-names>Yuqing</given-names></name><xref ref-type="aff" rid="aff-2">2</xref></contrib>
<contrib id="author-4" contrib-type="author">
<name name-style="western"><surname>Yan</surname><given-names>Jin</given-names></name><xref ref-type="aff" rid="aff-1">1</xref></contrib>
<contrib id="author-5" contrib-type="author">
<name name-style="western"><surname>Zhu</surname><given-names>Zeyu</given-names></name><xref ref-type="aff" rid="aff-1">1</xref></contrib>
<aff id="aff-1"><label>1</label><institution>School of Computing, Hunan University of Technology</institution>, <addr-line>Zhuzhou, 412000</addr-line>, <country>China</country></aff>
<aff id="aff-2"><label>2</label><institution>School of Computing, Hunan Software Vocational and Technical University</institution>, <addr-line>Xiangtan, 411100</addr-line>, <country>China</country></aff>
</contrib-group>
<author-notes>
<corresp id="cor1"><label>&#x002A;</label>Corresponding Author: Mansheng Xiao. Email: <email>xiaomansheng@hut.edu.cn</email></corresp>
</author-notes>
<pub-date date-type="collection" publication-format="electronic">
<year>2024</year></pub-date>
<pub-date date-type="pub" publication-format="electronic"><day>12</day><month>9</month><year>2024</year></pub-date>
<volume>80</volume>
<issue>3</issue>
<fpage>4433</fpage>
<lpage>4449</lpage>
<history>
<date date-type="received">
<day>28</day>
<month>4</month>
<year>2024</year>
</date>
<date date-type="accepted">
<day>03</day>
<month>8</month>
<year>2024</year>
</date>
</history>
<permissions>
<copyright-statement>&#x00A9; 2024 The Authors.</copyright-statement>
<copyright-year>2024</copyright-year>
<copyright-holder>Published by Tech Science Press.</copyright-holder>
<license xlink:href="https://creativecommons.org/licenses/by/4.0/">
<license-p>This work is licensed under a <ext-link ext-link-type="uri" xlink:type="simple" xlink:href="https://creativecommons.org/licenses/by/4.0/">Creative Commons Attribution 4.0 International License</ext-link>, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
</license>
</permissions>
<self-uri content-type="pdf" xlink:href="TSP_CMC_53232.pdf"></self-uri>
<abstract>
<p>Due to the limitations of a priori knowledge and convolution operation, the existing image restoration techniques cannot be directly applied to the cultural relics mural restoration, in order to more accurately restore the original appearance of the cultural relics mural images, an image restoration based on the denoising diffusion probability model (Denoising Diffusion Probability Model (DDPM)) and the Transformer method. The process involves two steps: in the first step, the damaged mural image is firstly utilized as the condition to generate the noise image, using the time, condition and noise image patch as the inputs to the noise prediction network, capturing the global dependencies in the input sequence through the multi-attention mechanism of the input sequence and feed-forward neural network processing, and designing a long skip connection between the shallow and deep layers in the Transformer blocks between the shallow and deep layers using long skip connections to fuse the feature information of global and local outputs to maintain the overall consistency of the restoration results; In the second step, taking the noisy image as a condition to direct the diffusion model to back sample to generate the restored image. Experiment results show that the PSNR and SSIM of the proposed method are improved by 2% to 9% and 1% to 3.3%, respectively, which are compared to the comparison methods. This study proposed synthesizes the advantages of the diffusion model and deep learning model to make the mural restoration results more accurate.</p>
</abstract>
<kwd-group kwd-group-type="author">
<kwd>Transformer</kwd>
<kwd>deep learning</kwd>
<kwd>noise estimation network</kwd>
<kwd>diffusion model</kwd>
<kwd>mural restoration</kwd>
</kwd-group>
<funding-group>
<award-group id="awg1">
<funding-source>Hunan Provincial Natural Science and Technology</funding-source>
<award-id>2022JJ50077</award-id>
</award-group>
<award-group id="awg2">
<funding-source>Natural Science Foundation of Hunan Province</funding-source>
<award-id>2024JJ8055</award-id>
</award-group>
</funding-group>
</article-meta>
</front>
<body>
<sec id="s1">
<label>1</label>
<title>Introduction</title>
<p>As a cultural heritage of a country or region, ancient frescoes carry a wealth of social, religious, and historical information. However, due to the long-term influence of environmental factors, mural images are usually faded or even mutilated. With the rapid development of artificial intelligence, intelligent restoration technology has attracted extensive attention from researchers [<xref ref-type="bibr" rid="ref-1">1</xref>,<xref ref-type="bibr" rid="ref-2">2</xref>].</p>
<p>In the field of image restoration, traditional deep learning techniques like Convolutional Neural Networks (CNNs) and Generative Adversarial Networks (GANs) [<xref ref-type="bibr" rid="ref-3">3</xref>] have achieved significant results. CNNs, with their hierarchical feature extraction and reconstruction capabilities, can effectively restore local structures and textures of images. GANs, through adversarial training between the generator and the discriminator, can produce repair results that are highly like the original images. However, these traditional methods still have limitations when dealing with complex image structures and textures. For example, CNNs primarily focus on local information when processing images, making them relatively weak in capturing global dependencies. This can lead to difficulties in maintaining consistency between the restored areas and the surrounding regions in images with complex structures and textures.</p>
<p>To solve these problems, the Transformer [<xref ref-type="bibr" rid="ref-4">4</xref>] structure emerges due to its powerful global information-capturing ability and self-attention mechanism. The Transformer overcomes the limitations of CNNs in terms of restricted receptive fields and uses attention mechanisms to achieve dynamic interaction and computation of features from different regions of the image. This improves the quality and efficiency of image restoration. Notably, the ViT (Vision Transformer) model proposed by Dosovitskiy et al. [<xref ref-type="bibr" rid="ref-5">5</xref>] applies the Transformer to images. Through global attention mechanisms, it captures global information in images, providing the model with a larger receptive field and greater flexibility in the restoration process.</p>
<p>In addition, the diffusion model, as an emerging generative model, also shows great potential in the field of image restoration. Diffusion models transfer pixel information from neighboring regions by designing diffusion functions to fill in missing pixels and restore image integrity. Compared with GAN, diffusion models have better generalization ability and stability, and they can generate high-quality restoration results [<xref ref-type="bibr" rid="ref-6">6</xref>]. In particular, the Denoising Diffusion Probabilistic Model (DDPM) [<xref ref-type="bibr" rid="ref-7">7</xref>] generates samples through an iterative denoising process [<xref ref-type="bibr" rid="ref-8">8</xref>], making the generated images more coherent and realistic in both detail and global structure [<xref ref-type="bibr" rid="ref-9">9</xref>].</p>
<p>Currently, research combining diffusion models [<xref ref-type="bibr" rid="ref-10">10</xref>] and Transformers [<xref ref-type="bibr" rid="ref-11">11</xref>] is relatively limited and still faces issues such as insufficient detail, limitations in capturing global structure, and lengthy training processes. In this study, we propose a Transformer and DDPM-based picture restoration model. This model combines the global information-capturing ability of the Transformer with the high-quality generation capability of DDPM. It aims to address key issues in the restoration of ancient mural images, such as the recovery of missing textures and the improvement of incomplete data coverage. By introducing ViT as the core structure and incorporating DDPM&#x2019;s denoising diffusion process, our model can adaptively adjust the scale of the attention mechanism during the restoration process. This enables better handling of image restoration tasks of varying sizes and complexities. Next, we describe in detail the structure, working principle, and experimental validation results ofthe model.</p>
<p>In conclusion, the main contribution of this work is four-fold:
<list list-type="order">
<list-item>
<p>We propose a method of cultural relics image restoration based on the diffusion model and ViT for the problems of missing structure and texture and incomplete data coverage of cultural relics murals due to improper preservation.</p></list-item>
<list-item>
<p>We design forward diffusion and backward sampling to guide the restoration, where in forward diffusion the information is better extracted through VIT to capture the image to generate a clear textured restoration image to restore the original appearance of the heritage mural image in a more reasonable way. Additionally, we utilize long skip connections between the shallow and deep layers of ViT, enabling the model to use low-level features more effectively for pixel prediction training.</p></list-item>
<list-item>
<p>Before the output, we add a 3 &#x00D7; 3 convolution block to prevent artifacts that might appear in images generated by the Transformer. By adjusting different forms of model parameters, we address issues of image quality degradation and excessive processing time that deep learning models might encounter in image restoration.</p></list-item>
<list-item>
<p>We optimized the loss function and dynamically adjusted the focus on different image regions to better measure the difference between the restoration results and the original murals. Extensive experiments validate that our proposed framework significantly outperforms the existing state-of-the-art methods.</p></list-item>
</list></p>
</sec>
<sec id="s2">
<label>2</label>
<title>Related Work</title>
<p>Among the image restoration based on diffusion modeling, the U-Net network architecture with CNN as the core is often used for the noise estimation of inverse generated images. In the forward process, the diffusion model aims to convert the original image to a full Gaussian noise image, however, this approach suffers from the problem of multiple sampling steps and long sampling time during the inference process, which leads to the high time cost of the inference process and restricts the scope and effectiveness of its application [<xref ref-type="bibr" rid="ref-12">12</xref>]. Resolving how to converge to a specific prior distribution in the expected time [<xref ref-type="bibr" rid="ref-13">13</xref>] as well as incorporating adaptive mechanisms are key issues that need to be addressed nowadays. In noise prediction by diffusion modeling, it is usually necessary to model complex data including spatial and temporal variations as well as possible sources of noise, and traditional methods may be limited by the fact that feature extraction does not allow for good restoration of mural images, whereas the use of ViT is able to better capture global information and local structure in the image. We propose an improved denoising diffusion probabilistic model that combines the forward diffusion and backward generation processes, denoising by iteration, and generating samples using a standard Gaussian distribution so that they gradually evolve into samples that conform to the empirical distribution. In the improved denoising diffusion probabilistic model, the input data in the forward diffusion phase corrupts the original data by gradually adding Gaussian noise, and the added noise level is dynamically estimated by means of a Transformer-based noise estimation network. While in the reverse diffusion stage, the task of the generative model is to learn the reverse diffusion process to recover the original input data from the noisy data, through introducing the idea of reverse denoising and combining the noise estimation network, it can capture the potential distribution of the original data more efficiently, so as to improve the performance of the generative model and the quality of the generated samples. This integration not only improves the quality of restoration but also accelerates the restoration process, making our method more efficient and practical for handling large-scale mural restoration tasks.</p>
<sec id="s2_1">
<label>2.1</label>
<title>Forward Diffusion</title>
<p>The forward diffusion process is defined as a Markov chain, where Gaussian noise is continuously added to successive nodes to obtain noisy samples, which in turn gradually transforms the Gaussian noise distribution into a distribution of data on which the generative model is trained. Specifically, as shown in <xref ref-type="fig" rid="fig-1">Fig. 1</xref>, given a data sample <inline-formula id="ieqn-1"><mml:math id="mml-ieqn-1"><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mn>0</mml:mn></mml:mrow></mml:msub><mml:mo>&#x223C;</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>N</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>, where <inline-formula id="ieqn-2"><mml:math id="mml-ieqn-2"><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mn>0</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula> is the original image and <inline-formula id="ieqn-3"><mml:math id="mml-ieqn-3"><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>N</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> is the image corresponding to the <italic>N</italic>th moment of added noise.</p>
<fig id="fig-1">
<label>Figure 1</label>
<caption>
<title>Image noise schematic map</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_53232-fig-1.tif"/>
</fig>
<p>The forward additive noise <italic>t</italic>-moment process is defined as:
<disp-formula id="eqn-1"><label>(1)</label><mml:math id="mml-eqn-1" display="block"><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msqrt><mml:msub><mml:mi>&#x03B1;</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:msqrt><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>t</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:msqrt><mml:mn>1</mml:mn><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mi>&#x03B1;</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:msqrt><mml:mi>&#x03B5;</mml:mi><mml:mo>,</mml:mo><mml:msub><mml:mi>&#x03B1;</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mn>1</mml:mn><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mi>&#x03B2;</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:math></disp-formula></p>
<p>where <inline-formula id="ieqn-4"><mml:math id="mml-ieqn-4"><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> is the added noise data at moment <italic>t</italic>, <inline-formula id="ieqn-5"><mml:math id="mml-ieqn-5"><mml:mi>t</mml:mi><mml:mo>&#x2208;</mml:mo><mml:mrow><mml:mo>{</mml:mo><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mn>2</mml:mn><mml:mo>,</mml:mo><mml:mo>&#x22EF;</mml:mo><mml:mo>,</mml:mo><mml:mi>T</mml:mi><mml:mo>}</mml:mo></mml:mrow></mml:math></inline-formula>, T is the number of times the noise is added, is the noise added at moment t, obeys Gaussian distribution, <inline-formula id="ieqn-6"><mml:math id="mml-ieqn-6"><mml:msub><mml:mi>&#x03B1;</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> is the initialized value, which is the empirical value, <inline-formula id="ieqn-7"><mml:math id="mml-ieqn-7"><mml:msub><mml:mi>&#x03B2;</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> increases linearly from 0.0001 to 0.002 during forward diffusion, <inline-formula id="ieqn-8"><mml:math id="mml-ieqn-8"><mml:mi>&#x03B5;</mml:mi></mml:math></inline-formula> is the noise, obeys Gaussian distribution. In the forward diffusion process, the later the moment, the more closely the noise data is related to the noise increased in the previous moment. According to the Markov chain, the state <inline-formula id="ieqn-9"><mml:math id="mml-ieqn-9"><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>t</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula> is denoted as:
<disp-formula id="eqn-2"><label>(2)</label><mml:math id="mml-eqn-2" display="block"><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>t</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msqrt><mml:msub><mml:mi>&#x03B1;</mml:mi><mml:mrow><mml:mi>t</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:msqrt><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>t</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>2</mml:mn></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:msqrt><mml:mn>1</mml:mn><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mi>&#x03B1;</mml:mi><mml:mrow><mml:mi>t</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:msqrt><mml:mi>&#x03B5;</mml:mi></mml:math></disp-formula>
<disp-formula id="eqn-3"><label>(3)</label><mml:math id="mml-eqn-3" display="block"><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msqrt><mml:msub><mml:mi>&#x03B1;</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:msqrt><mml:mrow><mml:mo>(</mml:mo><mml:msqrt><mml:msub><mml:mi>&#x03B1;</mml:mi><mml:mrow><mml:mi>t</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:msqrt><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>t</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>2</mml:mn></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:msqrt><mml:mn>1</mml:mn><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mi>&#x03B1;</mml:mi><mml:mrow><mml:mi>t</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:msqrt><mml:mi>&#x03B5;</mml:mi><mml:mo>)</mml:mo></mml:mrow><mml:mo>+</mml:mo><mml:msqrt><mml:mn>1</mml:mn><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mi>&#x03B1;</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:msqrt><mml:mi>&#x03B5;</mml:mi></mml:math></disp-formula></p>
<p>where <inline-formula id="ieqn-10"><mml:math id="mml-ieqn-10"><mml:mi>&#x03B5;</mml:mi><mml:mo>&#x223C;</mml:mo><mml:mi>N</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mi>I</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula> can also be expressed as:
<disp-formula id="eqn-4"><label>(4)</label><mml:math id="mml-eqn-4" display="block"><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>t</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msqrt><mml:msub><mml:mi>&#x03B1;</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:msub><mml:mi>&#x03B1;</mml:mi><mml:mrow><mml:mi>t</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:msqrt><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>t</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>2</mml:mn></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:msqrt><mml:mn>1</mml:mn><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mi>&#x03B1;</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:msub><mml:mi>&#x03B1;</mml:mi><mml:mrow><mml:mi>t</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:msqrt><mml:mi>&#x03B5;</mml:mi></mml:math></disp-formula>further the relationship between <inline-formula id="ieqn-11"><mml:math id="mml-ieqn-11"><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> and <inline-formula id="ieqn-12"><mml:math id="mml-ieqn-12"><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mn>0</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula> is obtained as:
<disp-formula id="eqn-5"><label>(5)</label><mml:math id="mml-eqn-5" display="block"><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msqrt><mml:msub><mml:mover><mml:mi>&#x03B1;</mml:mi><mml:mo accent="false">&#x00AF;</mml:mo></mml:mover><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:msqrt><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mn>0</mml:mn></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:msqrt><mml:mn>1</mml:mn><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mover><mml:mi>&#x03B1;</mml:mi><mml:mo accent="false">&#x00AF;</mml:mo></mml:mover><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:msqrt><mml:mi>&#x03B5;</mml:mi><mml:mo>.</mml:mo></mml:math></disp-formula></p>
<p>where <inline-formula id="ieqn-13"><mml:math id="mml-ieqn-13"><mml:msub><mml:mover><mml:mi>&#x03B1;</mml:mi><mml:mo accent="false">&#x00AF;</mml:mo></mml:mover><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:munderover><mml:mo>&#x220F;</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:munderover><mml:msub><mml:mi>&#x03B1;</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>.</p>
</sec>
<sec id="s2_2">
<label>2.2</label>
<title>Reverse Generation</title>
<p>The forward diffusion process results in a data that nearly obeys a Gaussian distribution, and the inverse diffusion process recovers the original data from Gaussian noise, generating original images for learning a parameterized posterior distribution <inline-formula id="ieqn-14"><mml:math id="mml-ieqn-14"><mml:msub><mml:mi>p</mml:mi><mml:mrow><mml:mi>&#x03B8;</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mn>0</mml:mn></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula> through <inline-formula id="ieqn-15"><mml:math id="mml-ieqn-15"><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>. Assuming that the inverse process <inline-formula id="ieqn-16"><mml:math id="mml-ieqn-16"><mml:mi>q</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>t</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula> is obtained, it is possible to gradually reduce an image through random noise <inline-formula id="ieqn-17"><mml:math id="mml-ieqn-17"><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>. The DDPM uses the neural network <inline-formula id="ieqn-18"><mml:math id="mml-ieqn-18"><mml:msub><mml:mi>p</mml:mi><mml:mrow><mml:mi>&#x03B8;</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mn>0</mml:mn></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula> to fit the inverse process <inline-formula id="ieqn-19"><mml:math id="mml-ieqn-19"><mml:mi>q</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>t</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula>, with the formula:
<disp-formula id="eqn-6"><label>(6)</label><mml:math id="mml-eqn-6" display="block"><mml:mi>q</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>t</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mn>0</mml:mn></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mi>N</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>t</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mover><mml:mi>&#x03BC;</mml:mi><mml:mo accent="false">&#x00AF;</mml:mo></mml:mover><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mn>0</mml:mn></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mover><mml:mi>&#x03B2;</mml:mi><mml:mo stretchy="false">&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mi>I</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:math></disp-formula>can be deduced:
<disp-formula id="eqn-7"><label>(7)</label><mml:math id="mml-eqn-7" display="block"><mml:msub><mml:mi>p</mml:mi><mml:mrow><mml:mi>&#x03B8;</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>t</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mi>N</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>t</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:msub><mml:mi>&#x03BC;</mml:mi><mml:mrow><mml:mi>&#x03B8;</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mi>t</mml:mi><mml:mo>)</mml:mo></mml:mrow><mml:mo>,</mml:mo><mml:munder><mml:mo movablelimits="false">&#x2211;</mml:mo><mml:mrow><mml:mi>&#x03B8;</mml:mi></mml:mrow></mml:munder><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mi>t</mml:mi><mml:mo>)</mml:mo></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:math></disp-formula></p>
<p>where <inline-formula id="ieqn-20"><mml:math id="mml-ieqn-20"><mml:msub><mml:mrow><mml:mover><mml:mi>&#x03B2;</mml:mi><mml:mo stretchy="false">&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mstyle displaystyle="true" scriptlevel="0"><mml:mfrac><mml:mrow><mml:mn>1</mml:mn><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mover><mml:mi>&#x03B1;</mml:mi><mml:mo accent="false">&#x00AF;</mml:mo></mml:mover><mml:mrow><mml:mi>t</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:mn>1</mml:mn><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mover><mml:mi>&#x03B1;</mml:mi><mml:mo accent="false">&#x00AF;</mml:mo></mml:mover><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:mfrac></mml:mstyle><mml:mo>&#x22C5;</mml:mo><mml:msub><mml:mi>&#x03B2;</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>, solve the equation by Bayesian formula:
<disp-formula id="eqn-8"><label>(8)</label><mml:math id="mml-eqn-8" display="block"><mml:msub><mml:mover><mml:mi>&#x03BC;</mml:mi><mml:mo accent="false">&#x00AF;</mml:mo></mml:mover><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mn>0</mml:mn></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:msqrt><mml:msub><mml:mi>&#x03B1;</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:msqrt><mml:mrow><mml:mo>(</mml:mo><mml:mn>1</mml:mn><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mover><mml:mi>&#x03B1;</mml:mi><mml:mo accent="false">&#x00AF;</mml:mo></mml:mover><mml:mrow><mml:mi>t</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:mn>1</mml:mn><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mover><mml:mi>&#x03B1;</mml:mi><mml:mo accent="false">&#x00AF;</mml:mo></mml:mover><mml:mrow><mml:mi>t</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:mrow></mml:mfrac><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:mfrac><mml:mrow><mml:msqrt><mml:msub><mml:mover><mml:mi>&#x03B1;</mml:mi><mml:mo accent="false">&#x00AF;</mml:mo></mml:mover><mml:mrow><mml:mi>t</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:msqrt><mml:msub><mml:mi>&#x03B2;</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:mn>1</mml:mn><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mover><mml:mi>&#x03B1;</mml:mi><mml:mo accent="false">&#x00AF;</mml:mo></mml:mover><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:mfrac><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mn>0</mml:mn></mml:mrow></mml:msub></mml:math></disp-formula></p>
<p><inline-formula id="ieqn-21"><mml:math id="mml-ieqn-21"><mml:msub><mml:mi>&#x03B1;</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> and <inline-formula id="ieqn-22"><mml:math id="mml-ieqn-22"><mml:msub><mml:mover><mml:mi>&#x03B1;</mml:mi><mml:mo accent="false">&#x00AF;</mml:mo></mml:mover><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> depend only on <inline-formula id="ieqn-23"><mml:math id="mml-ieqn-23"><mml:msub><mml:mi>&#x03B2;</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> and it follows from the forward diffusion to express <inline-formula id="ieqn-24"><mml:math id="mml-ieqn-24"><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mn>0</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula> as:
<disp-formula id="eqn-9"><label>(9)</label><mml:math id="mml-eqn-9" display="block"><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mn>0</mml:mn></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:msqrt><mml:msub><mml:mover><mml:mi>&#x03B1;</mml:mi><mml:mo accent="false">&#x00AF;</mml:mo></mml:mover><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:msqrt></mml:mfrac><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>&#x2212;</mml:mo><mml:msqrt><mml:mn>1</mml:mn><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mover><mml:mi>&#x03B1;</mml:mi><mml:mo accent="false">&#x00AF;</mml:mo></mml:mover><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:msqrt><mml:mi>&#x03B5;</mml:mi><mml:mo>)</mml:mo></mml:mrow><mml:mo>,</mml:mo><mml:mi>&#x03B5;</mml:mi><mml:mo>&#x223C;</mml:mo><mml:mi>N</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mi>I</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:math></disp-formula>by combining <xref ref-type="disp-formula" rid="eqn-7">Eqs. (7)</xref> and <xref ref-type="disp-formula" rid="eqn-8">(8)</xref>, a mean value, that depends only on <inline-formula id="ieqn-25"><mml:math id="mml-ieqn-25"><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>, results:
<disp-formula id="eqn-10"><label>(10)</label><mml:math id="mml-eqn-10" display="block"><mml:msub><mml:mover><mml:mi>&#x03BC;</mml:mi><mml:mo accent="false">&#x00AF;</mml:mo></mml:mover><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:msqrt><mml:msub><mml:mi>&#x03B1;</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:msqrt></mml:mfrac><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>&#x2212;</mml:mo><mml:mfrac><mml:msub><mml:mi>&#x03B2;</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:msqrt><mml:mn>1</mml:mn><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mover><mml:mi>&#x03B1;</mml:mi><mml:mo accent="false">&#x00AF;</mml:mo></mml:mover><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:msqrt><mml:mi>&#x03B5;</mml:mi></mml:mrow></mml:mfrac><mml:mo>)</mml:mo></mml:mrow></mml:math></disp-formula>thus, a neural network <inline-formula id="ieqn-26"><mml:math id="mml-ieqn-26"><mml:msub><mml:mi>&#x03B5;</mml:mi><mml:mrow><mml:mi>&#x03B8;</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mi>t</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula> can be used to approximate <inline-formula id="ieqn-27"><mml:math id="mml-ieqn-27"><mml:mi>&#x03B5;</mml:mi></mml:math></inline-formula> and obtain the average of the following:
<disp-formula id="eqn-11"><label>(11)</label><mml:math id="mml-eqn-11" display="block"><mml:msub><mml:mover><mml:mi>&#x03BC;</mml:mi><mml:mo accent="false">&#x00AF;</mml:mo></mml:mover><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mi>t</mml:mi><mml:mo>)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:msqrt><mml:msub><mml:mi>&#x03B1;</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:msqrt></mml:mfrac><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>&#x2212;</mml:mo><mml:mfrac><mml:msub><mml:mi>&#x03B2;</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:msqrt><mml:mn>1</mml:mn><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mover><mml:mi>&#x03B1;</mml:mi><mml:mo accent="false">&#x00AF;</mml:mo></mml:mover><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:msqrt></mml:mfrac><mml:msub><mml:mi>&#x03B5;</mml:mi><mml:mrow><mml:mi>&#x03B8;</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mi>t</mml:mi><mml:mo>)</mml:mo></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:math></disp-formula></p>
<p>where <inline-formula id="ieqn-28"><mml:math id="mml-ieqn-28"><mml:mi>&#x03B5;</mml:mi></mml:math></inline-formula> is the noise value predicted by the trained model. Actually, to predict the noise more accurately, a noise prediction network <inline-formula id="ieqn-29"><mml:math id="mml-ieqn-29"><mml:msub><mml:mi>&#x03B5;</mml:mi><mml:mrow><mml:mi>&#x03B8;</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mi>t</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula> is used to learn <inline-formula id="ieqn-30"><mml:math id="mml-ieqn-30"><mml:mi>E</mml:mi><mml:mrow><mml:mo>[</mml:mo><mml:mi>&#x03B5;</mml:mi><mml:mrow><mml:mo>|</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo fence="true" stretchy="true" symmetric="true"></mml:mo></mml:mrow><mml:mo>]</mml:mo></mml:mrow></mml:math></inline-formula> by minimizing the objective function <inline-formula id="ieqn-31"><mml:math id="mml-ieqn-31"><mml:mi>m</mml:mi><mml:mi>i</mml:mi><mml:mi>n</mml:mi><mml:msub><mml:mi>E</mml:mi><mml:mrow><mml:mi>t</mml:mi><mml:mo>,</mml:mo><mml:mrow><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mn>0</mml:mn></mml:mrow></mml:msub></mml:mrow><mml:mo>,</mml:mo><mml:mi>&#x03B5;</mml:mi></mml:mrow></mml:msub><mml:msub><mml:mrow><mml:mo symmetric="true">&#x2016;</mml:mo><mml:mi>&#x03B5;</mml:mi><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mi>&#x03B5;</mml:mi><mml:mrow><mml:mi>&#x03B8;</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mi>t</mml:mi><mml:mo>)</mml:mo></mml:mrow><mml:mo symmetric="true">&#x2016;</mml:mo></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula>, where <italic>t</italic> is uniformly distributed between 1 and T. To learn the conditional diffusion model, this paper further inputs the conditional information c into the noise prediction objective function:
<disp-formula id="eqn-12"><label>(12)</label><mml:math id="mml-eqn-12" display="block"><mml:mi>m</mml:mi><mml:mi>i</mml:mi><mml:mi>n</mml:mi><mml:msub><mml:mi>E</mml:mi><mml:mrow><mml:mi>t</mml:mi><mml:mo>,</mml:mo><mml:mrow><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mn>0</mml:mn></mml:mrow></mml:msub></mml:mrow><mml:mo>,</mml:mo><mml:mi>c</mml:mi><mml:mo>,</mml:mo><mml:mi>&#x03B5;</mml:mi></mml:mrow></mml:msub><mml:msub><mml:mrow><mml:mo symmetric="true">&#x2016;</mml:mo><mml:mi>&#x03B5;</mml:mi><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mi>&#x03B5;</mml:mi><mml:mrow><mml:mi>&#x03B8;</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mi>t</mml:mi><mml:mo>,</mml:mo><mml:mi>c</mml:mi><mml:mo>)</mml:mo></mml:mrow><mml:mo symmetric="true">&#x2016;</mml:mo></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub></mml:math></disp-formula></p>
<p>The image recovered after the noise value is obtained using the noise estimation training model in this paper, as shown in <xref ref-type="fig" rid="fig-2">Fig. 2</xref>.</p>
<fig id="fig-2">
<label>Figure 2</label>
<caption>
<title>Image conditional diffusion model map</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_53232-fig-2.tif"/>
</fig>
</sec>
</sec>
<sec id="s3">
<label>3</label>
<title>Noise Estimation Based on Transformer</title>
<p>ViT can better capture the global information and local structures in images. Additionally, ViT offers better generalization capabilities because instead of relying on fixed-size filters like traditional convolutional neural networks, it processes input sequences through a self-attentive mechanism, which makes it more flexible with respect to the length and size of the inputs. For this purpose, a simple and generalized ViT-based architecture noise estimation network is improved in this paper, as shown in <xref ref-type="fig" rid="fig-3">Fig. 3</xref>. The model follows the transformer&#x2019;s design approach, taking all inputs, including temporal, conditional, as well as noisy picture patches as markers, using long jump branch between shallow and deep layers. The architecture allows for more efficient training of pixel prediction targets using low level features. In addition, to prevent possible artifacts in the image generated by the Transformer [<xref ref-type="bibr" rid="ref-14">14</xref>], we added a 3 &#x00D7; 3 convolution block before the output. From experiments, the visual quality of the model-repaired images was improved.</p>
<fig id="fig-3">
<label>Figure 3</label>
<caption>
<title>Noise estimation network structure map</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_53232-fig-3.tif"/>
</fig>
<p>The model is a simple generalized backbone network for image generation diffusion models is shown in <xref ref-type="fig" rid="fig-3">Fig. 3</xref>. Given time <italic>t</italic>, condition c (discrete text converted to embedded sequences via CLIP encoder, aligned with stabilized diffusion, and input as a tagged sequence), alongside the noisy images <inline-formula id="ieqn-32"><mml:math id="mml-ieqn-32"><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> as input, and estimates the noise infused within <inline-formula id="ieqn-33"><mml:math id="mml-ieqn-33"><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>. Following ViT&#x2019;s design, images are segmented into homogeneous blocks, transformed into a sequence, and combined with time, condition, and blocks as inputs. The input sequence is then processed with a multi-head attention mechanism and a feed-forward neural network to enable the model to capture global dependencies in the input sequence and use stacking of multi-layer Transformer blocks to incrementally improve the feature representation at higher levels. To enhance the information delivery and reduce the information loss, we apply long jump branch between Transformer blocks by learning the ideas in U-Net of CNN. This branch allows shallow information to be passed directly to deeper layers, effectively preserving the low-level feature information of the image, and providing efficient paths for low-level features, thus simplifying the training process of the noise prediction network. The output of Transformer blocks is mapped to a spatial representation of the noisy image and features are further processed through a 3 &#x00D7; 3 convolutional layer to improve the model&#x2019;s ability to capture image details.</p>
<p>In <xref ref-type="sec" rid="s3_1">Section 3.1</xref>, we present a specific instantiation of our model, detailing its core components and architecture. Subsequently, in <xref ref-type="sec" rid="s3_2">Section 3.2</xref>, we delve into evaluating the model&#x2019;s scalability potential, focusing on the influence of architectural dimensions such as depth, width, and patch size on its performance.</p>
<sec id="s3_1">
<label>3.1</label>
<title>Concrete Realization</title>
<p>To make the model more effective, this paper conducts a systematic empirical study on its key elements and conducts ablation experiments on this paper&#x2019;s dataset, evaluating the FID scores of 1 K generated samples every 5 K training iterations, and selecting the optimal effect throughthe experiments.</p>
<p>We investigate various strategies for integrating the long skip branch within our Transformer architecture. Specifically, we consider the main branch embedding <inline-formula id="ieqn-34"><mml:math id="mml-ieqn-34"><mml:msub><mml:mi>h</mml:mi><mml:mrow><mml:mi>m</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> and the long skip branch embedding and the long skip branch embedding <inline-formula id="ieqn-35"><mml:math id="mml-ieqn-35"><mml:msub><mml:mi>h</mml:mi><mml:mrow><mml:mi>s</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>, both of which reside in <inline-formula id="ieqn-36"><mml:math id="mml-ieqn-36"><mml:msup><mml:mi>R</mml:mi><mml:mrow><mml:mi>H</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mi>W</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula>. Prior to passing these embeddings to the subsequent Transformer block, we explore five fusion approaches: (1) concatenating <inline-formula id="ieqn-37"><mml:math id="mml-ieqn-37"><mml:msub><mml:mi>h</mml:mi><mml:mrow><mml:mi>m</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> and <inline-formula id="ieqn-38"><mml:math id="mml-ieqn-38"><mml:msub><mml:mi>h</mml:mi><mml:mrow><mml:mi>s</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> followed by a linear projection (Linear (Concat(<inline-formula id="ieqn-39"><mml:math id="mml-ieqn-39"><mml:msub><mml:mi>h</mml:mi><mml:mrow><mml:mi>m</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>, <inline-formula id="ieqn-40"><mml:math id="mml-ieqn-40"><mml:msub><mml:mi>h</mml:mi><mml:mrow><mml:mi>s</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>))), (2) direct element-wise summation (<inline-formula id="ieqn-41"><mml:math id="mml-ieqn-41"><mml:msub><mml:mi>h</mml:mi><mml:mrow><mml:mi>m</mml:mi></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:msub><mml:mi>h</mml:mi><mml:mrow><mml:mi>s</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>), (3) applying a linear projection to <inline-formula id="ieqn-42"><mml:math id="mml-ieqn-42"><mml:msub><mml:mi>h</mml:mi><mml:mrow><mml:mi>s</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> prior to summation (<inline-formula id="ieqn-43"><mml:math id="mml-ieqn-43"><mml:msub><mml:mi>h</mml:mi><mml:mrow><mml:mi>m</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> &#x002B; Linear(<inline-formula id="ieqn-44"><mml:math id="mml-ieqn-44"><mml:msub><mml:mi>h</mml:mi><mml:mrow><mml:mi>s</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>)), (4) summing the embeddings and then applying a linear projection (Linear(<inline-formula id="ieqn-45"><mml:math id="mml-ieqn-45"><mml:msub><mml:mi>h</mml:mi><mml:mrow><mml:mi>m</mml:mi></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:msub><mml:mi>h</mml:mi><mml:mrow><mml:mi>s</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>)), and (5) a baseline scenario without the long skip branch for comparative analysis. Notably, the direct summation of <inline-formula id="ieqn-46"><mml:math id="mml-ieqn-46"><mml:msub><mml:mi>h</mml:mi><mml:mrow><mml:mi>m</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> and <inline-formula id="ieqn-47"><mml:math id="mml-ieqn-47"><mml:msub><mml:mi>h</mml:mi><mml:mrow><mml:mi>s</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> (i.e., <inline-formula id="ieqn-48"><mml:math id="mml-ieqn-48"><mml:msub><mml:mi>h</mml:mi><mml:mrow><mml:mi>m</mml:mi></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:msub><mml:mi>h</mml:mi><mml:mrow><mml:mi>s</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>) solely modulates the contribution of <inline-formula id="ieqn-49"><mml:math id="mml-ieqn-49"><mml:msub><mml:mi>h</mml:mi><mml:mrow><mml:mi>s</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> in a linear fashion, leaving the fundamental network architecture unaltered. In contrast, all alternative fusion strategies involving the long skip branch demonstrate enhanced performance compared to the absence of such a connection. As shown in <xref ref-type="fig" rid="fig-4">Fig. 4a</xref>, the approach that concatenates <inline-formula id="ieqn-50"><mml:math id="mml-ieqn-50"><mml:msub><mml:mi>h</mml:mi><mml:mrow><mml:mi>m</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> and <inline-formula id="ieqn-51"><mml:math id="mml-ieqn-51"><mml:msub><mml:mi>h</mml:mi><mml:mrow><mml:mi>s</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> followed by a linear projection emerges as the most effective, suggesting that this method is particularly adept at leveraging the complementary information from both branches.</p>
<fig id="fig-4">
<label>Figure 4</label>
<caption>
<title>Alate design choices map, (a) Combine long skip branch (b) Add an extra convolutional block (c) Feed time into the network (d) Different forms of patch embedding (e) Different forms of position embedding</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_53232-fig-4.tif"/>
</fig>
<p>In investigating the enhancement of our model, we evaluated two methodologies for integrating an additional convolutional layer following the Transformer block: (1) The first method involved appending a 3 &#x00D7; 3 convolutional block following the linear projection step, which converts token embeddings into image patches, as illustrated in <xref ref-type="fig" rid="fig-4">Fig. 4b</xref>. (2) Before linear projection, adding a 3 &#x00D7; 3 convolutional layer, and the one-dimensional sequence of label embeddings needs to be rearranged into two-dimensional feature map of dimensions H/P &#x00D7; W/P &#x00D7; D, with P represents the size of the patches. (3) Additionally, comparisons were made for the situation in which no additional convolutional layers were added. According to the results in <xref ref-type="fig" rid="fig-4">Fig. 4b</xref>, the method of adding 3 &#x00D7; 3 convolutional layers following the linear transformation exhibits marginally superior performance to the remainingtwo alternatives.</p>
<p>To incorporate temporal information into the network, we evaluate two different methods for inputting the time variable <italic>t</italic>: (1) Consider it as a marker, as shown in <xref ref-type="fig" rid="fig-4">Fig. 4c</xref>. (2) Merge the layer-normalized times into the Transformer block [<xref ref-type="bibr" rid="ref-15">15</xref>], analogous to the adaptive group normalization employed within U-Net [<xref ref-type="bibr" rid="ref-16">16</xref>]. Another method uses adaptive layer normalization (AdaLN). Use the normalization operation: <inline-formula id="ieqn-52"><mml:math id="mml-ieqn-52"><mml:mi>A</mml:mi><mml:mi>d</mml:mi><mml:mi>a</mml:mi><mml:mi>L</mml:mi><mml:mi>N</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mi>h</mml:mi><mml:mo>,</mml:mo><mml:mi>y</mml:mi><mml:mo>)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:msub><mml:mi>y</mml:mi><mml:mrow><mml:mi>s</mml:mi></mml:mrow></mml:msub><mml:mi>L</mml:mi><mml:mi>a</mml:mi><mml:mi>y</mml:mi><mml:mi>e</mml:mi><mml:mi>r</mml:mi><mml:mi>N</mml:mi><mml:mi>o</mml:mi><mml:mi>r</mml:mi><mml:mi>m</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mi>h</mml:mi><mml:mo>)</mml:mo></mml:mrow><mml:mo>+</mml:mo><mml:msub><mml:mi>y</mml:mi><mml:mrow><mml:mi>b</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>, within the Transformer block, we embed the input as <italic>h</italic>, and subsequently derive <inline-formula id="ieqn-53"><mml:math id="mml-ieqn-53"><mml:msub><mml:mi>y</mml:mi><mml:mrow><mml:mi>s</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> and <inline-formula id="ieqn-54"><mml:math id="mml-ieqn-54"><mml:msub><mml:mi>y</mml:mi><mml:mrow><mml:mi>b</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> through a linear projection of the temporal embedding. Despite the relative simplicity of the AdaLN approach, the first approach, which treats time as a marker, outperforms AdaLN as shown in <xref ref-type="fig" rid="fig-4">Fig. 4c</xref>.</p>
<p>We delve into the nuances of patch embedding by examining two distinct variants. (1) The original approach employs a linear projection to transform each patch into a labeled embedding, as illustrated in <xref ref-type="fig" rid="fig-4">Fig. 4d</xref>. (2) We employed a sequence of 3 &#x00D7; 3 convolutional blocks, followed by a 1 &#x00D7; 1 convolutional block, to map images into token embeddings. In contrast, the results indicate that the conventional patch embedding method outperforms this approach.</p>
<p>We delve into the realm of positional embedding variants, exploring two distinct methodologies tailored for our image restoration framework: (1) Firstly, we adopt the ubiquitous one-dimensional learnable positional embedding, which is the default setting used in this paper and proposed in the original ViT, (2) The second variant utilizes a 2-dimensional sinusoidal positional embedding, constructed by concatenating the sinusoidal embeddings of coordinates i and j for each patch located at (i, j). According to the results in <xref ref-type="fig" rid="fig-4">Fig. 4e</xref>, the former performs better than the latter, and it is found that the model is unable to generate meaningful images after trying it without any positional embedding, which proves that positional information is crucial in image generation.</p>
</sec>
<sec id="s3_2">
<label>3.2</label>
<title>Influence of Depth, Width, and Patch Size</title>
<p>We demonstrate the scalability of our proposed model through a meticulous investigation encompassing its depth, quantified by the number of layers, its width, defined by the hidden layer size D, and the granularity of its input, characterized by the patch size. As shown in <xref ref-type="fig" rid="fig-5">Fig. 5</xref>, the best performance was achieved at a depth of 14 in the 5 K iterations of the experiment, which shows that the depth is not positively correlated with the performance of the model, i.e., the model does not benefit from a greater depth. Analogously, augmenting the width dimension, specifically the hidden size, from 256 to 512 yields a noticeable performance enhancement. However, further escalation to 768 does not yield any discernible gains, indicating a saturation point in the model&#x2019;s capacity to leverage additional width for improved performance. On the other hand, a smaller patch size consistently improves the performance. In contrast, high-level tasks (e.g., classification) may require larger patches. In practice, due to the high dimensionality of the image data, there may be an increase in cost when using smaller patch values for training, and therefore it is recommended to downscale the image data before using the model.</p>
<fig id="fig-5">
<label>Figure 5</label>
<caption>
<title>Influence diagram of different factors, (a) Width (b) Depth (c) Patch size</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_53232-fig-5.tif"/>
</fig>
</sec>
<sec id="s3_3">
<label>3.3</label>
<title>Optimizing Denoising Losses and Estimating Distribution Functions</title>
<p>To better accomplish the repair task based on the diffusion model, in this paper, we refine our diffusion model&#x2019;s learning process by incorporating two distinct objective functions. The primary objective function implements a straightforward denoising loss, which is computed given a reference output image <italic>x</italic> and a randomly selected time step <italic>t</italic>, the reference image with a noisy version is generated as follows:
<disp-formula id="eqn-13"><label>(13)</label><mml:math id="mml-eqn-13" display="block"><mml:msup><mml:mi>x</mml:mi><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:msqrt><mml:msub><mml:mi>&#x03B1;</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:msqrt><mml:mi>x</mml:mi><mml:mo>+</mml:mo><mml:msqrt><mml:mn>1</mml:mn><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mi>&#x03B1;</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:msqrt><mml:mi>&#x03B5;</mml:mi></mml:math></disp-formula>take T to be 500. We train our conditional diffusion model to faithfully reconstruct the reference image <italic>x</italic> under the influence of the conditional feature <italic>c</italic> and the time step <italic>t</italic> as follows:
<disp-formula id="eqn-14"><label>(14)</label><mml:math id="mml-eqn-14" display="block"><mml:msub><mml:mrow><mml:mi>&#x02112;</mml:mi></mml:mrow><mml:mrow><mml:mi>s</mml:mi><mml:mi>i</mml:mi><mml:mi>m</mml:mi><mml:mi>p</mml:mi><mml:mi>l</mml:mi><mml:mi>e</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mi>E</mml:mi><mml:mrow><mml:mi>t</mml:mi><mml:mo>,</mml:mo><mml:mrow><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mn>0</mml:mn></mml:mrow></mml:msub></mml:mrow><mml:mo>,</mml:mo><mml:mi>c</mml:mi><mml:mo>,</mml:mo><mml:mi>&#x03B5;</mml:mi><mml:mrow><mml:mo>[</mml:mo><mml:msub><mml:mrow><mml:mo symmetric="true">&#x2016;</mml:mo><mml:mi>&#x03B5;</mml:mi><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mi>&#x03B5;</mml:mi><mml:mrow><mml:mi>&#x03B8;</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mi>t</mml:mi><mml:mo>,</mml:mo><mml:mi>c</mml:mi><mml:mo>)</mml:mo></mml:mrow><mml:mo symmetric="true">&#x2016;</mml:mo></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub><mml:mo>]</mml:mo></mml:mrow></mml:mrow></mml:msub></mml:math></disp-formula>based on the enhanced denoising diffusion model, we further train the network to predict the variance <inline-formula id="ieqn-55"><mml:math id="mml-ieqn-55"><mml:munder><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>&#x03B8;</mml:mi></mml:mrow></mml:munder><mml:mrow><mml:mo>(</mml:mo><mml:msup><mml:mi>x</mml:mi><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msup><mml:mo>,</mml:mo><mml:mi>c</mml:mi><mml:mo>,</mml:mo><mml:mi>t</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula>, which not only improves the fidelity of the reconstructed image, but also helps to improve its log-likelihood. The conditional diffusion model additionally outputs the interpolated coefficients s for each dimension and converts the output to variance as follows:
<disp-formula id="eqn-15"><label>(15)</label><mml:math id="mml-eqn-15" display="block"><mml:msub><mml:mo movablelimits="false">&#x2211;</mml:mo><mml:mrow><mml:mi>&#x03B8;</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:msup><mml:mi>x</mml:mi><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msup><mml:mo>,</mml:mo><mml:mi>c</mml:mi><mml:mo>,</mml:mo><mml:mi>t</mml:mi><mml:mo>)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mi>e</mml:mi><mml:mi>x</mml:mi><mml:mi>p</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mi>v</mml:mi><mml:mi>log</mml:mi><mml:mo>&#x2061;</mml:mo><mml:msub><mml:mi>&#x03B2;</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:mn>1</mml:mn><mml:mo>&#x2212;</mml:mo><mml:mi>s</mml:mi><mml:mo>)</mml:mo></mml:mrow><mml:mi>log</mml:mi><mml:mo>&#x2061;</mml:mo><mml:msub><mml:mrow><mml:mover><mml:mi>&#x03B2;</mml:mi><mml:mo stretchy="false">&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow></mml:math></disp-formula></p>
<p>where <inline-formula id="ieqn-56"><mml:math id="mml-ieqn-56"><mml:msub><mml:mi>&#x03B2;</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> and <inline-formula id="ieqn-57"><mml:math id="mml-ieqn-57"><mml:mrow><mml:mover><mml:mi>&#x03B2;</mml:mi><mml:mo stretchy="false">&#x007E;</mml:mo></mml:mover></mml:mrow></mml:math></inline-formula> denote the upper and lower limits of the variance, respectively. The second objective function directly optimizes the KL scatter between the estimated distribution <inline-formula id="ieqn-58"><mml:math id="mml-ieqn-58"><mml:msub><mml:mi>p</mml:mi><mml:mrow><mml:mi>&#x03B8;</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>t</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mn>0</mml:mn></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula> and the diffusion posterior <inline-formula id="ieqn-59"><mml:math id="mml-ieqn-59"><mml:mi>q</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>t</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mn>0</mml:mn></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula> with the following formula:
<disp-formula id="eqn-16"><label>(16)</label><mml:math id="mml-eqn-16" display="block"><mml:msub><mml:mrow><mml:mi>&#x02112;</mml:mi></mml:mrow><mml:mrow><mml:mi>v</mml:mi><mml:mi>l</mml:mi><mml:mi>b</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mi>K</mml:mi><mml:mi>L</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>p</mml:mi><mml:mrow><mml:mi>&#x03B8;</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>t</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mn>0</mml:mn></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow><mml:mrow><mml:mo symmetric="true">&#x2016;</mml:mo><mml:mi>q</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>t</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mn>0</mml:mn></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow><mml:mo fence="true" stretchy="true" symmetric="true"></mml:mo></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:math></disp-formula></p>
<p>The total loss function is the weighted sum of the two objective functions, formulated as follows:
<disp-formula id="eqn-17"><label>(17)</label><mml:math id="mml-eqn-17" display="block"><mml:mrow><mml:mi>&#x02112;</mml:mi></mml:mrow><mml:mo>=</mml:mo><mml:mi>&#x03B7;</mml:mi><mml:msub><mml:mrow><mml:mi>&#x02112;</mml:mi></mml:mrow><mml:mrow><mml:mi>s</mml:mi><mml:mi>i</mml:mi><mml:mi>m</mml:mi><mml:mi>p</mml:mi><mml:mi>l</mml:mi><mml:mi>e</mml:mi></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:mi>&#x03BB;</mml:mi><mml:msub><mml:mrow><mml:mi>&#x02112;</mml:mi></mml:mrow><mml:mrow><mml:mi>v</mml:mi><mml:mi>l</mml:mi><mml:mi>b</mml:mi></mml:mrow></mml:msub></mml:math></disp-formula></p>
<p>where <inline-formula id="ieqn-60"><mml:math id="mml-ieqn-60"><mml:mi>&#x03B7;</mml:mi></mml:math></inline-formula> and <inline-formula id="ieqn-61"><mml:math id="mml-ieqn-61"><mml:mi>&#x03BB;</mml:mi></mml:math></inline-formula> are the weight parameters of the balanced loss function, the improved loss function improves the performance of the network model, accelerates the convergence speed of the algorithm, which improves the efficiency of training when <inline-formula id="ieqn-62"><mml:math id="mml-ieqn-62"><mml:mi>&#x03B7;</mml:mi><mml:mo>=</mml:mo><mml:mn>0.4</mml:mn></mml:math></inline-formula>, <inline-formula id="ieqn-63"><mml:math id="mml-ieqn-63"><mml:mi>&#x03BB;</mml:mi><mml:mo>=</mml:mo><mml:mn>0.6</mml:mn></mml:math></inline-formula> is adjusted through experiments.</p>
</sec>
</sec>
<sec id="s4">
<label>4</label>
<title>Experiments and Analysis</title>
<p>To verify the restoration effect of the restoration method proposed in this paper on ancient mural images, and to compare and analyze it with existing restoration methods we conducted experiments on the dataset of this paper, the specific experimental process is as follows.</p>
<sec id="s4_1">
<label>4.1</label>
<title>Data Sets and Experimental Environments</title>
<p>Due to the scarcity of cultural relics mural data set, this paper selects the training data set from the official website of Dunhuang Research Institute and the official website of Shanxi Museum to provide 4000 images of cultural relics mural with different resolution synthesized into a training dataset, through data augmentation to ultimately obtain 10,000 cultural relics images with varying resolution. We firstly manually screened 4000 images with different resolutions, eliminated images with too much single color and too much irrelevant content, and then augmented the images to generate a large dataset of 10,000 images of cultural relics.</p>
</sec>
<sec id="s4_2">
<label>4.2</label>
<title>Evaluation Indicators</title>
<p>We used 2 types of subjective and objective evaluations to validate the method. Subjective evaluation is done by observing the texture and color information of the generated image, objective methods are evaluated by peak signal-to-noise ratio and structural similarity (SSIM), peak signal-to-noise ratio (PSNR) and Fr&#x00E9;chet Inception Distance (FID) evaluating the strengths and weaknesses of each algorithm. PSNR mainly estimates the noise fidelity of the reconstructed image, the higher the value, the better the quality of the reconstructed image. SSIM combines three factors: brightness, contrast, and structure. The mean is used as an estimate of brightness, the standard deviation as an estimate of contrast, and the covariance as an estimate of structural similarity. The value range is between [0, 1]. The closer the result is to 1, the better the reconstructed image quality is. FID calculates the similarity between advanced features of the image, and the smaller the value, the higher the degree of similarity.</p>
</sec>
<sec id="s4_3">
<label>4.3</label>
<title>Experimental Results Analysis</title>
<p>For objective comparison of restoration results of image restoration methods, the comparison methods in this paper use the same input data. The following experiment is the method of this paper with the hierarchical Transformer-based image restoration method [<xref ref-type="bibr" rid="ref-17">17</xref>]; Based on generative adversarial networks to generate high quality restored images by matching and correlating background patches method Contextual Attention [<xref ref-type="bibr" rid="ref-18">18</xref>]; Shift-Net, a deep learning method that combines a priori information [<xref ref-type="bibr" rid="ref-12">12</xref>]; Global Uniform and Local Continuity (GU&#x0026;LC) combining global uniformity and local continuity based on the relationship implied between linear systems and image restoration [<xref ref-type="bibr" rid="ref-19">19</xref>] and a comparison of the probabilistic diverse GAN method PD-GAN [<xref ref-type="bibr" rid="ref-20">20</xref>] for image repair on irregularly corrupted images of this dataset.</p>
<sec id="s4_3_1">
<label>4.3.1</label>
<title>Repair of Scratches and Damages of Different Sizes</title>
<p>This section compares the restoration results of each method on murals with different scratches that conform to realistic scenarios. Combining the characteristics of irregular area and discontinuous damaged area of cultural relics, <xref ref-type="fig" rid="fig-6">Fig. 6</xref> shows the restoration results of each method.</p>
<fig id="fig-6">
<label>Figure 6</label>
<caption>
<title>Comparison of repair results of different algorithms for real damaged scenes</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_53232-fig-6.tif"/>
</fig>
<p>We compare the image reconstruction effect when the methods are of the same magnification by subjective evaluation method, which mainly observes the texture information and color brightness of the generated image. As seen in <xref ref-type="fig" rid="fig-6">Fig. 6b</xref>, the hierarchical Transformer-based method repairs scratches with larger area better, however, it still fails to repair smaller scratches completely. As shown in <xref ref-type="fig" rid="fig-6">Fig. 6c</xref>, compared to the previous method, the Contextual Attention method scratch repair is more complete, but there are still details that are partially repaired that are not reasonable, probably because the model produces inaccurate results in predicting the structure of larger damaged areas. As can be seen in the third image, this is particularly evident when the known region does not provide enough a priori information. As seen in <xref ref-type="fig" rid="fig-6">Fig. 6d</xref>, the Shift-Net method performs well overall, successfully repairing the basic scratches and broken parts, and the color remains largely unchanged significantly. however, the lack of contextual semantic information in the repair region rubs off the texture of the image, resulting in detail not being visible. Especially in the first image, the Shift-Net model has changed less based on the original color and still shows a dark and old feeling, while in the third image the basic scratches of the image are all restored, but the detailed part of the restoration does not look natural. As seen in <xref ref-type="fig" rid="fig-6">Fig. 6e</xref>,<xref ref-type="fig" rid="fig-6">f</xref>, the restoration quality of the GU&#x0026;LC method and the PD-GAN method is relatively high, and the detail restoration of the GU&#x0026;LC is still not as natural as that of this paper despite the elimination of artifacts. The PD-GAN method excels in the completeness and rationality of the restoration results in terms of context, while more detailed comparisons and improvements in detail and color are still needed. Further observation of <xref ref-type="fig" rid="fig-6">Fig. 6g</xref>, this paper&#x2019;s method in the realization of the texture becomes clear at the same time, it also highlights the vivid colors of the image, so that the original dim image of cultural relics regained its former glory. The method in this paper is better than other methods for image restoration of this dataset, and the restoration results are semantically coherent, with fewer artifacts and duplicate textures, and better metrics and visual effects are achieved.</p>
<p>The quantitative evaluation of repair indexes of each method is shown in <xref ref-type="table" rid="table-1">Table 1</xref>, the method of this paper is optimal in PSNR, SSIM performance, compared with other methods, the PSNR indexes were improved respectively by 9.01%, 5.42%, 2.11%, 3.32% and 1.51%. SSIM indicator has improved by 1.87%, 3.31%, 2.67%, 3.48% and 1.89%. The above results show that the method of this paper has outstanding structural recovery ability for scratched and damaged cultural relics, and the recovery of texture and color is also more reasonable, which has a very good restoration effect.</p>
<table-wrap id="table-1">
<label>Table 1</label>
<caption>
<title>This method and other methods for scratch damage repair results</title>
</caption>
<table frame="hsides">
<colgroup>
<col align="left"/>
<col align="left"/>
<col align="left"/>
</colgroup>
<thead>
<tr>
<th>Method</th>
<th>PSRN/dB</th>
<th>SSIM</th>
</tr>
</thead>
<tbody>
<tr>
<td>Based on hierarchical Transformer [<xref ref-type="bibr" rid="ref-17">17</xref>]</td>
<td>28.6570</td>
<td>0.8756</td>
</tr>
<tr>
<td>Contextual-Attention [<xref ref-type="bibr" rid="ref-18">18</xref>]</td>
<td>29.6303</td>
<td>0.8634</td>
</tr>
<tr>
<td>Shift-Net [<xref ref-type="bibr" rid="ref-12">12</xref>]</td>
<td>30.5930</td>
<td>0.8688</td>
</tr>
<tr>
<td>GU&#x0026;LC [<xref ref-type="bibr" rid="ref-19">19</xref>]</td>
<td>30.2321</td>
<td>0.8620</td>
</tr>
<tr>
<td>PD-GAN [<xref ref-type="bibr" rid="ref-20">20</xref>]</td>
<td>30.7720</td>
<td>0.8754</td>
</tr>
<tr>
<td>Proposed</td>
<td><bold>31.</bold><bold>2376</bold></td>
<td><bold>0.</bold><bold>8920</bold></td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
<sec id="s4_3_2">
<label>4.3.2</label>
<title>Restoration of Natural Weathering Dislodgement</title>
<p>The experiments in this section focus on two types of restoration effects: large weathering and shedding repair and small weathering and shedding repair. Observe whether the restored image outlines the object of the painting clearly, whether the color contrast is sharp, and whether its texture is relatively reasonable. Therefore, the experiments in this section are trained using the images of figure murals in the dataset with large weathering and shedding areas of 30% to 40% and small weathering and shedding areas of 5% to 10%. This was done to ensure consistency and reproducibility of the experiments and to focus on specific types of artifact images. The experimental results are shown in <xref ref-type="fig" rid="fig-7">Figs. 7</xref> and <xref ref-type="fig" rid="fig-8">8</xref>. From <xref ref-type="fig" rid="fig-7">Fig. 7</xref>, the method of this paper grasps the global information of the large weathered and detached parts very well, and the texture is clear, but a small part of it will be slightly distorted. As shown in <xref ref-type="fig" rid="fig-8">Fig. 8</xref>, in the small area of detachment area, this paper&#x2019;s method also well restored the missing part of the image, perfected the integrity of the mural, improved the color contrast compared to a larger area of broken, no distortion, the restoration results of the structure of the coherent and in line with the context of the semantic information. From the data in <xref ref-type="table" rid="table-2">Table 2</xref>, the PSRN index of large weathering shedding loss compared to other methods. It respectively increased to 2.5523, 2.3094, 0.3762, 0.9981, and 0.5502 dB. The SSIM metrics improved by 0.0290, 0.0490, 0.0199, 0.0287, and 0.0164, respectively. The PSRN metrics for small-area weathering and shedding loss increased by 1.9977, 2.6687, 0.6686, 1.3537, and 0.9058 dB, and the SSIM metrics improved by 0.0307, 0.0537, 0.0246, 0.0334, and 0.0211. We can see that small areas are better restored than large areas.</p>
<fig id="fig-7">
<label>Figure 7</label>
<caption>
<title>Large area off repair effect diagram</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_53232-fig-7.tif"/>
</fig><fig id="fig-8">
<label>Figure 8</label>
<caption>
<title>Small area off repair effect diagram</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_53232-fig-8.tif"/>
</fig><table-wrap id="table-2">
<label>Table 2</label>
<caption>
<title>The results of this method and other methods are different in size of needle damage repair</title>
</caption>
<table frame="hsides">
<colgroup>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
</colgroup>
<thead>
<tr>
<th rowspan="2">Method</th>
<th align="center" colspan="2">Large area off</th>
<th align="center" colspan="2">Small area off</th>
</tr>
<tr>
<th>PSRN/dB</th>
<th>SSIM</th>
<th>PSRN/dB</th>
<th>SSIM</th>
</tr>
</thead>
<tbody>
<tr>
<td>Based on hierarchical Transformer [<xref ref-type="bibr" rid="ref-17">17</xref>]</td>
<td>28.3241</td>
<td>0.8630</td>
<td>29.2343</td>
<td>0.8660</td>
</tr>
<tr>
<td>Contextual-Attention [<xref ref-type="bibr" rid="ref-18">18</xref>]</td>
<td>28.5670</td>
<td>0.8430</td>
<td>28.5633</td>
<td>0.8430</td>
</tr>
<tr>
<td>Shift-Net [<xref ref-type="bibr" rid="ref-12">12</xref>]</td>
<td>30.5002</td>
<td>0.8721</td>
<td>30.5634</td>
<td>0.8721</td>
</tr>
<tr>
<td>GU&#x0026;LC [<xref ref-type="bibr" rid="ref-19">19</xref>]</td>
<td>29.8783</td>
<td>0.8633</td>
<td>29.8783</td>
<td>0.8633</td>
</tr>
<tr>
<td>PD-GAN [<xref ref-type="bibr" rid="ref-20">20</xref>]</td>
<td>30.3262</td>
<td>0.8756</td>
<td>30.3262</td>
<td>0.8756</td>
</tr>
<tr>
<td>Proposed</td>
<td><bold>30.</bold><bold>8764</bold></td>
<td><bold>0.</bold><bold>8920</bold></td>
<td><bold>31.</bold><bold>2320</bold></td>
<td><bold>0.</bold><bold>8967</bold></td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
<sec id="s4_3_3">
<label>4.3.3</label>
<title>Ablation Experiment</title>
<p>To further validate the effectiveness of our proposed image restoration method for artifacts and the contribution of each component, we conducted a series of ablation experiments. This experiment aims to analyze the effect of different modules and steps on the restoration results, first, we evaluate the effect of using only the denoising diffusion model for the image restoration of artifact murals. In this experiment, long skip connections were removed and additional added 3 &#x00D7; 3 convolutional machine layers were removed to determine the repair ability of the denoising diffusion model itself and to analyze the effect of having or not having long skip branch and adding additional convolutional layers on the repair effect. To verify its ability to improve the accuracy of noise estimation. The experimental results are shown in <xref ref-type="table" rid="table-3">Table 3</xref> for the performance of different modules on this dataset.</p>
<table-wrap id="table-3">
<label>Table 3</label>
<caption>
<title>The performance of different modules</title>
</caption>
<table frame="hsides">
<colgroup>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
</colgroup>
<thead>
<tr>
<th></th>
<th>PSRN/dB</th>
<th>SSIM</th>
<th>FID</th>
</tr>
</thead>
<tbody>
<tr>
<td>Base model</td>
<td>28.302</td>
<td>0.834</td>
<td>7.32</td>
</tr>
<tr>
<td>Removal of LSC</td>
<td>28.653</td>
<td>0.854</td>
<td>6.78</td>
</tr>
<tr>
<td>Remove of extra coiler layer</td>
<td>29.042</td>
<td>0.876</td>
<td>5.95</td>
</tr>
<tr>
<td>Full-scale model</td>
<td><bold>30.030</bold></td>
<td><bold>0.908</bold></td>
<td><bold>5.48</bold></td>
</tr>
</tbody>
</table>
</table-wrap>
<p>The experimental results are shown in <xref ref-type="table" rid="table-3">Table 3</xref>, the addition of long skip connections in the noise estimation network is 0.74 dB better than no long skip connections in PSRN and 0.042 better in SSIM. Adding the extra coiler layer is going to improve over the baseline model by 0.351 dB on PSRN and 0.02 on SSIM. Both add improvements on PSRN of 1.728 dB and 0.074 on SSIM. FID was reduced by 1.84. As shown in <xref ref-type="fig" rid="fig-9">Fig. 9</xref>, After removing the LSC module, it is evident from the comparative images that while the repaired result image exhibits clear color contrast and higher saturation, there are still structural deficiencies present. Compared to LSC, the impact of the additional convolutional layer is relatively minor. From the comparative images, it is apparent from the restoration of the cultural relic mural that although the removal of this module results in a coherent image with well-done detail restoration, the color contrast is not as distinct. The repair result of adding 2 modules at the same time is optimal in all 3 metrics, and the repair result is optimal.</p>
<fig id="fig-9">
<label>Figure 9</label>
<caption>
<title>Visual effects of different modules in the restoration of cultural relics murals</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_53232-fig-9.tif"/>
</fig>
</sec>
</sec>
</sec>
<sec id="s5">
<label>5</label>
<title>Conclusions</title>
<p>This paper proposes and achieves a new method for restoring images of cultural relics, which aims at restoring the original appearance of mural images of cultural relics in a more rational way through fine steps and techniques. The noise is mainly estimated by the noise prediction network in the forward diffusion model. The improved Transformer module proposed can process the image information efficiently, due to the long skip connection that can reduce the problem of information loss brought about by the process of multiple up-sampling and down-sampling, which enables the efficient Transformer module to improve the effect of intelligent restoration of the broken image while maintaining the global attention. The experimental results show that our method achieves excellent results in both breakage repair experiments and large area breakage repair experiments, which is not only validated in subjective assessment but also performs well in objective assessment.</p>
</sec>
</body>
<back>
<ack>
<p>The authors would like to express their heartfelt gratitude to the editors and reviewers for their detailed review and insightful advice.</p>
</ack>
<sec><title>Funding Statement</title>
<p>This financial support from Hunan Provincial Natural Science and Technology Fund Project (Grant No. 2022JJ50077), Natural Science Foundation of Hunan Province (Grant No. 2024JJ8055).</p>
</sec>
<sec><title>Author Contributions</title>
<p>The authors contribute to this paper in the following capacities: Conception and design of the study: Mansheng Xiao, Yuqing Hu; data collection: Yaoyao Wang; analysis and interpretation of results: Yaoyao Wang, Jin Yan, Zeyu Zhu; preparation of draft manuscript: Yaoyao Wang. All authors reviewed the results and approved the final version of the manuscript.</p>
</sec>
<sec sec-type="data-availability"><title>Availability of Data and Materials</title>
<p>Data and materials are available upon request from authors.</p>
</sec>
<sec><title>Ethics Approval</title>
<p>Not applicable.</p>
</sec>
<sec sec-type="COI-statement"><title>Conflicts of Interest</title>
<p>The authors declare that they have no conflicts of interest to report regarding the present study.</p>
</sec>
<ref-list content-type="authoryear">
<title>References</title>
<ref id="ref-1"><label>[1]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>H.</given-names> <surname>Tang</surname></string-name>, <string-name><given-names>G. H.</given-names> <surname>Geng</surname></string-name>, and <string-name><given-names>M. Q.</given-names> <surname>Zhou</surname></string-name></person-group>, &#x201C;<article-title>Application of digital processing in relic image restoration design</article-title>,&#x201D; <source>Sens. Imaging</source>, vol. <volume>21</volume>, no. <issue>1</issue>, <year>2020</year>, <comment>Art. no. 6</comment>. doi: <pub-id pub-id-type="doi">10.1007/s11220-019-0265-8</pub-id>.</mixed-citation></ref>
<ref id="ref-2"><label>[2]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>W. N.</given-names> <surname>Xu</surname></string-name> and <string-name><given-names>Y. L.</given-names> <surname>Fu</surname></string-name></person-group>, &#x201C;<article-title>Deep learning algorithm in ancient relics image colour restoration technology</article-title>,&#x201D; <source>Multimed. Tools. Appl.</source>, vol. <volume>82</volume>, no. <issue>15</issue>, pp. <fpage>23119</fpage>&#x2013;<lpage>23150</lpage>, <year>2023</year>. doi: <pub-id pub-id-type="doi">10.1007/s11042-022-14108-z</pub-id>.</mixed-citation></ref>
<ref id="ref-3"><label>[3]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>I.</given-names> <surname>Goodfellow</surname></string-name> <etal>et al.</etal></person-group>, &#x201C;<article-title>Generative adversarial nets</article-title>,&#x201D; <source>Commun. ACM</source>, vol. <volume>63</volume>, no. <issue>11</issue>, pp. <fpage>139</fpage>&#x2013;<lpage>144</lpage>, <year>2020</year>. doi: <pub-id pub-id-type="doi">10.1145/3422622</pub-id>.</mixed-citation></ref>
<ref id="ref-4"><label>[4]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>A.</given-names> <surname>Vaswani</surname></string-name> <etal>et al.</etal></person-group>, &#x201C;<article-title>Attention is all you need</article-title>,&#x201D; in <conf-name>Adv. Neural Inf. Process. Syst.</conf-name>, <publisher-loc>Long Beach, CA, USA</publisher-loc>, <year>Dec. 4&#x2013;9, 2017</year>, pp. <fpage>5998</fpage>&#x2013;<lpage>6008</lpage>. doi: <pub-id pub-id-type="doi">10.48550/arXiv.1706.03762</pub-id>.</mixed-citation></ref>
<ref id="ref-5"><label>[5]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>A.</given-names> <surname>Dosovitskiy</surname></string-name> <etal>et al.</etal></person-group>, &#x201C;<article-title>An image is worth 16 &#x00D7; 16 words: Transformers for image recognition at scale</article-title>,&#x201D; in <conf-name>Int. Conf. Learn. Represent.</conf-name>, <publisher-loc>Millennium Hall, Addis Ababa, Ethiopia</publisher-loc>, <year>Apr. 26&#x2013;30, 2020</year>, pp. <fpage>1</fpage>&#x2013;<lpage>21</lpage>. doi: <pub-id pub-id-type="doi">10.48550/arXiv.2010.11929</pub-id>.</mixed-citation></ref>
<ref id="ref-6"><label>[6]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>J. Y.</given-names> <surname>Peng</surname></string-name>, <string-name><given-names>Z.</given-names> <surname>Yu</surname></string-name>, <string-name><given-names>S. Y.</given-names> <surname>Qu</surname></string-name>, <string-name><given-names>Q. Y.</given-names> <surname>Hu</surname></string-name>, and <string-name><given-names>J.</given-names> <surname>Wang</surname></string-name></person-group>, &#x201C;<article-title>A research review on deep learning based image restoration methods</article-title>,&#x201D; (in Chinese), <source>J. Northwest. Univ. (Nat. Sci. Ed.)</source>, vol. <volume>53</volume>, no. <issue>6</issue>, pp. <fpage>943</fpage>&#x2013;<lpage>963</lpage>, <year>2023</year>. doi: <pub-id pub-id-type="doi">10.16152/j.cnki.xdxbzr.2023-06-006</pub-id>.</mixed-citation></ref>
<ref id="ref-7"><label>[7]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>J.</given-names> <surname>Ho</surname></string-name>, <string-name><given-names>A.</given-names> <surname>Jain</surname></string-name>, and <string-name><given-names>P.</given-names> <surname>Abbeel</surname></string-name></person-group>, &#x201C;<article-title>Denoising diffusion probabilistic models</article-title>,&#x201D; in <conf-name>Adv. Neural Inf. Process. Syst.</conf-name>, <publisher-loc>Montreal, QC, Canada</publisher-loc>, <year>Dec. 6&#x2013;12, 2020</year>, pp. <fpage>6840</fpage>&#x2013;<lpage>6851</lpage>. doi: <pub-id pub-id-type="doi">10.48550/arXiv.2006.11239</pub-id>.</mixed-citation></ref>
<ref id="ref-8"><label>[8]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>A.</given-names> <surname>Lugmayr</surname></string-name>, <string-name><given-names>M.</given-names> <surname>Danelljan</surname></string-name>, <string-name><given-names>A.</given-names> <surname>Romero</surname></string-name>, <string-name><given-names>F.</given-names> <surname>Yu</surname></string-name>, <string-name><given-names>R.</given-names> <surname>Timofte</surname></string-name> and <string-name><given-names>L. V.</given-names> <surname>Gool</surname></string-name></person-group>, &#x201C;<article-title>Repaint: Inpainting using denoising diffusion probabilistic models</article-title>,&#x201D; in <conf-name>Proc. IEEE Conf. Comput. Vis. Pattern Recog.</conf-name>, <publisher-loc>New Orleans, LA, USA</publisher-loc>, <year>Jun. 19&#x2013;24, 2022</year>, pp. <fpage>11461</fpage>&#x2013;<lpage>11471</lpage>. doi: <pub-id pub-id-type="doi">10.1109/CVPR52688.2022.01117</pub-id>.</mixed-citation></ref>
<ref id="ref-9"><label>[9]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>Z. H.</given-names> <surname>Yan</surname></string-name>, <string-name><given-names>C. B.</given-names> <surname>Zhou</surname></string-name>, and <string-name><given-names>X. C.</given-names> <surname>Li</surname></string-name></person-group>, &#x201C;<article-title>Review of generating diffusion models</article-title>,&#x201D; (in Chinese), <source>Comput. Sci.</source>, vol. <volume>51</volume>, no. <issue>1</issue>, pp. <fpage>273</fpage>&#x2013;<lpage>283</lpage>, <year>2024</year>. doi: <pub-id pub-id-type="doi">10.11896/jsjkx.230300057</pub-id>.</mixed-citation></ref>
<ref id="ref-10"><label>[10]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>H.</given-names> <surname>Sasaki</surname></string-name>, <string-name><given-names>C. G.</given-names> <surname>Willcocks</surname></string-name>, and <string-name><given-names>T. P.</given-names> <surname>Breckon</surname></string-name></person-group>, &#x201C;<article-title>UNIT-DDPM: Unpaired image translation with denoising diffusion probabilistic models</article-title>,&#x201D; in <conf-name>Proc. IEEE Conf. Comput. Vis. Pattern Recog.</conf-name>, <publisher-loc>Nashville, TN, USA</publisher-loc>, <year>Jun. 20&#x2013;25, 2021</year>, pp. <fpage>14125</fpage>&#x2013;<lpage>14134</lpage>. doi: <pub-id pub-id-type="doi">10.48550/arXiv.2104.05358</pub-id>.</mixed-citation></ref>
<ref id="ref-11"><label>[11]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>F.</given-names> <surname>Bao</surname></string-name> <etal>et al.</etal></person-group>, &#x201C;<article-title>All are worth words: A ViT backbone for diffusion models</article-title>,&#x201D; in <conf-name>Proc. IEEE Conf. Comput. Vis. Pattern Recog.</conf-name>, <publisher-loc>Vancouver, BC, Canada</publisher-loc>, <year>Jun. 18&#x2013;22, 2023</year>, pp. <fpage>22669</fpage>&#x2013;<lpage>22679</lpage>. doi: <pub-id pub-id-type="doi">10.1109/CVPR52729.2023.02171</pub-id>.</mixed-citation></ref>
<ref id="ref-12"><label>[12]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>Z. Y.</given-names> <surname>Yan</surname></string-name>, <string-name><given-names>X. M.</given-names> <surname>Li</surname></string-name>, <string-name><given-names>M.</given-names> <surname>Li</surname></string-name>, <string-name><given-names>W. M.</given-names> <surname>Zuo</surname></string-name>, and <string-name><given-names>S. G.</given-names> <surname>Shan</surname></string-name></person-group>, &#x201C;<article-title>Shift-Net: Image inpainting via deep feature rearrangement</article-title>,&#x201D; in <conf-name>Proc. Eur. Conf. Comput. Vis.</conf-name>, <publisher-loc>Munich, Germany</publisher-loc>, <year>Sep. 8&#x2013;14, 2018</year>, pp. <fpage>3</fpage>&#x2013;<lpage>19</lpage>. doi: <pub-id pub-id-type="doi">10.1007/978-3-030-01264-9_1</pub-id>.</mixed-citation></ref>
<ref id="ref-13"><label>[13]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>V. D.</given-names> <surname>Bortoli</surname></string-name>, <string-name><given-names>J.</given-names> <surname>Thornton</surname></string-name>, <string-name><given-names>J.</given-names> <surname>Heng</surname></string-name>, and <string-name><given-names>A.</given-names> <surname>Doucet</surname></string-name></person-group>, &#x201C;<article-title>Diffusion schr&#x00F6;dinger bridge with applications to score-based generative modeling</article-title>,&#x201D; in <conf-name>Adv. Neural Inf. Process. Syst.</conf-name>, <publisher-loc>Montreal, QC, Canada</publisher-loc>, <year>Dec. 6&#x2013;14, 2021</year>, pp. <fpage>17695</fpage>&#x2013;<lpage>17709</lpage>. doi: <pub-id pub-id-type="doi">10.48550/arXiv.2106.01357</pub-id>.</mixed-citation></ref>
<ref id="ref-14"><label>[14]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>B. W.</given-names> <surname>Zhang</surname></string-name> <etal>et al.</etal></person-group>, &#x201C;<article-title>StyleSwin: Transformer-based GAN for high-resolution image generation</article-title>,&#x201D; in <conf-name>Proc. IEEE Conf. Comput. Vis. Pattern Recog.</conf-name>, <publisher-loc>New Orleans, LA, USA</publisher-loc>, <year>Jun. 19&#x2013;24, 2022</year>, pp. <fpage>11304</fpage>&#x2013;<lpage>11314</lpage>. doi: <pub-id pub-id-type="doi">10.1109/cvpr52688.2022.01102</pub-id>.</mixed-citation></ref>
<ref id="ref-15"><label>[15]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>S. Y.</given-names> <surname>Gu</surname></string-name> <etal>et al.</etal></person-group>, &#x201C;<article-title>Vector quantized diffusion model for text-to-image synthesis</article-title>,&#x201D; in <conf-name>Proc. IEEE Conf. Comput. Vis. Pattern Recog.</conf-name>, <publisher-loc>New Orleans, LA, USA</publisher-loc>, <year>Jun. 19&#x2013;24, 2022</year>, pp. <fpage>10696</fpage>&#x2013;<lpage>10706</lpage>. doi: <pub-id pub-id-type="doi">10.1109/CVPR52688.2022.01043</pub-id>.</mixed-citation></ref>
<ref id="ref-16"><label>[16]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>P.</given-names> <surname>Dhariwal</surname></string-name> and <string-name><given-names>A.</given-names> <surname>Nichol</surname></string-name></person-group>, &#x201C;<article-title>Diffusion models beat gans on image synthesis</article-title>,&#x201D; in <conf-name>Adv. Neural Inf. Process. Syst.</conf-name>, <publisher-loc>Montreal, QC, Canada</publisher-loc>, <year>Dec. 6&#x2013;12, 2021</year>, pp. <fpage>8780</fpage>&#x2013;<lpage>8794</lpage>. doi: <pub-id pub-id-type="doi">10.48550/arXiv.2105.05233</pub-id>.</mixed-citation></ref>
<ref id="ref-17"><label>[17]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>Y. T.</given-names> <surname>Kang</surname></string-name></person-group>, &#x201C;<article-title>Image inpainting model based on gated deformable convolution and hierarchical transformer and its application</article-title>,&#x201D; <publisher-name>Donghua Univ., Shanghai</publisher-name>, <publisher-loc>China</publisher-loc>, <year>2023</year>. doi: <pub-id pub-id-type="doi">10.27012/d.cnki.gdhuu.2022.001329</pub-id>.</mixed-citation></ref>
<ref id="ref-18"><label>[18]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>J. H.</given-names> <surname>Yu</surname></string-name>, <string-name><given-names>Z.</given-names> <surname>Lin</surname></string-name>, <string-name><given-names>J. M.</given-names> <surname>Yang</surname></string-name>, <string-name><given-names>X. H.</given-names> <surname>Shen</surname></string-name>, <string-name><given-names>X.</given-names> <surname>Lu</surname></string-name> and <string-name><given-names>T. S.</given-names> <surname>Huang</surname></string-name></person-group>, &#x201C;<article-title>Generative image inpainting with contextual attention</article-title>,&#x201D; in <conf-name>Proc. IEEE Conf. Comput. Vis. Pattern Recog.</conf-name>, <publisher-loc>Salt Lake City, UT, USA</publisher-loc>, <year>Jun. 18&#x2013;22, 2018</year>, pp. <fpage>5505</fpage>&#x2013;<lpage>5514</lpage>. doi: <pub-id pub-id-type="doi">10.48550/arXiv.1801.07892</pub-id>.</mixed-citation></ref>
<ref id="ref-19"><label>[19]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>H.</given-names> <surname>Wang</surname></string-name>, <string-name><given-names>L.</given-names> <surname>Li</surname></string-name>, <string-name><given-names>Q.</given-names> <surname>Li</surname></string-name>, <string-name><given-names>J. Y.</given-names> <surname>Deng</surname></string-name>, and <string-name><given-names>H. M.</given-names> <surname>Shang</surname></string-name></person-group>, &#x201C;<article-title>A mural restoration method combining global consistency and local continuity</article-title>,&#x201D; (in Chinese), <source>J. Hunan Univ. (Natural Sci. Ed.)</source>, vol. <volume>49</volume>, no. <issue>6</issue>, pp. <fpage>135</fpage>&#x2013;<lpage>145</lpage>, <year>2022</year>. doi: <pub-id pub-id-type="doi">10.16339/j.cnki.hdxbzkb.2022292</pub-id>.</mixed-citation></ref>
<ref id="ref-20"><label>[20]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>H. Y.</given-names> <surname>Liu</surname></string-name>, <string-name><given-names>Z. U.</given-names> <surname>Wan</surname></string-name>, <string-name><given-names>W.</given-names> <surname>Huang</surname></string-name>, <string-name><given-names>Y. B.</given-names> <surname>Song</surname></string-name>, <string-name><given-names>X. T.</given-names> <surname>Han</surname></string-name> and <string-name><given-names>J.</given-names> <surname>Liao</surname></string-name></person-group>, &#x201C;<article-title>PD-GAN: Probabilistic diverse GAN for image inpainting</article-title>,&#x201D; in <conf-name>Proc. IEEE Conf. Comput. Vis. Pattern Recog.</conf-name>, <publisher-loc>TN, USA</publisher-loc>, <year>Jun. 19&#x2013;24, 2021</year>, pp. <fpage>9367</fpage>&#x2013;<lpage>9376</lpage>. doi: <pub-id pub-id-type="doi">10.1109/CVPR46437.2021.00925</pub-id>.</mixed-citation></ref>
</ref-list>
</back></article>