<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.1 20151215//EN" "http://jats.nlm.nih.gov/publishing/1.1/JATS-journalpublishing1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:mml="http://www.w3.org/1998/Math/MathML" xml:lang="en" article-type="research-article" dtd-version="1.1">
<front>
<journal-meta>
<journal-id journal-id-type="pmc">CMES</journal-id>
<journal-id journal-id-type="nlm-ta">CMES</journal-id>
<journal-id journal-id-type="publisher-id">CMES</journal-id>
<journal-title-group>
<journal-title>Computer Modeling in Engineering &#x0026; Sciences</journal-title>
</journal-title-group>
<issn pub-type="epub">1526-1506</issn>
<issn pub-type="ppub">1526-1492</issn>
<publisher>
<publisher-name>Tech Science Press</publisher-name>
<publisher-loc>USA</publisher-loc>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="publisher-id">56079</article-id>
<article-id pub-id-type="doi">10.32604/cmes.2024.056079</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Article</subject>
</subj-group>
</article-categories>
<title-group>
<article-title>Explicitly Color-Inspired Neural Style Transfer Using Patchified AdaIN</article-title>
<alt-title alt-title-type="left-running-head">Explicitly Color-Inspired Neural Style Transfer Using Patchified AdaIN</alt-title>
<alt-title alt-title-type="right-running-head">Explicitly Color-Inspired Neural Style Transfer Using Patchified AdaIN</alt-title>
</title-group>
<contrib-group>
<contrib id="author-1" contrib-type="author">
<name name-style="western"><surname>Kim</surname><given-names>Bumsoo</given-names></name><xref ref-type="aff" rid="aff-1">1</xref></contrib>
<contrib id="author-2" contrib-type="author">
<name name-style="western"><surname>Shin</surname><given-names>Wonseop</given-names></name><xref ref-type="aff" rid="aff-2">2</xref></contrib>
<contrib id="author-3" contrib-type="author">
<name name-style="western"><surname>Jung</surname><given-names>Yonghoon</given-names></name><xref ref-type="aff" rid="aff-1">1</xref></contrib>
<contrib id="author-4" contrib-type="author">
<name name-style="western"><surname>Park</surname><given-names>Youngsup</given-names></name><xref ref-type="aff" rid="aff-3">3</xref></contrib>
<contrib id="author-5" contrib-type="author" corresp="yes">
<name name-style="western"><surname>Seo</surname><given-names>Sanghyun</given-names></name><xref ref-type="aff" rid="aff-1">1</xref><xref ref-type="aff" rid="aff-4">4</xref><email>sanghyun@cau.ac.kr</email></contrib>
<aff id="aff-1"><label>1</label><institution>Department of Applied Art and Technology, Chung-Ang University</institution>, <addr-line>Anseong, 17546</addr-line>, <country>Republic of Korea</country></aff>
<aff id="aff-2"><label>2</label><institution>Department of Advanced Imaging Science Multimedia &#x0026; Film, Chung-Ang University</institution>, <addr-line>Seoul, 06974</addr-line>, <country>Republic of Korea</country></aff>
<aff id="aff-3"><label>3</label><institution>Innosimulation Co., Ltd., Gangseo-gu</institution>, <addr-line>07794</addr-line>, <country>Republic of Korea</country></aff>
<aff id="aff-4"><label>4</label><institution>School of Art and Technology, Chung-Ang University</institution>, <addr-line>Anseong, 17546</addr-line>, <country>Republic of Korea</country></aff>
</contrib-group>
<author-notes>
<corresp id="cor1"><label>&#x002A;</label>Corresponding Author: Sanghyun Seo. Email: <email>sanghyun@cau.ac.kr</email></corresp>
</author-notes>
<pub-date date-type="collection" publication-format="electronic">
<year>2024</year></pub-date>
<pub-date date-type="pub" publication-format="electronic">
<day>31</day>
<month>10</month>
<year>2024</year>
</pub-date>
<volume>141</volume>
<issue>3</issue>
<fpage>2143</fpage>
<lpage>2164</lpage>
<history>
<date date-type="received">
<day>13</day>
<month>7</month>
<year>2024</year>
</date>
<date date-type="accepted">
<day>25</day>
<month>9</month>
<year>2024</year>
</date>
</history>
<permissions>
<copyright-statement>&#x00A9; 2024 The Authors.</copyright-statement>
<copyright-year>2024</copyright-year>
<copyright-holder>Published by Tech Science Press.</copyright-holder>
<license xlink:href="https://creativecommons.org/licenses/by/4.0/">
<license-p>This work is licensed under a <ext-link ext-link-type="uri" xlink:type="simple" xlink:href="https://creativecommons.org/licenses/by/4.0/">Creative Commons Attribution 4.0 International License</ext-link>, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
</license>
</permissions>
<self-uri content-type="pdf" xlink:href="TSP_CMES_56079.pdf"></self-uri>
<abstract>
<p>Arbitrary style transfer aims to perceptually reflect the style of a reference image in artistic creations with visual aesthetics. Traditional style transfer models, particularly those using adaptive instance normalization (AdaIN) layer, rely on global statistics, which often fail to capture the spatially local color distribution, leading to outputs that lack variation despite geometric transformations. To address this, we introduce Patchified AdaIN, a color-inspired style transfer method that applies AdaIN to localized patches, utilizing local statistics to capture the spatial color distribution of the reference image. This approach enables enhanced color awareness in style transfer, adapting dynamically to geometric transformations by leveraging local image statistics. Since Patchified AdaIN builds on AdaIN, it integrates seamlessly into existing frameworks without the need for additional training, allowing users to control the output quality through adjustable blending parameters. Our comprehensive experiments demonstrate that Patchified AdaIN can reflect geometric transformations (e.g., translation, rotation, flipping) of images for style transfer, thereby achieving superior results compared to state-of-the-art methods. Additional experiments show the compatibility of Patchified AdaIN for integration into existing networks to enable spatial color-aware arbitrary style transfer by replacing the conventional AdaIN layer with the Patchified AdaIN layer.</p>
</abstract>
<kwd-group kwd-group-type="author">
<kwd>Neural style transfer</kwd>
<kwd>image synthesis</kwd>
<kwd>image stylization</kwd>
</kwd-group>
<funding-group>
<award-group id="awg1">
<funding-source>National Research Foundation of Korea</funding-source>
<award-id>2022R1A2C1004657</award-id>
</award-group>
<award-group id="awg2">
<funding-source>Ministry of Culture Sports and Tourism</funding-source>
<award-id>RS-2024-00352578</award-id>
</award-group>
</funding-group>
</article-meta>
</front>
<body>
<sec id="s1">
<label>1</label>
<title>Introduction</title>
<p>Arbitrary style transfer (AST) [<xref ref-type="bibr" rid="ref-1">1</xref>&#x2013;<xref ref-type="bibr" rid="ref-5">5</xref>] is used to change the appearance of a content image by referencing an external style image considering characteristics such as the painting/rendering style, brush strokes, patterns, and colorization. AST has enabled remarkable and attractive algorithms [<xref ref-type="bibr" rid="ref-6">6</xref>&#x2013;<xref ref-type="bibr" rid="ref-8">8</xref>] for various computer vision [<xref ref-type="bibr" rid="ref-9">9</xref>&#x2013;<xref ref-type="bibr" rid="ref-11">11</xref>] and graphics applications [<xref ref-type="bibr" rid="ref-12">12</xref>&#x2013;<xref ref-type="bibr" rid="ref-16">16</xref>] owing to its unique aesthetic effects [<xref ref-type="bibr" rid="ref-17">17</xref>&#x2013;<xref ref-type="bibr" rid="ref-21">21</xref>]. AST was pioneered in [<xref ref-type="bibr" rid="ref-22">22</xref>] with convolutional neural networks [<xref ref-type="bibr" rid="ref-23">23</xref>] pretrained on a large image dataset [<xref ref-type="bibr" rid="ref-24">24</xref>] for an upstream task (e.g., image reconstruction or recognition) and a mathematical or statistical model that leverages deep features in every convolutional layer. This method optimizes style transfer by iteratively minimizing the designated objective. Although fast patch-based optimization has been proposed [<xref ref-type="bibr" rid="ref-25">25</xref>], it requires optimizing every style transfer image, remaining a time-consuming method. On the other hand, a simple feedforward method is adaptive instance normalization (AdaIN) [<xref ref-type="bibr" rid="ref-26">26</xref>]. It can perform AST without requiring separate optimization by realigning the statistics of deep features from the content image (e.g., mean and standard deviation) to those of deep features extracted from the style (reference) image in a latent space. AdaIN achieves fast AST inference with low memory resources. Thus, it has been enhanced or specialized for various AST tasks [<xref ref-type="bibr" rid="ref-27">27</xref>,<xref ref-type="bibr" rid="ref-28">28</xref>] such as domain enhanced style transfer [<xref ref-type="bibr" rid="ref-29">29</xref>,<xref ref-type="bibr" rid="ref-30">30</xref>], fast inference [<xref ref-type="bibr" rid="ref-31">31</xref>] with low computational cost [<xref ref-type="bibr" rid="ref-5">5</xref>,<xref ref-type="bibr" rid="ref-32">32</xref>&#x2013;<xref ref-type="bibr" rid="ref-35">35</xref>], three-dimensional scene style transfer [<xref ref-type="bibr" rid="ref-12">12</xref>,<xref ref-type="bibr" rid="ref-18">18</xref>,<xref ref-type="bibr" rid="ref-20">20</xref>,<xref ref-type="bibr" rid="ref-36">36</xref>,<xref ref-type="bibr" rid="ref-37">37</xref>], video-level style transfer [<xref ref-type="bibr" rid="ref-1">1</xref>,<xref ref-type="bibr" rid="ref-6">6</xref>,<xref ref-type="bibr" rid="ref-38">38</xref>,<xref ref-type="bibr" rid="ref-39">39</xref>], shape- or pattern-aware style transfer [<xref ref-type="bibr" rid="ref-27">27</xref>,<xref ref-type="bibr" rid="ref-40">40</xref>,<xref ref-type="bibr" rid="ref-41">41</xref>], and high visual fidelity [<xref ref-type="bibr" rid="ref-42">42</xref>].</p>
<p>As a result of pioneering studies [<xref ref-type="bibr" rid="ref-22">22</xref>,<xref ref-type="bibr" rid="ref-26">26</xref>,<xref ref-type="bibr" rid="ref-43">43</xref>,<xref ref-type="bibr" rid="ref-44">44</xref>], AST has been rapidly specialized for several computer vision and graphics applications [<xref ref-type="bibr" rid="ref-8">8</xref>,<xref ref-type="bibr" rid="ref-17">17</xref>,<xref ref-type="bibr" rid="ref-18">18</xref>,<xref ref-type="bibr" rid="ref-20">20</xref>,<xref ref-type="bibr" rid="ref-38">38</xref>] by storm. Despite its advancements, AST remains challenging, especially when trying to enhance the results for a given input image to reach the desired appearance. Most existing AST methods fail to capture geometric transformations of input images to reflect regional color distributions. This is because AdaIN [<xref ref-type="bibr" rid="ref-26">26</xref>] manipulates global statistics, including the mean (or bias factor) and standard deviation (or scaling factor) [<xref ref-type="bibr" rid="ref-41">41</xref>]. Thus, available methods often ignore geometric changes in the input image for AST. At the service level, this limitation can undermine the user experience because perceptually similar results are obtained regardless of appearance changes in the input image. For example, if a user wants to control the output in terms of color distribution by rotating the input style image, the AST results are the same as the initial result. It hinders the user to find optimal wanted output. Without considering the color-awareness in AST, user always face the same results even though they are struggling to obtain the best output while geometrically changing the input images. This issue poses significant limitations to advance well-designed AI technology to real-world service and applications. In this point of view, although satisfactory visual quality has been achieved in previous studies, this limitation is important and necessitated. To address this, additional methods are required, e.g., attention module [<xref ref-type="bibr" rid="ref-45">45</xref>,<xref ref-type="bibr" rid="ref-46">46</xref>], etc. Therefore, we tackle this color-awareness issue in AST in this paper.</p>
<p>To address the abovementioned limitation, we propose a patchified method<xref ref-type="fn" rid="fn1"><sup>1</sup></xref><fn id="fn1"><label>1</label><p>Patch-based AST (StyleSwap) [<xref ref-type="bibr" rid="ref-25">25</xref>] should be distinguished from our method. StyleSwap leveraged patches to replace content patch to the closest-matching style patch, establishing a style swap for fast optimization, as explained in <xref ref-type="sec" rid="s2_1">Section 2.1</xref>. On the other hand, the proposed method relies on patches to preserve the regional information of color distribution by applying AdaIN to each content/style patch, showing spatial correspondence during inference (<xref ref-type="sec" rid="s4_2">Section 4.2</xref>).</p></fn> for color-aware style transfer based on AdaIN or Patchified AdaIN for short. It explicitly divides the features into small patches and performs AdaIN in a latent space to capture the regional color distribution. Hence, spatial information can be preserved after AdaIN (<xref ref-type="fig" rid="fig-1">Fig. 1</xref>). As Patchified AdaIN has no learnable parameters, it can be easily integrated into state-of-the-art (SOTA) models without additional training or finetuning while providing color-aware AST. Additionally, the patchified operation can be modified according to patch levels and types to adjust and generate the desired output image. Moreover, a code blending scheme provides a control factor to determine the naturalness of global-local color weights during inference. The key contributions of the proposed method can be summarized as follows:</p>
<p><list list-type="order">
<list-item>
<p>We introduce patchify-based AdaIN operation, dubbed as Patchified AdaIN that explicitly divides the content and style features in a latent space to then apply AdaIN to every patch. After Patchified AdaIN, the computed output is decoded using spatial recombination. Hence, the proposed method can preserve spatial color information, establishing the first color-aware AST method that can be embedded in SOTA models.</p></list-item>
<list-item>
<p>Patchified AdaIN has no learnable parameters and can be easily integrated into existing AST models to add color-awareness without requiring finetuning.</p></list-item>
<list-item>
<p>The user experience of AST applications with Patchified AdaIN is enhanced by enabling geometric transformations of the input image to obtain the desired results by adjusting parameters such as the patch level and type.</p></list-item>
</list></p>
<fig id="fig-1">
<label>Figure 1</label>
<caption>
<title>Abstracted our results with other network. Given geometrically transformed style image</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_56079-fig-1.tif"/>
</fig>
</sec>
<sec id="s2">
<label>2</label>
<title>Related Work</title>
<sec id="s2_1">
<label>2.1</label>
<title>Arbitrary Style Transfer</title>
<p>Keeping pace with the development of deep neural networks, neural style transfer is active research topic. It redefines the style of a content image by referencing to a style image to achieve an artistic appearance, thus resembling artwork with an aesthetic appeal. AST emerged with an optimization method [<xref ref-type="bibr" rid="ref-22">22</xref>] based on pretrained networks from an upstream task. However, this method requires time-consuming iterative optimization for every style transfer image. To address this problem, various methods have been devised. In [<xref ref-type="bibr" rid="ref-44">44</xref>], perceptual losses are used for real-time style transfer. In [<xref ref-type="bibr" rid="ref-47">47</xref>], a hierarchical deep convolutional neural network is proposed for fast style transfer. In [<xref ref-type="bibr" rid="ref-25">25</xref>], patch-based style transfer for fast inference is introduced. In [<xref ref-type="bibr" rid="ref-48">48</xref>], generalized style transfer is achieved without compromising the visual quality with unseen styles. An unprecedented method is proposed in [<xref ref-type="bibr" rid="ref-26">26</xref>] to transfer the style using AdaIN in a single feedforward step for real-time style transfer. AdaIN enables fast and simple style transfer and has become a common and essential operation in style transfer. Hence, various studies have been focused on solving other important problems such as view or video style transfer [<xref ref-type="bibr" rid="ref-16">16</xref>,<xref ref-type="bibr" rid="ref-38">38</xref>,<xref ref-type="bibr" rid="ref-39">39</xref>,<xref ref-type="bibr" rid="ref-49">49</xref>,<xref ref-type="bibr" rid="ref-50">50</xref>], three-dimensional style transfer [<xref ref-type="bibr" rid="ref-19">19</xref>,<xref ref-type="bibr" rid="ref-20">20</xref>,<xref ref-type="bibr" rid="ref-36">36</xref>,<xref ref-type="bibr" rid="ref-37">37</xref>], text-driven style transfer [<xref ref-type="bibr" rid="ref-4">4</xref>,<xref ref-type="bibr" rid="ref-7">7</xref>,<xref ref-type="bibr" rid="ref-51">51</xref>,<xref ref-type="bibr" rid="ref-52">52</xref>], domain generalization [<xref ref-type="bibr" rid="ref-28">28</xref>,<xref ref-type="bibr" rid="ref-53">53</xref>], and domain enhancement [<xref ref-type="bibr" rid="ref-29">29</xref>,<xref ref-type="bibr" rid="ref-30">30</xref>]. More recently, various architecture including GANs [<xref ref-type="bibr" rid="ref-54">54</xref>], attention module [<xref ref-type="bibr" rid="ref-55">55</xref>] or CLIP [<xref ref-type="bibr" rid="ref-56">56</xref>] are leveraged for the controllability [<xref ref-type="bibr" rid="ref-57">57</xref>], fast style transfer [<xref ref-type="bibr" rid="ref-58">58</xref>], applications [<xref ref-type="bibr" rid="ref-59">59</xref>].</p>
</sec>
<sec id="s2_2">
<label>2.2</label>
<title>Something-Aware Style Transfer</title>
<p>Despite its high quality and applications, AST results may not be perceptually satisfactory in terms of user experience. To further improve AST, various methods implemented something-aware schemes to generate perceptually intuitive results. In [<xref ref-type="bibr" rid="ref-41">41</xref>], structure-aware style transfer with kernel prediction is proposed to enhance the structure style. In [<xref ref-type="bibr" rid="ref-42">42</xref>], visual fidelity is improved by using vector quantization with codebooks proposed in discrete image modeling [<xref ref-type="bibr" rid="ref-60">60</xref>,<xref ref-type="bibr" rid="ref-61">61</xref>]. For a relatively sparse shape or structure in an image, object shapes in an RGB (red-green-blue) image are considered with distance transform and patch matching modules for shape-aware style transfer. Structure guidance is used in [<xref ref-type="bibr" rid="ref-40">40</xref>] to focus on local regions. It enhances the content structure to avoid blurry or distorted alignment by AdaIN. Recently, a pattern-aware scheme to discover the sweet spot of local and global style representations with pattern repeatability has been developed [<xref ref-type="bibr" rid="ref-62">62</xref>]. Previous studies have demonstrated intuitive outcomes using something-aware strategies for AST. However, user experience faces a common limitation. Existing AST methods generally fail to deliver the intended outcome if the user transforms the input style image, while the output image is expected to reflect any changes in the input images. To explore this limitation, we perform a pre-analysis of image transformation in <xref ref-type="sec" rid="s3">Section 3</xref>.</p>
</sec>
</sec>
<sec id="s3">
<label>3</label>
<title>Pre-Analysis</title>
<p>Consider a user manipulating a style transfer method in an exhibition hall as a representative AST example. The user might want to obtain various desired results from the same input image. However, this is difficult to achieve by manipulating the input images using an existing AST method. For instance, we can easily recognize that the user expects the appearance of resulting image to change when flipping or rotating the style or content image. In practice, user experience should be integrated into service-level applications. Nonetheless, most existing methods cannot address the abovementioned limitation because they cannot structurally reflect geometric changes in the input image.</p>
<p>To demonstrate this limitation through a short analysis, we applied geometric transformations (e.g., flipping, rotation, translation) to an input image. The images generated by AST methods always looked similar regardless of the applied transformation, as illustrated in <xref ref-type="fig" rid="fig-2">Fig. 2</xref>. We first revisit the core component in style transfer. Simple sequential normalization and denormalization (i.e., conditional instance normalization) [<xref ref-type="bibr" rid="ref-26">26</xref>] is usually adopted as the main component for AST because representing style using statistics in the feature space is useful and effective. Given content image <inline-formula id="ieqn-1"><mml:math id="mml-ieqn-1"><mml:msub><mml:mi>I</mml:mi><mml:mrow><mml:mi>c</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> and style image <inline-formula id="ieqn-2"><mml:math id="mml-ieqn-2"><mml:msub><mml:mi>I</mml:mi><mml:mrow><mml:mi>s</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>, the resulting image has the following formulation in a latent space:</p>
<p><disp-formula id="eqn-1"><label>(1)</label><mml:math id="mml-eqn-1" display="block"><mml:mi>A</mml:mi><mml:mi>d</mml:mi><mml:mi>a</mml:mi><mml:mi>I</mml:mi><mml:mi>N</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>I</mml:mi><mml:mrow><mml:mi>c</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>I</mml:mi><mml:mrow><mml:mi>s</mml:mi></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mi>&#x03C3;</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>I</mml:mi><mml:mrow><mml:mi>s</mml:mi></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow><mml:mo>&#x00D7;</mml:mo><mml:mfrac><mml:mrow><mml:msub><mml:mi>I</mml:mi><mml:mrow><mml:mi>c</mml:mi></mml:mrow></mml:msub><mml:mo>&#x2212;</mml:mo><mml:mi>&#x03BC;</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>I</mml:mi><mml:mrow><mml:mi>c</mml:mi></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:mi>&#x03C3;</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>I</mml:mi><mml:mrow><mml:mi>c</mml:mi></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:mfrac><mml:mo>+</mml:mo><mml:mi>&#x03BC;</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>I</mml:mi><mml:mrow><mml:mi>s</mml:mi></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow><mml:mo>,</mml:mo></mml:math></disp-formula>where <inline-formula id="ieqn-3"><mml:math id="mml-ieqn-3"><mml:mi>&#x03BC;</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mo>&#x22C5;</mml:mo><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> and <inline-formula id="ieqn-4"><mml:math id="mml-ieqn-4"><mml:mi>&#x03C3;</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mo>&#x22C5;</mml:mo><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> denote the mean and standard deviation, respectively. When we transform the input images with transformation matrix <inline-formula id="ieqn-5"><mml:math id="mml-ieqn-5"><mml:mrow><mml:mi mathvariant="bold-script">J</mml:mi></mml:mrow></mml:math></inline-formula>, the output images can be expressed as <inline-formula id="ieqn-6"><mml:math id="mml-ieqn-6"><mml:mrow><mml:mi mathvariant="bold-script">J</mml:mi></mml:mrow><mml:mo>&#x00D7;</mml:mo><mml:msub><mml:mi>I</mml:mi><mml:mrow><mml:mi>c</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> and <inline-formula id="ieqn-7"><mml:math id="mml-ieqn-7"><mml:mrow><mml:mi mathvariant="bold-script">J</mml:mi></mml:mrow><mml:mo>&#x00D7;</mml:mo><mml:msub><mml:mi>I</mml:mi><mml:mrow><mml:mi>s</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>. For simplicity, we denote <inline-formula id="ieqn-8"><mml:math id="mml-ieqn-8"><mml:mrow><mml:mi mathvariant="bold-script">J</mml:mi></mml:mrow><mml:mo>&#x00D7;</mml:mo><mml:msub><mml:mi>I</mml:mi><mml:mrow><mml:mi>s</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> as <inline-formula id="ieqn-9"><mml:math id="mml-ieqn-9"><mml:msub><mml:mi>I</mml:mi><mml:mrow><mml:mi>s</mml:mi><mml:mo>,</mml:mo><mml:mrow><mml:mi mathvariant="bold-script">J</mml:mi></mml:mrow></mml:mrow></mml:msub></mml:math></inline-formula> and fix the content image with no transformations. From <xref ref-type="disp-formula" rid="eqn-1">Eq. (1)</xref>, the image statistics of the transformed style image can be expressed as <inline-formula id="ieqn-10"><mml:math id="mml-ieqn-10"><mml:mi>&#x03BC;</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>I</mml:mi><mml:mrow><mml:mi>s</mml:mi><mml:mo>,</mml:mo><mml:mrow><mml:mi mathvariant="bold-script">J</mml:mi></mml:mrow></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula> and <inline-formula id="ieqn-11"><mml:math id="mml-ieqn-11"><mml:mi>&#x03C3;</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>I</mml:mi><mml:mrow><mml:mi>s</mml:mi><mml:mo>,</mml:mo><mml:mrow><mml:mi mathvariant="bold-script">J</mml:mi></mml:mrow></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula>. To quantitatively compare the appearance of the output images according to the transformation of the style image, we calculated statistics of the output images with and without transformation (i.e., original style image) for 10,000 images.</p>
<fig id="fig-2">
<label>Figure 2</label>
<caption>
<title>Limitation about image transformation. Regardless of style image transformation, the results always look the same</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_56079-fig-2.tif"/>
</fig>
<p>As a short experiment about the effect of geometric transformations on AST, we applied rotation, translation, and flipping to style images. As listed in <xref ref-type="table" rid="table-1">Table 1</xref>, negligible differences in the statistics were obtained across the spatial transformations. More specifically, rotation provided a small difference compared with the original statistics, while translation and flipping showed no changes in the image statistics because the absolute difference of the mean and standard deviation was zero. Hence, feature statistics after AdaIN provided highly similar representations regardless of spatial changes. Although the style image was drastically transformed in appearance, the resulting image was almost the same after AST. This can be explained by the image statistics before transformation, <inline-formula id="ieqn-12"><mml:math id="mml-ieqn-12"><mml:mi>&#x03BC;</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>I</mml:mi><mml:mrow><mml:mi>s</mml:mi><mml:mo>,</mml:mo><mml:mrow><mml:mi mathvariant="bold-script">J</mml:mi></mml:mrow></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula> and <inline-formula id="ieqn-13"><mml:math id="mml-ieqn-13"><mml:mi>&#x03C3;</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>I</mml:mi><mml:mrow><mml:mi>s</mml:mi><mml:mo>,</mml:mo><mml:mrow><mml:mi mathvariant="bold-script">J</mml:mi></mml:mrow></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula>, being almost equal to those after transformation, <inline-formula id="ieqn-14"><mml:math id="mml-ieqn-14"><mml:mi>&#x03BC;</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>I</mml:mi><mml:mrow><mml:mi>s</mml:mi></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula> and <inline-formula id="ieqn-15"><mml:math id="mml-ieqn-15"><mml:mi>&#x03C3;</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>I</mml:mi><mml:mrow><mml:mi>s</mml:mi></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula>. In other words, AdaIN globally aligns the image statistics without considering spatial distribution, thus failing to distinguish and ignoring spatial changes [<xref ref-type="bibr" rid="ref-41">41</xref>]. Considering the pre-analysis, we aimed to change statistics <inline-formula id="ieqn-16"><mml:math id="mml-ieqn-16"><mml:mi>&#x03BC;</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>I</mml:mi><mml:mrow><mml:mi>s</mml:mi><mml:mo>,</mml:mo><mml:mrow><mml:mi mathvariant="bold-script">J</mml:mi></mml:mrow></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula> and <inline-formula id="ieqn-17"><mml:math id="mml-ieqn-17"><mml:mi>&#x03C3;</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>I</mml:mi><mml:mrow><mml:mi>s</mml:mi><mml:mo>,</mml:mo><mml:mrow><mml:mi mathvariant="bold-script">J</mml:mi></mml:mrow></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula> to reflect spatial changes. In <xref ref-type="sec" rid="s4_2">Section 4.2</xref>, we explain how to expand global AdaIN to obtain variable local statistics according to spatial information.</p>
<table-wrap id="table-1">
<label>Table 1</label>
<caption>
<title>Comparison of image statistics between transformation and original. While style images are transformed with each corresponding <inline-formula id="ieqn-18"><mml:math id="mml-ieqn-18"><mml:mrow><mml:mi mathvariant="bold-script">J</mml:mi></mml:mrow></mml:math></inline-formula> (e.g., rotation, translation, flip), content images are fixed as original image without transformation. Since the average <inline-formula id="ieqn-19"><mml:math id="mml-ieqn-19"><mml:mi mathvariant="bold-italic">&#x03BC;</mml:mi></mml:math></inline-formula> and average <inline-formula id="ieqn-20"><mml:math id="mml-ieqn-20"><mml:mi mathvariant="bold-italic">&#x03C3;</mml:mi></mml:math></inline-formula> shows similar results for each RGB channel, we calculate the average about the results of 3 channels for simplicity</title>
</caption>
<table frame="hsides">
<colgroup>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
</colgroup>
<thead>
<tr>
<th>Transformations</th>
<th>Avg. <inline-formula id="ieqn-21"><mml:math id="mml-ieqn-21"><mml:mi>&#x03BC;</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>I</mml:mi><mml:mrow><mml:mi>s</mml:mi><mml:mo>,</mml:mo><mml:mrow><mml:mi mathvariant="bold-script">J</mml:mi></mml:mrow></mml:mrow></mml:msub><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula></th>
<th>Avg. <inline-formula id="ieqn-22"><mml:math id="mml-ieqn-22"><mml:mi>&#x03C3;</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>I</mml:mi><mml:mrow><mml:mi>s</mml:mi><mml:mo>,</mml:mo><mml:mrow><mml:mi mathvariant="bold-script">J</mml:mi></mml:mrow></mml:mrow></mml:msub><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula></th>
<th>Abs. mean dist.</th>
<th>Abs. std. dist.</th>
</tr>
</thead>
<tbody>
<tr>
<td>W/o transformation</td>
<td>103.43, 119.72, 133.55</td>
<td>11.08, 11.71, 12.27</td>
<td>.</td>
<td>.</td>
</tr>
<tr>
<td>Rotation <inline-formula id="ieqn-23"><mml:math id="mml-ieqn-23"><mml:mn>2</mml:mn><mml:mrow><mml:mo>/</mml:mo></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x03C0;</mml:mi></mml:mrow></mml:math></inline-formula></td>
<td><underline>103.87, 120.30, 134.22</underline></td>
<td><underline>11.50, 12.12, 12.70</underline></td>
<td><underline>0.44, 0.58, 0.67</underline></td>
<td><underline>0.42, 0.42, 0.43</underline></td>
</tr>
<tr>
<td>Rotation <inline-formula id="ieqn-24"><mml:math id="mml-ieqn-24"><mml:mrow><mml:mi mathvariant="normal">&#x03C0;</mml:mi></mml:mrow></mml:math></inline-formula></td>
<td>103.43, 119.72, 133.55</td>
<td>11.08, 11.71, 12.27</td>
<td>0.00, 0.00, 0.00</td>
<td>0.00, 0.00, 0.00</td>
</tr>
<tr>
<td>Rotation random</td>
<td><underline>103.50, 119.90, 133.84</underline></td>
<td><underline>10.23, 10.57, 10.96</underline></td>
<td><underline>0.07, 0.18, 0.29</underline></td>
<td><underline>0.85, 1.14, 1.31</underline></td>
</tr>
<tr>
<td>Translation x, 0.5</td>
<td>103.43, 119.72, 133.55</td>
<td>11.08, 11.71, 12.27</td>
<td>0.00, 0.00, 0.00</td>
<td>0.00, 0.00, 0.00</td>
</tr>
<tr>
<td>Translation x, random</td>
<td>103.43, 119.72, 133.55</td>
<td>11.08, 11.71, 12.27</td>
<td>0.00, 0.00, 0.00</td>
<td>0.00, 0.00, 0.00</td>
</tr>
<tr>
<td>Flip x</td>
<td>103.43, 119.72, 133.55</td>
<td>11.08, 11.71, 12.27</td>
<td>0.00, 0.00, 0.00</td>
<td>0.00, 0.00, 0.00</td>
</tr>
<tr>
<td>Flip y</td>
<td>103.43, 119.72, 133.55</td>
<td>11.08, 11.71, 12.27</td>
<td>0.00, 0.00, 0.00</td>
<td>0.00, 0.00, 0.00</td>
</tr>
</tbody>
</table>
<table-wrap-foot><fn><p>Note: <underline>Underline</underline> font means changed value compared from original value.</p></fn>
</table-wrap-foot>
</table-wrap>
</sec>
<sec id="s4">
<label>4</label>
<title>Proposed Method</title>
<p>From simple qualitative (<xref ref-type="fig" rid="fig-2">Fig. 2</xref>) and quantitative (<xref ref-type="table" rid="table-1">Table 1</xref>) analyses to unveil the limitations of existing AST methods, we identified the necessity to explicitly allow the statistics of a transformed image to change accordingly. To this end, various methods have been proposed for the transformed image, <inline-formula id="ieqn-25"><mml:math id="mml-ieqn-25"><mml:msub><mml:mi>I</mml:mi><mml:mrow><mml:mi>s</mml:mi><mml:mo>,</mml:mo><mml:mrow><mml:mi mathvariant="bold-script">J</mml:mi></mml:mrow></mml:mrow></mml:msub></mml:math></inline-formula> to provide different statistics to those of the original image, <inline-formula id="ieqn-26"><mml:math id="mml-ieqn-26"><mml:msub><mml:mi>I</mml:mi><mml:mrow><mml:mi>s</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>. Motivated by the findings in [<xref ref-type="bibr" rid="ref-63">63</xref>,<xref ref-type="bibr" rid="ref-64">64</xref>] and [<xref ref-type="bibr" rid="ref-27">27</xref>,<xref ref-type="bibr" rid="ref-62">62</xref>], we adopt an explicit method which allow each patch spatially divided from image to be participated into AdaIN operation. It is akin to strictly leverage a selected region in normalization process without considering another region within a whole image. By doing that, we argue that color preservation is simply yet effectively realized with easy-controllability, explainability, compatibility on other AdaIN-based style transfer networks. It has more efficacy than by incorporating advanced computationally large cost approach such as self-attention [<xref ref-type="bibr" rid="ref-64">64</xref>], contrastive learning [<xref ref-type="bibr" rid="ref-29">29</xref>], deformable convolution [<xref ref-type="bibr" rid="ref-41">41</xref>]. This explicit strategy has been shown to be simple and effective to handle local regions [<xref ref-type="bibr" rid="ref-25">25</xref>,<xref ref-type="bibr" rid="ref-27">27</xref>,<xref ref-type="bibr" rid="ref-62">62</xref>]. Our patch-based manipulation captures geometric transformations in the style image. We first divide the image into patches for segmenting the content and style images. Then, AdaIN is applied to every patch in a latent space. The obtained latent vectors are combined into an image with the original size in the latent space. Finally, the combined latent vector is decoded as the output image. Additionally, in the latent space, we use code blending to mitigate non-plausible results when spatially combining the patches.</p>

<sec id="s4_1">
<label>4.1</label>
<title>Pretraining Autoencoders</title>
<p>Before AST, autoencoders should be trained to perform reconstruction. As in [<xref ref-type="bibr" rid="ref-26">26</xref>], we adopt VGG-19 [<xref ref-type="bibr" rid="ref-23">23</xref>] as encoder <inline-formula id="ieqn-27"><mml:math id="mml-ieqn-27"><mml:mi>E</mml:mi></mml:math></inline-formula>, which was pretrained on the MSCOCO dataset [<xref ref-type="bibr" rid="ref-24">24</xref>] and used to process the style image, and learnable decoder <inline-formula id="ieqn-28"><mml:math id="mml-ieqn-28"><mml:mi>D</mml:mi></mml:math></inline-formula> with existing content and style loss functions. Unlike the method in [<xref ref-type="bibr" rid="ref-26">26</xref>], we do not use a shared encoder for the content and style images to avoid quality degradation that occurs by mapping the content and style images to the latent space using the same encoder [<xref ref-type="bibr" rid="ref-42">42</xref>].</p>
<p>As in [<xref ref-type="bibr" rid="ref-42">42</xref>], we first trained the content and style autoencoders for reconstruction on the MSCOCO [<xref ref-type="bibr" rid="ref-24">24</xref>] and WikiArt<xref ref-type="fn" rid="fn2"><sup>2</sup></xref><fn id="fn2"><label>2</label><p><ext-link ext-link-type="uri" xlink:href="https://www.wikiart.org/">https://www.wikiart.org/</ext-link> (accessed on 24 September 2024).</p></fn> datasets, respectively. For pretraining, each autoencoder was trained by reconstructing the input image as follows:
<disp-formula id="eqn-2"><label>(2)</label><mml:math id="mml-eqn-2" display="block"><mml:msub><mml:mi>D</mml:mi><mml:mrow><mml:mo>&#x2217;</mml:mo></mml:mrow></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>E</mml:mi><mml:mrow><mml:mo>&#x2217;</mml:mo></mml:mrow></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>I</mml:mi><mml:mrow><mml:mo>&#x2217;</mml:mo></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:msub><mml:mi>I</mml:mi><mml:mrow><mml:mo>&#x2217;</mml:mo><mml:mo>,</mml:mo><mml:mi>r</mml:mi><mml:mi>e</mml:mi><mml:mi>c</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo></mml:math></disp-formula>where <inline-formula id="ieqn-29"><mml:math id="mml-ieqn-29"><mml:mo>&#x2217;</mml:mo></mml:math></inline-formula> is <inline-formula id="ieqn-30"><mml:math id="mml-ieqn-30"><mml:mi>c</mml:mi></mml:math></inline-formula> for the content image and <inline-formula id="ieqn-31"><mml:math id="mml-ieqn-31"><mml:mi>s</mml:mi></mml:math></inline-formula> for the style image. Each autoencoder uses the following reconstruction error:
<disp-formula id="eqn-3"><label>(3)</label><mml:math id="mml-eqn-3" display="block"><mml:msub><mml:mrow><mml:mi>&#x02112;</mml:mi></mml:mrow><mml:mrow><mml:mi>r</mml:mi><mml:mi>e</mml:mi><mml:mi>c</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>I</mml:mi><mml:mrow><mml:mo>&#x2217;</mml:mo></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mrow><mml:mo symmetric="true">&#x2016;</mml:mo><mml:msub><mml:mi>I</mml:mi><mml:mrow><mml:mo>&#x2217;</mml:mo><mml:mo>,</mml:mo><mml:mi>r</mml:mi><mml:mi>e</mml:mi><mml:mi>c</mml:mi></mml:mrow></mml:msub><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mi>I</mml:mi><mml:mrow><mml:mo>&#x2217;</mml:mo></mml:mrow></mml:msub><mml:mo symmetric="true">&#x2016;</mml:mo></mml:mrow><mml:mo>.</mml:mo></mml:math></disp-formula></p>
<p>After pretraining, each encoder appropriately maps the corresponding image while considering the image characteristics. Among encoder and decoder for each image, we only fix the encoder for finetuning the decoder, as detailed in <xref ref-type="sec" rid="s4_3">Section 4.3</xref>. During inference, we fix all the parameters in the encoders and decoder, except for the user-defined parameters.</p>
</sec>
<sec id="s4_2">
<label>4.2</label>
<title>Style Transfer via Patchified-AdaIN</title>
<p>We use the encoders for inference or finetuning by freezing the encoder weights. In detail, given input image pair <inline-formula id="ieqn-32"><mml:math id="mml-ieqn-32"><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>I</mml:mi><mml:mrow><mml:mi>c</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>I</mml:mi><mml:mrow><mml:mi>s</mml:mi></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow><mml:mo>,</mml:mo></mml:math></inline-formula> we generate style-transferred output image <inline-formula id="ieqn-33"><mml:math id="mml-ieqn-33"><mml:mrow><mml:mover><mml:mi>I</mml:mi><mml:mo stretchy="false">&#x007E;</mml:mo></mml:mover></mml:mrow></mml:math></inline-formula>. Considering a transformation, each input image <inline-formula id="ieqn-34"><mml:math id="mml-ieqn-34"><mml:msub><mml:mi>I</mml:mi><mml:mrow><mml:mo>&#x2217;</mml:mo></mml:mrow></mml:msub></mml:math></inline-formula> is transformed into <inline-formula id="ieqn-35"><mml:math id="mml-ieqn-35"><mml:msub><mml:mi>I</mml:mi><mml:mrow><mml:mo>&#x2217;</mml:mo><mml:mo>,</mml:mo><mml:mrow><mml:mi mathvariant="bold-script">J</mml:mi></mml:mrow></mml:mrow></mml:msub></mml:math></inline-formula> where <inline-formula id="ieqn-36"><mml:math id="mml-ieqn-36"><mml:mo>&#x2217;</mml:mo></mml:math></inline-formula> is <inline-formula id="ieqn-37"><mml:math id="mml-ieqn-37"><mml:mi>c</mml:mi></mml:math></inline-formula> or <inline-formula id="ieqn-38"><mml:math id="mml-ieqn-38"><mml:mi>s</mml:mi></mml:math></inline-formula>, and <inline-formula id="ieqn-39"><mml:math id="mml-ieqn-39"><mml:mrow><mml:mi mathvariant="bold-script">J</mml:mi></mml:mrow></mml:math></inline-formula> represents the transformation (i.e., rotation <inline-formula id="ieqn-40"><mml:math id="mml-ieqn-40"><mml:mi>r</mml:mi></mml:math></inline-formula>, translations <inline-formula id="ieqn-41"><mml:math id="mml-ieqn-41"><mml:msub><mml:mi>t</mml:mi><mml:mrow><mml:mi>x</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> and <inline-formula id="ieqn-42"><mml:math id="mml-ieqn-42"><mml:msub><mml:mi>t</mml:mi><mml:mrow><mml:mi>y</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>, flipping along the horizontal <inline-formula id="ieqn-43"><mml:math id="mml-ieqn-43"><mml:msub><mml:mi>f</mml:mi><mml:mrow><mml:mi>h</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> and vertical <inline-formula id="ieqn-44"><mml:math id="mml-ieqn-44"><mml:msub><mml:mi>f</mml:mi><mml:mrow><mml:mi>v</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>, or zooming <inline-formula id="ieqn-45"><mml:math id="mml-ieqn-45"><mml:mi>z</mml:mi></mml:math></inline-formula>) described in <xref ref-type="sec" rid="s5_1">Section 5.1</xref>. The image is mapped onto the latent space by the encoder as follows:</p>
<p><disp-formula id="eqn-4"><label>(4)</label><mml:math id="mml-eqn-4" display="block"><mml:msub><mml:mi>z</mml:mi><mml:mrow><mml:mi>c</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mi>E</mml:mi><mml:mrow><mml:mi>c</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>I</mml:mi><mml:mrow><mml:mi>c</mml:mi><mml:mo>,</mml:mo><mml:mrow><mml:mover><mml:mrow><mml:mi mathvariant="bold-script">J</mml:mi></mml:mrow><mml:mo stretchy="false">&#x007E;</mml:mo></mml:mover></mml:mrow></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow><mml:mo>,</mml:mo><mml:msub><mml:mi>z</mml:mi><mml:mrow><mml:mi>s</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mi>E</mml:mi><mml:mrow><mml:mi>s</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>I</mml:mi><mml:mrow><mml:mi>s</mml:mi><mml:mo>,</mml:mo><mml:mrow><mml:mover><mml:mrow><mml:mi mathvariant="bold-script">J</mml:mi></mml:mrow><mml:mo stretchy="false">&#x007E;</mml:mo></mml:mover></mml:mrow></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow><mml:mo>,</mml:mo></mml:math></disp-formula>where <inline-formula id="ieqn-46"><mml:math id="mml-ieqn-46"><mml:mrow><mml:mover><mml:mrow><mml:mi mathvariant="bold-script">J</mml:mi></mml:mrow><mml:mo stretchy="false">&#x007E;</mml:mo></mml:mover></mml:mrow></mml:math></inline-formula> is a user-defined image transformation level, <inline-formula id="ieqn-47"><mml:math id="mml-ieqn-47"><mml:msub><mml:mi>z</mml:mi><mml:mrow><mml:mo>&#x2217;</mml:mo></mml:mrow></mml:msub><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:msub><mml:mi>C</mml:mi><mml:mrow><mml:mi>f</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo>&#x00D7;</mml:mo><mml:mrow><mml:msub><mml:mi>H</mml:mi><mml:mrow><mml:mi>f</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo>&#x00D7;</mml:mo><mml:mrow><mml:msub><mml:mi>W</mml:mi><mml:mrow><mml:mi>f</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:mrow></mml:msup></mml:math></inline-formula> is the latent code (<inline-formula id="ieqn-48"><mml:math id="mml-ieqn-48"><mml:msub><mml:mi>C</mml:mi><mml:mrow><mml:mi>f</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> is the number of channels, and <inline-formula id="ieqn-49"><mml:math id="mml-ieqn-49"><mml:msub><mml:mi>H</mml:mi><mml:mrow><mml:mi>f</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> and <inline-formula id="ieqn-50"><mml:math id="mml-ieqn-50"><mml:msub><mml:mi>W</mml:mi><mml:mrow><mml:mi>f</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> are the height and width of the encoded features), and <inline-formula id="ieqn-51"><mml:math id="mml-ieqn-51"><mml:msub><mml:mi>E</mml:mi><mml:mrow><mml:mi>c</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> and <inline-formula id="ieqn-52"><mml:math id="mml-ieqn-52"><mml:msub><mml:mi>E</mml:mi><mml:mrow><mml:mi>s</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> represent the pretrained content and style encoders, respectively. Then, the latent code is spatially divided into patches as follows:</p>
<p><disp-formula id="eqn-5"><label>(5)</label><mml:math id="mml-eqn-5" display="block"><mml:mrow><mml:mi>&#x1D49E;</mml:mi></mml:mrow><mml:mo>=</mml:mo><mml:mrow><mml:mi mathvariant="double-struck">P</mml:mi></mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>z</mml:mi><mml:mrow><mml:mi>c</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>M</mml:mi><mml:mrow><mml:mi>h</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>M</mml:mi><mml:mrow><mml:mi>w</mml:mi></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow><mml:mo>,</mml:mo><mml:mrow><mml:mi>&#x1D4AE;</mml:mi></mml:mrow><mml:mo>=</mml:mo><mml:mrow><mml:mi mathvariant="double-struck">P</mml:mi></mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>z</mml:mi><mml:mrow><mml:mi>c</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>M</mml:mi><mml:mrow><mml:mi>h</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>M</mml:mi><mml:mrow><mml:mi>w</mml:mi></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow><mml:mo>,</mml:mo></mml:math></disp-formula>with <inline-formula id="ieqn-53"><mml:math id="mml-ieqn-53"><mml:mrow><mml:mi mathvariant="double-struck">P</mml:mi></mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>z</mml:mi><mml:mrow><mml:mo>&#x2217;</mml:mo></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>M</mml:mi><mml:mrow><mml:mi>h</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>M</mml:mi><mml:mrow><mml:mi>w</mml:mi></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula> being the patching layer formulated as:<disp-formula id="eqn-6"><label>(6)</label><mml:math id="mml-eqn-6" display="block"><mml:mrow><mml:mi mathvariant="double-struck">P</mml:mi></mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>z</mml:mi><mml:mrow><mml:mo>&#x2217;</mml:mo></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>M</mml:mi><mml:mrow><mml:mi>h</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>M</mml:mi><mml:mrow><mml:mi>w</mml:mi></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:mrow><mml:mtext>CH</mml:mtext></mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mtext>CH</mml:mtext></mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>z</mml:mi><mml:mrow><mml:mo>&#x2217;</mml:mo></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>M</mml:mi><mml:mrow><mml:mi>h</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mrow><mml:mtext>h</mml:mtext></mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mrow><mml:mtext>axis</mml:mtext></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mo>,</mml:mo><mml:mrow><mml:mtext>w</mml:mtext></mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mrow><mml:mtext>axis</mml:mtext></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mo stretchy="false">]</mml:mo><mml:mo>,</mml:mo></mml:math></disp-formula>where CH is a chunk function. Hence, <inline-formula id="ieqn-54"><mml:math id="mml-ieqn-54"><mml:mrow><mml:mi mathvariant="double-struck">P</mml:mi></mml:mrow></mml:math></inline-formula> divides latent code <inline-formula id="ieqn-55"><mml:math id="mml-ieqn-55"><mml:msub><mml:mi>z</mml:mi><mml:mrow><mml:mo>&#x2217;</mml:mo></mml:mrow></mml:msub></mml:math></inline-formula> into <inline-formula id="ieqn-56"><mml:math id="mml-ieqn-56"><mml:msub><mml:mi>M</mml:mi><mml:mrow><mml:mi>h</mml:mi></mml:mrow></mml:msub><mml:mo>&#x00D7;</mml:mo><mml:msub><mml:mi>M</mml:mi><mml:mrow><mml:mi>w</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> patches, obtaining sets <inline-formula id="ieqn-57"><mml:math id="mml-ieqn-57"><mml:mrow><mml:mi>&#x1D49E;</mml:mi></mml:mrow></mml:math></inline-formula> and <inline-formula id="ieqn-58"><mml:math id="mml-ieqn-58"><mml:mrow><mml:mi>&#x1D4AE;</mml:mi></mml:mrow></mml:math></inline-formula> of patches extracted from the content and style images, respectively. These sets can be represented as <inline-formula id="ieqn-59"><mml:math id="mml-ieqn-59"><mml:mrow><mml:mi>&#x1D49E;</mml:mi></mml:mrow><mml:mo>=</mml:mo><mml:msubsup><mml:mrow><mml:mo>{</mml:mo><mml:msub><mml:mi>c</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>}</mml:mo></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:msub><mml:mi>M</mml:mi><mml:mrow><mml:mi>h</mml:mi></mml:mrow></mml:msub><mml:mo>&#x00D7;</mml:mo><mml:msub><mml:mi>M</mml:mi><mml:mrow><mml:mi>w</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:msubsup></mml:math></inline-formula>, <inline-formula id="ieqn-60"><mml:math id="mml-ieqn-60"><mml:mrow><mml:mi>&#x1D4AE;</mml:mi></mml:mrow><mml:mo>=</mml:mo><mml:msubsup><mml:mrow><mml:mo>{</mml:mo><mml:msub><mml:mi>s</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>}</mml:mo></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:msub><mml:mi>M</mml:mi><mml:mrow><mml:mi>h</mml:mi></mml:mrow></mml:msub><mml:mo>&#x00D7;</mml:mo><mml:msub><mml:mi>M</mml:mi><mml:mrow><mml:mi>w</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:msubsup></mml:math></inline-formula> where <inline-formula id="ieqn-61"><mml:math id="mml-ieqn-61"><mml:msub><mml:mi>c</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>, <inline-formula id="ieqn-62"><mml:math id="mml-ieqn-62"><mml:msub><mml:mi>s</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:msub><mml:mi>C</mml:mi><mml:mrow><mml:mi>f</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo>&#x00D7;</mml:mo><mml:mrow><mml:msub><mml:mi>H</mml:mi><mml:mrow><mml:mi>M</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo>&#x00D7;</mml:mo><mml:mrow><mml:msub><mml:mi>W</mml:mi><mml:mrow><mml:mi>M</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:mrow></mml:msup></mml:math></inline-formula> are the <inline-formula id="ieqn-63"><mml:math id="mml-ieqn-63"><mml:mi>i</mml:mi></mml:math></inline-formula>-th content and style patches, respectively, <inline-formula id="ieqn-64"><mml:math id="mml-ieqn-64"><mml:msub><mml:mi>H</mml:mi><mml:mrow><mml:mi>M</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> and <inline-formula id="ieqn-65"><mml:math id="mml-ieqn-65"><mml:msub><mml:mi>W</mml:mi><mml:mrow><mml:mi>M</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> are the dimensions of the divided patch calculated as <inline-formula id="ieqn-66"><mml:math id="mml-ieqn-66"><mml:msub><mml:mi>H</mml:mi><mml:mrow><mml:mi>M</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mi>H</mml:mi><mml:mrow><mml:mi>f</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo>/</mml:mo></mml:mrow><mml:msub><mml:mi>M</mml:mi><mml:mrow><mml:mi>h</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>, <inline-formula id="ieqn-67"><mml:math id="mml-ieqn-67"><mml:msub><mml:mi>W</mml:mi><mml:mrow><mml:mi>M</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mi>W</mml:mi><mml:mrow><mml:mi>f</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo>/</mml:mo></mml:mrow><mml:msub><mml:mi>M</mml:mi><mml:mrow><mml:mi>w</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>. We assume <inline-formula id="ieqn-68"><mml:math id="mml-ieqn-68"><mml:msub><mml:mi>M</mml:mi><mml:mrow><mml:mi>h</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mi>M</mml:mi><mml:mrow><mml:mi>w</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> throughout the paper unless stated otherwise. AdaIN is applied to each patch with the corresponding positional patch in the other image calculated as:</p>
<p><disp-formula id="eqn-7"><label>(7)</label><mml:math id="mml-eqn-7" display="block"><mml:mrow><mml:mi>&#x1D4AF;</mml:mi></mml:mrow><mml:mo>=</mml:mo><mml:mi>A</mml:mi><mml:mi>d</mml:mi><mml:mi>a</mml:mi><mml:mi>I</mml:mi><mml:msub><mml:mi>N</mml:mi><mml:mrow><mml:mrow><mml:mtext>patch</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mi>&#x1D49E;</mml:mi></mml:mrow><mml:mo>,</mml:mo><mml:mrow><mml:mi>&#x1D4AE;</mml:mi></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mo>,</mml:mo><mml:mi>t</mml:mi><mml:mo>=</mml:mo><mml:mi>A</mml:mi><mml:mi>d</mml:mi><mml:mi>a</mml:mi><mml:mi>I</mml:mi><mml:mi>N</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>z</mml:mi><mml:mrow><mml:mi>c</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>z</mml:mi><mml:mrow><mml:mi>s</mml:mi></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow><mml:mo>,</mml:mo></mml:math></disp-formula>where <inline-formula id="ieqn-69"><mml:math id="mml-ieqn-69"><mml:mi>t</mml:mi></mml:math></inline-formula> is the AdaIN output for code blending and <inline-formula id="ieqn-70"><mml:math id="mml-ieqn-70"><mml:mrow><mml:mi>&#x1D4AF;</mml:mi></mml:mrow></mml:math></inline-formula> is AdaIN output set <inline-formula id="ieqn-71"><mml:math id="mml-ieqn-71"><mml:mrow><mml:mi>&#x1D4AF;</mml:mi></mml:mrow><mml:mo>=</mml:mo><mml:msubsup><mml:mrow><mml:mo>{</mml:mo><mml:msub><mml:mi>t</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>}</mml:mo></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:msub><mml:mi>M</mml:mi><mml:mrow><mml:mi>h</mml:mi></mml:mrow></mml:msub><mml:mo>&#x00D7;</mml:mo><mml:msub><mml:mi>M</mml:mi><mml:mrow><mml:mi>w</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:msubsup></mml:math></inline-formula> calculated by <inline-formula id="ieqn-72"><mml:math id="mml-ieqn-72"><mml:mi>A</mml:mi><mml:mi>d</mml:mi><mml:mi>a</mml:mi><mml:mi>I</mml:mi><mml:msub><mml:mi>N</mml:mi><mml:mrow><mml:mrow><mml:mtext>patch</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mi>&#x1D49E;</mml:mi></mml:mrow><mml:mo>,</mml:mo><mml:mrow><mml:mi>&#x1D4AE;</mml:mi></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula>. As illustrated in <xref ref-type="fig" rid="fig-3">Fig. 3</xref>, this operation can be formulated as:</p>
<p><disp-formula id="eqn-8"><label>(8)</label><mml:math id="mml-eqn-8" display="block"><mml:mi>A</mml:mi><mml:mi>d</mml:mi><mml:mi>a</mml:mi><mml:mi>I</mml:mi><mml:msub><mml:mi>N</mml:mi><mml:mrow><mml:mrow><mml:mtext>patch</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mi>&#x1D49E;</mml:mi></mml:mrow><mml:mo>,</mml:mo><mml:mrow><mml:mi>&#x1D4AE;</mml:mi></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:msubsup><mml:mrow><mml:mo>{</mml:mo><mml:msub><mml:mi>t</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:msub><mml:mi>t</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mi>&#x03C3;</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>s</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow><mml:mfrac><mml:mrow><mml:msub><mml:mi>c</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>&#x2212;</mml:mo><mml:mi>&#x03BC;</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>c</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:mi>&#x03C3;</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>c</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:mfrac><mml:mo>+</mml:mo><mml:mi>&#x03BC;</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>s</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mo>}</mml:mo></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>M</mml:mi></mml:mrow></mml:msubsup><mml:mo>,</mml:mo></mml:math></disp-formula>where <inline-formula id="ieqn-73"><mml:math id="mml-ieqn-73"><mml:msub><mml:mi>c</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>&#x2208;</mml:mo><mml:mrow><mml:mi>&#x1D49E;</mml:mi></mml:mrow></mml:math></inline-formula> and <inline-formula id="ieqn-74"><mml:math id="mml-ieqn-74"><mml:msub><mml:mi>s</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>&#x2208;</mml:mo><mml:mrow><mml:mi>&#x1D4AE;</mml:mi></mml:mrow></mml:math></inline-formula>. After applying AdaIN in the latent space, all the patches are aggregated and then blended using global feature <inline-formula id="ieqn-75"><mml:math id="mml-ieqn-75"><mml:mi>t</mml:mi></mml:math></inline-formula> and local features <inline-formula id="ieqn-76"><mml:math id="mml-ieqn-76"><mml:mrow><mml:mi>&#x1D4AF;</mml:mi></mml:mrow></mml:math></inline-formula> by user-defined parameter <inline-formula id="ieqn-77"><mml:math id="mml-ieqn-77"><mml:mi>&#x03B1;</mml:mi></mml:math></inline-formula> as a global-local weight as follows:</p>
<p><disp-formula id="eqn-9"><label>(9)</label><mml:math id="mml-eqn-9" display="block"><mml:msub><mml:mrow><mml:mover><mml:mi>z</mml:mi><mml:mo stretchy="false">&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>&#x03B1;</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:mn>1</mml:mn><mml:mo>&#x2212;</mml:mo><mml:mi>&#x03B1;</mml:mi><mml:mo>)</mml:mo></mml:mrow><mml:mrow><mml:mi mathvariant="double-struck">A</mml:mi></mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mi>&#x1D4AF;</mml:mi></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mo>+</mml:mo><mml:mi>&#x03B1;</mml:mi><mml:mi>t</mml:mi><mml:mo>,</mml:mo></mml:math></disp-formula>where <inline-formula id="ieqn-78"><mml:math id="mml-ieqn-78"><mml:msub><mml:mrow><mml:mover><mml:mi>z</mml:mi><mml:mo stretchy="false">&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>&#x03B1;</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> is the final feature including user-defined factor &#x03B1; that sets the tradeoff between the global and local features. <inline-formula id="ieqn-79"><mml:math id="mml-ieqn-79"><mml:mrow><mml:mi mathvariant="double-struck">A</mml:mi></mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mo>&#x22C5;</mml:mo><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula> is the de-patching layer that aggregates the input patch set while ensuring that <inline-formula id="ieqn-80"><mml:math id="mml-ieqn-80"><mml:mrow><mml:mi mathvariant="double-struck">A</mml:mi></mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mi mathvariant="double-struck">P</mml:mi></mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>z</mml:mi><mml:mrow><mml:mo>&#x2217;</mml:mo></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>M</mml:mi><mml:mrow><mml:mi>h</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>M</mml:mi><mml:mrow><mml:mi>w</mml:mi></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:msub><mml:mi>z</mml:mi><mml:mrow><mml:mo>&#x2217;</mml:mo></mml:mrow></mml:msub></mml:math></inline-formula>. During inference, the user can set the content image as the original one, that is, <inline-formula id="ieqn-81"><mml:math id="mml-ieqn-81"><mml:msub><mml:mi>I</mml:mi><mml:mrow><mml:mi>c</mml:mi><mml:mo>,</mml:mo><mml:mrow><mml:mi mathvariant="bold-script">J</mml:mi></mml:mrow></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mi>I</mml:mi><mml:mrow><mml:mi>c</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>, while transforming the style image. Finally, output image <inline-formula id="ieqn-82"><mml:math id="mml-ieqn-82"><mml:msub><mml:mrow><mml:mover><mml:mi>I</mml:mi><mml:mo stretchy="false">&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>&#x03B1;</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> is obtained as <inline-formula id="ieqn-83"><mml:math id="mml-ieqn-83"><mml:msub><mml:mrow><mml:mover><mml:mi>I</mml:mi><mml:mo stretchy="false">&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>&#x03B1;</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mi>D</mml:mi><mml:mrow><mml:mi>s</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mrow><mml:mover><mml:mi>z</mml:mi><mml:mo stretchy="false">&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>&#x03B1;</mml:mi></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula>.</p>

<fig id="fig-3">
<label>Figure 3</label>
<caption>
<title>PatchizedAdaIN operation compared with AdaIN operation</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_56079-fig-3.tif"/>
</fig>
</sec>
<sec id="s4_3">
<label>4.3</label>
<title>Fine Tuning</title>
<p>The proposed Patchified AdaIN component can be easily used by replacing each AdaIN layer in an existing network without any other architecture modification or additional training or adaptation process. However, to enhance the regional quality of style transfer output, the decoder can be further tuned with an additional local loss using an existing loss function. To distinguish the loss functions, we refer to the existing AdaIN loss as the global loss throughout the paper. Similar to the global loss, our local loss is additionally computed from every patch of the content and style images. The local content loss is calculated as:</p>
<p><disp-formula id="eqn-10"><label>(10)</label><mml:math id="mml-eqn-10" display="block"><mml:msub><mml:mrow><mml:mi>&#x02112;</mml:mi></mml:mrow><mml:mrow><mml:mi>p</mml:mi><mml:mi>a</mml:mi><mml:mi>t</mml:mi><mml:mi>c</mml:mi><mml:mi>h</mml:mi><mml:mi mathvariant="normal">&#x005F;</mml:mi><mml:mi>c</mml:mi><mml:mi>o</mml:mi><mml:mi>n</mml:mi><mml:mi>t</mml:mi><mml:mi>e</mml:mi><mml:mi>n</mml:mi><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>M</mml:mi></mml:mrow></mml:munderover><mml:msub><mml:mrow><mml:mo symmetric="true">&#x2016;</mml:mo><mml:mrow><mml:mi>&#x1D4AF;</mml:mi></mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mrow><mml:mi mathvariant="double-struck">P</mml:mi></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>E</mml:mi><mml:mrow><mml:mi>c</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>D</mml:mi><mml:mrow><mml:mi>s</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mi mathvariant="double-struck">A</mml:mi></mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mi>&#x1D4AF;</mml:mi></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mo symmetric="true">&#x2016;</mml:mo></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo></mml:math></disp-formula>where we omit the summation index for simplicity. Note that <inline-formula id="ieqn-84"><mml:math id="mml-ieqn-84"><mml:mrow><mml:mi>&#x1D4AF;</mml:mi></mml:mrow></mml:math></inline-formula> and the output of <inline-formula id="ieqn-85"><mml:math id="mml-ieqn-85"><mml:mrow><mml:mi mathvariant="double-struck">P</mml:mi></mml:mrow></mml:math></inline-formula> are sets including the patchified latent feature. In the local loss term for finetuning, we exclude code blending (i.e., <inline-formula id="ieqn-86"><mml:math id="mml-ieqn-86"><mml:mi>&#x03B1;</mml:mi><mml:mo>=</mml:mo><mml:mn>0</mml:mn></mml:math></inline-formula> in 9). Thus, <inline-formula id="ieqn-87"><mml:math id="mml-ieqn-87"><mml:msub><mml:mrow><mml:mover><mml:mi>z</mml:mi><mml:mo stretchy="false">&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>&#x03B1;</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> becomes <inline-formula id="ieqn-88"><mml:math id="mml-ieqn-88"><mml:mrow><mml:mi mathvariant="double-struck">A</mml:mi></mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mi>&#x1D4AF;</mml:mi></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula> without considering the global features. Therefore, the local style loss can be calculated as:</p>
<p><disp-formula id="eqn-11"><label>(11)</label><mml:math id="mml-eqn-11" display="block"><mml:msub><mml:mrow><mml:mi>&#x02112;</mml:mi></mml:mrow><mml:mrow><mml:mi>p</mml:mi><mml:mi>a</mml:mi><mml:mi>t</mml:mi><mml:mi>c</mml:mi><mml:mi>h</mml:mi><mml:mi mathvariant="normal">&#x005F;</mml:mi><mml:mi>s</mml:mi><mml:mi>t</mml:mi><mml:mi>y</mml:mi><mml:mi>l</mml:mi><mml:mi>e</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>M</mml:mi></mml:mrow></mml:munderover><mml:mrow><mml:mo>(</mml:mo><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>j</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>L</mml:mi></mml:mrow></mml:munderover><mml:msub><mml:mrow><mml:mo symmetric="true">&#x2016;</mml:mo><mml:mi>&#x03BC;</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>&#x03D5;</mml:mi><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>D</mml:mi><mml:mrow><mml:mi>s</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mi mathvariant="double-struck">A</mml:mi></mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mi>&#x1D4AF;</mml:mi></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mi>&#x03BC;</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>&#x03D5;</mml:mi><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>I</mml:mi><mml:mrow><mml:mi>s</mml:mi></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mo symmetric="true">&#x2016;</mml:mo></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:msub><mml:mrow><mml:mo symmetric="true">&#x2016;</mml:mo><mml:mi>&#x03C3;</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>&#x03D5;</mml:mi><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>D</mml:mi><mml:mrow><mml:mi>s</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mi mathvariant="double-struck">A</mml:mi></mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mi>&#x1D4AF;</mml:mi></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mspace linebreak="newline" /><mml:mo>&#x2212;</mml:mo><mml:mi>&#x03C3;</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>&#x03D5;</mml:mi><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>I</mml:mi><mml:mrow><mml:mi>s</mml:mi></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mo symmetric="true">&#x2016;</mml:mo></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow><mml:mo>,</mml:mo></mml:math></disp-formula>where <inline-formula id="ieqn-89"><mml:math id="mml-ieqn-89"><mml:msub><mml:mi>&#x03D5;</mml:mi><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:mo>&#x22C5;</mml:mo><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula> represents the <inline-formula id="ieqn-90"><mml:math id="mml-ieqn-90"><mml:mi>j</mml:mi></mml:math></inline-formula>-th layer in <inline-formula id="ieqn-91"><mml:math id="mml-ieqn-91"><mml:msub><mml:mi>E</mml:mi><mml:mrow><mml:mi>s</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:mo>&#x22C5;</mml:mo><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula> and <inline-formula id="ieqn-92"><mml:math id="mml-ieqn-92"><mml:mi>L</mml:mi></mml:math></inline-formula> is the depth of the layer from <inline-formula id="ieqn-93"><mml:math id="mml-ieqn-93"><mml:msub><mml:mi>&#x03D5;</mml:mi><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:mo>&#x22C5;</mml:mo><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula> in the encoder to be used as a loss. We calculate the content loss using relu4_1 and style loss as in [<xref ref-type="bibr" rid="ref-26">26</xref>]. We add the patch loss to the existing global loss [<xref ref-type="bibr" rid="ref-26">26</xref>]. Finally, the patch and global loss functions are formulated as follows:</p>
<p><disp-formula id="eqn-12"><label>(12)</label><mml:math id="mml-eqn-12" display="block"><mml:msub><mml:mrow><mml:mi>&#x02112;</mml:mi></mml:mrow><mml:mrow><mml:mi>p</mml:mi><mml:mi>a</mml:mi><mml:mi>t</mml:mi><mml:mi>c</mml:mi><mml:mi>h</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mi>&#x02112;</mml:mi></mml:mrow><mml:mrow><mml:mi>p</mml:mi><mml:mi>a</mml:mi><mml:mi>t</mml:mi><mml:mi>c</mml:mi><mml:mi>h</mml:mi><mml:mi mathvariant="normal">&#x005F;</mml:mi><mml:mi>c</mml:mi><mml:mi>o</mml:mi><mml:mi>n</mml:mi><mml:mi>t</mml:mi><mml:mi>e</mml:mi><mml:mi>n</mml:mi><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:msub><mml:mi>&#x03BB;</mml:mi><mml:mrow><mml:mi>p</mml:mi><mml:mi>a</mml:mi><mml:mi>t</mml:mi><mml:mi>c</mml:mi><mml:mi>h</mml:mi></mml:mrow></mml:msub><mml:msub><mml:mrow><mml:mi>&#x02112;</mml:mi></mml:mrow><mml:mrow><mml:mi>p</mml:mi><mml:mi>a</mml:mi><mml:mi>t</mml:mi><mml:mi>c</mml:mi><mml:mi>h</mml:mi><mml:mi mathvariant="normal">&#x005F;</mml:mi><mml:mi>s</mml:mi><mml:mi>t</mml:mi><mml:mi>y</mml:mi><mml:mi>l</mml:mi><mml:mi>e</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>&#x02112;</mml:mi></mml:mrow><mml:mrow><mml:mi>g</mml:mi><mml:mi>l</mml:mi><mml:mi>o</mml:mi><mml:mi>b</mml:mi><mml:mi>a</mml:mi><mml:mi>l</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mi>&#x02112;</mml:mi></mml:mrow><mml:mrow><mml:mi>g</mml:mi><mml:mi>l</mml:mi><mml:mi>o</mml:mi><mml:mi>b</mml:mi><mml:mi>a</mml:mi><mml:mi>l</mml:mi><mml:mi mathvariant="normal">&#x005F;</mml:mi><mml:mi>c</mml:mi><mml:mi>o</mml:mi><mml:mi>n</mml:mi><mml:mi>t</mml:mi><mml:mi>e</mml:mi><mml:mi>n</mml:mi><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:msub><mml:mi>&#x03BB;</mml:mi><mml:mrow><mml:mi>g</mml:mi><mml:mi>l</mml:mi><mml:mi>o</mml:mi><mml:mi>b</mml:mi><mml:mi>a</mml:mi><mml:mi>l</mml:mi></mml:mrow></mml:msub><mml:msub><mml:mrow><mml:mi>&#x02112;</mml:mi></mml:mrow><mml:mrow><mml:mi>g</mml:mi><mml:mi>l</mml:mi><mml:mi>o</mml:mi><mml:mi>b</mml:mi><mml:mi>a</mml:mi><mml:mi>l</mml:mi><mml:mi mathvariant="normal">&#x005F;</mml:mi><mml:mi>s</mml:mi><mml:mi>t</mml:mi><mml:mi>y</mml:mi><mml:mi>l</mml:mi><mml:mi>e</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo></mml:math></disp-formula>where <inline-formula id="ieqn-94"><mml:math id="mml-ieqn-94"><mml:mi>&#x03BB;</mml:mi></mml:math></inline-formula> is a style loss weight. We set every <inline-formula id="ieqn-95"><mml:math id="mml-ieqn-95"><mml:mi>&#x03BB;</mml:mi></mml:math></inline-formula> to 10 as in [<xref ref-type="bibr" rid="ref-26">26</xref>]. <inline-formula id="ieqn-96"><mml:math id="mml-ieqn-96"><mml:msub><mml:mrow><mml:mi>&#x02112;</mml:mi></mml:mrow><mml:mrow><mml:mi>g</mml:mi><mml:mi>l</mml:mi><mml:mi>o</mml:mi><mml:mi>b</mml:mi><mml:mi>a</mml:mi><mml:mi>l</mml:mi><mml:mi mathvariant="normal">&#x005F;</mml:mi><mml:mi>c</mml:mi><mml:mi>o</mml:mi><mml:mi>n</mml:mi><mml:mi>t</mml:mi><mml:mi>e</mml:mi><mml:mi>n</mml:mi><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> and <inline-formula id="ieqn-97"><mml:math id="mml-ieqn-97"><mml:msub><mml:mrow><mml:mi>&#x02112;</mml:mi></mml:mrow><mml:mrow><mml:mi>g</mml:mi><mml:mi>l</mml:mi><mml:mi>o</mml:mi><mml:mi>b</mml:mi><mml:mi>a</mml:mi><mml:mi>l</mml:mi><mml:mi mathvariant="normal">&#x005F;</mml:mi><mml:mi>s</mml:mi><mml:mi>t</mml:mi><mml:mi>y</mml:mi><mml:mi>l</mml:mi><mml:mi>e</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> are the same as the original loss functions of [<xref ref-type="bibr" rid="ref-26">26</xref>]. The complete loss, <inline-formula id="ieqn-98"><mml:math id="mml-ieqn-98"><mml:msub><mml:mrow><mml:mi>&#x02112;</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi><mml:mi>o</mml:mi><mml:mi>t</mml:mi><mml:mi>a</mml:mi><mml:mi>l</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>, is obtained by linearly combining the patch and global loss functions as follows:</p>
<p><disp-formula id="eqn-13"><label>(13)</label><mml:math id="mml-eqn-13" display="block"><mml:msub><mml:mrow><mml:mi>&#x02112;</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi><mml:mi>o</mml:mi><mml:mi>t</mml:mi><mml:mi>a</mml:mi><mml:mi>l</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mi>&#x03B3;</mml:mi><mml:msub><mml:mrow><mml:mi>&#x02112;</mml:mi></mml:mrow><mml:mrow><mml:mi>p</mml:mi><mml:mi>a</mml:mi><mml:mi>t</mml:mi><mml:mi>c</mml:mi><mml:mi>h</mml:mi></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mn>1</mml:mn><mml:mo>&#x2212;</mml:mo><mml:mi>&#x03B3;</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:msub><mml:mrow><mml:mi>&#x02112;</mml:mi></mml:mrow><mml:mrow><mml:mi>g</mml:mi><mml:mi>l</mml:mi><mml:mi>o</mml:mi><mml:mi>b</mml:mi><mml:mi>a</mml:mi><mml:mi>l</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo></mml:math></disp-formula>where <inline-formula id="ieqn-99"><mml:math id="mml-ieqn-99"><mml:mi>&#x03B3;</mml:mi></mml:math></inline-formula> is the patch loss scale. By changing <inline-formula id="ieqn-100"><mml:math id="mml-ieqn-100"><mml:mi>&#x03B3;</mml:mi></mml:math></inline-formula>, we can control the weight of the local spatial fidelity for finetuning. We set <inline-formula id="ieqn-101"><mml:math id="mml-ieqn-101"><mml:mi>&#x03B3;</mml:mi></mml:math></inline-formula> to 0.9. For plausible visual fidelity without explicit control, <inline-formula id="ieqn-102"><mml:math id="mml-ieqn-102"><mml:mi>&#x03B3;</mml:mi></mml:math></inline-formula> can be considered as a learnable parameter with an initial value of 1.0.</p>
<p>Additionally, we set <inline-formula id="ieqn-103"><mml:math id="mml-ieqn-103"><mml:msub><mml:mi>M</mml:mi><mml:mrow><mml:mi>h</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> and <inline-formula id="ieqn-104"><mml:math id="mml-ieqn-104"><mml:msub><mml:mi>M</mml:mi><mml:mrow><mml:mi>w</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> to 2 and obtain each patch by dividing the features, as shown in <xref ref-type="fig" rid="fig-4">Fig. 4</xref>. Note that the user can control these factors (e.g., patch division strategy, number of patches), as analyzed in <xref ref-type="sec" rid="s5_5">Section 5.5</xref>. Overall, the patch loss captures the AdaIN loss from local features <inline-formula id="ieqn-105"><mml:math id="mml-ieqn-105"><mml:msub><mml:mi>s</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> and <inline-formula id="ieqn-106"><mml:math id="mml-ieqn-106"><mml:msub><mml:mi>c</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> to obtain local statistics. Then, with the existing AdaIN loss [<xref ref-type="bibr" rid="ref-26">26</xref>], we weight the patch loss to set the contribution of the local style features to every patch. This affects not only backpropagation of the cost function during training but also inference for resulting image with AST, as analyzed in <xref ref-type="sec" rid="s5_4">Section 5.4</xref>.</p>
<fig id="fig-4">
<label>Figure 4</label>
<caption>
<title>Proposed PatchizedAdaIN mechanism in inference step</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_56079-fig-4.tif"/>
</fig>
</sec>
<sec id="s4_4">
<label>4.4</label>
<title>Network Structure</title>
<p>We adopt the AdaIN network in [<xref ref-type="bibr" rid="ref-26">26</xref>] for the AST network as the baseline, but any other network relying on AdaIN can be used in our scheme by simply replacing AdaIN with the proposed Patchified AdaIN. For the encoder-decoder architecture, we use a nine-layer 3 &#x00D7; 3 filter convolution encoder f with up to 512 channels and decoder g with a symmetric architecture and the encoder based on [<xref ref-type="bibr" rid="ref-26">26</xref>]. The output image, <inline-formula id="ieqn-107"><mml:math id="mml-ieqn-107"><mml:mrow><mml:mover><mml:mi>I</mml:mi><mml:mo stretchy="false">&#x007E;</mml:mo></mml:mover></mml:mrow></mml:math></inline-formula>, is calculated as <inline-formula id="ieqn-108"><mml:math id="mml-ieqn-108"><mml:mrow><mml:mover><mml:mi>I</mml:mi><mml:mo stretchy="false">&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mo>=</mml:mo><mml:msub><mml:mi>D</mml:mi><mml:mrow><mml:mi>s</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:mn>1</mml:mn><mml:mo>&#x2212;</mml:mo><mml:mi>&#x03B1;</mml:mi><mml:mo>)</mml:mo></mml:mrow><mml:mrow><mml:mi mathvariant="double-struck">A</mml:mi></mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mi>&#x1D4AF;</mml:mi></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mo>+</mml:mo><mml:mi>&#x03B1;</mml:mi><mml:mi>t</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula> for inference and <inline-formula id="ieqn-109"><mml:math id="mml-ieqn-109"><mml:mrow><mml:mover><mml:mi>I</mml:mi><mml:mo stretchy="false">&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mo>=</mml:mo><mml:msub><mml:mi>D</mml:mi><mml:mrow><mml:mi>s</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mi mathvariant="double-struck">A</mml:mi></mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mi>&#x1D4AF;</mml:mi></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula> for finetuning.</p>
</sec>
</sec>
<sec id="s5">
<label>5</label>
<title>Experiments</title>
<p>We demonstrated the superiority of our method through comparisons with other models in terms of color-aware (i.e., geometric-transformation-aware) style transfer, and evaluated the influence of the number and types of patches on AST. First, we present the visual superiority of our proposal when transforming the style image in <xref ref-type="sec" rid="s5_1">Section 5.1</xref>. Then, we obtain the color distributions before and after geometric transformations to show that our method visually reflects image transformations for AST in <xref ref-type="sec" rid="s5_2">Section 5.2</xref>. For regional fidelity, we demonstrate finetuning results in <xref ref-type="sec" rid="s5_4">Section 5.4</xref>. To evaluate user experience, we design a runtime control strategy for patching in <xref ref-type="sec" rid="s5_5">Section 5.5</xref>. For practical application, we calculate the inference time by comparing our method with existing baselines in <xref ref-type="sec" rid="s5_8">Section 5.8</xref>. Finally, we modify SOTA AST models using our Patchified AdaIN to integrate transformation-awareness, thereby demonstrating the extension and applicability provided by our proposal in <xref ref-type="sec" rid="s5_7">Section 5.7</xref>. In training procedures, without patchization layer, we first pretrain the autoencoders for content and style, respectively. Then, the decoder of style autoencoder is further tuned based on <xref ref-type="disp-formula" rid="eqn-13">Eq. (13)</xref>.</p>
<sec id="s5_1">
<label>5.1</label>
<title>Qualitative Evaluation under Geometric Transformations</title>
<p>To evaluate the capabilities of spatial-transformation-aware AST, we choose various geometric transformations that notably changed the regional color distribution. We applied four geometric transformations: rotation angle <inline-formula id="ieqn-110"><mml:math id="mml-ieqn-110"><mml:mi>r</mml:mi></mml:math></inline-formula>, translations along the <italic>x</italic> axis, <inline-formula id="ieqn-111"><mml:math id="mml-ieqn-111"><mml:msub><mml:mi>t</mml:mi><mml:mrow><mml:mi>x</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>, and <italic>y</italic> axis, <inline-formula id="ieqn-112"><mml:math id="mml-ieqn-112"><mml:msub><mml:mi>t</mml:mi><mml:mrow><mml:mi>y</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>, flipping along the vertical, <inline-formula id="ieqn-113"><mml:math id="mml-ieqn-113"><mml:msub><mml:mi>f</mml:mi><mml:mrow><mml:mi>v</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>, and horizontal, <inline-formula id="ieqn-114"><mml:math id="mml-ieqn-114"><mml:msub><mml:mi>f</mml:mi><mml:mrow><mml:mi>h</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>, and zooming <inline-formula id="ieqn-115"><mml:math id="mml-ieqn-115"><mml:mi>z</mml:mi></mml:math></inline-formula>. To prevent the global statistics of the image from excessively changing after a geometric transformation, we filled out empty areas with fictitious pixel insertion by, for example, reflecting symmetrical tiles for the rotation. Four cases were evaluated, with the transformation being given by <inline-formula id="ieqn-116"><mml:math id="mml-ieqn-116"><mml:mrow><mml:mi mathvariant="bold-script">J</mml:mi></mml:mrow><mml:mo>=</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:mi>r</mml:mi><mml:mo>,</mml:mo><mml:msub><mml:mi>t</mml:mi><mml:mrow><mml:mi>x</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>t</mml:mi><mml:mrow><mml:mi>y</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>f</mml:mi><mml:mrow><mml:mi>v</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>f</mml:mi><mml:mrow><mml:mi>h</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mi>z</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula>, where <inline-formula id="ieqn-117"><mml:math id="mml-ieqn-117"><mml:mi>r</mml:mi></mml:math></inline-formula> is a clockwise rotation angle from <inline-formula id="ieqn-118"><mml:math id="mml-ieqn-118"><mml:mo>&#x2212;</mml:mo><mml:mi>&#x03C0;</mml:mi></mml:math></inline-formula> to <inline-formula id="ieqn-119"><mml:math id="mml-ieqn-119"><mml:mi>&#x03C0;</mml:mi></mml:math></inline-formula>, <inline-formula id="ieqn-120"><mml:math id="mml-ieqn-120"><mml:msub><mml:mi>t</mml:mi><mml:mrow><mml:mi>x</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> and <inline-formula id="ieqn-121"><mml:math id="mml-ieqn-121"><mml:msub><mml:mi>t</mml:mi><mml:mrow><mml:mi>y</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> are translation factors normalized from &#x2212;1 to 1, <inline-formula id="ieqn-122"><mml:math id="mml-ieqn-122"><mml:msub><mml:mi>f</mml:mi><mml:mrow><mml:mi>v</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo></mml:math></inline-formula> and <inline-formula id="ieqn-123"><mml:math id="mml-ieqn-123"><mml:msub><mml:mi>f</mml:mi><mml:mrow><mml:mi>h</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo></mml:math></inline-formula> are flipping indicators being true or false, and <inline-formula id="ieqn-124"><mml:math id="mml-ieqn-124"><mml:mi>z</mml:mi></mml:math></inline-formula> is a scaling and cropping factor from 0 to 0.5 with respect to the original image resolution. Examples of transformations are shown in <xref ref-type="fig" rid="fig-5">Fig. 5</xref>.</p>
<fig id="fig-5">
<label>Figure 5</label>
<caption>
<title>Geometric transformation cases</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_56079-fig-5.tif"/>
</fig>
<p>By applying predefined transformations, we performed AST using an original content image and transformed style image. For qualitative comparison, we selected AdaIN [<xref ref-type="bibr" rid="ref-26">26</xref>] as the baseline as well as LinearStyleTransfer (LST) [<xref ref-type="bibr" rid="ref-31">31</xref>], style-attentional network (SANET) [<xref ref-type="bibr" rid="ref-65">65</xref>], IEContraAST (IEST) [<xref ref-type="bibr" rid="ref-66">66</xref>], adaptive attention normalization (AdaAttN) [<xref ref-type="bibr" rid="ref-67">67</xref>], adaptive convolutions (AdaConv) [<xref ref-type="bibr" rid="ref-41">41</xref>], contrastive coherence preserving loss (CCPL) [<xref ref-type="bibr" rid="ref-30">30</xref>] as comparison networks. Because we focused on AST, we did not evaluate domain-enhanced (i.e., domain-specific) style transfer [<xref ref-type="bibr" rid="ref-29">29</xref>] and visual fidelity (whose results resemble image-to-image translation) [<xref ref-type="bibr" rid="ref-42">42</xref>]. As shown in <xref ref-type="fig" rid="fig-6">Fig. 6</xref>, our method reflected the image transformations with varying color distributions in the output images. As expected from the pre-analysis (<xref ref-type="sec" rid="s3">Section 3</xref>), no existing method could reflect the appearance change in the transformed style imaged, thus providing very similar results regardless of the applied transformation. On the other hand, the proposed method provided different results that reflected the transformations. For instance, our method generated an output image by locally reflecting the sky region (blue) in the style image in terms of spatial color distribution in the first row of <xref ref-type="fig" rid="fig-6">Fig. 6</xref>. Similarly, the teddy bear showed varying red shades depending on the style image transformations in the fourth row, thus reflecting the variability provided by the proposed Patchified AdaIN.</p>
<fig id="fig-6">
<label>Figure 6</label>
<caption>
<title>Qualitative comparison using our pre-defined geometric transformation cases</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_56079-fig-6.tif"/>
</fig>
</sec>
<sec id="s5_2">
<label>5.2</label>
<title>Quantitative Evaluation of Color-Awareness</title>
<p>The proposed Patchified AdaIN applies AdaIN to every spatially divided patch in the latent space. Nevertheless, it may generate images with a non-satisfactory appearance because the normalized latent code may be misaligned during decoding. To quantitatively evaluate this aspect, we measured the spatial color distribution according to the same patch division in the output image and used the same models as in <xref ref-type="sec" rid="s5_1">Section 5.1</xref> for comparison. The proposed method was finetuned using <xref ref-type="disp-formula" rid="eqn-13">Eq. (13)</xref> for a value of <inline-formula id="ieqn-125"><mml:math id="mml-ieqn-125"><mml:mi>a</mml:mi><mml:mo>=</mml:mo><mml:mn>0.0</mml:mn></mml:math></inline-formula> and <inline-formula id="ieqn-126"><mml:math id="mml-ieqn-126"><mml:mi>a</mml:mi><mml:mo>=</mml:mo><mml:mn>0.5</mml:mn></mml:math></inline-formula> in 9. Here, we adopt two metrics: (a) spatial statistics (mean and variance), (b) Jensen-Shannon Divergence [<xref ref-type="bibr" rid="ref-68">68</xref>]. We used 1000 samples between the input style image and each output image per model. Statistical evaluation allows us to intuitively compare the distribution of pixel values. Meanwhile, JSD measure the difference of given two distributions, here image serve as a distribution. Concretely, we found that recent paper [<xref ref-type="bibr" rid="ref-68">68</xref>] leveraged Jensen-Shannon Divergence to measure the color distribution between two RGB images, we also adopt JSD metric in our quantitative evaluation. We used same images in statistical and JSD evaluation.</p>
<p>The quantitative results in <xref ref-type="fig" rid="fig-7">Fig. 7</xref> showed that our methods provided the lowest distance for the mean, indicating that our style transfer results had similar RGB mean with the style image among the evaluated models. On the other hand, the distance for the standard deviation was similar. Meanwhile, similar to statistical evaluation results (<xref ref-type="fig" rid="fig-7">Fig. 7</xref>), JSD evaluation result showed that our method has best performance on reflecting spatial color distribution as shown <xref ref-type="table" rid="table-2">Table 2</xref>. To determine whether this style statistics affected AST, we performed an evaluation considering human perception, as reported in <xref ref-type="sec" rid="s5_3">Section 5.3</xref>.</p>
<fig id="fig-7">
<label>Figure 7</label>
<caption>
<title>Quantitative evaluation results with statistical metric</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_56079-fig-7.tif"/>
</fig><table-wrap id="table-2">
<label>Table 2</label>
<caption>
<title>Quantitative evaluation results with JS-Divergence metric</title>
</caption>
<table frame="hsides">
<colgroup>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
</colgroup>
<thead>
<tr>
<th>Models</th>
<th><inline-formula id="ieqn-127"><mml:math id="mml-ieqn-127"><mml:mrow><mml:mtext>Our</mml:mtext></mml:mrow><mml:msub><mml:mrow><mml:mtext>s</mml:mtext></mml:mrow><mml:mrow><mml:mi>&#x03B1;</mml:mi><mml:mo>=</mml:mo><mml:mn>0.0</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula></th>
<th><inline-formula id="ieqn-128"><mml:math id="mml-ieqn-128"><mml:mrow><mml:mtext>Our</mml:mtext></mml:mrow><mml:msub><mml:mrow><mml:mtext>s</mml:mtext></mml:mrow><mml:mrow><mml:mi>&#x03B1;</mml:mi><mml:mo>=</mml:mo><mml:mn>0.5</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula></th>
<th>LST31</th>
<th>LST41</th>
<th>SANet</th>
<th>IEST</th>
<th>AdaAttN</th>
<th>AdaConv</th>
<th>CCPL</th>
</tr>
</thead>
<tbody>
<tr>
<td><inline-formula id="ieqn-129"><mml:math id="mml-ieqn-129"><mml:mi>J</mml:mi><mml:mi>S</mml:mi><mml:msub><mml:mi>D</mml:mi><mml:mrow><mml:mi>R</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">&#x2193;</mml:mo></mml:math></inline-formula></td>
<td>0.077</td>
<td>0.105</td>
<td>0.103</td>
<td>0.102</td>
<td>0.096</td>
<td>0.398</td>
<td>0.094</td>
<td>0.293</td>
<td>0.449</td>
</tr>
<tr>
<td><inline-formula id="ieqn-130"><mml:math id="mml-ieqn-130"><mml:mi>J</mml:mi><mml:mi>S</mml:mi><mml:msub><mml:mi>D</mml:mi><mml:mrow><mml:mi>G</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">&#x2193;</mml:mo></mml:math></inline-formula></td>
<td>0.095</td>
<td>0.120</td>
<td>0.112</td>
<td>0.112</td>
<td>0.102</td>
<td>0.422</td>
<td>0.098</td>
<td>0.308</td>
<td>0.493</td>
</tr>
<tr>
<td><inline-formula id="ieqn-131"><mml:math id="mml-ieqn-131"><mml:mi>J</mml:mi><mml:mi>S</mml:mi><mml:msub><mml:mi>D</mml:mi><mml:mrow><mml:mi>B</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">&#x2193;</mml:mo></mml:math></inline-formula></td>
<td>0.080</td>
<td>0.111</td>
<td>0.108</td>
<td>0.109</td>
<td>0.102</td>
<td>0.429</td>
<td>0.097</td>
<td>0.302</td>
<td>0.487</td>
</tr>
<tr>
<td><inline-formula id="ieqn-132"><mml:math id="mml-ieqn-132"><mml:mi>A</mml:mi><mml:mi>v</mml:mi><mml:mi>g</mml:mi><mml:mo>.</mml:mo><mml:mi>J</mml:mi><mml:mi>S</mml:mi><mml:mi>D</mml:mi><mml:mo stretchy="false">&#x2193;</mml:mo></mml:math></inline-formula></td>
<td><underline>0.084</underline></td>
<td>0.111</td>
<td>0.107</td>
<td>0.107</td>
<td>0.099</td>
<td>0.416</td>
<td>0.096</td>
<td>0.301</td>
<td>0.476</td>
</tr>
</tbody>
</table>
<table-wrap-foot><p>Note: <underline>Underline</underline> font means best model and <inline-formula id="ieqn-133"><mml:math id="mml-ieqn-133"><mml:mo stretchy="false">&#x2193;</mml:mo></mml:math></inline-formula> indicates that lower value is better.</p>
</table-wrap-foot>
</table-wrap>
</sec>
<sec id="s5_3">
<label>5.3</label>
<title>User Study</title>
<p>To evaluate human perception of color-aware AST, we conducted a user study considering style transfer outputs before and after transformation. We evaluated seven content images and two style images before and after applying transformations. We then conducted a seven-question survey considering random pairs of content and style images. One question involved 14 images, and it was responded for the outputs of the proposed and six comparison AST methods, obtaining 98 responses from 30 participants. We asked which output imaged better reflected the regional color distribution of the style image after transformation. The 30 participants included a junior artist, students enrolled in the art or computer science major, graphics/vision field researchers, and an immersive content creator. Every participant selected the best image among the outputs of all the methods considering the original and transformed style images for the same content image. From the user study, the proposed method achieved the best performance in terms of regional color distribution, as shown in <xref ref-type="fig" rid="fig-8">Fig. 8</xref>. Almost half of the participants (45.71%) selected the proposed method as the best one among the evaluated methods. Remarkably, our proposal received dominant preference (&#x003E;70%) for rotation and flipping transformations. On the other hand, regarding zooming with cropping, all the methods received similar proportions of preference. This was because the color distribution of the zoomed style images was distorted before AST, as illustrated in <xref ref-type="fig" rid="fig-6">Fig. 6</xref>. Consequently, the output images of all the methods changed regardless of the ability to preserve the color distribution, causing highly variable preferences among the participants and similar preferences across methods after zooming the style image.</p>
<fig id="fig-8">
<label>Figure 8</label>
<caption>
<title>Result of user study in terms of color-distribution preservation</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_56079-fig-8.tif"/>
</fig>
<p>Along with the evaluation results presented in <xref ref-type="sec" rid="s5_2">Section 5.2</xref>, we can conclude that the distance in standard deviation is not as important as that in mean regarding human perception of AST. In addition, the results in <xref ref-type="fig" rid="fig-7">Figs. 7</xref> and <xref ref-type="fig" rid="fig-8">8</xref> suggest that the proposed method achieves SOTA performance and that the distance in mean allows to suitably evaluate the style similarity regarding human perception.</p>
</sec>
<sec id="s5_4">
<label>5.4</label>
<title>Local Image Style Transfer by Fine-Tuning</title>
<p>In addition to the existing AdaIN [<xref ref-type="bibr" rid="ref-26">26</xref>] training loss in <xref ref-type="disp-formula" rid="eqn-13">Eq. (13)</xref>, we introduce a local patch loss in <xref ref-type="disp-formula" rid="eqn-12">Eq. (12)</xref>. We evaluated the advantage of finetuning to highlight regional details in AST. We performed style transfer with two models, namely, a pretrained encoder-decoder replacing AdaIN with Patchified AdaIN without fine-tuning, whose weights were retrieved from an AdaIN PyTorch implementation [<xref ref-type="bibr" rid="ref-26">26</xref>], and a finetuned model considering our global-local loss in <xref ref-type="disp-formula" rid="eqn-13">Eq. (13)</xref>. As shown in <xref ref-type="fig" rid="fig-9">Fig. 9</xref>, structurally distorted results were obtained without finetuning (<xref ref-type="fig" rid="fig-9">Fig. 9b</xref>). Specifically, the boundaries between the cat and bathtub were blurry when using the pretrained model. In contrast, the finetuned model (<xref ref-type="fig" rid="fig-9">Fig. 9a</xref>) preserved more structural details of the content image. Hence, finetuning enhanced the output image details without notably compromising the content image. Although blurry results may be artistically preferred in some contexts, preserving structural details is often preferred for human perception.</p>
<fig id="fig-9">
<label>Figure 9</label>
<caption>
<title>Structural details between w/ and w/o fine-tuning</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_56079-fig-9.tif"/>
</fig>
</sec>
<sec id="s5_5">
<label>5.5</label>
<title>Runtime Control for Patchization</title>
<p>To further evaluate the user experience, we explored runtime control strategies focusing on the Patchified AdaIN parameters of patch level and type. These controls may enhance the user experience during inference without requiring additional training.</p>
<sec id="s5_5_1">
<label>5.5.1</label>
<title>Patch Level</title>
<p>Common segmentation into patches spatially divides features into 2 by 2 patches. As more patches may be useful in some cases, we evaluated the effects of increasing the patch number from <inline-formula id="ieqn-134"><mml:math id="mml-ieqn-134"><mml:msub><mml:mi>M</mml:mi><mml:mrow><mml:mi>h</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mi>M</mml:mi><mml:mrow><mml:mi>w</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mn>2</mml:mn></mml:math></inline-formula> (default) to <inline-formula id="ieqn-135"><mml:math id="mml-ieqn-135"><mml:msub><mml:mi>M</mml:mi><mml:mrow><mml:mi>h</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mi>M</mml:mi><mml:mrow><mml:mi>w</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mn>5</mml:mn></mml:math></inline-formula>. <xref ref-type="fig" rid="fig-10">Fig. 10</xref> shows the corresponding results. When more patches were used, the structural details decreased for <inline-formula id="ieqn-136"><mml:math id="mml-ieqn-136"><mml:mi>&#x03B1;</mml:mi><mml:mo>=</mml:mo><mml:mn>0.0</mml:mn></mml:math></inline-formula>. Hence, the apparent abstraction-level of the output increased because the model tried to preserve the regional color distribution. Thus, the structure of content image remained as edge information. On the other hand, the outputs reflecting global statistics (<inline-formula id="ieqn-137"><mml:math id="mml-ieqn-137"><mml:mi>&#x03B1;</mml:mi><mml:mo>&#x2260;</mml:mo><mml:mn>0.0</mml:mn></mml:math></inline-formula>) demonstrated a relatively clear and intuitive structure compared with patch-only AST, but the abstraction level of these results was highly dependent on the style image, i.e., they are unstable and non-consistent. Nonetheless, we concluded that the local-global model (i.e., <inline-formula id="ieqn-138"><mml:math id="mml-ieqn-138"><mml:mi>&#x03B1;</mml:mi><mml:mo>=</mml:mo><mml:mn>0.5</mml:mn></mml:math></inline-formula>) suitably preserved the structure of the content image regardless of the number of patches. Hence, a deep patch level led to unclear information near the content image edges. In addition, perspective or stereoscopic information easily collapsed for <italic>a</italic> &#x003D; 0.0, likely providing a perceptually unpleasant impression from ambiguous results. We found that a model with <inline-formula id="ieqn-139"><mml:math id="mml-ieqn-139"><mml:msub><mml:mi>H</mml:mi><mml:mrow><mml:mi>f</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> and <inline-formula id="ieqn-140"><mml:math id="mml-ieqn-140"><mml:msub><mml:mi>W</mml:mi><mml:mrow><mml:mi>f</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> divided by 2 or 3 with <inline-formula id="ieqn-141"><mml:math id="mml-ieqn-141"><mml:mi>&#x03B1;</mml:mi><mml:mo>=</mml:mo><mml:mn>0.5</mml:mn></mml:math></inline-formula> provided appealing results in most cases. <inline-formula id="ieqn-142"><mml:math id="mml-ieqn-142"><mml:msub><mml:mi>H</mml:mi><mml:mrow><mml:mi>f</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> and <inline-formula id="ieqn-143"><mml:math id="mml-ieqn-143"><mml:msub><mml:mi>W</mml:mi><mml:mrow><mml:mi>f</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> divided by 4 or 5 can also be used, but these deep patch levels caused a global-local trade-off for <inline-formula id="ieqn-144"><mml:math id="mml-ieqn-144"><mml:mi>&#x03B1;</mml:mi><mml:mo>=</mml:mo><mml:mn>0.5</mml:mn></mml:math></inline-formula>, impairing the user experience in terms of color-aware AST. Hence, we do not recommend using patch levels of 4 or 5. Patch level parameter do not require fine-tuning step, it can be simply used in inference step by user.</p>
<fig id="fig-10">
<label>Figure 10</label>
<caption>
<title>Stylized image according to different patch levels. In each patch levels (b), <inline-formula id="ieqn-145"><mml:math id="mml-ieqn-145"><mml:msub><mml:mi mathvariant="bold-italic">M</mml:mi><mml:mrow><mml:mi mathvariant="bold-italic">h</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>, <inline-formula id="ieqn-146"><mml:math id="mml-ieqn-146"><mml:msub><mml:mi mathvariant="bold-italic">M</mml:mi><mml:mrow><mml:mi mathvariant="bold-italic">w</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> are 2, 3, 4 and 5</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_56079-fig-10.tif"/>
</fig>
</sec>
<sec id="s5_5_2">
<label>5.5.2</label>
<title>Patch Type</title>
<p>For patching in Patchified AdaIN, 2 by 2 division was the default setting (<xref ref-type="fig" rid="fig-3">Fig. 3</xref>). However, different patch types can lead to different visual results. For evaluation, we applied three different patch types including the default one while preserving <inline-formula id="ieqn-147"><mml:math id="mml-ieqn-147"><mml:msub><mml:mi>M</mml:mi><mml:mrow><mml:mi>h</mml:mi></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:msub><mml:mi>M</mml:mi><mml:mrow><mml:mi>w</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mn>4</mml:mn></mml:math></inline-formula> in <xref ref-type="disp-formula" rid="eqn-5">Eq. (5)</xref>, obtaining the results shown in <xref ref-type="fig" rid="fig-11">Fig. 11</xref>. In the feature space, we applied simple division methods to show the output changes according to the patch types, as shown in <xref ref-type="fig" rid="fig-11">Fig. 11b</xref>. Type A was the most general method while and Types B and C corresponded to horizontal and vertical separation into patches, respectively. <xref ref-type="fig" rid="fig-11">Fig. 11</xref> shows that the patch types generated notably different output images. For instance, Type B provided a gloomy result, while Type C provided a bright sunny impression for Style 1. Hence, different impressions can be generated depending on the patch type for the same input image. Therefore, AST using the proposed method can provide a rich user experience. In fact, the user can flexibly select the runtime control parameters to achieve the desired style transfer results without additional training.</p>
<fig id="fig-11">
<label>Figure 11</label>
<caption>
<title>Stylized image according to different patch types. In Type-A, <inline-formula id="ieqn-148"><mml:math id="mml-ieqn-148"><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">M</mml:mi><mml:mrow><mml:mi mathvariant="bold-italic">h</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">M</mml:mi><mml:mrow><mml:mi mathvariant="bold-italic">w</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mtext mathvariant="bold">2</mml:mtext></mml:mrow><mml:mo>,</mml:mo><mml:mrow><mml:mtext mathvariant="bold">2</mml:mtext></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>. Type-B and Type-C has <inline-formula id="ieqn-149"><mml:math id="mml-ieqn-149"><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">M</mml:mi><mml:mrow><mml:mi mathvariant="bold-italic">h</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">M</mml:mi><mml:mrow><mml:mi mathvariant="bold-italic">w</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mtext mathvariant="bold">4</mml:mtext></mml:mrow><mml:mo>,</mml:mo><mml:mrow><mml:mtext mathvariant="bold">1</mml:mtext></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> and <inline-formula id="ieqn-150"><mml:math id="mml-ieqn-150"><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">M</mml:mi><mml:mrow><mml:mi mathvariant="bold-italic">h</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">M</mml:mi><mml:mrow><mml:mi mathvariant="bold-italic">w</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mtext mathvariant="bold">1</mml:mtext></mml:mrow><mml:mo>,</mml:mo><mml:mrow><mml:mtext mathvariant="bold">4</mml:mtext></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>, respectively</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_56079-fig-11.tif"/>
</fig>
</sec>
</sec>
<sec id="s5_6">
<label>5.6</label>
<title>Inference Speed</title>
<p>To confirm the practicality of the proposed Patchified AdaIN, we measured the inference time according to different patch levels. The results were obtained using a computer equipped with an NVIDIA GeForce RTX 3090 GPU of 24 GB. We calculated were the average inference time over 1000 iterations.</p>
<p>We performed AST for patch levels from 2 to 5, as in <xref ref-type="sec" rid="s5_5">Section 5.5</xref>. The baseline was the pretrained AdaIN model [<xref ref-type="bibr" rid="ref-26">26</xref>], and our method considered the finetuned model replacing AdaIN with Patchified AdaIN. For comparison, we normalized the AdaIN inference speed to 1. As listed in <xref ref-type="table" rid="table-3">Table 3</xref>, our models had a negligible inference gap compared with the simpler baseline. Even the highest patch level provided a speed with a factor of 1.057. Hence, the proposed method incurs a small inference burden, suggesting that Patchified AdaIN can be integrated in SOTA models without notably compromising the inference speed.</p>
<table-wrap id="table-3">
<label>Table 3</label>
<caption>
<title>Inference speed according to the number of patches</title>
</caption>
<table frame="hsides">
<colgroup>
<col align="left"/>
<col align="left"/>
</colgroup>
<thead>
<tr>
<th>Model</th>
<th>Time</th>
</tr>
</thead>
<tbody>
<tr>
<td>AdaIN (baseline) [<xref ref-type="bibr" rid="ref-26">26</xref>]</td>
<td>1.000</td>
</tr>
<tr>
<td>Ours <inline-formula id="ieqn-153"><mml:math id="mml-ieqn-153"><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>M</mml:mi><mml:mrow><mml:mi>h</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mi>M</mml:mi><mml:mrow><mml:mi>w</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mn>2</mml:mn><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula></td>
<td>1.005</td>
</tr>
<tr>
<td>Ours <inline-formula id="ieqn-154"><mml:math id="mml-ieqn-154"><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>M</mml:mi><mml:mrow><mml:mi>h</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mi>M</mml:mi><mml:mrow><mml:mi>w</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mn>3</mml:mn><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula></td>
<td>1.008</td>
</tr>
<tr>
<td>Ours <inline-formula id="ieqn-155"><mml:math id="mml-ieqn-155"><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>M</mml:mi><mml:mrow><mml:mi>h</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mi>M</mml:mi><mml:mrow><mml:mi>w</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mn>4</mml:mn><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula></td>
<td>1.031</td>
</tr>
<tr>
<td>Ours <inline-formula id="ieqn-156"><mml:math id="mml-ieqn-156"><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>M</mml:mi><mml:mrow><mml:mi>h</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mi>M</mml:mi><mml:mrow><mml:mi>w</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mn>5</mml:mn><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula></td>
<td>1.057</td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
<sec id="s5_7">
<label>5.7</label>
<title>Application to SOTA Models</title>
<p>As discussed in <xref ref-type="sec" rid="s5_8">Section 5.8</xref>, Patchified AdaIN can be used in other models through simple replacement without notably increasing the inference time. We selected various SOTA AST models based on AdaIN for replacement with the proposed Patchified AdaIN. We aimed to demonstrate the extensive applicability of our method and the AST performance preservation while reflecting the regional color distribution. For evaluation, we selected CCPL [<xref ref-type="bibr" rid="ref-30">30</xref>] with <inline-formula id="ieqn-151"><mml:math id="mml-ieqn-151"><mml:mi>&#x03B1;</mml:mi><mml:mo>=</mml:mo><mml:mn>0.0</mml:mn></mml:math></inline-formula> and LinearStyleTransfer [<xref ref-type="bibr" rid="ref-31">31</xref>] with <inline-formula id="ieqn-152"><mml:math id="mml-ieqn-152"><mml:mi>&#x03B1;</mml:mi><mml:mo>=</mml:mo><mml:mn>0.5</mml:mn></mml:math></inline-formula> for code blending using <xref ref-type="disp-formula" rid="eqn-9">(9)</xref>. Patchified AdaIN was integrated into the transformation or normalization block. As the evaluated models adopted different schemes for AST, we adapted Patchified AdaIN to their structure, establishing <italic>modified</italic> Patchified AdaIN implementations that preserved the core scheme of Patchified AdaIN of explicitly dividing the features into patches and aggregating them to obtain the output image. As shown in <xref ref-type="fig" rid="fig-12">Fig. 12</xref>, the <italic>modified</italic> Patchified AdaIN endowed the SOTA models with the ability to perform spatially color-aware AST. The original SOTA models generated highly similar output images regardless of the transformed style image. After applying the <italic>modified</italic> Patchified AdaIN, the models provided different results reflecting the transformations, thus providing diverse appearances to the output images. These results demonstrate that our method is effective and easily applicable to other AST models without compromising performance. Nonetheless, it should be carefully considered when our method will be applied into existing networks.</p>
<fig id="fig-12">
<label>Figure 12</label>
<caption>
<title>Extension to SOTA models using modified PatchizedAdaIN. Our PatchizedAdaIN can be easily applied to other style transfer networks to provide the color-aware capability</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_56079-fig-12.tif"/>
</fig>
</sec>
<sec id="s5_8">
<label>5.8</label>
<title>Generalizability and Failure Case</title>
<p>For real-world use cases, it is important whether method can used in general and wide use cases, that is generalizailbity. To provide this capability, we provide additional experiment with some images that have non-uniform and complex color distribution. Samples which have a high variance are chosen and we conduct style transfer under our model using those images to showcase potential generalizability in real-world use cases. As shown in <xref ref-type="fig" rid="fig-13">Fig. 13</xref>, we observed that resulting stylized images has few visual changes even though different transformations are adopted. This issue can be the limitation of our method, but it might be addressed if user input the style images which have high contrast, vibrant color intensity. Meanwhile, our patchification strategy requires patch aggregation due to the spatial division in 2-dimensional space. Therefore, there is potential quality degradation in the edge of patch in aggregation step. Especially, if the structure of content images has semantically important information in the edge, artifacts in that region could prominently demonstrate visually non-harmonized appearance. Thus, this potential quality degradation should be carefully considered in inference step.</p>
<fig id="fig-13">
<label>Figure 13</label>
<caption>
<title>Failure cases on non-uniform and intricate color distribution. The resulting images may not be easily distinguishable despite various transformations when style images have complex color distribution</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_56079-fig-13.tif"/>
</fig>
</sec>
</sec>
<sec id="s6">
<label>6</label>
<title>Conclusion</title>
<p>We propose a simple but effective scheme to reflect or preserve the color distribution of images that undergo transformations in AST. We divide the latent features into patches and aggregate them after applying AdaIN to every patch. This method can enlarge the user experience because Patchified AdaIN can generate different output images according to geometric transformations applied to the input images. Hence, the user can generate various desired results while using the same input image pairs. Additionally, we offer control over parameters to further increase the image variability and achieve the best results via iterative inference with different patch levels and types. A user study reveals that our method provides the preferred output images regarding human perception. Moreover, Patchified AdaIN can be easily integrated into existing models, demonstrating its applicability for extending existing AST models to preserve the spatial color distribution of transformed images. Overall, the proposed method achieves the best performance in terms of color-aware AST, establishing a SOTA approach without notable computational overhead during training and inference compared with baseline methods. However, there is a limitation in that ambiguous output occurs when the input image has a complex or non-uniform color distribution. This can be alleviated by having the end-users consider this technical limitation to achieve the desired appearance when using our algorithm in their applications.</p>
</sec>
</body>
<back>
<ack><p>This research was supported by the Chung-Ang University Graduate Research Scholarship in 2024.</p>
</ack>
<sec><title>Funding Statement</title>
<p>This research work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korean government (MSIT) (No. 2022R1A2C1004657, Contribution Rate: 50%) and Culture, Sports and Tourism R&#x0026;D Program through the Korea Creative Content Agency grant funded by Ministry of Culture Sports and Tourism in 2024 (Project Name: Developing Professionals for R&#x0026;D in Contents Production Based on Generative Ai and Cloud, Project Number: RS-2024-00352578, Contribution Rate: 50%).</p>
</sec>
<sec><title>Author Contributions</title>
<p>The authors confirm contribution to the paper as follows: study conception and design: Bumsoo Kim, Sanghyun Seo; data collection and comparative experiments: Bumsoo Kim, Wonseop Shin, Yonghoon Jung; analysis and interpretation of results: Bumsoo Kim, Youngsup Park, Sanghyun Seo; draft manuscript preparation: Bumsoo Kim, Sanghyun Seo. All authors reviewed the results and approved the final version of the manuscript.</p>
</sec>
<sec sec-type="data-availability"><title>Availability of Data and Materials</title>
<p>All the data used and analyzed is available in the manuscript.</p>
</sec>
<sec><title>Ethics Approval</title>
<p>Not applicable.</p>
</sec>
<sec sec-type="COI-statement"><title>Conflicts of Interest</title>
<p>The authors declare no conflicts of interest to report regarding the present study.</p>
</sec>
<ref-list content-type="authoryear">
<title>References</title>
<ref id="ref-1"><label>1.</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Huang</surname> <given-names>HP</given-names></string-name>, <string-name><surname>Tseng</surname> <given-names>HY</given-names></string-name>, <string-name><surname>Saini</surname> <given-names>S</given-names></string-name>, <string-name><surname>Singh</surname> <given-names>M</given-names></string-name>, <string-name><surname>Yang</surname> <given-names>MH</given-names></string-name></person-group>. <article-title>Learning to stylize novel views</article-title>. In: <source>Proceedings of the IEEE/CVF International Conference on Computer Vision</source>, <year>2021</year>; p. <fpage>13869</fpage>&#x2013;<lpage>78</lpage>. doi:<pub-id pub-id-type="doi">10.1109/ICCV48922.2021.01361</pub-id>.</mixed-citation></ref>
<ref id="ref-2"><label>2.</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><surname>Chen</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Yuan</surname> <given-names>Q</given-names></string-name>, <string-name><surname>Li</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Liu</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>W</given-names></string-name>, <string-name><surname>Xie</surname> <given-names>C</given-names></string-name>, <etal>et al.</etal></person-group> <article-title>UPST-NeRF: universal photorealistic style transfer of neural radiance fields for 3D scene</article-title>; <year>2022</year>. <comment>arXiv preprint arXiv: 220807059</comment>.</mixed-citation></ref>
<ref id="ref-3"><label>3.</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Wang</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Zhao</surname> <given-names>L</given-names></string-name>, <string-name><surname>Zuo</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Li</surname> <given-names>A</given-names></string-name>, <string-name><surname>Xing</surname> <given-names>W</given-names></string-name>, <etal>et al</etal></person-group>. <article-title>AesUST: towards aesthetic-enhanced universal style transfer</article-title>. In: <source>Proceedings of the 30th ACM International Conference on Multimedia</source>, <year>2022</year>; <publisher-name>Lisbon</publisher-name>, <publisher-loc>Portugal</publisher-loc>, p. <fpage>1095</fpage>&#x2013;<lpage>106</lpage>. doi:<pub-id pub-id-type="doi">10.1145/3503161.3547939</pub-id>.</mixed-citation></ref>
<ref id="ref-4"><label>4.</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><surname>Liu</surname> <given-names>ZS</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>LW</given-names></string-name>, <string-name><surname>Siu</surname> <given-names>WC</given-names></string-name>, <string-name><surname>Kalogeiton</surname> <given-names>V</given-names></string-name></person-group>. <article-title>Name your style: an arbitrary artist-aware image style transfer</article-title>; <year>2022</year>. <comment>arXiv preprint arXiv: 220213562</comment>.</mixed-citation></ref>
<ref id="ref-5"><label>5.</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Wu</surname> <given-names>X</given-names></string-name>, <string-name><surname>Hu</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Sheng</surname> <given-names>L</given-names></string-name>, <string-name><surname>Xu</surname> <given-names>D</given-names></string-name></person-group>. <article-title>StyleFormer: real-time arbitrary style transfer via parametric style composition</article-title>. In: <source>Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)</source>, <year>2021</year>; p. <fpage>14618</fpage>&#x2013;<lpage>27</lpage>. doi:<pub-id pub-id-type="doi">10.1109/ICCV48922.2021.01435</pub-id>.</mixed-citation></ref>
<ref id="ref-6"><label>6.</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Gao</surname> <given-names>W</given-names></string-name>, <string-name><surname>Li</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Yin</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Yang</surname> <given-names>MH</given-names></string-name></person-group>. <article-title>Fast video multi-style transfer</article-title>. In: <source>Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)</source>, <year>2020</year>; <publisher-name>Snowmass Village, CO</publisher-name>, <publisher-loc>USA</publisher-loc>. doi:<pub-id pub-id-type="doi">10.1109/WACV45572.2020.9093420</pub-id>.</mixed-citation></ref>
<ref id="ref-7"><label>7.</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Kwon</surname> <given-names>G</given-names></string-name>, <string-name><surname>Ye</surname> <given-names>JC</given-names></string-name></person-group>. <article-title>CLIPstyler: image style transfer with a single text condition</article-title>. In: <source>Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)</source>, <year>2022</year>; <publisher-name>New Orleans, LA</publisher-name>, <publisher-loc>USA</publisher-loc>; p. <fpage>18062</fpage>&#x2013;<lpage>71</lpage>. doi:<pub-id pub-id-type="doi">10.1109/CVPR52688.2022.01753</pub-id>.</mixed-citation></ref>
<ref id="ref-8"><label>8.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Zhang</surname> <given-names>S</given-names></string-name>, <string-name><surname>Su</surname> <given-names>S</given-names></string-name>, <string-name><surname>Li</surname> <given-names>L</given-names></string-name>, <string-name><surname>Lu</surname> <given-names>J</given-names></string-name>, <string-name><surname>Zhou</surname> <given-names>Q</given-names></string-name>, <string-name><surname>Chang</surname> <given-names>CC</given-names></string-name></person-group>. <article-title>CSST-Net: an arbitrary image style transfer network of coverless steganography</article-title>. <source>Vis Comput</source>. <year>2022</year>;38:<fpage>1</fpage>&#x2013;<lpage>13</lpage>. doi:<pub-id pub-id-type="doi">10.1007/s00371-021-02272-6</pub-id>.</mixed-citation></ref>
<ref id="ref-9"><label>9.</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Zhang</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Huang</surname> <given-names>N</given-names></string-name>, <string-name><surname>Tang</surname> <given-names>F</given-names></string-name>, <string-name><surname>Huang</surname> <given-names>H</given-names></string-name>, <string-name><surname>Ma</surname> <given-names>C</given-names></string-name>, <string-name><surname>Dong</surname> <given-names>W</given-names></string-name>, <etal>et al</etal></person-group>. <article-title>Inversion-based style transfer with diffusion models</article-title>. In: <source>Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)</source>, <year>2023</year>; <publisher-name>Vancouver, BC</publisher-name>, <publisher-loc>Canada</publisher-loc>; p. <fpage>10146</fpage>&#x2013;<lpage>56</lpage>. doi:<pub-id pub-id-type="doi">10.1109/CVPR52729.2023.00978</pub-id>.</mixed-citation></ref>
<ref id="ref-10"><label>10.</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Luan</surname> <given-names>F</given-names></string-name>, <string-name><surname>Paris</surname> <given-names>S</given-names></string-name>, <string-name><surname>Shechtman</surname> <given-names>E</given-names></string-name>, <string-name><surname>Bala</surname> <given-names>K</given-names></string-name></person-group>. <article-title>Deep photo style transfer</article-title>. In: <source>Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition</source>, <year>2017</year>; <publisher-name>Honolulu</publisher-name>, <publisher-loc>Hawaii</publisher-loc>; p. <fpage>4990</fpage>&#x2013;<lpage>8</lpage>. doi:<pub-id pub-id-type="doi">10.1109/CVPR.2017.740</pub-id>.</mixed-citation></ref>
<ref id="ref-11"><label>11.</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><surname>Luo</surname> <given-names>X</given-names></string-name>, <string-name><surname>Han</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Yang</surname> <given-names>L</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>L</given-names></string-name></person-group>. <article-title>Consistent style transfer</article-title>; <year>2022</year>. <comment>arXiv preprint arXiv: 220102233</comment>.</mixed-citation></ref>
<ref id="ref-12"><label>12.</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Mu</surname> <given-names>F</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>J</given-names></string-name>, <string-name><surname>Wu</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Li</surname> <given-names>Y</given-names></string-name></person-group>. <article-title>3D photo stylization: learning to generate stylized novel views from a single image</article-title>. In: <source>Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</source>, <year>2022</year>; <publisher-name>New Orleans, LA</publisher-name>, <publisher-loc>USA</publisher-loc>; p. <fpage>16273</fpage>&#x2013;<lpage>82</lpage>. doi:<pub-id pub-id-type="doi">10.1109/CVPR52688.2022.01579</pub-id>.</mixed-citation></ref>
<ref id="ref-13"><label>13.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Lu</surname> <given-names>M</given-names></string-name>, <string-name><surname>Xu</surname> <given-names>F</given-names></string-name>, <string-name><surname>Zhao</surname> <given-names>H</given-names></string-name>, <string-name><surname>Yao</surname> <given-names>A</given-names></string-name>, <string-name><surname>Chen</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>L</given-names></string-name></person-group>. <article-title>Exemplar-based portrait style transfer</article-title>. <source>IEEE Access</source>. <year>2018</year>;<volume>6</volume>:<fpage>58532</fpage>&#x2013;<lpage>42</lpage>. doi:<pub-id pub-id-type="doi">10.1109/ACCESS.2018.2874203</pub-id>.</mixed-citation></ref>
<ref id="ref-14"><label>14.</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Xie</surname> <given-names>X</given-names></string-name>, <string-name><surname>Li</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Huang</surname> <given-names>H</given-names></string-name>, <string-name><surname>Fu</surname> <given-names>H</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>W</given-names></string-name>, <string-name><surname>Guo</surname> <given-names>Y</given-names></string-name></person-group>. <article-title>Artistic style discovery with independent components</article-title>. In: <source>Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)</source>, <year>2022</year>; <publisher-name>New Orleans, LA</publisher-name>, <publisher-loc>USA</publisher-loc>; p. <fpage>19870</fpage>&#x2013;<lpage>9</lpage>. doi:<pub-id pub-id-type="doi">10.1109/CVPR52688.2022.01925</pub-id>.</mixed-citation></ref>
<ref id="ref-15"><label>15.</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Huang</surname> <given-names>S</given-names></string-name>, <string-name><surname>Xiong</surname> <given-names>H</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>T</given-names></string-name>, <string-name><surname>Wen</surname> <given-names>B</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>Q</given-names></string-name>, <string-name><surname>Chen</surname> <given-names>Z</given-names></string-name>, <etal>et al</etal></person-group>. <article-title>Parameter-free style projection for arbitrary image style transfer</article-title>. In: <source>ICASSP 2022&#x2013;2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)</source>, <year>2022</year>; <publisher-loc>Singapore</publisher-loc>, <publisher-name>IEEE</publisher-name>; p. <fpage>2070</fpage>&#x2013;<lpage>4</lpage>. doi:<pub-id pub-id-type="doi">10.1109/ICASSP43922.2022.9746290</pub-id>.</mixed-citation></ref>
<ref id="ref-16"><label>16.</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Chen</surname> <given-names>D</given-names></string-name>, <string-name><surname>Yuan</surname> <given-names>L</given-names></string-name>, <string-name><surname>Liao</surname> <given-names>J</given-names></string-name>, <string-name><surname>Yu</surname> <given-names>N</given-names></string-name>, <string-name><surname>Hua</surname> <given-names>G</given-names></string-name></person-group>. <article-title>Stereoscopic neural style transfer</article-title>. In: <source>Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition</source>, <publisher-loc>Salt Lake City, UT, USA</publisher-loc>, <year>2018</year>; p. <fpage>6654</fpage>&#x2013;<lpage>63</lpage>. doi:<pub-id pub-id-type="doi">10.1109/CVPR.2018.00696</pub-id>.</mixed-citation></ref>
<ref id="ref-17"><label>17.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Han</surname> <given-names>F</given-names></string-name>, <string-name><surname>Ye</surname> <given-names>S</given-names></string-name>, <string-name><surname>He</surname> <given-names>M</given-names></string-name>, <string-name><surname>Chai</surname> <given-names>M</given-names></string-name>, <string-name><surname>Liao</surname> <given-names>J</given-names></string-name></person-group>. <article-title>Exemplar-based 3D portrait stylization</article-title>. <source>IEEE Transact Visual Comput Graph</source>. <year>2021</year>;<volume>29</volume>(<issue>2</issue>):<fpage>1371</fpage>&#x2013;<lpage>83</lpage>. doi:<pub-id pub-id-type="doi">10.1109/TVCG.2021.3114308</pub-id>; <pub-id pub-id-type="pmid">34559656</pub-id></mixed-citation></ref>
<ref id="ref-18"><label>18.</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>H&#x00F6;llein</surname> <given-names>L</given-names></string-name>, <string-name><surname>Johnson</surname> <given-names>J</given-names></string-name>, <string-name><surname>Nie&#x00DF;ner</surname> <given-names>M</given-names></string-name></person-group>. <article-title>Stylemesh: style transfer for indoor 3D scene reconstructions</article-title>. In: <source>Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</source>, <year>2022</year>; <publisher-name>New Orleans, LA</publisher-name>, <publisher-loc>USA</publisher-loc>; p. <fpage>6198</fpage>&#x2013;<lpage>208</lpage>. doi:<pub-id pub-id-type="doi">10.1109/CVPR52688.2022.00610</pub-id>.</mixed-citation></ref>
<ref id="ref-19"><label>19.</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Chiang</surname> <given-names>PZ</given-names></string-name>, <string-name><surname>Tsai</surname> <given-names>MS</given-names></string-name>, <string-name><surname>Tseng</surname> <given-names>HY</given-names></string-name>, <string-name><surname>Lai</surname> <given-names>WS</given-names></string-name>, <string-name><surname>Chiu</surname> <given-names>WC</given-names></string-name></person-group>. <article-title>Stylizing 3D scene via implicit representation and hypernetwork</article-title>. In: <source>Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision</source>, <year>2022</year>; <publisher-name>Waikoloa, HI</publisher-name>, <publisher-loc>USA</publisher-loc>; p. <fpage>1475</fpage>&#x2013;<lpage>84</lpage>. doi:<pub-id pub-id-type="doi">10.1109/WACV51458.2022.00029</pub-id>.</mixed-citation></ref>
<ref id="ref-20"><label>20.</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Zhang</surname> <given-names>K</given-names></string-name>, <string-name><surname>Kolkin</surname> <given-names>N</given-names></string-name>, <string-name><surname>Bi</surname> <given-names>S</given-names></string-name>, <string-name><surname>Luan</surname> <given-names>F</given-names></string-name>, <string-name><surname>Xu</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Shechtman</surname> <given-names>E</given-names></string-name>, <etal>et al</etal></person-group>. <article-title>ARF: artistic radiance fields</article-title>. In: <source>European Conference on Computer Vision</source>, <year>2022</year>; <publisher-loc>Tel Aviv, Israel</publisher-loc>, <publisher-name>Springer</publisher-name>; p. <fpage>717</fpage>&#x2013;<lpage>33</lpage>. doi:<pub-id pub-id-type="doi">10.1007/978-3-031-19821-2_41</pub-id>.</mixed-citation></ref>
<ref id="ref-21"><label>21.</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Ma</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Zhao</surname> <given-names>C</given-names></string-name>, <string-name><surname>Li</surname> <given-names>X</given-names></string-name>, <string-name><surname>Basu</surname> <given-names>A</given-names></string-name></person-group>. <article-title>RAST: restorable arbitrary style transfer via multi-restoration</article-title>. In: <source>Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)</source>, <year>2023</year>; <publisher-name>Waikoloa, HI</publisher-name>, <publisher-loc>USA</publisher-loc>; p. <fpage>331</fpage>&#x2013;<lpage>40</lpage>. doi:<pub-id pub-id-type="doi">10.1145/3638770</pub-id>.</mixed-citation></ref>
<ref id="ref-22"><label>22.</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Gatys</surname> <given-names>LA</given-names></string-name>, <string-name><surname>Ecker</surname> <given-names>AS</given-names></string-name>, <string-name><surname>Bethge</surname> <given-names>M</given-names></string-name></person-group>. <article-title>Image style transfer using convolutional neural networks</article-title>. In: <source>Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition</source>, <year>2016</year>; <publisher-name>Las Vegas, NV</publisher-name>, <publisher-loc>USA</publisher-loc>; p. <fpage>2414</fpage>&#x2013;<lpage>23</lpage>. doi:<pub-id pub-id-type="doi">10.1109/CVPR.2016.265</pub-id>.</mixed-citation></ref>
<ref id="ref-23"><label>23.</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><surname>Simonyan</surname> <given-names>K</given-names></string-name>, <string-name><surname>Zisserman</surname> <given-names>A</given-names></string-name></person-group>. <article-title>Very deep convolutional networks for large-scale image recognition</article-title>; <year>2014</year>. <comment>arXiv preprint arXiv: 14091556</comment>.</mixed-citation></ref>
<ref id="ref-24"><label>24.</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Lin</surname> <given-names>TY</given-names></string-name>, <string-name><surname>Maire</surname> <given-names>M</given-names></string-name>, <string-name><surname>Belongie</surname> <given-names>S</given-names></string-name>, <string-name><surname>Hays</surname> <given-names>J</given-names></string-name>, <string-name><surname>Perona</surname> <given-names>P</given-names></string-name>, <string-name><surname>Ramanan</surname> <given-names>D</given-names></string-name>, <etal>et al</etal></person-group>. <article-title>Microsoft COCO: common objects in context</article-title>. In: <source>European Conference on Computer Vision</source>, <year>2014</year>; <publisher-loc>Zurich, Switzerland</publisher-loc>, <publisher-name>Springer</publisher-name>; p. <fpage>740</fpage>&#x2013;<lpage>55</lpage>. doi:<pub-id pub-id-type="doi">10.1007/978-3-319-10602-1_48</pub-id>.</mixed-citation></ref>
<ref id="ref-25"><label>25.</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><surname>Chen</surname> <given-names>TQ</given-names></string-name>, <string-name><surname>Schmidt</surname> <given-names>M</given-names></string-name></person-group>. <article-title>Fast patch-based style transfer of arbitrary style</article-title>; <year>2016</year>. <comment>arXiv preprint arXiv: 161204337</comment>.</mixed-citation></ref>
<ref id="ref-26"><label>26.</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Huang</surname> <given-names>X</given-names></string-name>, <string-name><surname>Belongie</surname> <given-names>S</given-names></string-name></person-group>. <article-title>Arbitrary style transfer in real-time with adaptive instance normalization</article-title>. In: <source>Proceedings of the IEEE International Conference on Computer Vision</source>, <year>2017</year>; <publisher-loc>Venice, Italy</publisher-loc>; p. <fpage>1501</fpage>&#x2013;<lpage>10</lpage>. doi:<pub-id pub-id-type="doi">10.1109/ICCV.2017.167</pub-id>.</mixed-citation></ref>
<ref id="ref-27"><label>27.</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Hong</surname> <given-names>K</given-names></string-name>, <string-name><surname>Jeon</surname> <given-names>S</given-names></string-name>, <string-name><surname>Lee</surname> <given-names>J</given-names></string-name>, <string-name><surname>Ahn</surname> <given-names>N</given-names></string-name>, <string-name><surname>Kim</surname> <given-names>K</given-names></string-name>, <string-name><surname>Lee</surname> <given-names>P</given-names></string-name>, <etal>et al</etal></person-group>. <article-title>AesPA-Net: aesthetic pattern-aware style transfer networks</article-title>. In: <source>Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)</source>, <year>2023</year>; <publisher-loc>Paris, France</publisher-loc>; p. <fpage>22758</fpage>&#x2013;<lpage>67</lpage>. doi:<pub-id pub-id-type="doi">10.1109/ICCV51070.2023.02080</pub-id>.</mixed-citation></ref>
<ref id="ref-28"><label>28.</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Chen</surname> <given-names>J</given-names></string-name>, <string-name><surname>Jiang</surname> <given-names>M</given-names></string-name>, <string-name><surname>Dou</surname> <given-names>Q</given-names></string-name>, <string-name><surname>Chen</surname> <given-names>Q</given-names></string-name></person-group>. <article-title>Federated domain generalization for image recognition via cross-client style transfer</article-title>. In: <source>Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)</source>, <year>2023</year>; <publisher-name>Waikoloa, HI</publisher-name>, <publisher-loc>USA</publisher-loc>; p. <fpage>361</fpage>&#x2013;<lpage>70</lpage>. doi:<pub-id pub-id-type="doi">10.1109/WACV56688.2023.00044</pub-id>.</mixed-citation></ref>
<ref id="ref-29"><label>29.</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Zhang</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Tang</surname> <given-names>F</given-names></string-name>, <string-name><surname>Dong</surname> <given-names>W</given-names></string-name>, <string-name><surname>Huang</surname> <given-names>H</given-names></string-name>, <string-name><surname>Ma</surname> <given-names>C</given-names></string-name>, <string-name><surname>Lee</surname> <given-names>TY</given-names></string-name>, <etal>et al</etal></person-group>. <article-title>Domain enhanced arbitrary image style transfer via contrastive learning</article-title>. In: <source>ACM SIGGRAPH, 2022 Conference Proceedings</source>, <year>2022</year>; <publisher-name>Vancouver, BC</publisher-name>, <publisher-loc>Canada</publisher-loc>; p. <fpage>1</fpage>&#x2013;<lpage>8</lpage>. doi:<pub-id pub-id-type="doi">10.1145/3528233.3530736</pub-id>.</mixed-citation></ref>
<ref id="ref-30"><label>30.</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Wu</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Zhu</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Du</surname> <given-names>J</given-names></string-name>, <string-name><surname>Bai</surname> <given-names>X</given-names></string-name></person-group>. <article-title>CCPL: contrastive coherence preserving loss for versatile style transfer</article-title>. In: <source>European Conference on Computer Vision</source>, <year>2022</year>; <publisher-loc>Tel Aviv, Israel</publisher-loc>, <publisher-name>Springer</publisher-name>; p. <fpage>189</fpage>&#x2013;<lpage>206</lpage>. doi:<pub-id pub-id-type="doi">10.48550/arXiv.2207.04808</pub-id>.</mixed-citation></ref>
<ref id="ref-31"><label>31.</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><surname>Li</surname> <given-names>X</given-names></string-name>, <string-name><surname>Liu</surname> <given-names>S</given-names></string-name>, <string-name><surname>Kautz</surname> <given-names>J</given-names></string-name>, <string-name><surname>Yang</surname> <given-names>MH</given-names></string-name></person-group>. <article-title>Learning linear transformations for fast arbitrary style transfer</article-title>; <year>2018</year>. <comment>arXiv preprint arXiv: 180804537</comment>.</mixed-citation></ref>
<ref id="ref-32"><label>32.</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Jing</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Liu</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Yang</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Feng</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Yu</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Tao</surname> <given-names>D</given-names></string-name>, <etal>et al</etal></person-group>. <article-title>Stroke controllable fast style transfer with adaptive receptive fields</article-title>. In: <source>Proceedings of the European Conference on Computer Vision (ECCV)</source>, <year>2018</year>; <publisher-loc>Munich, Germany</publisher-loc>. doi:<pub-id pub-id-type="doi">10.1007/978-3-030-01261-8_15</pub-id>.</mixed-citation></ref>
<ref id="ref-33"><label>33.</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Puy</surname> <given-names>G</given-names></string-name>, <string-name><surname>Perez</surname> <given-names>P</given-names></string-name></person-group>. <article-title>A flexible convolutional solver for fast style transfers</article-title>. In: <source>Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)</source>, <year>2019</year>; <publisher-name>Long Beach, CA</publisher-name>, <publisher-loc>USA</publisher-loc>. doi:<pub-id pub-id-type="doi">10.1109/CVPR.2019.00917</pub-id>.</mixed-citation></ref>
<ref id="ref-34"><label>34.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Huang</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Jing</surname> <given-names>M</given-names></string-name>, <string-name><surname>Zhou</surname> <given-names>J</given-names></string-name>, <string-name><surname>Liu</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Fan</surname> <given-names>Y</given-names></string-name></person-group>. <article-title>LCCStyle: arbitrary style transfer with low computational complexity</article-title>. <source>IEEE Trans Multimedia</source>. <year>2021</year>;<volume>25</volume>. doi:<pub-id pub-id-type="doi">10.1109/TMM.2021.3128058</pub-id>.</mixed-citation></ref>
<ref id="ref-35"><label>35.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Wang</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Zhao</surname> <given-names>L</given-names></string-name>, <string-name><surname>Zuo</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Li</surname> <given-names>A</given-names></string-name>, <string-name><surname>Chen</surname> <given-names>H</given-names></string-name>, <string-name><surname>Xing</surname> <given-names>W</given-names></string-name>, <etal>et al</etal></person-group>. <article-title>MicroAST: towards super-fast ultra-resolution arbitrary style transfer</article-title>. <source>Proc AAAI Conf Artif Intell</source>. <year>2023</year>;<volume>37</volume>:<fpage>2742</fpage>&#x2013;<lpage>50</lpage>. doi:<pub-id pub-id-type="doi">10.1609/aaai.v37i3.25374</pub-id>.</mixed-citation></ref>
<ref id="ref-36"><label>36.</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Zhang</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Liu</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Han</surname> <given-names>C</given-names></string-name>, <string-name><surname>Pan</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Guo</surname> <given-names>T</given-names></string-name>, <string-name><surname>Yao</surname> <given-names>T</given-names></string-name></person-group>. <article-title>Transforming radiance field with lipschitz network for photorealistic 3D scene stylization</article-title>. In: <source>Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)</source>, <year>2023</year>; <publisher-name>Vancouver, BC</publisher-name>, <publisher-loc>Canada</publisher-loc>; p. <fpage>20712</fpage>&#x2013;<lpage>21</lpage>. doi:<pub-id pub-id-type="doi">10.1109/CVPR52729.2023.01984</pub-id>.</mixed-citation></ref>
<ref id="ref-37"><label>37.</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Huang</surname> <given-names>YH</given-names></string-name>, <string-name><surname>He</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Yuan</surname> <given-names>YJ</given-names></string-name>, <string-name><surname>Lai</surname> <given-names>YK</given-names></string-name>, <string-name><surname>Gao</surname> <given-names>L</given-names></string-name></person-group>. <article-title>StylizedNeRF: consistent 3D scene stylization as stylized nerf via 2D-3D mutual learning</article-title>. In: <source>Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</source>, <year>2022</year>; <publisher-name>New Orleans, LA</publisher-name>, <publisher-loc>USA</publisher-loc>; p. <fpage>18342</fpage>&#x2013;<lpage>52</lpage>. doi:<pub-id pub-id-type="doi">10.1109/CVPR52688.2022.01780</pub-id>.</mixed-citation></ref>
<ref id="ref-38"><label>38.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Wang</surname> <given-names>W</given-names></string-name>, <string-name><surname>Yang</surname> <given-names>S</given-names></string-name>, <string-name><surname>Xu</surname> <given-names>J</given-names></string-name>, <string-name><surname>Liu</surname> <given-names>J</given-names></string-name></person-group>. <article-title>Consistent video style transfer via relaxation and regularization</article-title>. <source>IEEE Transact Image Process</source>. <year>2020</year>;<volume>29</volume>:<fpage>9125</fpage>&#x2013;<lpage>39</lpage>. doi:<pub-id pub-id-type="doi">10.1109/TIP.2020.3024018</pub-id>; <pub-id pub-id-type="pmid">32966219</pub-id></mixed-citation></ref>
<ref id="ref-39"><label>39.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Deng</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Tang</surname> <given-names>F</given-names></string-name>, <string-name><surname>Dong</surname> <given-names>W</given-names></string-name>, <string-name><surname>Huang</surname> <given-names>H</given-names></string-name>, <string-name><surname>Ma</surname> <given-names>C</given-names></string-name>, <string-name><surname>Xu</surname> <given-names>C</given-names></string-name></person-group>. <article-title>Arbitrary video style transfer via multi-channel correlation</article-title>. <source>Proc AAAI Conf Artif Intell</source>. <year>2021</year>;<volume>35</volume>(<issue>2</issue>):<fpage>1210</fpage>&#x2013;<lpage>7</lpage>. doi:<pub-id pub-id-type="doi">10.1609/aaai.v35i2.16208</pub-id>.</mixed-citation></ref>
<ref id="ref-40"><label>40.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Atarsaikhan</surname> <given-names>G</given-names></string-name>, <string-name><surname>Iwana</surname> <given-names>BK</given-names></string-name>, <string-name><surname>Uchida</surname> <given-names>S</given-names></string-name></person-group>. <article-title>Guided neural style transfer for shape stylization</article-title>. <source>PLoS One</source>. <year>2020</year>;<volume>15</volume>(<issue>6</issue>):<fpage>e0233489</fpage>. doi:<pub-id pub-id-type="doi">10.1371/journal.pone.0233489</pub-id>; <pub-id pub-id-type="pmid">32497055</pub-id></mixed-citation></ref>
<ref id="ref-41"><label>41.</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Chandran</surname> <given-names>P</given-names></string-name>, <string-name><surname>Zoss</surname> <given-names>G</given-names></string-name>, <string-name><surname>Gotardo</surname> <given-names>P</given-names></string-name>, <string-name><surname>Gross</surname> <given-names>M</given-names></string-name>, <string-name><surname>Bradley</surname> <given-names>D</given-names></string-name></person-group>. <article-title>Adaptive convolutions for structure-aware style transfer</article-title>. In: <source>Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</source>, Virtual, <year>2021</year>; p. <fpage>7972</fpage>&#x2013;<lpage>81</lpage>. doi:<pub-id pub-id-type="doi">10.1109/CVPR46437.2021.00788</pub-id>.</mixed-citation></ref>
<ref id="ref-42"><label>42.</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Huang</surname> <given-names>S</given-names></string-name>, <string-name><surname>An</surname> <given-names>J</given-names></string-name>, <string-name><surname>Wei</surname> <given-names>D</given-names></string-name>, <string-name><surname>Luo</surname> <given-names>J</given-names></string-name>, <string-name><surname>Pfister</surname> <given-names>H</given-names></string-name></person-group>. <article-title>QuantArt: quantizing image style transfer towards high visual fidelity</article-title>. In: <source>Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</source>, <year>2023</year>; <publisher-name>Vancouver, BC</publisher-name>, <publisher-loc>Canada</publisher-loc>; p. <fpage>5947</fpage>&#x2013;<lpage>56</lpage>. doi:<pub-id pub-id-type="doi">10.48550/arXiv.2212.10431</pub-id>.</mixed-citation></ref>
<ref id="ref-43"><label>43.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Jing</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Liu</surname> <given-names>X</given-names></string-name>, <string-name><surname>Ding</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>X</given-names></string-name>, <string-name><surname>Ding</surname> <given-names>E</given-names></string-name>, <string-name><surname>Song</surname> <given-names>M</given-names></string-name>, <etal>et al.</etal></person-group> <article-title>Dynamic instance normalization for arbitrary style transfer</article-title>. <source>Proc AAAI Conf Artif Intell</source>. <year>2020</year>;<volume>34</volume>(<issue>4</issue>):<fpage>4369</fpage>&#x2013;<lpage>76</lpage>. doi:<pub-id pub-id-type="doi">10.1609/aaai.v34i04.5862</pub-id>.</mixed-citation></ref>
<ref id="ref-44"><label>44.</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Johnson</surname> <given-names>J</given-names></string-name>, <string-name><surname>Alahi</surname> <given-names>A</given-names></string-name>, <string-name><surname>Fei-Fei</surname> <given-names>L</given-names></string-name></person-group>. <article-title>Perceptual losses for real-time style transfer and super-resolution</article-title>. In: <source>Computer Vision&#x2013;ECCV 2016: 14th European Conference, October 11&#x2013;14, 2016</source>, <publisher-loc>Amsterdam, The Netherlands</publisher-loc>, <publisher-name>Springer</publisher-name>; p. <fpage>694</fpage>&#x2013;<lpage>711</lpage>. doi:<pub-id pub-id-type="doi">10.1007/978-3-319-46475-6_43</pub-id>.</mixed-citation></ref>
<ref id="ref-45"><label>45.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Vaswani</surname> <given-names>A</given-names></string-name></person-group>. <article-title>Attention is all you need</article-title>. In: <source>Advances in neural information processing systems 30 (NIPS 2017)</source>. <publisher-name>Long Beach, CA</publisher-name>, <publisher-loc>USA</publisher-loc>; <year>2017</year>.</mixed-citation></ref>
<ref id="ref-46"><label>46.</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Zhang</surname> <given-names>H</given-names></string-name>, <string-name><surname>Goodfellow</surname> <given-names>I</given-names></string-name>, <string-name><surname>Metaxas</surname> <given-names>D</given-names></string-name>, <string-name><surname>Odena</surname> <given-names>A</given-names></string-name></person-group>. <article-title>Self-attention generative adversarial networks</article-title>. In: <source>International Conference on Machine Learning</source>, <year>2019</year>; <publisher-name>Long Beach, CA</publisher-name>, <publisher-loc>USA</publisher-loc>: <publisher-name>PMLR</publisher-name>; p. <fpage>7354</fpage>&#x2013;<lpage>63</lpage>. doi:<pub-id pub-id-type="doi">10.48550/arXiv.1805.08318</pub-id>.</mixed-citation></ref>
<ref id="ref-47"><label>47.</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Wang</surname> <given-names>X</given-names></string-name>, <string-name><surname>Oxholm</surname> <given-names>G</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>D</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>YF</given-names></string-name></person-group>. <article-title>Multimodal transfer: a hierarchical deep convolutional neural network for fast artistic style transfer</article-title>. In: <source>Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition</source>, <year>2017</year>; <publisher-name>Honolulu, HI</publisher-name>, <publisher-loc>USA</publisher-loc>; p. <fpage>5239</fpage>&#x2013;<lpage>47</lpage>. doi:<pub-id pub-id-type="doi">10.1109/CVPR.2017.759</pub-id>.</mixed-citation></ref>
<ref id="ref-48"><label>48.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Li</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Fang</surname> <given-names>C</given-names></string-name>, <string-name><surname>Yang</surname> <given-names>J</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Lu</surname> <given-names>X</given-names></string-name>, <string-name><surname>Yang</surname> <given-names>MH</given-names></string-name></person-group>. <article-title>Universal style transfer via feature transforms</article-title>. <source>Adv Neural Inf Process Syst</source>. <year>2017</year>;<volume>30</volume>:<fpage>385</fpage>&#x2013;<lpage>95</lpage>. doi:<pub-id pub-id-type="doi">10.5555/3294771.3294808</pub-id>.</mixed-citation></ref>
<ref id="ref-49"><label>49.</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Tseng</surname> <given-names>KW</given-names></string-name>, <string-name><surname>Lee</surname> <given-names>YC</given-names></string-name>, <string-name><surname>Chen</surname> <given-names>CS</given-names></string-name></person-group>. <article-title>Artistic style novel view synthesis based on a single image</article-title>. In: <source>Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</source>, <year>2022</year>; <publisher-name>New Orleans, LA</publisher-name>, <publisher-loc>USA</publisher-loc>; p. <fpage>2258</fpage>&#x2013;<lpage>62</lpage>. doi:<pub-id pub-id-type="doi">10.1109/CVPRW56347.2022.00248</pub-id>.</mixed-citation></ref>
<ref id="ref-50"><label>50.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Kong</surname> <given-names>X</given-names></string-name>, <string-name><surname>Deng</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Tang</surname> <given-names>F</given-names></string-name>, <string-name><surname>Dong</surname> <given-names>W</given-names></string-name>, <string-name><surname>Ma</surname> <given-names>C</given-names></string-name>, <string-name><surname>Chen</surname> <given-names>Y</given-names></string-name>, <etal>et al</etal></person-group>. <article-title>Exploring the temporal consistency of arbitrary style transfer: a channelwise perspective</article-title>. <source>IEEE Trans Neural Netw Learn Syst</source>. <year>2023</year>;<volume>35</volume>:<fpage>8482</fpage>&#x2013;<lpage>96</lpage>. doi:<pub-id pub-id-type="doi">10.1109/TNNLS.2022.3230084</pub-id>; <pub-id pub-id-type="pmid">37018565</pub-id></mixed-citation></ref>
<ref id="ref-51"><label>51.</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Jandial</surname> <given-names>S</given-names></string-name>, <string-name><surname>Deshmukh</surname> <given-names>S</given-names></string-name>, <string-name><surname>Java</surname> <given-names>A</given-names></string-name>, <string-name><surname>Shahid</surname> <given-names>S</given-names></string-name>, <string-name><surname>Krishnamurthy</surname> <given-names>B</given-names></string-name></person-group>. <article-title>Gatha: relational loss for enhancing text-based style transfer</article-title>. In: <source>Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops</source>, <year>2023</year>; <publisher-name>Vancouver, BC</publisher-name>, <publisher-loc>Canada</publisher-loc>; p. <fpage>3546</fpage>&#x2013;<lpage>51</lpage>. doi:<pub-id pub-id-type="doi">10.1109/CVPRW59228.2023.00362</pub-id>.</mixed-citation></ref>
<ref id="ref-52"><label>52.</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Fu</surname> <given-names>TJ</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>XE</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>WY</given-names></string-name></person-group>. <article-title>Language-driven artistic style transfer</article-title>. In: <source>European Conference on Computer Vision</source>, <year>2022</year>; <publisher-name>Tel Aviv</publisher-name>, <publisher-loc>Israel</publisher-loc>, <publisher-name>Springer</publisher-name>; p. <fpage>717</fpage>&#x2013;<lpage>34</lpage>. doi:<pub-id pub-id-type="doi">10.1007/978-3-031-20059-5_41</pub-id>.</mixed-citation></ref>
<ref id="ref-53"><label>53.</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Zhang</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Li</surname> <given-names>M</given-names></string-name>, <string-name><surname>Li</surname> <given-names>R</given-names></string-name>, <string-name><surname>Jia</surname> <given-names>K</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>L</given-names></string-name></person-group>. <article-title>Exact feature distribution matching for arbitrary style transfer and domain generalization</article-title>. In: <source>Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)</source>, <year>2022</year>; New Orleans, LA, USA; p. <fpage>8035</fpage>&#x2013;<lpage>45</lpage>. doi:<pub-id pub-id-type="doi">10.1109/CVPR52688.2022.00787</pub-id>.</mixed-citation></ref>
<ref id="ref-54"><label>54.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Zheng</surname> <given-names>X</given-names></string-name>, <string-name><surname>Yang</surname> <given-names>X</given-names></string-name>, <string-name><surname>Zhao</surname> <given-names>Q</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>H</given-names></string-name>, <string-name><surname>He</surname> <given-names>X</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>J</given-names></string-name>, <etal>et al</etal></person-group>. <article-title>CFA-GAN: cross fusion attention and frequency loss for image style transfer</article-title>. <source>Displays</source>. <year>2024</year>;<volume>81</volume>:<fpage>102588</fpage>. doi:<pub-id pub-id-type="doi">10.1016/j.displa.2023.102588</pub-id>.</mixed-citation></ref>
<ref id="ref-55"><label>55.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Ge</surname> <given-names>B</given-names></string-name>, <string-name><surname>Hu</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Xia</surname> <given-names>C</given-names></string-name>, <string-name><surname>Guan</surname> <given-names>J</given-names></string-name></person-group>. <article-title>Arbitrary style transfer method with attentional feature distribution matching</article-title>. <source>Multimed Syst</source>. <year>2024</year>;<volume>30</volume>(<issue>2</issue>):<fpage>96</fpage>. doi:<pub-id pub-id-type="doi">10.21203/rs.3.rs-3365364/v1</pub-id>.</mixed-citation></ref>
<ref id="ref-56"><label>56.</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Suresh</surname> <given-names>AP</given-names></string-name>, <string-name><surname>Jain</surname> <given-names>S</given-names></string-name>, <string-name><surname>Noinongyao</surname> <given-names>P</given-names></string-name>, <string-name><surname>Ganguly</surname> <given-names>A</given-names></string-name>, <string-name><surname>Watchareeruetai</surname> <given-names>U</given-names></string-name>, <string-name><surname>Samacoits</surname> <given-names>A</given-names></string-name></person-group>. <article-title>FastCLIPstyler: optimisation-free text-based image style transfer using style representations</article-title>. In: <source>Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision</source>, <year>2024</year>; <publisher-name>Waikoloa, HI</publisher-name>, <publisher-loc>USA</publisher-loc>; p. <fpage>7316</fpage>&#x2013;<lpage>25</lpage>. doi:<pub-id pub-id-type="doi">10.1109/WACV57701.2024.00715</pub-id>.</mixed-citation></ref>
<ref id="ref-57"><label>57.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Kim</surname> <given-names>S</given-names></string-name>, <string-name><surname>Min</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Jung</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Kim</surname> <given-names>S</given-names></string-name></person-group>. <article-title>Controllable style transfer via test-time training of implicit neural representation</article-title>. <source>Pattern Recognit</source>. <year>2024</year>;<volume>146</volume>:<fpage>109988</fpage>. doi:<pub-id pub-id-type="doi">10.1016/j.patcog.2023.109988</pub-id>.</mixed-citation></ref>
<ref id="ref-58"><label>58.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Du</surname> <given-names>X</given-names></string-name>, <string-name><surname>Jia</surname> <given-names>N</given-names></string-name>, <string-name><surname>Du</surname> <given-names>H</given-names></string-name></person-group>. <article-title>FST-OAM: a fast style transfer model using optimized self-attention mechanism</article-title>. <source>Signal, Image Video Process</source>. <year>2024</year>;<volume>18</volume>:<fpage>1</fpage>&#x2013;<lpage>13</lpage>. doi:<pub-id pub-id-type="doi">10.1007/s11760-024-03064-w</pub-id>.</mixed-citation></ref>
<ref id="ref-59"><label>59.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Li</surname> <given-names>Ha</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>L</given-names></string-name>, <string-name><surname>Liu</surname> <given-names>J</given-names></string-name></person-group>. <article-title>Application of multi-level adaptive neural network based on optimization algorithm in image style transfer</article-title>. <source>Multimed Tools Appl</source>. <year>2024</year>;<volume>83</volume>:<fpage>1</fpage>&#x2013;<lpage>23</lpage>. doi:<pub-id pub-id-type="doi">10.1007/s11042-024-18451-1</pub-id>.</mixed-citation></ref>
<ref id="ref-60"><label>60.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Van Den Oord</surname> <given-names>A</given-names></string-name>, <string-name><surname>Vinyals</surname> <given-names>O</given-names></string-name>, <string-name><surname>Kavukcuoglu</surname> <given-names>K</given-names></string-name></person-group>. <article-title>Neural discrete representation learning</article-title>. <source>Adv Neural Inform Process Syst</source>. <year>2017</year>;<volume>30</volume>:<fpage>6309</fpage>&#x2013;<lpage>18</lpage>. doi:<pub-id pub-id-type="doi">10.48550/arXiv.1711.00937</pub-id>.</mixed-citation></ref>
<ref id="ref-61"><label>61.</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Esser</surname> <given-names>P</given-names></string-name>, <string-name><surname>Rombach</surname> <given-names>R</given-names></string-name>, <string-name><surname>Ommer</surname> <given-names>B</given-names></string-name></person-group>. <article-title>Taming transformers for high-resolution image synthesis</article-title>. In: <source>Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</source>, <year>2021</year>; p. <fpage>12873</fpage>&#x2013;<lpage>83</lpage>. doi:<pub-id pub-id-type="doi">10.1109/CVPR46437.2021.01268</pub-id>.</mixed-citation></ref>
<ref id="ref-62"><label>62.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Liu</surname> <given-names>S</given-names></string-name>, <string-name><surname>Zhu</surname> <given-names>T</given-names></string-name></person-group>. <article-title>Structure-guided arbitrary style transfer for artistic image and video</article-title>. <source>IEEE Transact Multimed</source>. <year>2021</year>;<volume>24</volume>:<fpage>1299</fpage>&#x2013;<lpage>312</lpage>. doi:<pub-id pub-id-type="doi">10.1109/TMM.2021.3063605</pub-id>.</mixed-citation></ref>
<ref id="ref-63"><label>63.</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Isola</surname> <given-names>P</given-names></string-name>, <string-name><surname>Zhu</surname> <given-names>JY</given-names></string-name>, <string-name><surname>Zhou</surname> <given-names>T</given-names></string-name>, <string-name><surname>Efros</surname> <given-names>AA</given-names></string-name></person-group>. <article-title>Image-to-image translation with conditional adversarial networks</article-title>. In: <source>Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition</source>, <year>2017</year>; <publisher-name>Honolulu, HI</publisher-name>, <publisher-loc>USA</publisher-loc>; p. <fpage>1125</fpage>&#x2013;<lpage>34</lpage>. doi:<pub-id pub-id-type="doi">10.1109/CVPR.2017.632</pub-id>.</mixed-citation></ref>
<ref id="ref-64"><label>64.</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>He</surname> <given-names>K</given-names></string-name>, <string-name><surname>Chen</surname> <given-names>X</given-names></string-name>, <string-name><surname>Xie</surname> <given-names>S</given-names></string-name>, <string-name><surname>Li</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Doll&#x00E1;r</surname> <given-names>P</given-names></string-name>, <string-name><surname>Girshick</surname> <given-names>R</given-names></string-name></person-group>. <article-title>Masked autoencoders are scalable vision learners</article-title>. In: <source>Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</source>, <year>2022</year>; <publisher-name>NewOrleans, LA</publisher-name>, <publisher-loc>USA</publisher-loc>; p. <fpage>16000</fpage>&#x2013;<lpage>9</lpage>. doi:<pub-id pub-id-type="doi">10.1109/CVPR52688.2022.01553</pub-id>.</mixed-citation></ref>
<ref id="ref-65"><label>65.</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Park</surname> <given-names>DY</given-names></string-name>, <string-name><surname>Lee</surname> <given-names>KH</given-names></string-name></person-group>. <article-title>Arbitrary style transfer with style-attentional networks</article-title>. In: <source>Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</source>, <year>2019</year>; <publisher-name>Long Beach, CA</publisher-name>, <publisher-loc>USA</publisher-loc>; p. <fpage>5880</fpage>&#x2013;<lpage>8</lpage>. doi:<pub-id pub-id-type="doi">10.1109/CVPR.2019.00603</pub-id>.</mixed-citation></ref>
<ref id="ref-66"><label>66.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Chen</surname> <given-names>H</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>H</given-names></string-name>, <string-name><surname>Zuo</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Li</surname> <given-names>A</given-names></string-name>, <string-name><surname>Xing</surname> <given-names>W</given-names></string-name>, <etal>et al</etal></person-group>. <article-title>Artistic style transfer with internal-external learning and contrastive learning</article-title>. <source>Adv Neural Inform Process Syst</source>. <year>2021</year>;<volume>34</volume>:<fpage>26561</fpage>&#x2013;<lpage>73</lpage>. doi:<pub-id pub-id-type="doi">10.5555/3540261.3542295</pub-id>.</mixed-citation></ref>
<ref id="ref-67"><label>67.</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Liu</surname> <given-names>S</given-names></string-name>, <string-name><surname>Lin</surname> <given-names>T</given-names></string-name>, <string-name><surname>He</surname> <given-names>D</given-names></string-name>, <string-name><surname>Li</surname> <given-names>F</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>M</given-names></string-name>, <string-name><surname>Li</surname> <given-names>X</given-names></string-name>, <etal>et al</etal></person-group>. <article-title>AdaAttN: revisit attention mechanism in arbitrary neural style transfer</article-title>. In: <source>Proceedings of the IEEE/CVF International Conference on Computer Vision</source>, <year>2021</year>; p. <fpage>6649</fpage>&#x2013;<lpage>58</lpage>. doi:<pub-id pub-id-type="doi">10.1109/ICCV48922.2021.00658</pub-id>.</mixed-citation></ref>
<ref id="ref-68"><label>68.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Liu</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Zhao</surname> <given-names>H</given-names></string-name>, <string-name><surname>Chan</surname> <given-names>KC</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>X</given-names></string-name>, <string-name><surname>Loy</surname> <given-names>CC</given-names></string-name>, <string-name><surname>Qiao</surname> <given-names>Y</given-names></string-name>, <etal>et al</etal></person-group>. <article-title>Temporally consistent video colorization with deep feature propagation and self-regularization learning</article-title>. <source>Comput Visual Media</source>. <year>2024</year>;<volume>10</volume>(<issue>2</issue>):<fpage>375</fpage>&#x2013;<lpage>95</lpage>. doi:<pub-id pub-id-type="doi">10.1007/s41095-023-0342-8</pub-id>.</mixed-citation></ref>
</ref-list>
</back></article>