<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.1 20151215//EN" "http://jats.nlm.nih.gov/publishing/1.1/JATS-journalpublishing1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:mml="http://www.w3.org/1998/Math/MathML" xml:lang="en" article-type="research-article" dtd-version="1.1">
<front>
<journal-meta>
<journal-id journal-id-type="pmc">CMC</journal-id>
<journal-id journal-id-type="nlm-ta">CMC</journal-id>
<journal-id journal-id-type="publisher-id">CMC</journal-id>
<journal-title-group>
<journal-title>Computers, Materials &#x0026; Continua</journal-title>
</journal-title-group>
<issn pub-type="epub">1546-2226</issn>
<issn pub-type="ppub">1546-2218</issn>
<publisher>
<publisher-name>Tech Science Press</publisher-name>
<publisher-loc>USA</publisher-loc>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="publisher-id">69784</article-id>
<article-id pub-id-type="doi">10.32604/cmc.2025.069784</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Article</subject>
</subj-group>
</article-categories>
<title-group>
<article-title>3D Enhanced Residual CNN for Video Super-Resolution Network</article-title>
<alt-title alt-title-type="left-running-head">3D Enhanced Residual CNN for Video Super-Resolution Network</alt-title>
<alt-title alt-title-type="right-running-head">3D Enhanced Residual CNN for Video Super-Resolution Network</alt-title>
</title-group>
<contrib-group>
<contrib id="author-1" contrib-type="author">
<name name-style="western"><surname>Xin</surname><given-names>Weiqiang</given-names></name><xref ref-type="aff" rid="aff-1">1</xref><xref ref-type="aff" rid="aff-2">2</xref><xref ref-type="aff" rid="aff-3">3</xref><xref ref-type="author-notes" rid="afn1">#</xref></contrib>
<contrib id="author-2" contrib-type="author">
<name name-style="western"><surname>Wang</surname><given-names>Zheng</given-names></name><xref ref-type="aff" rid="aff-4">4</xref><xref ref-type="author-notes" rid="afn1">#</xref></contrib>
<contrib id="author-3" contrib-type="author">
<name name-style="western"><surname>Chen</surname><given-names>Xi</given-names></name><xref ref-type="aff" rid="aff-1">1</xref><xref ref-type="aff" rid="aff-5">5</xref></contrib>
<contrib id="author-4" contrib-type="author">
<name name-style="western"><surname>Tang</surname><given-names>Yufeng</given-names></name><xref ref-type="aff" rid="aff-1">1</xref></contrib>
<contrib id="author-5" contrib-type="author">
<name name-style="western"><surname>Li</surname><given-names>Bing</given-names></name><xref ref-type="aff" rid="aff-1">1</xref></contrib>
<contrib id="author-6" contrib-type="author" corresp="yes">
<name name-style="western"><surname>Tian</surname><given-names>Chunwei</given-names></name><xref ref-type="aff" rid="aff-2">2</xref><xref ref-type="aff" rid="aff-5">5</xref><email>chunweitian@nwpu.edu.cn</email></contrib>
<aff id="aff-1"><label>1</label><institution>School of Software, Northwestern Polytechnical University</institution>, <addr-line>Xi&#x2019;an, 710072</addr-line>, <country>China</country></aff>
<aff id="aff-2"><label>2</label><institution>Shenzhen Research Institute of Northwestern Polytechnical University, Northwestern Polytechnical University</institution>, <addr-line>Shenzhen, 518057</addr-line>, <country>China</country></aff>
<aff id="aff-3"><label>3</label><institution>State Key Laboratory for Novel Software Technology, Nanjing University</institution>, <addr-line>Nanjing, 210023</addr-line>, <country>China</country></aff>
<aff id="aff-4"><label>4</label><institution>School of Interdisciplinary Studies, Lingnan University</institution>, <addr-line>Hong Kong, 999077</addr-line>, <country>China</country></aff>
<aff id="aff-5"><label>5</label><institution>Yangtze River Delta Research Institute, Northwestern Polytechnical University</institution>, <addr-line>Taicang, 215400</addr-line>, <country>China</country></aff>
</contrib-group>
<author-notes>
<corresp id="cor1"><label>&#x002A;</label>Corresponding Author: Chunwei Tian. Email: <email>chunweitian@nwpu.edu.cn</email></corresp>
<fn id="afn1">
<p><sup>#</sup>These authors contributed equally to this work</p>
</fn>
</author-notes>
<pub-date date-type="collection" publication-format="electronic">
<year>2025</year></pub-date>
<pub-date date-type="pub" publication-format="electronic">
<day>23</day><month>09</month><year>2025</year></pub-date>
<volume>85</volume>
<issue>2</issue>
<fpage>2837</fpage>
<lpage>2849</lpage>
<history>
<date date-type="received">
<day>30</day>
<month>6</month>
<year>2025</year>
</date>
<date date-type="accepted">
<day>15</day>
<month>8</month>
<year>2025</year>
</date>
</history>
<permissions>
<copyright-statement>&#x00A9; 2025 The Authors.</copyright-statement>
<copyright-year>2025</copyright-year>
<copyright-holder>Published by Tech Science Press.</copyright-holder>
<license xlink:href="https://creativecommons.org/licenses/by/4.0/">
<license-p>This work is licensed under a <ext-link ext-link-type="uri" xlink:type="simple" xlink:href="https://creativecommons.org/licenses/by/4.0/">Creative Commons Attribution 4.0 International License</ext-link>, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
</license>
</permissions>
<self-uri content-type="pdf" xlink:href="TSP_CMC_69784.pdf"></self-uri>
<abstract>
<p>Deep convolutional neural networks (CNNs) have demonstrated remarkable performance in video super-resolution (VSR). However, the ability of most existing methods to recover fine details in complex scenes is often hindered by the loss of shallow texture information during feature extraction. To address this limitation, we propose a 3D Convolutional Enhanced Residual Video Super-Resolution Network (3D-ERVSNet). This network employs a forward and backward bidirectional propagation module (FBBPM) that aligns features across frames using explicit optical flow through lightweight SPyNet. By incorporating an enhanced residual structure (ERS) with skip connections, shallow and deep features are effectively integrated, enhancing texture restoration capabilities. Furthermore, 3D convolution module (3DCM) is applied after the backward propagation module to implicitly capture spatio-temporal dependencies. The architecture synergizes these components where FBBPM extracts aligned features, ERS fuses hierarchical representations, and 3DCM refines temporal coherence. Finally, a deep feature aggregation module (DFAM) fuses the processed features, and a pixel-upsampling module (PUM) reconstructs the high-resolution (HR) video frames. Comprehensive evaluations on REDS, Vid4, UDM10, and Vim4 benchmarks demonstrate well performance including 30.95 dB PSNR/0.8822 SSIM on REDS and 32.78 dB/0.8987 on Vim4. 3D-ERVSNet achieves significant gains over baselines while maintaining high efficiency with only 6.3M parameters and 77 ms/frame runtime (i.e., 20&#x00D7; faster than RBPN). The network&#x2019;s effectiveness stems from its task-specific asymmetric design that balances explicit alignment and implicit fusion.</p>
</abstract>
<kwd-group kwd-group-type="author">
<kwd>Video super-resolution</kwd>
<kwd>3D convolution</kwd>
<kwd>enhanced residual CNN</kwd>
<kwd>spatio-temporal feature extraction</kwd>
</kwd-group>
<funding-group>
<award-group id="awg1">
<funding-source>Basic and Applied Basic Research Foundation of Guangdong Province</funding-source>
<award-id>2025A1515011566</award-id>
</award-group>
<award-group id="awg2">
<funding-source>State Key Laboratory for Novel Software Technology, Nanjing University</funding-source>
<award-id>KFKT2024B08</award-id>
</award-group>
<award-group id="awg3">
<funding-source>Leading Talents in Gusu Innovation and Entrepreneurship</funding-source>
<award-id>ZXL2023170</award-id>
</award-group>
<award-group id="awg4">
<funding-source>Basic Research Programs of Taicang 2024</funding-source>
<award-id>TC2024JC32</award-id>
</award-group>
</funding-group>
</article-meta>
</front>
<body>
<sec id="s1">
<label>1</label>
<title>Introduction</title>
<p>Super-resolution (SR) [<xref ref-type="bibr" rid="ref-1">1</xref>] constitutes a fundamental task in image processing and computer vision [<xref ref-type="bibr" rid="ref-2">2</xref>]. It is an inherently challenging problem due to the loss of both spatial and temporal information during image capture and transmission, rendering it ill-posed, particularly in complex scenes [<xref ref-type="bibr" rid="ref-3">3</xref>]. Unlike single image super-resolution (SISR) [<xref ref-type="bibr" rid="ref-4">4</xref>], video super-resolution (VSR) [<xref ref-type="bibr" rid="ref-5">5</xref>] leverages temporal information across adjacent frames, enabling the recovery of finer details and richer textures. Consequently, VSR techniques have found widespread application in domains such as video enhancement [<xref ref-type="bibr" rid="ref-6">6</xref>], autonomous driving [<xref ref-type="bibr" rid="ref-7">7</xref>], and security surveillance [<xref ref-type="bibr" rid="ref-8">8</xref>].</p>
<p>The advent of deep learning [<xref ref-type="bibr" rid="ref-9">9</xref>,<xref ref-type="bibr" rid="ref-10">10</xref>] has led to the widespread adoption of neural networks across various image tasks, such as denoising [<xref ref-type="bibr" rid="ref-11">11</xref>], super-resolution [<xref ref-type="bibr" rid="ref-12">12</xref>,<xref ref-type="bibr" rid="ref-13">13</xref>], watermark removal [<xref ref-type="bibr" rid="ref-14">14</xref>] and blind image quality assessment [<xref ref-type="bibr" rid="ref-15">15</xref>,<xref ref-type="bibr" rid="ref-16">16</xref>]. These advancements have significantly empowered convolutional neural networks (CNNs) for effective spatio-temporal feature extraction and modeling. Pioneering VSR works, such as SRCNN-based approaches [<xref ref-type="bibr" rid="ref-17">17</xref>,<xref ref-type="bibr" rid="ref-18">18</xref>], demonstrated the feasibility of deep learning for this task. Subsequent methods like ToFlow [<xref ref-type="bibr" rid="ref-19">19</xref>] improved frame alignment through trainable motion estimation. Furthermore, richer feature representations have been shown to yield significant improvements in super-resolution results [<xref ref-type="bibr" rid="ref-20">20</xref>]. Notable contributions include the Recurrent Back-Projection Network (RBPN) proposed by Haris et al. [<xref ref-type="bibr" rid="ref-21">21</xref>], which integrates multi-frame information with a single-frame SR path, and EDVR by Wang et al. [<xref ref-type="bibr" rid="ref-22">22</xref>], which incorporates deformable convolution and a spatio-temporal attention fusion module. More recent innovations explore dynamic adaptive filters for feature-level alignment [<xref ref-type="bibr" rid="ref-23">23</xref>] and the incorporation of fuzzy mechanisms to enhance reconstruction robustness [<xref ref-type="bibr" rid="ref-24">24</xref>]. Despite the notable progress made by deep learning based VSR models, significant challenges remain. Many CNN-based VSR models expand receptive fields using deeper layers, and this may lead to loss of fine details due to weakened shallow feature propagation. Moreover, existing models often rely solely on explicit alignment or insufficient temporal modeling, resulting in degraded performance in complex scenes.</p>
<p>Consequently, there is a need to design more efficient and streamlined network architectures that not only enhance shallow feature extraction but also enable more robust spatio-temporal feature acquisition and inter-frame relationship modeling through improved contextual understanding. To this end, we introduce a 3D Convolutional Enhanced Residual Video Super-Resolution Network (3D-ERVSNet). The core of our approach is a Forward and Backward Bidirectional Propagation Module (FBBPM), designed to extract and align features across video frame sequences. By incorporating an Enhanced Residual Structure (ERS) with skip connections, shallow and deep features are effectively combined, significantly enhancing the network&#x2019;s ability to recover intricate image textures. The main contributions of this paper are summarized as follows:
<list list-type="order">
<list-item><p>We propose a novel video super-resolution network, termed 3D-ERVSNet, which employs an Enhanced Residual Structure (ERS) with multi-layer skip connections. This design facilitates improved texture restoration by effectively leveraging shallow feature information.</p></list-item>
<list-item><p>We introduce a Forward and Backward Bidirectional Propagation Module (FBBPM) for inter-frame alignment. The FBBPM incorporates an Optical Flow-guided Alignment Block (OFAB) to explicitly warp features using pre-computed optical flow. Furthermore, 3D convolutions are integrated to implicitly learn spatio-temporal relationships and enable adaptive alignment. This hybrid strategy effectively leverages both explicit and implicit mechanisms to exploit spatio-temporal information.</p></list-item>
<list-item><p>Extensive experiments demonstrate that the proposed 3D-ERVSNet achieves superior visual quality, particularly in restoring image edge and texture details, outperforming existing methods on benchmark datasets.</p></list-item>
</list></p>
</sec>
<sec id="s2">
<label>2</label>
<title>Proposed Method</title>
<sec id="s2_1">
<label>2.1</label>
<title>Network Architecture</title>
<p>This section details the architecture of the proposed 3D-ERVSNet. As illustrated in <xref ref-type="fig" rid="fig-1">Fig. 1</xref>, 3D-ERVSNet comprises four key modules: the Forward and Backward Bidirectional Propagation Module (FBBPM), the 3D Convolution Module (3DCM), the Deep Feature Aggregation Module (DFAM), and the Pixel-Upsampling Module (PUM). The FBBPM is specifically designed to extract features in parallel along both forward and backward temporal directions. It consists of three sub-modules: the Optical Flow Alignment Block (OFAB), the Feature Alignment Block (FAB), and the Enhanced Residual Structure (ERS). A 3D convolution layer is applied subsequent to the Backward Propagation Module (BPM). The DFAM then fuses the output features from the Forward Propagation Module (FPM) with those from the 3DCM. The fused result is subsequently fed into the PUM to generate the final high-resolution (HR) video frame.</p>
<fig id="fig-1">
<label>Figure 1</label>
<caption>
<title>Overview of the proposed 3D-ERVSNet framework</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_69784-fig-1.tif"/>
</fig>
</sec>
<sec id="s2_2">
<label>2.2</label>
<title>Bidirectional Propagation Module</title>
<p>Recognizing the unique characteristic of the VSR task (i.e., the input comprises image sequences rather than individual frames as in SISR), it is essential to model both the spatial features within each frame and the temporal relationships between frames. Consequently, 3D-ERVSNet incorporates two parallel feature propagation pathways. As depicted in <xref ref-type="fig" rid="fig-1">Fig. 1</xref>, the Forward Propagation Module (FPM, <inline-formula id="ieqn-1"><mml:math id="mml-ieqn-1"><mml:msub><mml:mi>F</mml:mi><mml:mrow><mml:mi>f</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>) and the Backward Propagation Module (BPM, <inline-formula id="ieqn-2"><mml:math id="mml-ieqn-2"><mml:msub><mml:mi>F</mml:mi><mml:mrow><mml:mi>b</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>) operate concurrently, processing video sequences in opposite temporal directions. While sharing an identical internal structure, these modules differ solely in their direction of feature propagation. The operation of these modules is mathematically expressed as follows:
<disp-formula id="eqn-1"><label>(1)</label><mml:math id="mml-eqn-1" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd /><mml:mtd><mml:msub><mml:mi>O</mml:mi><mml:mrow><mml:mi>f</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mi>f</mml:mi><mml:mi>o</mml:mi><mml:mi>r</mml:mi><mml:mi>w</mml:mi><mml:mi>a</mml:mi><mml:mi>r</mml:mi><mml:mi>d</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>I</mml:mi><mml:mrow><mml:mi>L</mml:mi><mml:mi>R</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">)</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<disp-formula id="eqn-2"><label>(2)</label><mml:math id="mml-eqn-2" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd /><mml:mtd><mml:msub><mml:mi>O</mml:mi><mml:mrow><mml:mi>b</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mi>b</mml:mi><mml:mi>a</mml:mi><mml:mi>c</mml:mi><mml:mi>k</mml:mi><mml:mi>w</mml:mi><mml:mi>a</mml:mi><mml:mi>r</mml:mi><mml:mi>d</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>I</mml:mi><mml:mrow><mml:mi>L</mml:mi><mml:mi>R</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">)</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>where <inline-formula id="ieqn-3"><mml:math id="mml-ieqn-3"><mml:mi>f</mml:mi><mml:mi>o</mml:mi><mml:mi>r</mml:mi><mml:mi>w</mml:mi><mml:mi>a</mml:mi><mml:mi>r</mml:mi><mml:mi>d</mml:mi></mml:math></inline-formula> and <inline-formula id="ieqn-4"><mml:math id="mml-ieqn-4"><mml:mi>b</mml:mi><mml:mi>a</mml:mi><mml:mi>c</mml:mi><mml:mi>k</mml:mi><mml:mi>w</mml:mi><mml:mi>a</mml:mi><mml:mi>r</mml:mi><mml:mi>d</mml:mi></mml:math></inline-formula> denote the FPM and BPM modules, respectively. <inline-formula id="ieqn-5"><mml:math id="mml-ieqn-5"><mml:msub><mml:mi>I</mml:mi><mml:mrow><mml:mi>L</mml:mi><mml:mi>R</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> represents the input <italic>LR</italic> video sequence, and <inline-formula id="ieqn-6"><mml:math id="mml-ieqn-6"><mml:msub><mml:mi>O</mml:mi><mml:mrow><mml:mi>f</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> and <inline-formula id="ieqn-7"><mml:math id="mml-ieqn-7"><mml:msub><mml:mi>O</mml:mi><mml:mrow><mml:mi>b</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> denote the outputs of the corresponding modules. Each propagation module primarily consists of three components including the OFAB, the FAB and the ERS, described in detail below.</p>
<sec id="s2_2_1">
<label>2.2.1</label>
<title>Optical Flow Alignment Block (OFAB)</title>
<p>To achieve precise alignment between consecutive frames for enhanced texture reconstruction, we employ a pre-trained optical flow network. Among prominent candidates RAFT [<xref ref-type="bibr" rid="ref-25">25</xref>], PWC-Net [<xref ref-type="bibr" rid="ref-26">26</xref>], and SPyNet [<xref ref-type="bibr" rid="ref-27">27</xref>], SPyNet stands out as the most lightweight option, requiring only 1.2 million (1.2M) parameters compared to 5.3M for RAFT and 8.75M for PWC-Net. Furthermore, SPyNet exhibits a significantly smaller memory footprint (9.7 MB) than PWC-Net (41.1 MB). Notably, all three networks achieve inference times below 0.1 s. Based on this efficiency analysis, SPyNet is selected as our optical flow estimation network. Crucially, the parameters of SPyNet are frozen during the training of 3D-ERVSNet. This ensures consistent alignment performance across diverse scenes and reduces the overall model complexity. The optical flow computation is formally defined as:
<disp-formula id="eqn-3"><label>(3)</label><mml:math id="mml-eqn-3" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd /><mml:mtd><mml:msubsup><mml:mi>s</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>f</mml:mi></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:mi>S</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">)</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<disp-formula id="eqn-4"><label>(4)</label><mml:math id="mml-eqn-4" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd /><mml:mtd><mml:msubsup><mml:mi>s</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>b</mml:mi></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:mi>S</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">)</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>where <inline-formula id="ieqn-8"><mml:math id="mml-ieqn-8"><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula>, <inline-formula id="ieqn-9"><mml:math id="mml-ieqn-9"><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>, and <inline-formula id="ieqn-10"><mml:math id="mml-ieqn-10"><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula> denote the original LR images of the preceding, current, and subsequent frames, respectively. <inline-formula id="ieqn-11"><mml:math id="mml-ieqn-11"><mml:mi>S</mml:mi></mml:math></inline-formula> represents the pre-trained SPyNet optical flow estimation network. <inline-formula id="ieqn-12"><mml:math id="mml-ieqn-12"><mml:msubsup><mml:mi>s</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>f</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula> and <inline-formula id="ieqn-13"><mml:math id="mml-ieqn-13"><mml:msubsup><mml:mi>s</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>b</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula> represent the forward and backward aligned optical flows of the current video frame.</p>
</sec>
<sec id="s2_2_2">
<label>2.2.2</label>
<title>Feature Alignment Block (FAB)</title>
<p>Within the FPM, the FAB utilizes the computed forward optical flow to warp the features from the preceding frame towards alignment with the current frame. Conversely, within the BPM, the FAB employs the backward optical flow to warp the features from the subsequent frame into alignment with the current frame. This alignment operation facilitates the fusion of relevant information from adjacent frames, contributing to enhanced sharpness and clarity in the processed frames. This process is expressed by:
<disp-formula id="eqn-5"><label>(5)</label><mml:math id="mml-eqn-5" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd /><mml:mtd><mml:msubsup><mml:mover><mml:mi>h</mml:mi><mml:mo accent="false">&#x00AF;</mml:mo></mml:mover><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>f</mml:mi></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:mi>W</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msubsup><mml:mi>h</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>f</mml:mi></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:msubsup><mml:mi>s</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>f</mml:mi></mml:mrow></mml:msubsup><mml:mo stretchy="false">)</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula><disp-formula id="eqn-6"><label>(6)</label><mml:math id="mml-eqn-6" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd /><mml:mtd><mml:msubsup><mml:mover><mml:mi>h</mml:mi><mml:mo accent="false">&#x00AF;</mml:mo></mml:mover><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>b</mml:mi></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:mi>W</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msubsup><mml:mi>h</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>b</mml:mi></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:msubsup><mml:mi>s</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>b</mml:mi></mml:mrow></mml:msubsup><mml:mo stretchy="false">)</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>where <italic>W</italic> denotes the feature alignment operation, <inline-formula id="ieqn-14"><mml:math id="mml-ieqn-14"><mml:msubsup><mml:mover><mml:mi>h</mml:mi><mml:mo accent="false">&#x00AF;</mml:mo></mml:mover><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>f</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula> and <inline-formula id="ieqn-15"><mml:math id="mml-ieqn-15"><mml:msubsup><mml:mover><mml:mi>h</mml:mi><mml:mo accent="false">&#x00AF;</mml:mo></mml:mover><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>b</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula> represent the aligned features output by the FAB for the current frame in the forward and backward paths, respectively.</p>
</sec>
<sec id="s2_2_3">
<label>2.2.3</label>
<title>Enhanced Residual Structure (ERS)</title>
<p>Deep CNNs are susceptible to performance degradation stemming from the vanishing or exploding gradient problem. Residual learning, pioneered by He et al. [<xref ref-type="bibr" rid="ref-28">28</xref>], effectively mitigates this issue by incorporating skip connections that merge the output of a layer with the output of an earlier layer. Building upon this concept, we propose an Enhanced Residual Structure (ERS) incorporating multi-layer skip connections. This design facilitates the effective integration of both shallow and deep features, significantly strengthening the network&#x2019;s capacity to recover intricate image textures. As depicted in <xref ref-type="fig" rid="fig-2">Fig. 2</xref>, the ERS comprises an initial convolutional layer for channel adjustment, followed by a LeakyReLU activation function, and a series of stacked Enhanced Residual Blocks (ERBs). Each ERB contains three convolutional layers interspersed with two Gaussian Error Linear Unit (GELU) activation functions.</p>
<fig id="fig-2">
<label>Figure 2</label>
<caption>
<title>Schematic diagram of the Enhanced Residual Structure (ERS) and the Enhanced Residual Block (ERB)</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_69784-fig-2.tif"/>
</fig>
<p>Within the ERB, we adopt the GELU activation function in place of the conventional ReLU. Compared to standard residual blocks utilizing ReLU, GELU enhances neuron utilization efficiency, promotes smoother model convergence during training, and offers computational advantages.</p>
</sec>
</sec>
<sec id="s2_3">
<label>2.3</label>
<title>3D Convolution Module (3DCM)</title>
<p>Motivated by the proven effectiveness of 3D convolutions for capturing spatio-temporal information in various computer vision tasks, such as face restoration using plug-and-play 3D facial priors [<xref ref-type="bibr" rid="ref-29">29</xref>], we incorporate a <inline-formula id="ieqn-16"><mml:math id="mml-ieqn-16"><mml:mn>3</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mn>3</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mn>3</mml:mn></mml:math></inline-formula> 3D convolution layer after the BPM. This module performs implicit spatio-temporal alignment and fusion, formulated as:<disp-formula id="eqn-7"><label>(7)</label><mml:math id="mml-eqn-7" display="block"><mml:msub><mml:mi>O</mml:mi><mml:mrow><mml:mn>3</mml:mn><mml:mi>D</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mn>3</mml:mn><mml:mi>D</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>O</mml:mi><mml:mrow><mml:mi>b</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">)</mml:mo></mml:math></disp-formula>where <inline-formula id="ieqn-17"><mml:math id="mml-ieqn-17"><mml:msub><mml:mi>O</mml:mi><mml:mrow><mml:mn>3</mml:mn><mml:mi>D</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> denotes the output feature map. Since a <inline-formula id="ieqn-18"><mml:math id="mml-ieqn-18"><mml:mn>3</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mn>3</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mn>3</mml:mn></mml:math></inline-formula> convolution kernel reduces the temporal length of the sequence, we replicate the first and last frames of the BPM&#x2019;s output sequence. This results in an extended feature volume of dimensions <inline-formula id="ieqn-19"><mml:math id="mml-ieqn-19"><mml:mo stretchy="false">(</mml:mo><mml:mi>T</mml:mi><mml:mo>+</mml:mo><mml:mn>2</mml:mn><mml:mo stretchy="false">)</mml:mo><mml:mo>&#x00D7;</mml:mo><mml:mi>C</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mi>H</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mi>W</mml:mi></mml:math></inline-formula>, effectively padding the sequence temporally. Additionally, spatial padding of size <inline-formula id="ieqn-20"><mml:math id="mml-ieqn-20"><mml:mo stretchy="false">(</mml:mo><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> is applied to maintain consistent spatial dimensions. The 3DCM serves to mitigate inaccuracies arising from unreliable optical flow estimations and effectively addresses occlusion challenges. We strategically apply the 3DCM only after the backward propagation path. The rationale stems from the observation that features processed by the BPM have already aggregated information from future frames, thereby providing a richer temporal context. This context is particularly suitable for the implicit spatio-temporal fusion capabilities of 3D convolutions. This asymmetric design choice [<xref ref-type="bibr" rid="ref-30">30</xref>] is empirically validated to be both effective and computationally more efficient than symmetric application on both paths. Unlike explicit alignment methods, 3D convolution implicitly captures spatio-temporal dependencies between frames. It dynamically leverages reliable information from neighboring frames to compensate for and rectify unreliable regions within the current frame. This implicit mechanism significantly enhances the robustness of the feature representations and consequently improves video restoration quality, particularly in challenging scenarios involving complex motion or occlusions.</p>
</sec>
<sec id="s2_4">
<label>2.4</label>
<title>Deep Feature Aggregation Module (DFAM)</title>
<p>To synthesize a more comprehensive feature representation, the DFAM aggregates the deep features generated by the preceding modules. The DFAM consists of a single <inline-formula id="ieqn-21"><mml:math id="mml-ieqn-21"><mml:mn>1</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mn>1</mml:mn></mml:math></inline-formula> convolutional layer. Its inputs are the output features <inline-formula id="ieqn-22"><mml:math id="mml-ieqn-22"><mml:msub><mml:mi>O</mml:mi><mml:mrow><mml:mi>f</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> from the FPM and <inline-formula id="ieqn-23"><mml:math id="mml-ieqn-23"><mml:msub><mml:mi>O</mml:mi><mml:mrow><mml:mn>3</mml:mn><mml:mi>D</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> from the 3DCM. The module concatenates these two feature maps along the channel dimension and subsequently adjusts the channel dimensionality through the <inline-formula id="ieqn-24"><mml:math id="mml-ieqn-24"><mml:mn>1</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mn>1</mml:mn></mml:math></inline-formula> convolution. This aggregation process is defined as:<disp-formula id="eqn-8"><label>(8)</label><mml:math id="mml-eqn-8" display="block"><mml:msub><mml:mi>O</mml:mi><mml:mrow><mml:mi>c</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mi>f</mml:mi><mml:mi>u</mml:mi><mml:mi>s</mml:mi><mml:mi>i</mml:mi><mml:mi>o</mml:mi><mml:mi>n</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>O</mml:mi><mml:mrow><mml:mi>f</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>O</mml:mi><mml:mrow><mml:mn>3</mml:mn><mml:mi>D</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">)</mml:mo></mml:math></disp-formula>where <inline-formula id="ieqn-25"><mml:math id="mml-ieqn-25"><mml:mi>f</mml:mi><mml:mi>u</mml:mi><mml:mi>s</mml:mi><mml:mi>i</mml:mi><mml:mi>o</mml:mi><mml:mi>n</mml:mi></mml:math></inline-formula> represents the function of the Deep Feature Aggregation Module, and <inline-formula id="ieqn-26"><mml:math id="mml-ieqn-26"><mml:msub><mml:mi>O</mml:mi><mml:mrow><mml:mi>c</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> denotes the fused output feature map.</p>
</sec>
<sec id="s2_5">
<label>2.5</label>
<title>Pixel-Upsampling Module (PUM)</title>
<p>The Pixel-Upsampling Module (PUM) reconstructs the final high-resolution (HR) image from the aggregated deep features <inline-formula id="ieqn-27"><mml:math id="mml-ieqn-27"><mml:msub><mml:mi>O</mml:mi><mml:mrow><mml:mi>c</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>. It employs a dual-path architecture with a residual connection. The primary feature path, denoted as <inline-formula id="ieqn-28"><mml:math id="mml-ieqn-28"><mml:mi>P</mml:mi><mml:mi>U</mml:mi><mml:mn>1</mml:mn></mml:math></inline-formula>, learns the high-frequency residual details. This path processes <inline-formula id="ieqn-29"><mml:math id="mml-ieqn-29"><mml:msub><mml:mi>O</mml:mi><mml:mrow><mml:mi>c</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> through a sequence of two <inline-formula id="ieqn-30"><mml:math id="mml-ieqn-30"><mml:mn>3</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mn>3</mml:mn></mml:math></inline-formula> convolution layers, each followed by a pixel shuffle operation with a upscaling factor of 2, effectively upsampling the features to the target resolution. A LeakyReLU activation is applied after each convolution. While ReLU outputs zero for negative inputs, LeakyReLU retains a small gradient in such cases. This ensures that the gradient can continue to flow within the network even with a negative input, which is beneficial for model convergence. The secondary residual path, denoted as <inline-formula id="ieqn-31"><mml:math id="mml-ieqn-31"><mml:mi>P</mml:mi><mml:mi>U</mml:mi><mml:mn>2</mml:mn></mml:math></inline-formula>, provides a coarse, structurally-sound baseline image. It directly upsamples the original low-resolution input <inline-formula id="ieqn-32"><mml:math id="mml-ieqn-32"><mml:msub><mml:mi>I</mml:mi><mml:mrow><mml:mi>L</mml:mi><mml:mi>R</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> to the target resolution using a single bilinear interpolation operation. The full-resolution feature map from <inline-formula id="ieqn-33"><mml:math id="mml-ieqn-33"><mml:mi>P</mml:mi><mml:mi>U</mml:mi><mml:mn>1</mml:mn></mml:math></inline-formula> is added element-wise to the full-resolution upsampled image from <inline-formula id="ieqn-34"><mml:math id="mml-ieqn-34"><mml:mi>P</mml:mi><mml:mi>U</mml:mi><mml:mn>2</mml:mn></mml:math></inline-formula>. This design allows the network to focus on learning the high-frequency residual information needed to transform the coarse bilinear upsampling into a high-quality HR image. This final fusion is described by the equation:<disp-formula id="eqn-9"><label>(9)</label><mml:math id="mml-eqn-9" display="block"><mml:msub><mml:mi>O</mml:mi><mml:mrow><mml:mi>o</mml:mi><mml:mi>u</mml:mi><mml:mi>t</mml:mi><mml:mi>p</mml:mi><mml:mi>u</mml:mi><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mi>P</mml:mi><mml:mi>U</mml:mi><mml:mn>1</mml:mn><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>O</mml:mi><mml:mrow><mml:mi>c</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mo>&#x2295;</mml:mo><mml:mi>P</mml:mi><mml:mi>U</mml:mi><mml:mn>2</mml:mn><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>I</mml:mi><mml:mrow><mml:mi>L</mml:mi><mml:mi>R</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">)</mml:mo></mml:math></disp-formula>where <inline-formula id="ieqn-35"><mml:math id="mml-ieqn-35"><mml:msub><mml:mi>O</mml:mi><mml:mrow><mml:mi>o</mml:mi><mml:mi>u</mml:mi><mml:mi>t</mml:mi><mml:mi>p</mml:mi><mml:mi>u</mml:mi><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> denotes the final high-resolution output frame of the 3D-ERVSNet, <inline-formula id="ieqn-36"><mml:math id="mml-ieqn-36"><mml:mi>P</mml:mi><mml:mi>U</mml:mi><mml:mn>1</mml:mn></mml:math></inline-formula> represents the function of the primary up-sampling path processing the DFAM output, <inline-formula id="ieqn-37"><mml:math id="mml-ieqn-37"><mml:mi>P</mml:mi><mml:mi>U</mml:mi><mml:mn>2</mml:mn></mml:math></inline-formula> represents the function of the secondary path up-sampling the original LR input frame, and <inline-formula id="ieqn-38"><mml:math id="mml-ieqn-38"><mml:mo>&#x2A01;</mml:mo></mml:math></inline-formula> denotes element-wise addition.</p>
</sec>
</sec>
<sec id="s3">
<label>3</label>
<title>Experiments</title>
<sec id="s3_1">
<label>3.1</label>
<title>Datasets and Evaluation Protocol</title>
<p>To ensure experimental fairness, the proposed 3D-ERVSNet model was trained on the REDS dataset [<xref ref-type="bibr" rid="ref-31">31</xref>]. Evaluation was conducted on three benchmark test sets: Vid4 [<xref ref-type="bibr" rid="ref-32">32</xref>], UDM10 [<xref ref-type="bibr" rid="ref-33">33</xref>], and Vim4 (comprising four sequences randomly selected from Vim90k [<xref ref-type="bibr" rid="ref-19">19</xref>]). The REDS dataset is partitioned into a training set (240 videos), a validation set (30 videos), and a test set (30 videos). The Vid4 test set contains four video sequences: &#x2018;calendar&#x2019;, &#x2018;city&#x2019;, &#x2018;foliage&#x2019;, and &#x2018;walk&#x2019;. The UDM10 dataset includes 10 video sequences, each consisting of 32 consecutive frames. The Vim4 dataset consists of 4 video sequences extracted from Vim90k, each containing 7 consecutive frames. Quantitative performance evaluation employs the Peak Signal-to-Noise Ratio (PSNR) and Structural Similarity Index (SSIM) metrics.</p>
</sec>
<sec id="s3_2">
<label>3.2</label>
<title>Experimental Setting</title>
<p>Following the training methodology of LapSRN [<xref ref-type="bibr" rid="ref-34">34</xref>], we employ the Charbonnier loss [<xref ref-type="bibr" rid="ref-35">35</xref>] to optimize the network parameters. This loss function measures the difference between the ground-truth high-resolution (HR) video frames and the predicted HR frames. For effective training, an input sequence length of 15 frames is utilized. Optimization is performed using the Adam optimizer with a cosine annealing learning rate schedule. The initial learning rates are configured as follows: <inline-formula id="ieqn-39"><mml:math id="mml-ieqn-39"><mml:mn>1</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:msup><mml:mn>10</mml:mn><mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mn>4</mml:mn></mml:mrow></mml:msup></mml:math></inline-formula> for the feature extraction components, <inline-formula id="ieqn-40"><mml:math id="mml-ieqn-40"><mml:mn>2.5</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:msup><mml:mn>10</mml:mn><mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mn>5</mml:mn></mml:mrow></mml:msup></mml:math></inline-formula> for the optical flow estimation module (SPyNet, kept frozen), and <inline-formula id="ieqn-41"><mml:math id="mml-ieqn-41"><mml:mn>2</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:msup><mml:mn>10</mml:mn><mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mn>4</mml:mn></mml:mrow></mml:msup></mml:math></inline-formula> for the remaining modules. Training proceeds for a total of 30,000 iterations. Crucially, the weights of the BPM and FAB sub-modules are frozen during the initial 5000 iterations to stabilize early training. The batch size is set to 8. Within each ERB, Kaiming initialization is applied to the convolutional layers, with the initial weights scaled by a factor of 0.1. All experiments focus on the challenging and prevalent case of <inline-formula id="ieqn-42"><mml:math id="mml-ieqn-42"><mml:mo>&#x00D7;</mml:mo><mml:mn>4</mml:mn></mml:math></inline-formula> video super-resolution [<xref ref-type="bibr" rid="ref-36">36</xref>]. The computational platform comprises an AMD EPYC 7502P Processor (64 cores), an NVIDIA GeForce RTX 3090 GPU, and 128 GB of RAM.</p>
</sec>
<sec id="s3_3">
<label>3.3</label>
<title>Ablation Study</title>
<p>An ablation study is conducted to evaluate the individual contributions of the Enhanced Residual Structure (ERS) and the 3D Convolution Module (3DCM) within the proposed 3D-ERVSNet framework. <xref ref-type="table" rid="table-1">Tables 1</xref> and <xref ref-type="table" rid="table-2">2</xref> present the quantitative results on the UDM10 and Vid4 (BD degradation) datasets, respectively.</p>
<table-wrap id="table-1">
<label>Table 1</label>
<caption>
<title>Ablation study results of 3D-ERVSNet on the UDM10 dataset</title>
</caption>
<table>
<colgroup>
<col/>
<col/>
<col/>
<col/>
</colgroup>
<thead>
<tr>
<th>Enhanced residual structure</th>
<th>3D convolution</th>
<th>PSNR (dB)</th>
<th>SSIM</th>
</tr>
</thead>
<tbody>
<tr>
<td>&#x2717;</td>
<td>&#x2717;</td>
<td>33.46</td>
<td>0.9308</td>
</tr>
<tr>
<td>&#x2713;</td>
<td>&#x2717;</td>
<td>34.86</td>
<td>0.9436</td>
</tr>
<tr>
<td>&#x2713;</td>
<td>&#x2713;</td>
<td>35.07</td>
<td>0.9459</td>
</tr>
</tbody>
</table>
</table-wrap><table-wrap id="table-2">
<label>Table 2</label>
<caption>
<title>Ablation study results of 3D-ERVSNet on the Vid4 (BD) dataset</title>
</caption>
<table>
<colgroup>
<col/>
<col/>
<col/>
<col/>
</colgroup>
<thead>
<tr>
<th>Enhanced residual structure</th>
<th>3D convolution</th>
<th>PSNR (dB)</th>
<th>SSIM</th>
</tr>
</thead>
<tbody>
<tr>
<td>&#x2717;</td>
<td>&#x2717;</td>
<td>24.45</td>
<td>0.7456</td>
</tr>
<tr>
<td>&#x2713;</td>
<td>&#x2717;</td>
<td>24.74</td>
<td>0.7714</td>
</tr>
<tr>
<td>&#x2713;</td>
<td>&#x2713;</td>
<td>25.20</td>
<td>0.7720</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>The baseline model employs a standard residual structure, lacking both the ERS and 3DCM. As evidenced by <xref ref-type="table" rid="table-1">Tables 1</xref> and <xref ref-type="table" rid="table-2">2</xref>, integrating the ERS yields statistically significant improvements in both PSNR and SSIM metrics (e.g., approximately &#x002B;1.40 dB PSNR on UDM10). This substantial gain underscores the critical role of the ERS in enhancing the feature extraction capability. The results suggest that the ERS successfully facilitates the extraction of richer features while effectively preserving vital shallow texture information present in the video frames. The subsequent integration of the 3DCM further enhances performance consistently across datasets (e.g., approximately &#x002B;0.46 dB PSNR on Vid4). This improvement confirms the efficacy of the 3DCM in refining spatio-temporal details and implicitly capturing inter-frame dependencies. Consequently, the synergistic combination of both the ERS and 3DCM leads to the most accurate and detailed reconstruction performance.</p>

</sec>
<sec id="s3_4">
<label>3.4</label>
<title>Experimental Results</title>
<p><xref ref-type="table" rid="table-3">Table 3</xref> summarizes the quantitative comparison against other methods. The proposed 3D-ERVSNet demonstrates competitive or superior performance across all three benchmark datasets. Notably, it achieves the highest PSNR and SSIM values on the REDS (30.95 dB/0.8822) and Vim4 (32.78 dB/0.8987) datasets. On the challenging Vid4 benchmark with BD degradation, 3D-ERVSNet (27.02 dB/0.8224) also performs favorably compared to strong methods like RBPN (27.12 dB/0.8180) and EDVR-M (27.10 dB/0.8186), achieving comparable PSNR with a higher SSIM score.</p>
<table-wrap id="table-3">
<label>Table 3</label>
<caption>
<title>Quantitative performance comparison on REDS, Vid4, and Vim4 datasets for <inline-formula id="ieqn-43"><mml:math id="mml-ieqn-43"><mml:mo>&#x00D7;</mml:mo><mml:mn>4</mml:mn></mml:math></inline-formula> video super-resolution. (&#x2018;&#x2013;&#x2019; indicates results not reported in the original papers or unavailable, <styled-content style-type="color" style="color: #EE0000;">best</styled-content> and <styled-content style-type="color" style="color: #4472C4;">second-best</styled-content> results are highlighted in <styled-content style-type="color" style="color: #EE0000;">red</styled-content> and <styled-content style-type="color" style="color: #4472C4;">blue</styled-content>)</title>
</caption>
<table>
<colgroup>
<col align="center"/>
<col/>
<col/>
<col/>
<col/>
<col/>
<col/>
</colgroup>
<thead>
<tr>
<th align="center" rowspan="2">Methods</th>
<th colspan="2">REDS dataset</th>
<th colspan="2">Vid4 dataset</th>
<th colspan="2">Vim4 dataset</th>
</tr>
<tr>
<th>PSNR (dB)</th>
<th>SSIM</th>
<th>PSNR (dB)</th>
<th>SSIM</th>
<th>PSNR (dB)</th>
<th>SSIM</th>
</tr>
</thead>
<tbody>
<tr>
<td>Bicubic [<xref ref-type="bibr" rid="ref-37">37</xref>]</td>
<td>26.14</td>
<td>0.7292</td>
<td>23.58</td>
<td>0.6363</td>
<td>29.01</td>
<td>0.8408</td>
</tr>
<tr>
<td>TOFlow [<xref ref-type="bibr" rid="ref-19">19</xref>]</td>
<td>27.98</td>
<td>0.7990</td>
<td>25.89</td>
<td>0.7651</td>
<td>&#x2013;</td>
<td>&#x2013;</td>
</tr>
<tr>
<td>FSTRN [<xref ref-type="bibr" rid="ref-36">36</xref>]</td>
<td>&#x2013;</td>
<td>&#x2013;</td>
<td>24.76</td>
<td>0.7200</td>
<td>&#x2013;</td>
<td>&#x2013;</td>
</tr>
<tr>
<td>DUF [<xref ref-type="bibr" rid="ref-38">38</xref>]</td>
<td>28.63</td>
<td>0.8251</td>
<td>&#x2013;</td>
<td>&#x2013;</td>
<td>&#x2013;</td>
<td>&#x2013;</td>
</tr>
<tr>
<td>RBPN [<xref ref-type="bibr" rid="ref-21">21</xref>]</td>
<td>30.09</td>
<td>0.8590</td>
<td><styled-content style-type="color" style="color: #EE0000;">27.12</styled-content></td>
<td>0.8180</td>
<td>&#x2013;</td>
<td>&#x2013;</td>
</tr>
<tr>
<td>EDVR-M [<xref ref-type="bibr" rid="ref-22">22</xref>]</td>
<td>30.53</td>
<td>0.8699</td>
<td><styled-content style-type="color" style="color: #4472C4;">27.10</styled-content></td>
<td><styled-content style-type="color" style="color: #4472C4;">0.8186</styled-content></td>
<td>&#x2013;</td>
<td>&#x2013;</td>
</tr>
<tr>
<td>PFNL [<xref ref-type="bibr" rid="ref-39">39</xref>]</td>
<td>29.63</td>
<td>0.8502</td>
<td>26.73</td>
<td>0.8029</td>
<td>&#x2013;</td>
<td>&#x2013;</td>
</tr>
<tr>
<td>MuCAN [<xref ref-type="bibr" rid="ref-40">40</xref>]</td>
<td><styled-content style-type="color" style="color: #4472C4;">30.88</styled-content></td>
<td><styled-content style-type="color" style="color: #4472C4;">0.8750</styled-content></td>
<td>&#x2013;</td>
<td>&#x2013;</td>
<td><styled-content style-type="color" style="color: #4472C4;">32.75</styled-content></td>
<td><styled-content style-type="color" style="color: #4472C4;">0.8970</styled-content></td>
</tr>
<tr>
<td>TMNet [<xref ref-type="bibr" rid="ref-41">41</xref>]</td>
<td>29.91</td>
<td>0.8633</td>
<td>26.23</td>
<td>0.8041</td>
<td>&#x2013;</td>
<td>&#x2013;</td>
</tr>
<tr>
<td>STDAN [<xref ref-type="bibr" rid="ref-42">42</xref>]</td>
<td>29.98</td>
<td>0.8613</td>
<td>26.28</td>
<td>0.8041</td>
<td>&#x2013;</td>
<td>&#x2013;</td>
</tr>
<tr>
<td>DSRNet [<xref ref-type="bibr" rid="ref-43">43</xref>]</td>
<td>&#x2013;</td>
<td>&#x2013;</td>
<td>26.72</td>
<td>0.8002</td>
<td>&#x2013;</td>
<td>&#x2013;</td>
</tr>
<tr>
<td>VESPCN [<xref ref-type="bibr" rid="ref-44">44</xref>]</td>
<td>&#x2013;</td>
<td>&#x2013;</td>
<td>25.35</td>
<td>0.7557</td>
<td>&#x2013;</td>
<td>&#x2013;</td>
</tr>
<tr>
<td>SPMC [<xref ref-type="bibr" rid="ref-45">45</xref>]</td>
<td>&#x2013;</td>
<td>&#x2013;</td>
<td>25.52</td>
<td>0.7600</td>
<td>31.92</td>
<td>0.8843</td>
</tr>
<tr>
<td>FRVSR [<xref ref-type="bibr" rid="ref-46">46</xref>]</td>
<td>&#x2013;</td>
<td>&#x2013;</td>
<td>26.10</td>
<td>0.7755</td>
<td>31.59</td>
<td>0.8811</td>
</tr>
<tr>
<td>SOFVSR [<xref ref-type="bibr" rid="ref-47">47</xref>]</td>
<td>&#x2013;</td>
<td>&#x2013;</td>
<td>26.04</td>
<td>0.7753</td>
<td>32.03</td>
<td>0.8852</td>
</tr>
<tr>
<td>3D-ERVSNet (Ours)</td>
<td><styled-content style-type="color" style="color: #EE0000;">30.95</styled-content></td>
<td><styled-content style-type="color" style="color: #EE0000;">0.8822</styled-content></td>
<td>27.02</td>
<td><styled-content style-type="color" style="color: #EE0000;">0.8224</styled-content></td>
<td><styled-content style-type="color" style="color: #EE0000;">32.78</styled-content></td>
<td><styled-content style-type="color" style="color: #EE0000;">0.8987</styled-content></td>
</tr>
</tbody>
</table>
</table-wrap>
<p><xref ref-type="table" rid="table-4">Table 4</xref> compares model complexity and inference speed. The proposed 3D-ERVSNet achieves an excellent balance, offering competitive performance (<xref ref-type="table" rid="table-3">Table 3</xref>) with moderate parameter complexity (6.3M parameters, inclusive of the frozen SPyNet) and significantly faster inference time (77 ms per frame) compared to many methods like RBPN (1507 ms) and DUF (974 ms). This efficiency makes 3D-ERVSNet highly practical.</p>
<table-wrap id="table-4">
<label>Table 4</label>
<caption>
<title>Model complexity comparison: parameters (Millions) and Runtime (milliseconds) for processing a single frame of size 180 &#x00D7; 320</title>
</caption>
<table>
<colgroup>
<col/>
<col/>
<col/>
</colgroup>
<thead>
<tr>
<th>Methods</th>
<th>Params (M)</th>
<th>Runtime (ms)</th>
</tr>
</thead>
<tbody>
<tr>
<td>RBPN [<xref ref-type="bibr" rid="ref-21">21</xref>]</td>
<td>12.2</td>
<td>1507</td>
</tr>
<tr>
<td>EDVR [<xref ref-type="bibr" rid="ref-22">22</xref>]</td>
<td>20.6</td>
<td>378</td>
</tr>
<tr>
<td>DUF [<xref ref-type="bibr" rid="ref-38">38</xref>]</td>
<td>5.8</td>
<td>974</td>
</tr>
<tr>
<td>FRVSR [<xref ref-type="bibr" rid="ref-46">46</xref>]</td>
<td>3.3</td>
<td>118</td>
</tr>
<tr>
<td>3D-ERVSNet (Ours)</td>
<td>6.3</td>
<td>77</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>Visual comparisons further substantiate the effectiveness of 3D-ERVSNet. <xref ref-type="fig" rid="fig-3">Fig. 3</xref> illustrates that 3D-ERVSNet reconstructs finer texture details on the Vid4 dataset compared to other methods. In the zoomed upper-left region (marked area), our model reconstructs the structural texture of the mural with clear visibility, while competing methods fail to restore these patterns entirely. Similarly, <xref ref-type="fig" rid="fig-4">Fig. 4</xref> demonstrates superior sharpness and clarity in the marked region for results on the UDM10 dataset. Upon magnification, our result exhibits the closest resemblance to the HR ground truth in texture lines. Other methods either omit these lines or produce textures that significantly deviate from the HR reference. <xref ref-type="fig" rid="fig-5">Fig. 5</xref> highlights the realism achieved by 3D-ERVSNet on the Vim4 dataset, particularly evident in the accurate reconstruction of intricate elements like the door handle within the marked area. Vertical stripes on the truck are faithfully preserved without tilt (unlike the skewed outputs of other models). And our method avoids generating false patterns (e.g., MuCAN erroneously reconstructs the pickup truck&#x2019;s door handle as a vertical line, while ours matches the GT structure). These qualitative observations align with the quantitative metrics, confirming that 3D-ERVSNet excels in recovering authentic spatial details and robustly modeling temporal information.</p>
<fig id="fig-3">
<label>Figure 3</label>
<caption>
<title>Visual comparison of &#x00D7;4 super-resolution results on the Vid4 &#x2018;city&#x2019; sequence. The proposed 3D-ERVSNet reconstructs sharper edges and more authentic textures (see marked region) compared to other methods. Ground Truth (GT) is shown for reference</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_69784-fig-3.tif"/>
</fig><fig id="fig-4">
<label>Figure 4</label>
<caption>
<title>Visual comparison of &#x00D7;4 super-resolution results on the UDM10 dataset. 3D-ERVSNet produces results with enhanced sharpness and clarity in the marked region relative to competing approaches</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_69784-fig-4.tif"/>
</fig><fig id="fig-5">
<label>Figure 5</label>
<caption>
<title>Visual comparison of &#x00D7;4 super-resolution results on the Vim4 dataset. The output of 3D-ERVSNet exhibits superior realism and detail, particularly noticeable in the reconstruction of complex structures like the door handle within the marked area</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_69784-fig-5.tif"/>
</fig>
</sec>
</sec>
<sec id="s4">
<label>4</label>
<title>Conclusion</title>
<p>This paper presents 3D-ERVSNet, a novel and efficient network for video super-resolution. The core of our approach is a Forward and Backward Bidirectional Propagation Module (FBBPM), designed to achieve robust inter-frame feature alignment. The FBBPM incorporates an Enhanced Residual Structure (ERS) with multi-layer skip connections, enabling effective integration of shallow and deep features to enrich frame representations and enhance texture recovery. Furthermore, a 3D Convolution Module (3DCM) is strategically integrated following the backward propagation path to implicitly capture spatio-temporal dependencies and refine feature expression, effectively complementing the explicit alignment of the FBBPM. The resulting architecture achieves an advantageous balance between performance and efficiency. Despite its relatively streamlined design, 3D-ERVSNet demonstrates rapid convergence. Extensive quantitative and qualitative evaluations on multiple benchmark datasets (REDS, Vid4, UDM10, Vim4) confirm that 3D-ERVSNet achieves highly competitive performance in reconstructing high-resolution video frames with rich textures and fine details, while maintaining lower computational complexity and faster inference speed compared to many existing high-performance VSR models.</p>
</sec>
</body>
<back>
<ack>
<p>Not applicable.</p>
</ack>
<sec>
<title>Funding Statement</title>
<p>This project was supported in part by the Basic and Applied Basic Research Foundation of Guangdong Province [2025A1515011566]; in part by the State Key Laboratory for Novel Software Technology, Nanjing University [KFKT2024B08]; in part by Leading Talents in Gusu Innovation and Entrepreneurship [ZXL2023170]; and in part by the Basic Research Programs of Taicang 2024, [TC2024JC32].</p>
</sec>
<sec>
<title>Author Contributions</title>
<p>The authors confirm contribution to the paper as follows: Conceptualization, Weiqiang Xin and Xi Chen; methodology, Chunwei Tian and Zheng Wang; software, Weiqiang Xin, Xi Chen and Bing Li; validation, Zheng Wang and Bing Li; formal analysis, Yufeng Tang and Zheng Wang; visualization, Weiqiang Xin and Yufeng Tang; supervision, Chunwei Tian. All authors reviewed the results and approved the final version of the manuscript.</p>
</sec>
<sec sec-type="data-availability">
<title>Availability of Data and Materials</title>
<p>Open data available in <ext-link ext-link-type="uri" xlink:href="https://github.com/xwq325/3D-ERVSNet">https://github.com/xwq325/3D-ERVSNet</ext-link> (accessed on 30 June 2025).</p>
</sec>
<sec>
<title>Ethics Approval</title>
<p>Not applicable.</p>
</sec>
<sec sec-type="COI-statement">
<title>Conflicts of Interest</title>
<p>The authors declare no conflicts of interest to report regarding the present study.</p>
</sec>
<ref-list content-type="authoryear">
<title>References</title>
<ref id="ref-1"><label>[1]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Tian</surname> <given-names>C</given-names></string-name>, <string-name><surname>Song</surname> <given-names>M</given-names></string-name>, <string-name><surname>Fan</surname> <given-names>X</given-names></string-name>, <string-name><surname>Zheng</surname> <given-names>X</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>B</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>D</given-names></string-name></person-group>. <article-title>A tree-guided CNN for image super-resolution</article-title>. <source>IEEE Trans Consum Electron</source>. <year>2025</year>;<volume>71</volume>(<issue>2</issue>):<fpage>1</fpage>&#x2013;<lpage>10</lpage>. doi:<pub-id pub-id-type="doi">10.1109/TCE.2025.3572732</pub-id>.</mixed-citation></ref>
<ref id="ref-2"><label>[2]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Lu</surname> <given-names>M</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>P</given-names></string-name></person-group>. <article-title>Grouped spatio-temporal alignment network for video super-resolution</article-title>. <source>IEEE Signal Process Lett</source>. <year>2022</year>;<volume>29</volume>:<fpage>2193</fpage>&#x2013;<lpage>7</lpage>. doi:<pub-id pub-id-type="doi">10.1109/lsp.2022.3210874</pub-id>.</mixed-citation></ref>
<ref id="ref-3"><label>[3]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Lu</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Liu</surname> <given-names>M</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>H</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>L</given-names></string-name></person-group>. <article-title>Learning spatial-temporal implicit neural representations for event-guided video super-resolution</article-title>. In: <conf-name>2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)</conf-name>; <year>2023 Jun 17&#x2013;24</year>; <publisher-loc>Vancouver, BC, Canada</publisher-loc>. p. <fpage>1557</fpage>&#x2013;<lpage>67</lpage>. doi:<pub-id pub-id-type="doi">10.1109/CVPR52729.2023.00156</pub-id>.</mixed-citation></ref>
<ref id="ref-4"><label>[4]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Tian</surname> <given-names>C</given-names></string-name>, <string-name><surname>Yuan</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>S</given-names></string-name>, <string-name><surname>Lin</surname> <given-names>CW</given-names></string-name>, <string-name><surname>Zuo</surname> <given-names>W</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>D</given-names></string-name></person-group>. <article-title>Image super-resolution with an enhanced group convolutional neural network</article-title>. <source>Neural Netw</source>. <year>2022</year>;<volume>153</volume>(<issue>6</issue>):<fpage>373</fpage>&#x2013;<lpage>85</lpage>. doi:<pub-id pub-id-type="doi">10.1016/j.neunet.2022.06.009</pub-id>; <pub-id pub-id-type="pmid">35779445</pub-id></mixed-citation></ref>
<ref id="ref-5"><label>[5]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Wang</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Isobe</surname> <given-names>T</given-names></string-name>, <string-name><surname>Jia</surname> <given-names>X</given-names></string-name>, <string-name><surname>Tao</surname> <given-names>X</given-names></string-name>, <string-name><surname>Lu</surname> <given-names>H</given-names></string-name>, <string-name><surname>Tai</surname> <given-names>YW</given-names></string-name></person-group>. <article-title>Compression-aware video super-resolution</article-title>. In: <conf-name>2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)</conf-name>; <year>2023 Jun 17&#x2013;24</year>; <publisher-loc>Vancouver, BC, Canada</publisher-loc>. p. <fpage>2012</fpage>&#x2013;<lpage>21</lpage>. doi:<pub-id pub-id-type="doi">10.1109/CVPR52729.2023.00200</pub-id>.</mixed-citation></ref>
<ref id="ref-6"><label>[6]</label><mixed-citation publication-type="book"><person-group person-group-type="author"><string-name><surname>Lv</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Wu</surname> <given-names>J</given-names></string-name>, <string-name><surname>Xie</surname> <given-names>S</given-names></string-name>, <string-name><surname>Gander</surname> <given-names>AJ</given-names></string-name></person-group>. <chapter-title>Video enhancement and super-resolution</chapter-title>. In: <source>Digital image enhancement and reconstruction</source>. <publisher-loc>Amsterdam, The Netherlands</publisher-loc>: <publisher-name>Elsevier</publisher-name>; <year>2023</year>. p. <fpage>1</fpage>&#x2013;<lpage>28</lpage>. doi:<pub-id pub-id-type="doi">10.1016/b978-0-32-398370-9.00008-1</pub-id>.</mixed-citation></ref>
<ref id="ref-7"><label>[7]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Haghighi</surname> <given-names>H</given-names></string-name>, <string-name><surname>Dianati</surname> <given-names>M</given-names></string-name>, <string-name><surname>Donzella</surname> <given-names>V</given-names></string-name>, <string-name><surname>Debattista</surname> <given-names>K</given-names></string-name></person-group>. <article-title>Accelerating stereo image simulation for automotive applications using neural stereo super resolution</article-title>. <source>IEEE Trans Intell Transp Syst</source>. <year>2023</year>;<volume>24</volume>(<issue>11</issue>):<fpage>12627</fpage>&#x2013;<lpage>36</lpage>. doi:<pub-id pub-id-type="doi">10.1109/TITS.2023.3287912</pub-id>.</mixed-citation></ref>
<ref id="ref-8"><label>[8]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Guo</surname> <given-names>K</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Guo</surname> <given-names>H</given-names></string-name>, <string-name><surname>Ren</surname> <given-names>S</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>L</given-names></string-name>, <string-name><surname>Zhou</surname> <given-names>X</given-names></string-name>, <etal>et al</etal></person-group>. <article-title>Video super-resolution based on inter-frame information utilization for intelligent transportation</article-title>. <source>IEEE Trans Intell Transp Syst</source>. <year>2023</year>;<volume>24</volume>(<issue>11</issue>):<fpage>13409</fpage>&#x2013;<lpage>21</lpage>. doi:<pub-id pub-id-type="doi">10.1109/TITS.2023.3237708</pub-id>.</mixed-citation></ref>
<ref id="ref-9"><label>[9]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Tian</surname> <given-names>C</given-names></string-name>, <string-name><surname>Zheng</surname> <given-names>M</given-names></string-name>, <string-name><surname>Li</surname> <given-names>B</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>S</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>D</given-names></string-name></person-group>. <article-title>Perceptive self-supervised learning network for noisy image watermark removal</article-title>. <source>IEEE Trans Circuits Syst Video Technol</source>. <year>2024</year>;<volume>34</volume>(<issue>8</issue>):<fpage>7069</fpage>&#x2013;<lpage>79</lpage>. doi:<pub-id pub-id-type="doi">10.1109/TCSVT.2024.3349678</pub-id>.</mixed-citation></ref>
<ref id="ref-10"><label>[10]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Zhang</surname> <given-names>S</given-names></string-name>, <string-name><surname>Mao</surname> <given-names>W</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>Z</given-names></string-name></person-group>. <article-title>An efficient accelerator based on lightweight deformable 3D-CNN for video super-resolution</article-title>. <source>IEEE Trans Circuits Syst I Regul Pap</source>. <year>2023</year>;<volume>70</volume>(<issue>6</issue>):<fpage>2384</fpage>&#x2013;<lpage>97</lpage>. doi:<pub-id pub-id-type="doi">10.1109/TCSI.2023.3258446</pub-id>.</mixed-citation></ref>
<ref id="ref-11"><label>[11]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Tian</surname> <given-names>C</given-names></string-name>, <string-name><surname>Zheng</surname> <given-names>M</given-names></string-name>, <string-name><surname>Lin</surname> <given-names>CW</given-names></string-name>, <string-name><surname>Li</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>D</given-names></string-name></person-group>. <article-title>Heterogeneous window transformer for image denoising</article-title>. <source>IEEE Trans Syst Man Cybern Syst</source>. <year>2024</year>;<volume>54</volume>(<issue>11</issue>):<fpage>6621</fpage>&#x2013;<lpage>32</lpage>. doi:<pub-id pub-id-type="doi">10.1109/TSMC.2024.3429345</pub-id>.</mixed-citation></ref>
<ref id="ref-12"><label>[12]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Kim</surname> <given-names>J</given-names></string-name>, <string-name><surname>Lee</surname> <given-names>JK</given-names></string-name>, <string-name><surname>Lee</surname> <given-names>KM</given-names></string-name></person-group>. <article-title>Deeply-recursive convolutional network for image super-resolution</article-title>. In: <conf-name>2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)</conf-name>; <year>2016 Jun 27&#x2013;30</year>; <publisher-loc>Las Vegas, NV, USA</publisher-loc>. p. <fpage>1637</fpage>&#x2013;<lpage>45</lpage>. doi:<pub-id pub-id-type="doi">10.1109/CVPR.2016.181</pub-id>.</mixed-citation></ref>
<ref id="ref-13"><label>[13]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Kim</surname> <given-names>J</given-names></string-name>, <string-name><surname>Lee</surname> <given-names>JK</given-names></string-name>, <string-name><surname>Lee</surname> <given-names>KM</given-names></string-name></person-group>. <article-title>Accurate image super-resolution using very deep convolutional networks</article-title>. In: <conf-name>2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)</conf-name>; <year>2016 Jun 27&#x2013;30</year>; <publisher-loc>Las Vegas, NV, USA</publisher-loc>. p. <fpage>1646</fpage>&#x2013;<lpage>54</lpage>. doi:<pub-id pub-id-type="doi">10.1109/CVPR.2016.182</pub-id>.</mixed-citation></ref>
<ref id="ref-14"><label>[14]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Tian</surname> <given-names>C</given-names></string-name>, <string-name><surname>Zheng</surname> <given-names>M</given-names></string-name>, <string-name><surname>Jiao</surname> <given-names>T</given-names></string-name>, <string-name><surname>Zuo</surname> <given-names>W</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Lin</surname> <given-names>CW</given-names></string-name></person-group>. <article-title>A self-supervised CNN for image watermark removal</article-title>. <source>IEEE Trans Circuits Syst Video Technol</source>. <year>2024</year>;<volume>34</volume>(<issue>8</issue>):<fpage>7566</fpage>&#x2013;<lpage>76</lpage>. doi:<pub-id pub-id-type="doi">10.1109/TCSVT.2024.3375831</pub-id>.</mixed-citation></ref>
<ref id="ref-15"><label>[15]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Wang</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Ma</surname> <given-names>K</given-names></string-name></person-group>. <article-title>Active fine-tuning from gMAD examples improves blind image quality assessment</article-title>. <source>IEEE Trans Pattern Anal Mach Intell</source>. <year>2022</year>;<volume>44</volume>(<issue>9</issue>):<fpage>4577</fpage>&#x2013;<lpage>90</lpage>. doi:<pub-id pub-id-type="doi">10.1109/TPAMI.2021.3071759</pub-id>; <pub-id pub-id-type="pmid">33830918</pub-id></mixed-citation></ref>
<ref id="ref-16"><label>[16]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Wang</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>H</given-names></string-name>, <string-name><surname>Chen</surname> <given-names>T</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Ma</surname> <given-names>K</given-names></string-name></person-group>. <article-title>Troubleshooting blind image quality models in the wild</article-title>. In: <conf-name>2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)</conf-name>; <year>2021 Jun 20&#x2013;25</year>; <publisher-loc>Nashville, TN, USA</publisher-loc>. p. <fpage>16251</fpage>&#x2013;<lpage>60</lpage>. doi:<pub-id pub-id-type="doi">10.1109/cvpr46437.2021.01599</pub-id>.</mixed-citation></ref>
<ref id="ref-17"><label>[17]</label><mixed-citation publication-type="book"><person-group person-group-type="author"><string-name><surname>Dong</surname> <given-names>C</given-names></string-name>, <string-name><surname>Loy</surname> <given-names>CC</given-names></string-name>, <string-name><surname>He</surname> <given-names>K</given-names></string-name>, <string-name><surname>Tang</surname> <given-names>X</given-names></string-name></person-group>. <chapter-title>Learning a deep convolutional network for image super-resolution</chapter-title>. In: <person-group person-group-type="editor"><string-name><surname>Fleet</surname> <given-names>D</given-names></string-name>, <string-name><surname>Pajdla</surname> <given-names>T</given-names></string-name>, <string-name><surname>Schiele</surname> <given-names>B</given-names></string-name>, <string-name><surname>Tuytelaars</surname> <given-names>T</given-names></string-name></person-group>, editors. <source>Computer Vision&#x2014;ECCV 2014; 2014 Sep 6&#x2013;12</source>. <publisher-loc>Zurich, Switzerland. Cham, Switzerland</publisher-loc>: <publisher-name>Springer International Publishing</publisher-name>; <year>2014</year>. p. <fpage>184</fpage>&#x2013;<lpage>99</lpage>. doi:<pub-id pub-id-type="doi">10.1007/978-3-319-10593-2_13</pub-id>.</mixed-citation></ref>
<ref id="ref-18"><label>[18]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Kappeler</surname> <given-names>A</given-names></string-name>, <string-name><surname>Yoo</surname> <given-names>S</given-names></string-name>, <string-name><surname>Dai</surname> <given-names>Q</given-names></string-name>, <string-name><surname>Katsaggelos</surname> <given-names>AK</given-names></string-name></person-group>. <article-title>Video super-resolution with convolutional neural networks</article-title>. <source>IEEE Trans Comput Imag</source>. <year>2016</year>;<volume>2</volume>(<issue>2</issue>):<fpage>109</fpage>&#x2013;<lpage>22</lpage>. doi:<pub-id pub-id-type="doi">10.1109/TCI.2016.2532323</pub-id>.</mixed-citation></ref>
<ref id="ref-19"><label>[19]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Xue</surname> <given-names>T</given-names></string-name>, <string-name><surname>Chen</surname> <given-names>B</given-names></string-name>, <string-name><surname>Wu</surname> <given-names>J</given-names></string-name>, <string-name><surname>Wei</surname> <given-names>D</given-names></string-name>, <string-name><surname>Freeman</surname> <given-names>WT</given-names></string-name></person-group>. <article-title>Video enhancement with task-oriented flow</article-title>. <source>Int J Comput Vis</source>. <year>2019</year>;<volume>127</volume>(<issue>8</issue>):<fpage>1106</fpage>&#x2013;<lpage>25</lpage>. doi:<pub-id pub-id-type="doi">10.1007/s11263-018-01144-2</pub-id>.</mixed-citation></ref>
<ref id="ref-20"><label>[20]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Deng</surname> <given-names>X</given-names></string-name></person-group>. <article-title>Enhancing image quality via style transfer for single image super-resolution</article-title>. <source>IEEE Signal Process Lett</source>. <year>2018</year>;<volume>25</volume>(<issue>4</issue>):<fpage>571</fpage>&#x2013;<lpage>5</lpage>. doi:<pub-id pub-id-type="doi">10.1109/LSP.2018.2805809</pub-id>.</mixed-citation></ref>
<ref id="ref-21"><label>[21]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Haris</surname> <given-names>M</given-names></string-name>, <string-name><surname>Shakhnarovich</surname> <given-names>G</given-names></string-name>, <string-name><surname>Ukita</surname> <given-names>N</given-names></string-name></person-group>. <article-title>Recurrent back-projection network for video super-resolution</article-title>. In: <conf-name>2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)</conf-name>; <year>2019 Jun 15&#x2013;20</year>; <publisher-loc>Long Beach, CA, USA</publisher-loc>. p. <fpage>3892</fpage>&#x2013;<lpage>901</lpage>. doi:<pub-id pub-id-type="doi">10.1109/cvpr.2019.00402</pub-id>.</mixed-citation></ref>
<ref id="ref-22"><label>[22]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Wang</surname> <given-names>X</given-names></string-name>, <string-name><surname>Chan</surname> <given-names>KCK</given-names></string-name>, <string-name><surname>Yu</surname> <given-names>K</given-names></string-name>, <string-name><surname>Dong</surname> <given-names>C</given-names></string-name>, <string-name><surname>Loy</surname> <given-names>CC</given-names></string-name></person-group>. <article-title>EDVR: video restoration with enhanced deformable convolutional networks</article-title>. In: <conf-name>2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW); 2019 Jun 16&#x2013;17</conf-name>; <publisher-loc>Long Beach, CA, USA</publisher-loc>. p. <fpage>1954</fpage>&#x2013;<lpage>63</lpage>. doi:<pub-id pub-id-type="doi">10.1109/cvprw.2019.00247</pub-id>.</mixed-citation></ref>
<ref id="ref-23"><label>[23]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Zhou</surname> <given-names>C</given-names></string-name>, <string-name><surname>Chen</surname> <given-names>C</given-names></string-name>, <string-name><surname>Ding</surname> <given-names>F</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>D</given-names></string-name></person-group>. <article-title>Video super-resolution with non-local alignment network</article-title>. <source>IET Image Process</source>. <year>2021</year>;<volume>15</volume>(<issue>8</issue>):<fpage>1655</fpage>&#x2013;<lpage>67</lpage>. doi:<pub-id pub-id-type="doi">10.1049/ipr2.12134</pub-id>.</mixed-citation></ref>
<ref id="ref-24"><label>[24]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Ma</surname> <given-names>L</given-names></string-name>, <string-name><surname>Li</surname> <given-names>N</given-names></string-name>, <string-name><surname>Zhu</surname> <given-names>P</given-names></string-name>, <string-name><surname>Tang</surname> <given-names>K</given-names></string-name>, <string-name><surname>Khan</surname> <given-names>A</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>F</given-names></string-name>, <etal>et al</etal></person-group>. <article-title>A novel fuzzy neural network architecture search framework for defect recognition with uncertainties</article-title>. <source>IEEE Trans Fuzzy Syst</source>. <year>2024</year>;<volume>32</volume>(<issue>5</issue>):<fpage>3274</fpage>&#x2013;<lpage>85</lpage>. doi:<pub-id pub-id-type="doi">10.1109/TFUZZ.2024.3373792</pub-id>.</mixed-citation></ref>
<ref id="ref-25"><label>[25]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Teed</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Deng</surname> <given-names>J</given-names></string-name></person-group>. <article-title>RAFT: recurrent all-pairs field transforms for optical flow</article-title>. In: <conf-name> European Conference on Computer Vision; 2020 Aug 23&#x2013;28; Glasgow, UK</conf-name>. <publisher-loc>Cham, Switzerland</publisher-loc>: <publisher-name>Springer International Publishing</publisher-name>; <year>2020</year>. p. <fpage>402</fpage>&#x2013;<lpage>19</lpage>. doi:<pub-id pub-id-type="doi">10.1007/978-3-030-58536-5_24</pub-id>.</mixed-citation></ref>
<ref id="ref-26"><label>[26]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Sun</surname> <given-names>D</given-names></string-name>, <string-name><surname>Yang</surname> <given-names>X</given-names></string-name>, <string-name><surname>Liu</surname> <given-names>MY</given-names></string-name>, <string-name><surname>Kautz</surname> <given-names>J</given-names></string-name></person-group>. <article-title>PWC-net: CNNs for optical flow using pyramid, warping, and cost volume</article-title>. In: <conf-name>2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2018 Jun 18&#x2013;23</conf-name>; <publisher-loc>Salt Lake City, UT, USA</publisher-loc>. p. <fpage>8934</fpage>&#x2013;<lpage>43</lpage>. doi:<pub-id pub-id-type="doi">10.1109/CVPR.2018.00931</pub-id>.</mixed-citation></ref>
<ref id="ref-27"><label>[27]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Ranjan</surname> <given-names>A</given-names></string-name>, <string-name><surname>Black</surname> <given-names>MJ</given-names></string-name></person-group>. <article-title>Optical flow estimation using a spatial pyramid network</article-title>. In: <conf-name>2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR); 2017 Jul 21&#x2013;26</conf-name>; <publisher-loc>Honolulu, HI, USA</publisher-loc>. p. <fpage>2720</fpage>&#x2013;<lpage>9</lpage>. doi:<pub-id pub-id-type="doi">10.1109/CVPR.2017.291</pub-id>.</mixed-citation></ref>
<ref id="ref-28"><label>[28]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>He</surname> <given-names>K</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>X</given-names></string-name>, <string-name><surname>Ren</surname> <given-names>S</given-names></string-name>, <string-name><surname>Sun</surname> <given-names>J</given-names></string-name></person-group>. <article-title>Deep residual learning for image recognition</article-title>. In: <conf-name>2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR); 2016 Jun 27&#x2013;30</conf-name>; <publisher-loc>Las Vegas, NV, USA</publisher-loc>. p. <fpage>770</fpage>&#x2013;<lpage>8</lpage>. doi:<pub-id pub-id-type="doi">10.1109/CVPR.2016.90</pub-id>.</mixed-citation></ref>
<ref id="ref-29"><label>[29]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Hu</surname> <given-names>X</given-names></string-name>, <string-name><surname>Ren</surname> <given-names>W</given-names></string-name>, <string-name><surname>Yang</surname> <given-names>J</given-names></string-name>, <string-name><surname>Cao</surname> <given-names>X</given-names></string-name>, <string-name><surname>Wipf</surname> <given-names>D</given-names></string-name>, <string-name><surname>Menze</surname> <given-names>B</given-names></string-name>, <etal>et al</etal></person-group>. <article-title>Face restoration via plug-and-play 3D facial priors</article-title>. <source>IEEE Trans Pattern Anal Mach Intell</source>. <year>2022</year>;<volume>44</volume>(<issue>12</issue>):<fpage>8910</fpage>&#x2013;<lpage>26</lpage>. doi:<pub-id pub-id-type="doi">10.1109/TPAMI.2021.3123085</pub-id>; <pub-id pub-id-type="pmid">34705635</pub-id></mixed-citation></ref>
<ref id="ref-30"><label>[30]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Ding</surname> <given-names>X</given-names></string-name>, <string-name><surname>Guo</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Ding</surname> <given-names>G</given-names></string-name>, <string-name><surname>Han</surname> <given-names>J</given-names></string-name></person-group>. <article-title>ACNet: strengthening the kernel skeletons for powerful CNN via asymmetric convolution blocks</article-title>. In: <conf-name>2019 IEEE/CVF International Conference on Computer Vision (ICCV); 2019 Oct 27&#x2013;Nov 2</conf-name>; <publisher-loc>Seoul, Republic of Korea</publisher-loc>. p. <fpage>1911</fpage>&#x2013;<lpage>20</lpage>. doi:<pub-id pub-id-type="doi">10.1109/iccv.2019.00200</pub-id>.</mixed-citation></ref>
<ref id="ref-31"><label>[31]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Nah</surname> <given-names>S</given-names></string-name>, <string-name><surname>Baik</surname> <given-names>S</given-names></string-name>, <string-name><surname>Hong</surname> <given-names>S</given-names></string-name>, <string-name><surname>Moon</surname> <given-names>G</given-names></string-name>, <string-name><surname>Son</surname> <given-names>S</given-names></string-name>, <string-name><surname>Timofte</surname> <given-names>R</given-names></string-name>, <etal>et al.</etal></person-group> <article-title>Challenge on video deblurring and super-resolution: dataset and study</article-title>. In: <conf-name>2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW); 2019 Jun 16&#x2013;17</conf-name>; <publisher-loc>Long Beach, CA, USA</publisher-loc>. p. <fpage>1996</fpage>&#x2013;<lpage>2005</lpage>. doi:<pub-id pub-id-type="doi">10.1109/cvprw.2019.00251</pub-id>.</mixed-citation></ref>
<ref id="ref-32"><label>[32]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Liu</surname> <given-names>C</given-names></string-name>, <string-name><surname>Sun</surname> <given-names>D</given-names></string-name></person-group>. <article-title>On Bayesian adaptive video super resolution</article-title>. <source>IEEE Trans Pattern Anal Mach Intell</source>. <year>2014</year>;<volume>36</volume>(<issue>2</issue>):<fpage>346</fpage>&#x2013;<lpage>60</lpage>. doi:<pub-id pub-id-type="doi">10.1109/TPAMI.2013.127</pub-id>; <pub-id pub-id-type="pmid">24356354</pub-id></mixed-citation></ref>
<ref id="ref-33"><label>[33]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Schultz</surname> <given-names>RR</given-names></string-name>, <string-name><surname>Meng</surname> <given-names>L</given-names></string-name>, <string-name><surname>Stevenson</surname> <given-names>RL</given-names></string-name></person-group>. <article-title>Subpixel motion estimation for super-resolution image sequence enhancement</article-title>. <source>J Vis Commun Image Represent</source>. <year>1998</year>;<volume>9</volume>(<issue>1</issue>):<fpage>38</fpage>&#x2013;<lpage>50</lpage>. doi:<pub-id pub-id-type="doi">10.1006/jvci.1997.0370</pub-id>.</mixed-citation></ref>
<ref id="ref-34"><label>[34]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Lai</surname> <given-names>WS</given-names></string-name>, <string-name><surname>Huang</surname> <given-names>JB</given-names></string-name>, <string-name><surname>Ahuja</surname> <given-names>N</given-names></string-name>, <string-name><surname>Yang</surname> <given-names>MH</given-names></string-name></person-group>. <article-title>Deep Laplacian pyramid networks for fast and accurate super-resolution</article-title>. In: <conf-name>2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR); 2017 Jul 21&#x2013;26</conf-name>; <publisher-loc>Honolulu, HI, USA</publisher-loc>. p. <fpage>5835</fpage>&#x2013;<lpage>43</lpage>. doi:<pub-id pub-id-type="doi">10.1109/CVPR.2017.618</pub-id>.</mixed-citation></ref>
<ref id="ref-35"><label>[35]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Lai</surname> <given-names>WS</given-names></string-name>, <string-name><surname>Huang</surname> <given-names>JB</given-names></string-name>, <string-name><surname>Ahuja</surname> <given-names>N</given-names></string-name>, <string-name><surname>Yang</surname> <given-names>MH</given-names></string-name></person-group>. <article-title>Fast and accurate image super-resolution with deep Laplacian pyramid networks</article-title>. <source>IEEE Trans Pattern Anal Mach Intell</source>. <year>2019</year>;<volume>41</volume>(<issue>11</issue>):<fpage>2599</fpage>&#x2013;<lpage>613</lpage>. doi:<pub-id pub-id-type="doi">10.1109/TPAMI.2018.2865304</pub-id>; <pub-id pub-id-type="pmid">30106708</pub-id></mixed-citation></ref>
<ref id="ref-36"><label>[36]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Li</surname> <given-names>S</given-names></string-name>, <string-name><surname>He</surname> <given-names>F</given-names></string-name>, <string-name><surname>Du</surname> <given-names>B</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>L</given-names></string-name>, <string-name><surname>Xu</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Tao</surname> <given-names>D</given-names></string-name></person-group>. <article-title>Fast spatio-temporal residual network for video super-resolution</article-title>. In: <conf-name>2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); 2019 Jun 15&#x2013;20</conf-name>; <publisher-loc>Long Beach, CA, USA</publisher-loc>. p. <fpage>10514</fpage>&#x2013;<lpage>23</lpage>. doi:<pub-id pub-id-type="doi">10.1109/cvpr.2019.01077</pub-id>.</mixed-citation></ref>
<ref id="ref-37"><label>[37]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Rifman</surname> <given-names>SS</given-names></string-name></person-group>. <article-title>Digital rectification of ERTS multispectral imagery</article-title>. In: <conf-name>NASA Goddard Space Flight Center Symposium on Significant Results Obtained from the ERTS-1; 1973 Mar 5&#x2013;9</conf-name>; <publisher-loc>Greenbelt, MD, USA</publisher-loc>.</mixed-citation></ref>
<ref id="ref-38"><label>[38]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Jo</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Oh</surname> <given-names>SW</given-names></string-name>, <string-name><surname>Kang</surname> <given-names>J</given-names></string-name>, <string-name><surname>Kim</surname> <given-names>SJ</given-names></string-name></person-group>. <article-title>Deep video super-resolution network using dynamic upsampling filters without explicit motion compensation</article-title>. In: <conf-name>2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2018 Jun 18&#x2013;23</conf-name>; <publisher-loc>Salt Lake City, UT, USA</publisher-loc>. p. <fpage>3224</fpage>&#x2013;<lpage>32</lpage>. doi:<pub-id pub-id-type="doi">10.1109/CVPR.2018.00340</pub-id>.</mixed-citation></ref>
<ref id="ref-39"><label>[39]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Yi</surname> <given-names>P</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Jiang</surname> <given-names>K</given-names></string-name>, <string-name><surname>Jiang</surname> <given-names>J</given-names></string-name>, <string-name><surname>Ma</surname> <given-names>J</given-names></string-name></person-group>. <article-title>Progressive fusion video super-resolution network via exploiting non-local spatio-temporal correlations</article-title>. In: <conf-name>2019 IEEE/CVF International Conference on Computer Vision (ICCV); 2019 Oct 27&#x2013;Nov 2</conf-name>; <publisher-loc>Seoul, Republic of Korea</publisher-loc>. p. <fpage>3106</fpage>&#x2013;<lpage>15</lpage>. doi:<pub-id pub-id-type="doi">10.1109/iccv.2019.00320</pub-id>.</mixed-citation></ref>
<ref id="ref-40"><label>[40]</label><mixed-citation publication-type="book"><person-group person-group-type="author"><string-name><surname>Li</surname> <given-names>W</given-names></string-name>, <string-name><surname>Tao</surname> <given-names>X</given-names></string-name>, <string-name><surname>Guo</surname> <given-names>T</given-names></string-name>, <string-name><surname>Qi</surname> <given-names>L</given-names></string-name>, <string-name><surname>Lu</surname> <given-names>J</given-names></string-name>, <string-name><surname>Jia</surname> <given-names>J</given-names></string-name></person-group>. <chapter-title>MuCAN: multi-correspondence aggregation network for video super-resolution</chapter-title>. In: <person-group person-group-type="editor"><string-name><surname>Vedaldi</surname> <given-names>A</given-names></string-name>, <string-name><surname>Bischof</surname> <given-names>H</given-names></string-name>, <string-name><surname>Brox</surname> <given-names>T</given-names></string-name>, <string-name><surname>Frahm</surname> <given-names>J-M</given-names></string-name></person-group>, editors. <source>Computer Vision&#x2014;ECCV 2020; 2020 Aug 23&#x2013;28</source>; <publisher-loc>Glasgow, UK. Cham, Switzerland</publisher-loc>: <publisher-name>Springer International Publishing</publisher-name>; <year>2020</year>. p. <fpage>335</fpage>&#x2013;<lpage>51</lpage>. doi:<pub-id pub-id-type="doi">10.1007/978-3-030-58607-2_20</pub-id>.</mixed-citation></ref>
<ref id="ref-41"><label>[41]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Xu</surname> <given-names>G</given-names></string-name>, <string-name><surname>Xu</surname> <given-names>J</given-names></string-name>, <string-name><surname>Li</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>L</given-names></string-name>, <string-name><surname>Sun</surname> <given-names>X</given-names></string-name>, <string-name><surname>Cheng</surname> <given-names>MM</given-names></string-name></person-group>. <article-title>Temporal modulation network for controllable space-time video super-resolution</article-title>. In: <conf-name>2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); 2021 Jun 20&#x2013;25</conf-name>; <publisher-loc>Nashville, TN, USA</publisher-loc>. p. <fpage>6384</fpage>&#x2013;<lpage>93</lpage>. doi:<pub-id pub-id-type="doi">10.1109/cvpr46437.2021.00632</pub-id>.</mixed-citation></ref>
<ref id="ref-42"><label>[42]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Wang</surname> <given-names>H</given-names></string-name>, <string-name><surname>Xiang</surname> <given-names>X</given-names></string-name>, <string-name><surname>Tian</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Yang</surname> <given-names>W</given-names></string-name>, <string-name><surname>Liao</surname> <given-names>Q</given-names></string-name></person-group>. <article-title>STDAN: deformable attention network for space-time video super-resolution</article-title>. <source>IEEE Trans Neural Netw Learn Syst</source>. <year>2024</year>;<volume>35</volume>(<issue>8</issue>):<fpage>10606</fpage>&#x2013;<lpage>16</lpage>. doi:<pub-id pub-id-type="doi">10.1109/TNNLS.2023.3243029</pub-id>; <pub-id pub-id-type="pmid">37027773</pub-id></mixed-citation></ref>
<ref id="ref-43"><label>[43]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Zhu</surname> <given-names>Q</given-names></string-name>, <string-name><surname>Chen</surname> <given-names>F</given-names></string-name>, <string-name><surname>Zhu</surname> <given-names>S</given-names></string-name>, <string-name><surname>Liu</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Zhou</surname> <given-names>X</given-names></string-name>, <string-name><surname>Xiong</surname> <given-names>R</given-names></string-name>, <etal>et al</etal></person-group>. <article-title>DVSRNet: deep video super-resolution based on progressive deformable alignment and temporal-sparse enhancement</article-title>. <source>IEEE Trans Neural Netw Learn Syst</source>. <year>2025</year>;<volume>36</volume>(<issue>2</issue>):<fpage>3258</fpage>&#x2013;<lpage>72</lpage>. doi:<pub-id pub-id-type="doi">10.1109/tnnls.2023.3347450</pub-id>; <pub-id pub-id-type="pmid">38215317</pub-id></mixed-citation></ref>
<ref id="ref-44"><label>[44]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Caballero</surname> <given-names>J</given-names></string-name>, <string-name><surname>Ledig</surname> <given-names>C</given-names></string-name>, <string-name><surname>Aitken</surname> <given-names>A</given-names></string-name>, <string-name><surname>Acosta</surname> <given-names>A</given-names></string-name>, <string-name><surname>Totz</surname> <given-names>J</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>Z</given-names></string-name>, <etal>et al.</etal></person-group> <article-title>Real-time video super-resolution with spatio-temporal networks and motion compensation</article-title>. In: <conf-name>2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR); 2017 Jul 21&#x2013;26</conf-name>; <publisher-loc>Honolulu, HI, USA</publisher-loc>. p. <fpage>2848</fpage>&#x2013;<lpage>57</lpage>. doi:<pub-id pub-id-type="doi">10.1109/CVPR.2017.304</pub-id>.</mixed-citation></ref>
<ref id="ref-45"><label>[45]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Tao</surname> <given-names>X</given-names></string-name>, <string-name><surname>Gao</surname> <given-names>H</given-names></string-name>, <string-name><surname>Liao</surname> <given-names>R</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>J</given-names></string-name>, <string-name><surname>Jia</surname> <given-names>J</given-names></string-name></person-group>. <article-title>Detail-revealing deep video super-resolution</article-title>. In: <conf-name> 2017 IEEE International Conference on Computer Vision (ICCV); 2017 Oct 22&#x2013;29</conf-name>; <publisher-loc>Venice, Italy</publisher-loc>. p. <fpage>4482</fpage>&#x2013;<lpage>90</lpage>. doi:<pub-id pub-id-type="doi">10.1109/ICCV.2017.479</pub-id>.</mixed-citation></ref>
<ref id="ref-46"><label>[46]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Sajjadi</surname> <given-names>MSM</given-names></string-name>, <string-name><surname>Vemulapalli</surname> <given-names>R</given-names></string-name>, <string-name><surname>Brown</surname> <given-names>M</given-names></string-name></person-group>. <article-title>Frame-recurrent video super-resolution</article-title>. In: <conf-name>2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2018 Jun 18&#x2013;23</conf-name>; <publisher-loc>Salt Lake City, UT, USA</publisher-loc>. p. <fpage>6626</fpage>&#x2013;<lpage>34</lpage>. doi:<pub-id pub-id-type="doi">10.1109/CVPR.2018.00693</pub-id>.</mixed-citation></ref>
<ref id="ref-47"><label>[47]</label><mixed-citation publication-type="book"><person-group person-group-type="author"><string-name><surname>Wang</surname> <given-names>L</given-names></string-name>, <string-name><surname>Guo</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Lin</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Deng</surname> <given-names>X</given-names></string-name>, <string-name><surname>An</surname> <given-names>W</given-names></string-name></person-group>. <chapter-title>Learning for video super-resolution through HR optical flow estimation</chapter-title>. In: <person-group person-group-type="editor"><string-name><surname>Jawahar</surname> <given-names>CV</given-names></string-name>, <string-name><surname>Li</surname> <given-names>H</given-names></string-name>, <string-name><surname>Mori</surname> <given-names>G</given-names></string-name>, <string-name><surname>Schindler</surname> <given-names>K</given-names></string-name></person-group>, editors. <source>Computer Vision&#x2014;ACCV 2018; 2018 Dec 2&#x2013;6</source>; <publisher-loc>Perth, Australia. Cham, Switzerland</publisher-loc>: <publisher-name>Springer International Publishing</publisher-name>; <year>2019</year>. p. <fpage>514</fpage>&#x2013;<lpage>29</lpage>. doi:<pub-id pub-id-type="doi">10.1007/978-3-030-20887-5_32</pub-id>.</mixed-citation></ref>
</ref-list>
</back></article>