<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.1 20151215//EN" "http://jats.nlm.nih.gov/publishing/1.1/JATS-journalpublishing1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:mml="http://www.w3.org/1998/Math/MathML" xml:lang="en" article-type="research-article" dtd-version="1.1">
<front>
<journal-meta>
<journal-id journal-id-type="pmc">CMC</journal-id>
<journal-id journal-id-type="nlm-ta">CMC</journal-id>
<journal-id journal-id-type="publisher-id">CMC</journal-id>
<journal-title-group>
<journal-title>Computers, Materials &#x0026; Continua</journal-title>
</journal-title-group>
<issn pub-type="epub">1546-2226</issn>
<issn pub-type="ppub">1546-2218</issn>
<publisher>
<publisher-name>Tech Science Press</publisher-name>
<publisher-loc>USA</publisher-loc>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="publisher-id">66803</article-id>
<article-id pub-id-type="doi">10.32604/cmc.2025.066803</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Article</subject>
</subj-group>
</article-categories>
<title-group>
<article-title>BSDNet: Semantic Information Distillation-Based for Bilateral-Branch Real-Time Semantic Segmentation on Street Scene Image</article-title>
<alt-title alt-title-type="left-running-head">BSDNet: Semantic Information Distillation-Based for Bilateral-Branch Real-Time Semantic Segmentation on Street Scene Image</alt-title>
<alt-title alt-title-type="right-running-head">BSDNet: Semantic Information Distillation-Based for Bilateral-Branch Real-Time Semantic Segmentation on Street Scene Image</alt-title>
</title-group>
<contrib-group>
<contrib id="author-1" contrib-type="author">
<name name-style="western"><surname>Zeng</surname><given-names>Huan</given-names></name></contrib>
<contrib id="author-2" contrib-type="author" corresp="yes">
<name name-style="western"><surname>Zhang</surname><given-names>Jianxun</given-names></name><email>zjx@cqut.edu.cn</email></contrib>
<contrib id="author-3" contrib-type="author">
<name name-style="western"><surname>Chen</surname><given-names>Hongji</given-names></name></contrib>
<contrib id="author-4" contrib-type="author">
<name name-style="western"><surname>Zhu</surname><given-names>Xinwei</given-names></name></contrib>
<aff id="aff-1"><institution>Department of Computer Science and Engineering, Chongqing University of Technology</institution>, <addr-line>Chongqing, 400054</addr-line>, <country>China</country></aff>
</contrib-group>
<author-notes>
<corresp id="cor1"><label>&#x002A;</label>Corresponding Author: Jianxun Zhang. Email: <email>zjx@cqut.edu.cn</email></corresp>
</author-notes>
<pub-date date-type="collection" publication-format="electronic">
<year>2025</year></pub-date>
<pub-date date-type="pub" publication-format="electronic">
<day>23</day><month>09</month><year>2025</year></pub-date>
<volume>85</volume>
<issue>2</issue>
<fpage>3879</fpage>
<lpage>3896</lpage>
<history>
<date date-type="received">
<day>17</day>
<month>4</month>
<year>2025</year>
</date>
<date date-type="accepted">
<day>05</day>
<month>8</month>
<year>2025</year>
</date>
</history>
<permissions>
<copyright-statement>&#x00A9; 2025 The Authors.</copyright-statement>
<copyright-year>2025</copyright-year>
<copyright-holder>Published by Tech Science Press.</copyright-holder>
<license xlink:href="https://creativecommons.org/licenses/by/4.0/">
<license-p>This work is licensed under a <ext-link ext-link-type="uri" xlink:type="simple" xlink:href="https://creativecommons.org/licenses/by/4.0/">Creative Commons Attribution 4.0 International License</ext-link>, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
</license>
</permissions>
<self-uri content-type="pdf" xlink:href="TSP_CMC_66803.pdf"></self-uri>
<abstract>
<p>Semantic segmentation in street scenes is a crucial technology for autonomous driving to analyze the surrounding environment. In street scenes, issues such as high image resolution caused by a large viewpoints and differences in object scales lead to a decline in real-time performance and difficulties in multi-scale feature extraction. To address this, we propose a bilateral-branch real-time semantic segmentation method based on semantic information distillation (BSDNet) for street scene images. The BSDNet consists of a Feature Conversion Convolutional Block (FCB), a Semantic Information Distillation Module (SIDM), and a Deep Aggregation Atrous Convolution Pyramid Pooling (DASP). FCB reduces the semantic gap between the backbone and the semantic branch. SIDM extracts high-quality semantic information from the Transformer branch to reduce computational costs. DASP aggregates information lost in atrous convolutions, effectively capturing multi-scale objects. Extensive experiments conducted on Cityscapes, CamVid, and ADE20K, achieving an accuracy of 81.7<inline-formula id="ieqn-1"><mml:math id="mml-ieqn-1"><mml:mi mathvariant="normal">&#x0025;</mml:mi></mml:math></inline-formula> Mean Intersection over Union (mIoU) at 70.6 Frames Per Second (FPS) on Cityscapes, demonstrate that our method achieves a better balance between accuracy and inference speed.</p>
</abstract>
<kwd-group kwd-group-type="author">
<kwd>Street scene understanding</kwd>
<kwd>real-time semantic segmentation</kwd>
<kwd>knowledge distillation</kwd>
<kwd>multi-scale feature extraction</kwd>
</kwd-group>
<funding-group>
<award-group id="awg1">
<funding-source>National Natural Science Foundation of China</funding-source>
<award-id>62471075</award-id>
</award-group>
<award-group id="awg2">
<funding-source>Major Science and Technology Project Grant</funding-source>
<award-id>KJZD-M202301901</award-id>
</award-group>
<award-group id="awg3">
<funding-source>Graduate Innovation Fund of Chongqing</funding-source>
<award-id>gzlcx20253235</award-id>
</award-group>
</funding-group>
</article-meta>
</front>
<body>
<sec id="s1">
<label>1</label>
<title>Introduction</title>
<p>Street scene understanding requires precise and comprehensive semantic information within the street environment. Compared to object detection or classification, semantic segmentation offers detailed pixel-level image classification by assigning semantic labels to each pixel. It is essential for the detailed understanding of diverse street scene objects, such as roads, vehicles, and pedestrians, and is widely applied in intelligent transportation systems, including autonomous driving and road monitoring.</p>
<p>Although semantic segmentation models&#x2019; performance has steadily improved with the advancement of deep learning technology, real-time semantic segmentation in street scenes remains a significant problem due to the higher processing costs and inference times. First of all, images of street scenes are often high-resolution to achieve a wider field of view. The resolution of each image in the Cityscapes dataset, for example, is <inline-formula id="ieqn-2"><mml:math id="mml-ieqn-2"><mml:mn>1024</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mn>2048</mml:mn></mml:math></inline-formula>. It will have quadratic processing complexity in terms of image resolution when self-attention is used. Second, there are a wide variety of object types in street scenes, with significant scale differences between objects like cars and pedestrians. Traditional multi-scale feature extraction methods often involve upsampling operations, which further increase computational costs. These issues pose significant challenges to real-time semantic segmentation in street scenes.</p>
<p>Existing real-time semantic segmentation models have made remarkable progress in performance, but at the cost of increased computational complexity. The RTFormer [<xref ref-type="bibr" rid="ref-1">1</xref>] method utilizes self-attention to capture high-quality long-range context. However, self-attention inherently exhibits quadratic complexity with respect to input resolution, limiting its efficiency on high-resolution images. Knowledge distillation has been adopted to enhance the efficiency of semantic segmentation models. Yet, it remains challenging to effectively distill knowledge between CNNs and Transformers due to their fundamentally different architectures. For multi-scale object extraction, DeepLab [<xref ref-type="bibr" rid="ref-2">2</xref>] introduces the Atrous Spatial Pyramid Pooling (ASPP), which employs atrous convolutions to obtain receptive fields of varying sizes. However, the multi-scale features extracted through atrous convolution often lack inter-layer correlation, leading to information loss. In order to integrate multi-scale contexts and increase the effective receptive field, DDRNet [<xref ref-type="bibr" rid="ref-3">3</xref>] proposed the DAPPM. This module&#x2019;s substantial upsampling procedures, however, raise computational cost and degrade real-time performance.</p>
<p>To address these challenges, we propose BSDNet, a real-time semantic segmentation network for street scenes. The bilateral-branch structure adopted by BSDNet enables the lightweight CNN backbone to acquire high-quality semantic information while reducing computational complexity. To alleviate the adverse effects of structural differences between branches on semantic information distillation effectiveness, the FCB is designed to reduce feature discrepancies between the two branch models. Additionally, considering that different model architectures may learn distinct predictive distributions due to their inherent inductive biases, the SIDM is introduced. In SIDM, OFA loss [<xref ref-type="bibr" rid="ref-4">4</xref>] is employed to limit the impact of irrelevant information in logits. To address multi-scale objects in street scenes, the DASP is designed to capture features of different scales while enhancing correlations between them. And <xref ref-type="fig" rid="fig-1">Fig. 1</xref> shows the comparison between BSDNet and other methods on the Cityscapes test dataset. The overall architecture of the proposed BSDNet is shown in <xref ref-type="fig" rid="fig-2">Fig. 2</xref>. The primary contributions of this paper can be summarized as follows:</p>
<fig id="fig-1">
<label>Figure 1</label>
<caption>
<title>Compared with the speed-accuracy performance on the Cityscapes test set. Our method is marked with red stars, while other methods are marked with green dots</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_66803-fig-1.tif"/>
</fig><fig id="fig-2">
<label>Figure 2</label>
<caption>
<title>The overall structure of BSDNet. In the CNN backbone network, basic convolution blocks (blue squares) are used first, followed by FCB (green squares). In the semantic segmentation head, DASP stands for the Deep Aggregation Atrous Convolution Pyramid Pooling. SIDM between the two networks represents the Semantic Information Distillation Module</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_66803-fig-2.tif"/>
</fig>
<p><list list-type="simple">
<list-item><label>1)</label><p>FCB is designed to reduce feature differences during knowledge distillation between models with different architectures through a more efficient attention mechanism and feedforward network.</p></list-item>
<list-item><label>2)</label><p>SIDM is intended to reduce the model inference time by allowing the CNN backbone to learn high-quality semantic information from the pre-trained Transformer branch.</p></list-item>
<list-item><label>3)</label><p>DASP is introduced to accurately and efficiently capture objects with significant scale differences in complex street scenes.</p></list-item>
<list-item><label>4)</label><p>Extensive experimental results demonstrate that the proposed BSDNet outperforms state-of-the-art methods in real-time semantic segmentation on Cityscapes, CamVid, and ADE20K datasets.</p></list-item>
</list></p>
</sec>
<sec id="s2">
<label>2</label>
<title>Related Works</title>
<p>In this section, the related works are divided into three parts: high-performance semantic segmentation, real-time semantic segmentation, and semantic information knowledge distillation.</p>
<sec id="s2_1">
<label>2.1</label>
<title>High Performance Semantic Segmentation</title>
<p>In deep learning, the first method specifically designed for semantic segmentation was FCN [<xref ref-type="bibr" rid="ref-5">5</xref>]. This method differs from traditional approaches that treat semantic segmentation as a region classification problem by framing it as a pixel-level classification problem. Badrinarayanan et al. [<xref ref-type="bibr" rid="ref-6">6</xref>] improved upon FCN by introducing the SegNet. Compared to the FCN model, SegNet uses the corresponding max-pooling layer indices to restore the resolution of feature maps, showing better performance on low-resolution images. DeepLab introduced atrous convolution and atrous spatial pyramid pooling (ASPP) into the segmentation network, effectively expanding the receptive field. Multi-level feature fusion modules were created by PSPNet [<xref ref-type="bibr" rid="ref-7">7</xref>] to handle context information at different scales. Lin et al. [<xref ref-type="bibr" rid="ref-8">8</xref>] employed residual connections and multi-scale fusion techniques in the RefineNet network, and used deconvolution to achieve higher resolution in the output segmentation results.</p>
</sec>
<sec id="s2_2">
<label>2.2</label>
<title>Real-Time Semantic Segmentation</title>
<p>Early real-time semantic segmentation methods explored many lightweight network architectures, mainly reducing inference costs through channel compression and fast downsampling. For instance, Zhao et al. [<xref ref-type="bibr" rid="ref-9">9</xref>] proposed a novel image cascade network in ICNet, which refines segmentation predictions by utilizing low-resolution semantic information and high-resolution image details. Yu et al. proposed BiSeNetV1 [<xref ref-type="bibr" rid="ref-10">10</xref>] and BiSeNetV2 [<xref ref-type="bibr" rid="ref-11">11</xref>], which combine shallow feature details with deep feature semantics. To reduce time-consuming auxiliary paths, Fan et al. presented STDC [<xref ref-type="bibr" rid="ref-12">12</xref>], a bilateral-branch network based on BiSeNet that encodes spatial information using a guidance module. DDRNet extracts high and low-resolution features independently using a multi-resolution network in order to balance context information during the quick downsampling process. Some real-time semantic segmentation methods also employ transformers to enhance performance. Zhang et al. [<xref ref-type="bibr" rid="ref-13">13</xref>] introduced the self-attention mechanism in the Top-Former, but using transformers on low-resolution feature maps led to lower accuracy. RTformer proposed a more GPU-friendly attention mechanism. SeaFormer [<xref ref-type="bibr" rid="ref-14">14</xref>] is a lightweight transformer model that compresses the spatial dimensions of the input feature map to lower computing cost.</p>
</sec>
<sec id="s2_3">
<label>2.3</label>
<title>Knowledge Distillation</title>
<p>Knowledge distillation is a lightweight method that maintains high model performance, helping achieve high-performance real-time semantic segmentation. Some logic-based knowledge distillation methods have been improved through model ensembles, contrastive learning, and other techniques. Touvron et al. [<xref ref-type="bibr" rid="ref-15">15</xref>] introduced a novel distillation process based on distillation tokens when training transformer students. To narrow the large capacity gap between teacher and student models, Huang et al. [<xref ref-type="bibr" rid="ref-16">16</xref>] proposed relaxing the precise matching based on KL divergence. Furthermore, Romero et al. [<xref ref-type="bibr" rid="ref-17">17</xref>] first proposed a hint-based distillation method, in which the student features are projected into the teacher&#x2019;s feature space through convolutional layers. To adapt to the characteristics of dense prediction tasks, Xie et al. [<xref ref-type="bibr" rid="ref-18">18</xref>] computed the Euclidean distance between the central pixel and its 8-neighboring pixels, constructing a local similarity graph. Liu et al. [<xref ref-type="bibr" rid="ref-19">19</xref>] suggested a method to capture structured information between pixels and global correlation. To focus the learning process on intra-class feature variations, Wang et al. [<xref ref-type="bibr" rid="ref-20">20</xref>] employed the cosine distance between pixel features and corresponding class prototypes to learn structural knowledge. Shu et al. [<xref ref-type="bibr" rid="ref-21">21</xref>] proposed a distillation loss function that pays more attention to the most salient regions across channels. Currently, knowledge distillation is widely applied in semantic segmentation tasks [<xref ref-type="bibr" rid="ref-22">22</xref>&#x2013;<xref ref-type="bibr" rid="ref-25">25</xref>]. TriKD [<xref ref-type="bibr" rid="ref-26">26</xref>] offers a three-view knowledge distillation framework for semi-supervised semantic segmentation. For few-shot unsupervised semantic segmentation tasks, Li et al. [<xref ref-type="bibr" rid="ref-27">27</xref>] designed a semi-supervised semantic segmentation framework. Xu et al. [<xref ref-type="bibr" rid="ref-28">28</xref>] developed a single-branch real-time semantic segmentation model.</p>
</sec>
</sec>
<sec id="s3">
<label>3</label>
<title>Proposed Method</title>
<p>In this section, the overall framework of BSDNet is first introduced, followed by a detailed description of the proposed FCB, SIDM, and DASP.</p>
<sec id="s3_1">
<label>3.1</label>
<title>Framework Overview</title>
<p>As illustrated in <xref ref-type="fig" rid="fig-2">Fig. 2</xref>, the overall architecture comprises a CNN backbone for efficient inference and a Transformer branch for semantic feature extraction. Within the CNN backbone, the attention mechanism embedded in the FCB module mitigates the semantic gap between the two branches. SIDM serves as a bridge between the two branches, enabling the backbone network to extract semantic information from the pre-trained Transformer. The decoder of the backbone network incorporates a DASP module for multi-scale feature extraction, followed by a segmentation head. Specifically, features from the fourth stage are first fed into the DASP to capture multi-scale features, which are then fused with features from the second stage to obtain rich contextual information. The resulting features are subsequently passed to the segmentation head, and finally processed by a <inline-formula id="ieqn-3"><mml:math id="mml-ieqn-3"><mml:mn>1</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mn>1</mml:mn></mml:math></inline-formula> convolutional classifier to obtain accurate segmentation results.</p>
</sec>
<sec id="s3_2">
<label>3.2</label>
<title>Feature Conversion Convolutional Block</title>
<p>In the designed network architecture, both an efficient CNN and a Transformer capable of extracting high-quality contextual information are used. However, structural differences between the networks can hinder knowledge distillation, especially when the teacher&#x2019;s features exceed the processing capacity of the student [<xref ref-type="bibr" rid="ref-15">15</xref>].</p>
<p>As seen in <xref ref-type="fig" rid="fig-3">Fig. 3</xref> (right), an FCB is designed to reduce the discrepancy between the information learned by the Transformer branch and the CNN branch. In FCB, the DEC Attention module employs convolution operations to compute attention, enabling the CNN branch to capture Transformer-like features. The structure of FCB is derived from the typical Transformer encoder structure, and it can be described as follows:
<disp-formula id="eqn-1"><label>(1)</label><mml:math id="mml-eqn-1" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd /><mml:mtd><mml:msub><mml:mi>x</mml:mi><mml:mi>m</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mrow><mml:mtext>Norm</mml:mtext></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo>+</mml:mo><mml:mrow><mml:mtext>DECAttention</mml:mtext></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">)</mml:mo><mml:mo>,</mml:mo></mml:mtd></mml:mtr><mml:mtr><mml:mtd /><mml:mtd><mml:msub><mml:mi>x</mml:mi><mml:mi>o</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mrow><mml:mtext>Norm</mml:mtext></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mi>m</mml:mi></mml:msub><mml:mo>+</mml:mo><mml:mrow><mml:mtext>FFN</mml:mtext></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mi>m</mml:mi></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">)</mml:mo><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<fig id="fig-3">
<label>Figure 3</label>
<caption>
<title>Structure of DEC Attention method (left) and design of Feature Conversion Convolutional Block (right). The DEC Attention method consists of a detail enhancement branch and the process of attention computation. <inline-formula id="ieqn-8"><mml:math id="mml-ieqn-8"><mml:mi>k</mml:mi></mml:math></inline-formula> denotes the convolution kernel size</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_66803-fig-3.tif"/>
</fig>
<p>where <inline-formula id="ieqn-4"><mml:math id="mml-ieqn-4"><mml:mrow><mml:mi mathvariant="normal">N</mml:mi><mml:mi mathvariant="normal">o</mml:mi><mml:mi mathvariant="normal">r</mml:mi><mml:mi mathvariant="normal">m</mml:mi></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mo>&#x22C5;</mml:mo><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> denotes batch normalization, and <inline-formula id="ieqn-5"><mml:math id="mml-ieqn-5"><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>, <inline-formula id="ieqn-6"><mml:math id="mml-ieqn-6"><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>m</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>, and <inline-formula id="ieqn-7"><mml:math id="mml-ieqn-7"><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>o</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> represent the input features, intermediate features, and output features, respectively.</p>
<p>The proposed Detail Enhancing Convolutional Attention (DEC) method is designed for real-time semantic segmentation, which requires low latency and efficient feature extraction. Its specific structure is shown in <xref ref-type="fig" rid="fig-3">Fig. 3</xref> (left). It linearly combines the channel values of each pixel using convolution kernels, operating only along the channel dimension without involving spatial position interactions. For the input feature map <inline-formula id="ieqn-9"><mml:math id="mml-ieqn-9"><mml:mi>X</mml:mi><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:mi>H</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mi>W</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:msub><mml:mi>C</mml:mi><mml:mrow><mml:mtext>in</mml:mtext></mml:mrow></mml:msub></mml:mrow></mml:msup></mml:math></inline-formula> and convolution kernel <inline-formula id="ieqn-10"><mml:math id="mml-ieqn-10"><mml:mi>W</mml:mi><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mn>1</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:msub><mml:mi>C</mml:mi><mml:mrow><mml:mtext>in</mml:mtext></mml:mrow></mml:msub><mml:mo>&#x00D7;</mml:mo><mml:msub><mml:mi>C</mml:mi><mml:mrow><mml:mtext>out</mml:mtext></mml:mrow></mml:msub></mml:mrow></mml:msup></mml:math></inline-formula>, <inline-formula id="ieqn-11"><mml:math id="mml-ieqn-11"><mml:mi>Y</mml:mi><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:mi>H</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mi>W</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:msub><mml:mi>C</mml:mi><mml:mrow><mml:mtext>out</mml:mtext></mml:mrow></mml:msub></mml:mrow></mml:msup></mml:math></inline-formula> is the output of pixel-by-pixel convolution. The detailed calculation process can be described as follows:
<disp-formula id="eqn-2"><label>(2)</label><mml:math id="mml-eqn-2" display="block"><mml:msub><mml:mi>Y</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi><mml:mo>,</mml:mo><mml:msub><mml:mi>C</mml:mi><mml:mrow><mml:mrow><mml:mtext>out</mml:mtext></mml:mrow></mml:mrow></mml:msub></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:msub><mml:mi>C</mml:mi><mml:mrow><mml:mrow><mml:mtext>in</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:msub><mml:mi>C</mml:mi><mml:mrow><mml:mrow><mml:mtext>in</mml:mtext></mml:mrow></mml:mrow></mml:msub></mml:mrow></mml:munderover><mml:msub><mml:mi>X</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi><mml:mo>,</mml:mo><mml:msub><mml:mi>C</mml:mi><mml:mrow><mml:mrow><mml:mtext>in</mml:mtext></mml:mrow></mml:mrow></mml:msub></mml:mrow></mml:msub><mml:mo>&#x22C5;</mml:mo><mml:msub><mml:mi>W</mml:mi><mml:mrow><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:msub><mml:mi>C</mml:mi><mml:mrow><mml:mrow><mml:mtext>in</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>C</mml:mi><mml:mrow><mml:mrow><mml:mtext>out</mml:mtext></mml:mrow></mml:mrow></mml:msub></mml:mrow></mml:msub><mml:mo>,</mml:mo></mml:math></disp-formula></p>
<p>where <inline-formula id="ieqn-12"><mml:math id="mml-ieqn-12"><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi><mml:mo>,</mml:mo><mml:msub><mml:mi>C</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mi>n</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:msub></mml:math></inline-formula> denotes the value of the input feature at position <inline-formula id="ieqn-13"><mml:math id="mml-ieqn-13"><mml:mo stretchy="false">(</mml:mo><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> with channel index <inline-formula id="ieqn-14"><mml:math id="mml-ieqn-14"><mml:msub><mml:mi>C</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mi>n</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>, <italic>H</italic> and <italic>W</italic> represent the height and width of the input feature, and <inline-formula id="ieqn-15"><mml:math id="mml-ieqn-15"><mml:msub><mml:mi>C</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mi>n</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>, <inline-formula id="ieqn-16"><mml:math id="mml-ieqn-16"><mml:msub><mml:mi>C</mml:mi><mml:mrow><mml:mi>o</mml:mi><mml:mi>u</mml:mi><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> denote the numbers of channels in the input and output feature maps, respectively.</p>
<p>The calculation of attention is highly sensitive to the size of the input feature map. To address this, Grouped Double Normalization (GDN) is adopted to compute attention weights in different ways along two dimensions. Specifically, the softmax is applied across the spatial dimension to generate spatial attention weights, and grouped L2 regularization is employed along the channel dimension. This approach processes parameters in groups to improve efficiency while balancing the weights across different groups, preventing instability caused by excessively large weights in a single group. It increases the diversity among the attention maps of different query points, and thus captures richer semantic representation.</p>
<p>To reduce computational complexity, convolution kernels of <inline-formula id="ieqn-17"><mml:math id="mml-ieqn-17"><mml:mn>1</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mi>k</mml:mi></mml:math></inline-formula> and <inline-formula id="ieqn-18"><mml:math id="mml-ieqn-18"><mml:mi>k</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mn>1</mml:mn></mml:math></inline-formula> are used in the horizontal and vertical directions to replace the standard <inline-formula id="ieqn-19"><mml:math id="mml-ieqn-19"><mml:mi>k</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mi>k</mml:mi></mml:math></inline-formula> convolution. After that, attention is then computed separately in both directions. While this approach is efficient, it lacks local information. Therefore, a detail enhancement branch is introduced in the attention mechanism. The initially computed <inline-formula id="ieqn-20"><mml:math id="mml-ieqn-20"><mml:mi>q</mml:mi></mml:math></inline-formula>, <inline-formula id="ieqn-21"><mml:math id="mml-ieqn-21"><mml:mi>k</mml:mi></mml:math></inline-formula>, and <inline-formula id="ieqn-22"><mml:math id="mml-ieqn-22"><mml:mi>v</mml:mi></mml:math></inline-formula> are concatenated along the channel dimension, and a <inline-formula id="ieqn-23"><mml:math id="mml-ieqn-23"><mml:mn>3</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mn>3</mml:mn></mml:math></inline-formula> depthwise convolution is applied to aggregate auxiliary local details. The output is then processed by a linear projection with an activation function and batch normalization. This process compresses the channel dimension and produces the detail enhancement weights. Finally, the detail-enhanced features are fused with the attention features, and the process can be described as follows:
<disp-formula id="eqn-3"><label>(3)</label><mml:math id="mml-eqn-3" display="block"><mml:mi>y</mml:mi><mml:mo>=</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x03C3;</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>X</mml:mi><mml:mo>&#x2297;</mml:mo><mml:msub><mml:mi>K</mml:mi><mml:mi>h</mml:mi></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mo>&#x2297;</mml:mo><mml:msubsup><mml:mi>K</mml:mi><mml:mi>h</mml:mi><mml:mi>T</mml:mi></mml:msubsup><mml:mo>+</mml:mo><mml:mi>&#x03C3;</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>X</mml:mi><mml:mo>&#x2297;</mml:mo><mml:msub><mml:mi>K</mml:mi><mml:mi>v</mml:mi></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mo>&#x2297;</mml:mo><mml:msubsup><mml:mi>K</mml:mi><mml:mi>v</mml:mi><mml:mi>T</mml:mi></mml:msubsup><mml:mo stretchy="false">)</mml:mo><mml:mo>&#x2217;</mml:mo><mml:msub><mml:mi>F</mml:mi><mml:mi>d</mml:mi></mml:msub><mml:mo>,</mml:mo></mml:math></disp-formula></p>
<p>where <inline-formula id="ieqn-24"><mml:math id="mml-ieqn-24"><mml:mi>X</mml:mi><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:mi>C</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mi>H</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mi>W</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula>, <inline-formula id="ieqn-25"><mml:math id="mml-ieqn-25"><mml:mi>K</mml:mi><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:mi>C</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mi>H</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mi>k</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mi>k</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula> and <inline-formula id="ieqn-26"><mml:math id="mml-ieqn-26"><mml:msup><mml:mi>K</mml:mi><mml:mi>T</mml:mi></mml:msup><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:mi>N</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mi>C</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mi>k</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mi>k</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula> represent the input image, query, and key, respectively, and <italic>C</italic>, <italic>H</italic>, <italic>W</italic> denote the number of channels, height, and width of the feature map. <italic>N</italic> represents the number of learnable parameters, and <inline-formula id="ieqn-27"><mml:math id="mml-ieqn-27"><mml:mi>k</mml:mi></mml:math></inline-formula> denotes the size of the convolution kernel. <inline-formula id="ieqn-28"><mml:math id="mml-ieqn-28"><mml:mi>&#x03C3;</mml:mi></mml:math></inline-formula> refers to the grouped double normalization. <inline-formula id="ieqn-29"><mml:math id="mml-ieqn-29"><mml:msub><mml:mi>F</mml:mi><mml:mi>d</mml:mi></mml:msub></mml:math></inline-formula> represents the detailed features obtained from the detail enhancement branch.</p>
<p>In the FCB, computationally expensive matrix operations for attention are replaced by per-pixel convolutions, which preserve the original spatial structure and are beneficial for feature extraction. DEC Attention computes attention using stripe convolutions in both horizontal and vertical directions, which reduces computational cost compared to standard convolutions. By adding local information to the extracted attention features, the detail enhancement branch may enhance the model&#x2019;s overall performance. A more efficient FFN (<xref ref-type="fig" rid="fig-3">Fig. 3</xref>, right) is employed, in which depthwise separable convolutions are used to perform convolution independently on each channel, significantly reducing computational cost and better preserving channel-wise independence. In FFN, residual connections are incorporated to mitigate the vanishing gradient problem [<xref ref-type="bibr" rid="ref-29">29</xref>].</p>
</sec>
<sec id="s3_3">
<label>3.3</label>
<title>Semantic Information Distillation Module</title>
<p>Previous knowledge distillation methods [<xref ref-type="bibr" rid="ref-30">30</xref>] have mostly been used for learning between similar models, whereas our model needs to learn from two different types of models. To enable the lightweight CNN backbone to efficiently extract features, SIDM is designed to extract semantic information from a pre-trained Transformer branch. As shown in <xref ref-type="fig" rid="fig-2">Fig. 2</xref>, this module can be divided into intermediate feature alignment and logits alignment.</p>
<p>Intermediate Feature Alignment: During knowledge distillation, differences in model structures are often reflected in their feature spaces, with features preserved in different latent spaces [<xref ref-type="bibr" rid="ref-4">4</xref>]. The use of basic similarity measurement functions, such as Mean Squared Error (MSE) [<xref ref-type="bibr" rid="ref-15">15</xref>] loss, for information extraction does not ensure effective alignment of learned features and may negatively impact model performance. Relying on the designed Feature Conversion Block (FCB), attention is computed via convolution operations, allowing features from different models to be represented similarly. This results in intermediate features with structures similar to those of the Transformer features. During feature alignment, the student features derived from the CNN are first projected onto the teacher feature dimensions of the Transformer. Adjust the resolution using upsampling or downsampling to avoid directly aligning the features. Then, adjust the CNN student features to ensure their statistical properties are consistent with the teacher features. Finally, the semantic loss is computed between the adjusted CNN student features and the Transformer teacher features. To ensure that the student model focuses on semantic rather than spatial information, CWD Loss is employed as the alignment loss, with its computation process can be summarized as follows:
<disp-formula id="eqn-4"><label>(4)</label><mml:math id="mml-eqn-4" display="block"><mml:mi>&#x03D5;</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mi>c</mml:mi></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mi>exp</mml:mi><mml:mo>&#x2061;</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mfrac><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>c</mml:mi><mml:mo>,</mml:mo><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mi>&#x1D4AF;</mml:mi></mml:mrow></mml:mfrac><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mrow><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>W</mml:mi><mml:mo>&#x22C5;</mml:mo><mml:mi>H</mml:mi></mml:mrow></mml:munderover><mml:mi>exp</mml:mi><mml:mo>&#x2061;</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mfrac><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>c</mml:mi><mml:mo>,</mml:mo><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mi>&#x1D4AF;</mml:mi></mml:mrow></mml:mfrac><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mfrac><mml:mo>,</mml:mo></mml:math></disp-formula>where <inline-formula id="ieqn-30"><mml:math id="mml-ieqn-30"><mml:mi>c</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mn>2</mml:mn><mml:mo>,</mml:mo><mml:mo>.</mml:mo><mml:mo>.</mml:mo><mml:mo>.</mml:mo><mml:mo>,</mml:mo><mml:mi>C</mml:mi></mml:math></inline-formula> is the channel index, <inline-formula id="ieqn-31"><mml:math id="mml-ieqn-31"><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mn>2</mml:mn><mml:mo>,</mml:mo><mml:mo>.</mml:mo><mml:mo>.</mml:mo><mml:mo>.</mml:mo></mml:math></inline-formula> is the index of the channel space, and <inline-formula id="ieqn-32"><mml:math id="mml-ieqn-32"><mml:msub><mml:mi>x</mml:mi><mml:mi>T</mml:mi></mml:msub></mml:math></inline-formula>, <inline-formula id="ieqn-33"><mml:math id="mml-ieqn-33"><mml:msub><mml:mi>x</mml:mi><mml:mi>S</mml:mi></mml:msub></mml:math></inline-formula> represent the feature maps of the Transformer teacher and CNN student, respectively. <inline-formula id="ieqn-34"><mml:math id="mml-ieqn-34"><mml:mrow><mml:mi>&#x1D4AF;</mml:mi></mml:mrow></mml:math></inline-formula> is a hyperparameter called temperature; the larger its value, the softer the probability distribution, allowing a wider region of each channel to be considered. <inline-formula id="ieqn-35"><mml:math id="mml-ieqn-35"><mml:mi>&#x03D5;</mml:mi></mml:math></inline-formula> computes a channel-level probability distribution from the feature activation map, mitigating the impact of scale differences between the two models.</p>
<p>The process of computing the channel discrepancy between the student and teacher models can be formulated as:
<disp-formula id="eqn-5"><label>(5)</label><mml:math id="mml-eqn-5" display="block"><mml:mi>&#x03C6;</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mi>T</mml:mi></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mi>S</mml:mi></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mfrac><mml:msup><mml:mrow><mml:mi>&#x1D4AF;</mml:mi></mml:mrow><mml:mn>2</mml:mn></mml:msup><mml:mi>C</mml:mi></mml:mfrac><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>c</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>C</mml:mi></mml:mrow></mml:munderover><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>H</mml:mi><mml:mo>&#x22C5;</mml:mo><mml:mi>W</mml:mi></mml:mrow></mml:munderover><mml:mi>&#x03D5;</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msubsup><mml:mi>x</mml:mi><mml:mi>T</mml:mi><mml:mrow><mml:mi>c</mml:mi><mml:mo>,</mml:mo><mml:mi>i</mml:mi></mml:mrow></mml:msubsup><mml:mo stretchy="false">)</mml:mo><mml:mo>&#x22C5;</mml:mo><mml:mi>log</mml:mi><mml:mo>&#x2061;</mml:mo><mml:mrow><mml:mo>[</mml:mo><mml:mfrac><mml:mrow><mml:mi>&#x03D5;</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msubsup><mml:mi>x</mml:mi><mml:mi>T</mml:mi><mml:mrow><mml:mi>c</mml:mi><mml:mo>,</mml:mo><mml:mi>i</mml:mi></mml:mrow></mml:msubsup><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mrow><mml:mi>&#x03D5;</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msubsup><mml:mi>x</mml:mi><mml:mi>S</mml:mi><mml:mrow><mml:mi>c</mml:mi><mml:mo>,</mml:mo><mml:mi>i</mml:mi></mml:mrow></mml:msubsup><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mfrac><mml:mo>]</mml:mo></mml:mrow><mml:mo>,</mml:mo></mml:math></disp-formula></p>
<p>where <inline-formula id="ieqn-36"><mml:math id="mml-ieqn-36"><mml:mi>&#x03C6;</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mo>&#x22C5;</mml:mo><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> is used to evaluate the difference between the two networks. To minimize <inline-formula id="ieqn-37"><mml:math id="mml-ieqn-37"><mml:mi>&#x03C6;</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mo>&#x22C5;</mml:mo><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>, as the value of <inline-formula id="ieqn-38"><mml:math id="mml-ieqn-38"><mml:mi>&#x03D5;</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msubsup><mml:mi>x</mml:mi><mml:mi>T</mml:mi><mml:mrow><mml:mi>c</mml:mi><mml:mo>,</mml:mo><mml:mi>i</mml:mi></mml:mrow></mml:msubsup><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> increases, the the value of <inline-formula id="ieqn-39"><mml:math id="mml-ieqn-39"><mml:mi>&#x03D5;</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msubsup><mml:mi>x</mml:mi><mml:mi>S</mml:mi><mml:mrow><mml:mi>c</mml:mi><mml:mo>,</mml:mo><mml:mi>i</mml:mi></mml:mrow></mml:msubsup><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> also increases, and conversely, when <inline-formula id="ieqn-40"><mml:math id="mml-ieqn-40"><mml:mi>&#x03D5;</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msubsup><mml:mi>x</mml:mi><mml:mi>T</mml:mi><mml:mrow><mml:mi>c</mml:mi><mml:mo>,</mml:mo><mml:mi>i</mml:mi></mml:mrow></mml:msubsup><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> decreases, the influence of <inline-formula id="ieqn-41"><mml:math id="mml-ieqn-41"><mml:mi>&#x03D5;</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msubsup><mml:mi>x</mml:mi><mml:mi>S</mml:mi><mml:mrow><mml:mi>c</mml:mi><mml:mo>,</mml:mo><mml:mi>i</mml:mi></mml:mrow></mml:msubsup><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> on KL divergence becomes smaller. This asymmetric learning mechanism makes the KL divergence focus more on the foreground salient regions with high probabilities predicted by the teacher model, while reducing attention to the background areas.</p>
<p>Logits Alignment: When features are passed through the segmentation head to obtain logits, they do not contain model-specific information like intermediate features do. Therefore, different models can be aligned directly in the logits space. However, despite sharing the same learning objectives in the logits space, the different inductive biases of the models often lead to different results. Their outcomes are influenced by these biases, resulting in different prediction distributions. For example, CNN excels at capturing shared local information across different categories, while Transformers are better at learning global features using attention mechanisms. Given these learning biases, OFA Loss is used to align different models in the logits space. It employs an adaptive target information enhancement method, which adds a term related only to the target class, guiding the student model to learn from a more confident teacher. This process can be formulated as:
<disp-formula id="eqn-6"><label>(6)</label><mml:math id="mml-eqn-6" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mi>&#x02112;</mml:mi></mml:mrow><mml:mrow><mml:mi>K</mml:mi><mml:mi>D</mml:mi></mml:mrow></mml:msub></mml:mtd><mml:mtd><mml:mi></mml:mi><mml:mo>=</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mi>log</mml:mi><mml:mo>&#x2061;</mml:mo><mml:msubsup><mml:mi>p</mml:mi><mml:mrow><mml:mrow><mml:mover><mml:mi>c</mml:mi><mml:mo stretchy="false">&#x005E;</mml:mo></mml:mover></mml:mrow></mml:mrow><mml:mi>S</mml:mi></mml:msubsup><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mrow><mml:mi mathvariant="double-struck">E</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi><mml:mo>&#x223C;</mml:mo><mml:mi>y</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">[</mml:mo><mml:msubsup><mml:mi>p</mml:mi><mml:mrow><mml:mi>c</mml:mi></mml:mrow><mml:mi>T</mml:mi></mml:msubsup><mml:mi>log</mml:mi><mml:mo>&#x2061;</mml:mo><mml:msubsup><mml:mi>p</mml:mi><mml:mrow><mml:mi>c</mml:mi></mml:mrow><mml:mi>S</mml:mi></mml:msubsup><mml:mo stretchy="false">]</mml:mo></mml:mtd></mml:mtr><mml:mtr><mml:mtd /><mml:mtd><mml:mi></mml:mi><mml:mo>=</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mn>1</mml:mn><mml:mo>+</mml:mo><mml:msubsup><mml:mi>p</mml:mi><mml:mrow><mml:mrow><mml:mover><mml:mi>c</mml:mi><mml:mo stretchy="false">&#x005E;</mml:mo></mml:mover></mml:mrow></mml:mrow><mml:mi>T</mml:mi></mml:msubsup><mml:mo stretchy="false">)</mml:mo><mml:mi>log</mml:mi><mml:mo>&#x2061;</mml:mo><mml:msubsup><mml:mi>p</mml:mi><mml:mrow><mml:mrow><mml:mover><mml:mi>c</mml:mi><mml:mo stretchy="false">&#x005E;</mml:mo></mml:mover></mml:mrow></mml:mrow><mml:mi>S</mml:mi></mml:msubsup><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mrow><mml:mi mathvariant="double-struck">E</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi><mml:mo>&#x223C;</mml:mo><mml:mi>y</mml:mi><mml:mrow><mml:mo>/</mml:mo></mml:mrow><mml:mo fence="false" stretchy="false">{</mml:mo><mml:mrow><mml:mover><mml:mi>c</mml:mi><mml:mo stretchy="false">&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mo fence="false" stretchy="false">}</mml:mo></mml:mrow></mml:msub><mml:mo stretchy="false">[</mml:mo><mml:msubsup><mml:mi>p</mml:mi><mml:mrow><mml:mi>c</mml:mi></mml:mrow><mml:mi>T</mml:mi></mml:msubsup><mml:mi>log</mml:mi><mml:mo>&#x2061;</mml:mo><mml:msubsup><mml:mi>p</mml:mi><mml:mrow><mml:mi>c</mml:mi></mml:mrow><mml:mi>S</mml:mi></mml:msubsup><mml:mo stretchy="false">]</mml:mo><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>where <inline-formula id="ieqn-42"><mml:math id="mml-ieqn-42"><mml:mi>c</mml:mi></mml:math></inline-formula> and <inline-formula id="ieqn-43"><mml:math id="mml-ieqn-43"><mml:mrow><mml:mover><mml:mi>c</mml:mi><mml:mo stretchy="false">&#x005E;</mml:mo></mml:mover></mml:mrow></mml:math></inline-formula> represent the target class and the predicted class, respectively, and <inline-formula id="ieqn-44"><mml:math id="mml-ieqn-44"><mml:msubsup><mml:mi>p</mml:mi><mml:mrow><mml:mrow><mml:mover><mml:mi>c</mml:mi><mml:mo stretchy="false">&#x005E;</mml:mo></mml:mover></mml:mrow></mml:mrow><mml:mi>S</mml:mi></mml:msubsup></mml:math></inline-formula>, <inline-formula id="ieqn-45"><mml:math id="mml-ieqn-45"><mml:msubsup><mml:mi>p</mml:mi><mml:mrow><mml:mrow><mml:mover><mml:mi>c</mml:mi><mml:mo stretchy="false">&#x005E;</mml:mo></mml:mover></mml:mrow></mml:mrow><mml:mi>T</mml:mi></mml:msubsup></mml:math></inline-formula> denote the predicted probability distributions of the student model and teacher model, respectively.</p>
<p>To regulate the relationship between the teacher and student model distributions and encourage better alignment of the target class in the student model, a parameter <inline-formula id="ieqn-46"><mml:math id="mml-ieqn-46"><mml:mi>&#x03B8;</mml:mi></mml:math></inline-formula> is added to the term <inline-formula id="ieqn-47"><mml:math id="mml-ieqn-47"><mml:mn>1</mml:mn><mml:mo>+</mml:mo><mml:msubsup><mml:mi>p</mml:mi><mml:mrow><mml:mrow><mml:mover><mml:mi>c</mml:mi><mml:mo stretchy="false">&#x005E;</mml:mo></mml:mover></mml:mrow></mml:mrow><mml:mi>T</mml:mi></mml:msubsup></mml:math></inline-formula> to enhance the target class. The distillation loss function can be described as follows:
<disp-formula id="eqn-7"><label>(7)</label><mml:math id="mml-eqn-7" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mi>&#x02112;</mml:mi></mml:mrow><mml:mrow><mml:mi>O</mml:mi><mml:mi>F</mml:mi><mml:mi>A</mml:mi></mml:mrow></mml:msub></mml:mtd><mml:mtd><mml:mi></mml:mi><mml:mo>=</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mn>1</mml:mn><mml:mo>+</mml:mo><mml:msubsup><mml:mi>p</mml:mi><mml:mrow><mml:mrow><mml:mover><mml:mi>c</mml:mi><mml:mo stretchy="false">&#x005E;</mml:mo></mml:mover></mml:mrow></mml:mrow><mml:mi>T</mml:mi></mml:msubsup><mml:msup><mml:mo stretchy="false">)</mml:mo><mml:mrow><mml:mi>&#x03B8;</mml:mi></mml:mrow></mml:msup><mml:mi>log</mml:mi><mml:mo>&#x2061;</mml:mo><mml:msubsup><mml:mi>p</mml:mi><mml:mrow><mml:mrow><mml:mover><mml:mi>c</mml:mi><mml:mo stretchy="false">&#x005E;</mml:mo></mml:mover></mml:mrow></mml:mrow><mml:mi>S</mml:mi></mml:msubsup><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mrow><mml:mi mathvariant="double-struck">E</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi><mml:mo>&#x223C;</mml:mo><mml:mi>y</mml:mi><mml:mrow><mml:mo>/</mml:mo></mml:mrow><mml:mo fence="false" stretchy="false">{</mml:mo><mml:mrow><mml:mover><mml:mi>c</mml:mi><mml:mo stretchy="false">&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mo fence="false" stretchy="false">}</mml:mo></mml:mrow></mml:msub><mml:mrow><mml:mo>[</mml:mo><mml:msubsup><mml:mi>p</mml:mi><mml:mrow><mml:mi>c</mml:mi></mml:mrow><mml:mi>T</mml:mi></mml:msubsup><mml:mi>log</mml:mi><mml:mo>&#x2061;</mml:mo><mml:msubsup><mml:mi>p</mml:mi><mml:mrow><mml:mi>c</mml:mi></mml:mrow><mml:mi>S</mml:mi></mml:msubsup><mml:mo>]</mml:mo></mml:mrow></mml:mtd></mml:mtr><mml:mtr><mml:mtd /><mml:mtd><mml:mi></mml:mi><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mi>&#x02112;</mml:mi></mml:mrow><mml:mrow><mml:mi>K</mml:mi><mml:mi>D</mml:mi></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>k</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>&#x03B8;</mml:mi></mml:mrow></mml:munderover><mml:mrow><mml:mstyle scriptlevel="0"><mml:mrow><mml:mo maxsize="2.047em" minsize="2.047em">(</mml:mo></mml:mrow></mml:mstyle><mml:mfrac linethickness="0"><mml:mi>&#x03B8;</mml:mi><mml:mi>k</mml:mi></mml:mfrac><mml:mstyle scriptlevel="0"><mml:mrow><mml:mo maxsize="2.047em" minsize="2.047em">)</mml:mo></mml:mrow></mml:mstyle></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:msubsup><mml:mi>p</mml:mi><mml:mrow><mml:mrow><mml:mover><mml:mi>c</mml:mi><mml:mo stretchy="false">&#x005E;</mml:mo></mml:mover></mml:mrow></mml:mrow><mml:mi>T</mml:mi></mml:msubsup><mml:msup><mml:mo stretchy="false">)</mml:mo><mml:mi>k</mml:mi></mml:msup><mml:mo>&#x2212;</mml:mo><mml:msubsup><mml:mi>p</mml:mi><mml:mrow><mml:mrow><mml:mover><mml:mi>c</mml:mi><mml:mo stretchy="false">&#x005E;</mml:mo></mml:mover></mml:mrow></mml:mrow><mml:mi>T</mml:mi></mml:msubsup><mml:mo>)</mml:mo></mml:mrow><mml:mi>log</mml:mi><mml:mo>&#x2061;</mml:mo><mml:msubsup><mml:mi>p</mml:mi><mml:mrow><mml:mrow><mml:mover><mml:mi>c</mml:mi><mml:mo stretchy="false">&#x005E;</mml:mo></mml:mover></mml:mrow></mml:mrow><mml:mi>S</mml:mi></mml:msubsup><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>where the term added with the parameter <inline-formula id="ieqn-48"><mml:math id="mml-ieqn-48"><mml:mi>&#x03B8;</mml:mi></mml:math></inline-formula> in <inline-formula id="ieqn-49"><mml:math id="mml-ieqn-49"><mml:msub><mml:mrow><mml:mi>&#x02112;</mml:mi></mml:mrow><mml:mrow><mml:mi>O</mml:mi><mml:mi>F</mml:mi><mml:mi>A</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> is a positive term that is only related to the target class. If the teacher model is confident in the target class, the higher-order term with the parameter <inline-formula id="ieqn-50"><mml:math id="mml-ieqn-50"><mml:mi>&#x03B8;</mml:mi></mml:math></inline-formula> decays slowly. Conversely, if the teacher model is less confident, the decay accelerates to hinder the learning of the target class.</p>
<p>By adjusting this parameter, adaptive enhancement of the target class information learning is achieved, mitigating the influence of soft labels when the teacher model provides suboptimal predictions. This enables models with different structures to discount biases in learning capabilities, thereby enhancing the overall distillation effect.</p>
</sec>
<sec id="s3_4">
<label>3.4</label>
<title>Deep Aggregation Atrous Convolution Pyramid Pooling</title>
<p>Accurate segmentation of street scene images requires balancing features across multiple scales. This allows the model to simultaneously recognize both large and small objects while enhancing its ability to perceive objects of varying sizes. To achieve this, a novel DASP is proposed to efficiently and accurately extract features at different scales. Its structure is shown in <xref ref-type="fig" rid="fig-4">Fig. 4</xref>.</p>
<fig id="fig-4">
<label>Figure 4</label>
<caption>
<title>The detailed structure of Deep Aggregation Atrous Convolution Pyramid Pooling (DASP). DConv represents an atrous convolution. The shortcut means a <inline-formula id="ieqn-51"><mml:math id="mml-ieqn-51"><mml:mn>1</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mn>1</mml:mn></mml:math></inline-formula> convolution</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_66803-fig-4.tif"/>
</fig>
<p>To address the conflict between multi-scale inference and full-resolution dense prediction, a common approach [<xref ref-type="bibr" rid="ref-3">3</xref>] is to obtain a global view through downsampling layers and use repeated upsampling to restore lost resolution. However, this method leads to a decrease in resolution, necessitating upsampling to restore resolution when concatenating features at different scales. The extensive use of upsampling operations significantly increases the computational cost. Moreover, the information and object lost during the downsampling process cannot be fully recovered through upsampling, which causes suboptimal performance in semantic segmentation tasks. Atrous convolutions can expand the receptive field without reducing resolution, thereby eliminating the need for costly upsampling operations. Therefore, atrous convolutions with different dilation rates are used instead of standard convolutions to obtain multi-scale features in this method. Three dilated convolution layers with kernel size <inline-formula id="ieqn-52"><mml:math id="mml-ieqn-52"><mml:mn>3</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mn>3</mml:mn></mml:math></inline-formula> are assigned dilation rates of <inline-formula id="ieqn-53"><mml:math id="mml-ieqn-53"><mml:mi>r</mml:mi><mml:mo>=</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:mn>6</mml:mn><mml:mo>,</mml:mo><mml:mspace width="thinmathspace" /><mml:mn>12</mml:mn><mml:mo>,</mml:mo><mml:mspace width="thinmathspace" /><mml:mn>18</mml:mn><mml:mo stretchy="false">]</mml:mo></mml:math></inline-formula>, following the empirically effective configuration in prior works like DeepLab [<xref ref-type="bibr" rid="ref-4">4</xref>]. This design also draws on the principle of Hybrid Dilated Convolution [<xref ref-type="bibr" rid="ref-31">31</xref>], which avoids using identical dilation rates across layers to ensure a complete receptive field. In multiple layers using atrous convolutions, neighboring pixels across layers are convolved from mutually independent subsets, leading to a lack of dependency between them. To mitigate this, a deep aggregation approach is adopted, which can be formulated as:
<disp-formula id="eqn-8"><label>(8)</label><mml:math id="mml-eqn-8" display="block"><mml:msub><mml:mi>f</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mrow><mml:mo>{</mml:mo><mml:mtable columnalign="left left" rowspacing=".2em" columnspacing="1em" displaystyle="false"><mml:mtr><mml:mtd><mml:msub><mml:mi>C</mml:mi><mml:mrow><mml:mn>1</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>,</mml:mo></mml:mtd><mml:mtd><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn><mml:mo>;</mml:mo></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:msub><mml:mi>C</mml:mi><mml:mrow><mml:mn>3</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mn>3</mml:mn></mml:mrow></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>D</mml:mi><mml:mrow><mml:mn>3</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mn>3</mml:mn></mml:mrow></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>+</mml:mo><mml:msub><mml:mi>f</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow><mml:mo>,</mml:mo></mml:mtd><mml:mtd><mml:mn>1</mml:mn><mml:mo>&#x003C;</mml:mo><mml:mi>i</mml:mi><mml:mo>&#x003C;</mml:mo><mml:mi>n</mml:mi><mml:mo>;</mml:mo></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:msub><mml:mi>C</mml:mi><mml:mrow><mml:mn>3</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mn>3</mml:mn></mml:mrow></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:mi>U</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>C</mml:mi><mml:mrow><mml:mn>1</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>P</mml:mi><mml:mrow><mml:mi>g</mml:mi><mml:mi>l</mml:mi><mml:mi>o</mml:mi><mml:mi>b</mml:mi><mml:mi>a</mml:mi><mml:mi>l</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">)</mml:mo><mml:mo>+</mml:mo><mml:msub><mml:mi>f</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mo>)</mml:mo></mml:mrow><mml:mo>,</mml:mo></mml:mtd><mml:mtd><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mi>n</mml:mi><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable><mml:mo fence="true" stretchy="true" symmetric="true"></mml:mo></mml:mrow></mml:math></disp-formula></p>
<p>where by taking <inline-formula id="ieqn-54"><mml:math id="mml-ieqn-54"><mml:mi>x</mml:mi></mml:math></inline-formula> as input, the features of each layer can be represented as <inline-formula id="ieqn-55"><mml:math id="mml-ieqn-55"><mml:msub><mml:mi>f</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:math></inline-formula>. <inline-formula id="ieqn-56"><mml:math id="mml-ieqn-56"><mml:msub><mml:mi>C</mml:mi><mml:mrow><mml:mn>1</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula>, <inline-formula id="ieqn-57"><mml:math id="mml-ieqn-57"><mml:msub><mml:mi>C</mml:mi><mml:mrow><mml:mn>1</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula> represents a <inline-formula id="ieqn-58"><mml:math id="mml-ieqn-58"><mml:mn>1</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mn>1</mml:mn></mml:math></inline-formula> convolution, <inline-formula id="ieqn-59"><mml:math id="mml-ieqn-59"><mml:msub><mml:mi>C</mml:mi><mml:mrow><mml:mn>3</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mn>3</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula> represents a <inline-formula id="ieqn-60"><mml:math id="mml-ieqn-60"><mml:mn>3</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mn>3</mml:mn></mml:math></inline-formula> convolution, <inline-formula id="ieqn-61"><mml:math id="mml-ieqn-61"><mml:msub><mml:mi>D</mml:mi><mml:mrow><mml:mn>3</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mn>3</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula> denotes a <inline-formula id="ieqn-62"><mml:math id="mml-ieqn-62"><mml:mn>3</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mn>3</mml:mn></mml:math></inline-formula> atrous convolution, <italic>U</italic> represents the upsampling operation, and <inline-formula id="ieqn-63"><mml:math id="mml-ieqn-63"><mml:msub><mml:mi>P</mml:mi><mml:mrow><mml:mi>g</mml:mi><mml:mi>l</mml:mi><mml:mi>o</mml:mi><mml:mi>b</mml:mi><mml:mi>a</mml:mi><mml:mi>l</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> stands for global adaptive average pooling. <inline-formula id="ieqn-64"><mml:math id="mml-ieqn-64"><mml:mi>n</mml:mi></mml:math></inline-formula> represents the number of feature extraction layers. To enable the network to capture the overall feature of each channel from a global perspective, a <inline-formula id="ieqn-65"><mml:math id="mml-ieqn-65"><mml:mn>1</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mn>1</mml:mn></mml:math></inline-formula> adaptive average pooling layer is also employed.</p>
<p>After multi-scale feature extraction, adjacent atrous convolution layers that lack correlation are concatenated and passed through a <inline-formula id="ieqn-66"><mml:math id="mml-ieqn-66"><mml:mn>3</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mn>3</mml:mn></mml:math></inline-formula> standard convolution layer to extract more local information. This process produces multiple feature maps that are both correlated and of different scales. These feature maps are then concatenated and passed through a <inline-formula id="ieqn-67"><mml:math id="mml-ieqn-67"><mml:mn>1</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mn>1</mml:mn></mml:math></inline-formula> convolution layer to compress the channels to the expected dimension. Additionally, to mitigate gradient vanishing and explosion, a skip connection is introduced to retain the semantic information from the input before entering the module. Although our module employs multiple convolution layers and complex feature fusion methods, the input resolution of the DASP is only <inline-formula id="ieqn-68"><mml:math id="mml-ieqn-68"><mml:mn>1</mml:mn><mml:mrow><mml:mo>/</mml:mo></mml:mrow><mml:mn>32</mml:mn></mml:math></inline-formula> of the original image resolution. Even at an input size of <inline-formula id="ieqn-69"><mml:math id="mml-ieqn-69"><mml:mn>1024</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mn>1024</mml:mn></mml:math></inline-formula>, the largest feature map resolution is only <inline-formula id="ieqn-70"><mml:math id="mml-ieqn-70"><mml:mn>32</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mn>32</mml:mn></mml:math></inline-formula>, indicating that DASP imposes a limited impact on inference speed.</p>
</sec>
</sec>
<sec id="s4">
<label>4</label>
<title>Experiment</title>
<p>In this section, experiments were conducted on Cityscapes [<xref ref-type="bibr" rid="ref-32">32</xref>], Camvid [<xref ref-type="bibr" rid="ref-33">33</xref>], and ADE20K [<xref ref-type="bibr" rid="ref-34">34</xref>]. First, the datasets and implementation details of the experiments are introduced, followed by a comparison with state-of-the-art models [<xref ref-type="bibr" rid="ref-35">35</xref>&#x2013;<xref ref-type="bibr" rid="ref-37">37</xref>]. Finally, ablation studies are performed on Cityscapes.</p>
<sec id="s4_1">
<label>4.1</label>
<title>Datasets and Implementation Details</title>
<p>Cityscapes is a dataset that focuses on analyzing street scenes. It is divided into three parts: training, validation, and test sets, containing 2975, 500, and 1525 images, respectively. We adopt 19 common categories (such as roads, cars, and pedestrians) for the semantic segmentation task. For model training, AdamW is chosen as the optimizer, with an initial learning rate set to 0.0004 and a weight decay of 0.0125. A poly learning rate strategy with a power of 0.9 is used to reduce the learning rate, and linear warm-up is applied at the beginning of training. The random scaling range is set between 0.25 and 1.5, and random cropping of sizes <inline-formula id="ieqn-71"><mml:math id="mml-ieqn-71"><mml:mn>1024</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mn>512</mml:mn></mml:math></inline-formula> or <inline-formula id="ieqn-72"><mml:math id="mml-ieqn-72"><mml:mn>1024</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mn>1024</mml:mn></mml:math></inline-formula> is applied to both the images and their corresponding ground truth annotations. Additionally, images are randomly flipped horizontally with a probability of 0.5.</p>
<p>CamVid is a dataset designed for street scene understanding, including categories such as roads, cars, bicycles, and others. It contains 701 densely annotated frames with a resolution of <inline-formula id="ieqn-73"><mml:math id="mml-ieqn-73"><mml:mn>960</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mn>720</mml:mn></mml:math></inline-formula>. The images are split into 367 training, 101 validation, and 233 test images. The training is conducted on the training and validation datasets. During training, the initial learning rate is set to 0.001, and the images are randomly cropped to <inline-formula id="ieqn-74"><mml:math id="mml-ieqn-74"><mml:mn>960</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mn>720</mml:mn></mml:math></inline-formula>. All other training configurations are kept consistent with those for Cityscapes.</p>
<p>ADE20K is a scene parsing dataset containing 150 semantic categories across a wide range of indoor and outdoor environments, such as buildings, furniture, animals, and roads. It is divided into 20K, 2K, and 3K images for training, validation, and testing. The initial learning rate is set to 0.0005, weight decay is set to 0.01. Images are randomly cropped to <inline-formula id="ieqn-75"><mml:math id="mml-ieqn-75"><mml:mn>512</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mn>512</mml:mn></mml:math></inline-formula> for data augmentation. The remaining training settings follow those used for Cityscapes.</p>
<p>To validate the effectiveness of the method, a base model of comparable size to RTFormer-B/DDRNet-23 called BSDNet-B is constructed, along with a smaller variant called BSDNet-S. The CNN backbone is first pre-trained on ImageNet, with SegFormer chosen as the transformer branch, followed by fine-tuning the model on semantic segmentation datasets. All our models use Cross-Entropy Loss (CE Loss) [<xref ref-type="bibr" rid="ref-38">38</xref>] to compute the loss between predictions and ground truth.</p>
<p>For performance evaluation, Mean Intersection over Union (mIoU) and Frames Per Second (FPS) are adopted as metrics to evaluate accuracy and inference speed. Experiments are conducted on an NVIDIA A6000 with 48 GB of memory and an Intel<inline-formula id="ieqn-76"><mml:math id="mml-ieqn-76"><mml:mi>&#x00AE;</mml:mi></mml:math></inline-formula> Xeon<inline-formula id="ieqn-77"><mml:math id="mml-ieqn-77"><mml:mi>&#x00AE;</mml:mi></mml:math></inline-formula> Gold 6226R CPU @ 2.9 GHz, using Python 3.8 and PyTorch 1.11.0. To ensure fair comparisons, the inference speed of all proposed methods is measured on an NVIDIA A6000, with the FPS values reported based on identical input resolutions.</p>
</sec>
<sec id="s4_2">
<label>4.2</label>
<title>Comparison with State-of-the-Art Methods</title>
<p><bold>Results on Cityscapes:</bold> As shown in <xref ref-type="table" rid="table-1">Table 1</xref>, underlining indicates the best mIoU result, while bold formatting highlights our best mIoU and FPS scores. BSDNet-B-Seg100 achieves <inline-formula id="ieqn-78"><mml:math id="mml-ieqn-78"><mml:mn>81.7</mml:mn><mml:mi mathvariant="normal">&#x0025;</mml:mi></mml:math></inline-formula> mIoU at 70.6 FPS, representing our best performance on Cityscapes. The FPS improvement primarily stems from the dilated convolutions in DASP, which effectively reduce computational complexity. Meanwhile, the semantic information captured by the Transformer branch significantly enhances BSDNet&#x2019;s accuracy. Additionally, <xref ref-type="fig" rid="fig-5">Fig. 5</xref> presents visualization results on the Cityscapes dataset. Compared to DDRNet and SCTNet, BSDNet not only provides more accurate predictions for large-area categories such as road and vegetation (in yellow boxes) but also preserves finer details for small objects like traffic lights and traffic signs (in white boxes). This demonstrates that BSDNet effectively captures high-quality long-range context while retaining fine details of small-scale objects.</p>
<table-wrap id="table-1">
<label>Table 1</label>
<caption>
<title>Comparison with other state-of-the-art methods on Cityscapes. The suffixes Seg50, Seg75, and Seg100 after the method names indicate input sizes of <inline-formula id="ieqn-79"><mml:math id="mml-ieqn-79"><mml:mn>1024</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mn>512</mml:mn></mml:math></inline-formula>, <inline-formula id="ieqn-80"><mml:math id="mml-ieqn-80"><mml:mn>1536</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mn>768</mml:mn></mml:math></inline-formula>, and <inline-formula id="ieqn-81"><mml:math id="mml-ieqn-81"><mml:mn>2048</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mn>1024</mml:mn></mml:math></inline-formula>, respectively</title>
</caption>
<table>
<colgroup>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
</colgroup>
<thead>
<tr>
<th align="center">Method</th>
<th align="center">Reference</th>
<th align="center">Params</th>
<th align="center">Resolution</th>
<th>FPS</th>
<th align="center">mIoU <bold>(%)</bold></th>
<th>FLOPs</th>
</tr>
</thead>
<tbody>
<tr>
<td>STDC2-Seg75</td>
<td>CVPR-2021</td>
<td>22.2 M</td>
<td>1536 <inline-formula id="ieqn-82"><mml:math id="mml-ieqn-82"><mml:mo>&#x00D7;</mml:mo></mml:math></inline-formula> 768</td>
<td>94.3</td>
<td>77.0</td>
<td>&#x2013;</td>
</tr>
<tr>
<td>STDC2-Seg50</td>
<td>CVPR-2021</td>
<td>22.2 M</td>
<td>1024 <inline-formula id="ieqn-83"><mml:math id="mml-ieqn-83"><mml:mo>&#x00D7;</mml:mo></mml:math></inline-formula> 512</td>
<td>108.6</td>
<td>74.2</td>
<td>&#x2013;</td>
</tr>
<tr>
<td>SegNext-T-Seg100</td>
<td>NeurIPS-2022</td>
<td>4.3 M</td>
<td>2048 <inline-formula id="ieqn-84"><mml:math id="mml-ieqn-84"><mml:mo>&#x00D7;</mml:mo></mml:math></inline-formula> 1024</td>
<td>31.1</td>
<td>79.8</td>
<td>&#x2013;</td>
</tr>
<tr>
<td>SegNext-T-Seg75</td>
<td>NeurIPS-2022</td>
<td>4.3 M</td>
<td>1536 <inline-formula id="ieqn-85"><mml:math id="mml-ieqn-85"><mml:mo>&#x00D7;</mml:mo></mml:math></inline-formula> 768</td>
<td>48.6</td>
<td>78.0</td>
<td>&#x2013;</td>
</tr>
<tr>
<td>DDRNet-23-S</td>
<td>TIP-2022</td>
<td>5.7 M</td>
<td>2048 <inline-formula id="ieqn-86"><mml:math id="mml-ieqn-86"><mml:mo>&#x00D7;</mml:mo></mml:math></inline-formula> 1024</td>
<td>116.7</td>
<td>77.8</td>
<td>31.6</td>
</tr>
<tr>
<td>DDRNet-23</td>
<td>TIP-2022</td>
<td>20.1 M</td>
<td>2048 <inline-formula id="ieqn-87"><mml:math id="mml-ieqn-87"><mml:mo>&#x00D7;</mml:mo></mml:math></inline-formula> 1024</td>
<td>66.7</td>
<td>79.5</td>
<td>143.1</td>
</tr>
<tr>
<td>RITFormer-S</td>
<td>NeurIPS-2022</td>
<td>4.8 M</td>
<td>2048 <inline-formula id="ieqn-88"><mml:math id="mml-ieqn-88"><mml:mo>&#x00D7;</mml:mo></mml:math></inline-formula> 1024</td>
<td>99.6</td>
<td>76.3</td>
<td>&#x2013;</td>
</tr>
<tr>
<td>RITFormer-B</td>
<td>NeurIPS-2022</td>
<td>16.8 M</td>
<td>2048 <inline-formula id="ieqn-89"><mml:math id="mml-ieqn-89"><mml:mo>&#x00D7;</mml:mo></mml:math></inline-formula> 1024</td>
<td>58.2</td>
<td>79.3</td>
<td>&#x2013;</td>
</tr>
<tr>
<td>SeaFormer-B-Seg100</td>
<td>ICLR-2023</td>
<td>8.6 M</td>
<td>2048 <inline-formula id="ieqn-90"><mml:math id="mml-ieqn-90"><mml:mo>&#x00D7;</mml:mo></mml:math></inline-formula> 1024</td>
<td>42.5</td>
<td>77.7</td>
<td>&#x2013;</td>
</tr>
<tr>
<td>SeaFormer-B-Seg50</td>
<td>ICLR-2023</td>
<td>8.6 M</td>
<td>1024 <inline-formula id="ieqn-91"><mml:math id="mml-ieqn-91"><mml:mo>&#x00D7;</mml:mo></mml:math></inline-formula> 512</td>
<td>52.2</td>
<td>72.2</td>
<td>&#x2013;</td>
</tr>
<tr>
<td>AFFormer-B-Seg100</td>
<td>AAAI-2023</td>
<td>3.0 M</td>
<td>2048 <inline-formula id="ieqn-92"><mml:math id="mml-ieqn-92"><mml:mo>&#x00D7;</mml:mo></mml:math></inline-formula> 1024</td>
<td>31.4</td>
<td>78.7</td>
<td>&#x2013;</td>
</tr>
<tr>
<td>AFFormer-B-Seg75</td>
<td>AAAI-2023</td>
<td>3.0 M</td>
<td>1536 <inline-formula id="ieqn-93"><mml:math id="mml-ieqn-93"><mml:mo>&#x00D7;</mml:mo></mml:math></inline-formula> 768</td>
<td>41.6</td>
<td>76.5</td>
<td>&#x2013;</td>
</tr>
<tr>
<td>AFFormer-B-Seg50</td>
<td>AAAI-2023</td>
<td>3.0 M</td>
<td>1024 <inline-formula id="ieqn-94"><mml:math id="mml-ieqn-94"><mml:mo>&#x00D7;</mml:mo></mml:math></inline-formula> 512</td>
<td>52.5</td>
<td>73.5</td>
<td>&#x2013;</td>
</tr>
<tr>
<td>PIDNet-S</td>
<td>CVPR-2023</td>
<td>7.6 M</td>
<td>2048 <inline-formula id="ieqn-95"><mml:math id="mml-ieqn-95"><mml:mo>&#x00D7;</mml:mo></mml:math></inline-formula> 1024</td>
<td>103.2</td>
<td>78.8</td>
<td>47.7</td>
</tr>
<tr>
<td>PIDNet-M</td>
<td>CVPR-2023</td>
<td>34.4 M</td>
<td>2048 <inline-formula id="ieqn-96"><mml:math id="mml-ieqn-96"><mml:mo>&#x00D7;</mml:mo></mml:math></inline-formula> 1024</td>
<td>42.8</td>
<td>80.1</td>
<td>197.4</td>
</tr>
<tr>
<td>SCTNet-S-Seg75</td>
<td>AAAI-2024</td>
<td>4.7 M</td>
<td>1536 <inline-formula id="ieqn-97"><mml:math id="mml-ieqn-97"><mml:mo>&#x00D7;</mml:mo></mml:math></inline-formula> 768</td>
<td>153.6</td>
<td>76.1</td>
<td>33.7</td>
</tr>
<tr>
<td>SCTNet-B-Seg100</td>
<td>AAAI-2024</td>
<td>17.4 M</td>
<td>2048 <inline-formula id="ieqn-98"><mml:math id="mml-ieqn-98"><mml:mo>&#x00D7;</mml:mo></mml:math></inline-formula> 1024</td>
<td>68.3</td>
<td><underline>80.5</underline></td>
<td>48.3</td>
</tr>
<tr>
<td>BSDNet-S-Seg50</td>
<td>Ours</td>
<td>4.4 M</td>
<td>1024 <inline-formula id="ieqn-99"><mml:math id="mml-ieqn-99"><mml:mo>&#x00D7;</mml:mo></mml:math></inline-formula> 512</td>
<td><bold>178.4</bold></td>
<td>73.5</td>
<td>31.6</td>
</tr>
<tr>
<td>BSDNet-S-Seg75</td>
<td>Ours</td>
<td>4.4 M</td>
<td>1536 <inline-formula id="ieqn-100"><mml:math id="mml-ieqn-100"><mml:mo>&#x00D7;</mml:mo></mml:math></inline-formula> 768</td>
<td>175.8</td>
<td>77.0</td>
<td>31.6</td>
</tr>
<tr>
<td>BSDNet-B-Seg50</td>
<td>Ours</td>
<td>16.1 M</td>
<td>1024 <inline-formula id="ieqn-101"><mml:math id="mml-ieqn-101"><mml:mo>&#x00D7;</mml:mo></mml:math></inline-formula> 512</td>
<td>168.0</td>
<td>77.4</td>
<td>48.1</td>
</tr>
<tr>
<td>BSDNet-B-Seg75</td>
<td>Ours</td>
<td>16.1 M</td>
<td>1536 <inline-formula id="ieqn-102"><mml:math id="mml-ieqn-102"><mml:mo>&#x00D7;</mml:mo></mml:math></inline-formula> 768</td>
<td>116.3</td>
<td>80.3</td>
<td>48.1</td>
</tr>
<tr>
<td>BSDNet-B-Seg100</td>
<td>Ours</td>
<td>16.1 M</td>
<td>2048 <inline-formula id="ieqn-103"><mml:math id="mml-ieqn-103"><mml:mo>&#x00D7;</mml:mo></mml:math></inline-formula> 1024</td>
<td>70.6</td>
<td><bold>81.7</bold></td>
<td>48.1 G</td>
</tr>
</tbody>
</table>
</table-wrap><fig id="fig-5">
<label>Figure 5</label>
<caption>
<title>Visualization results on Cityscapes. The five columns from left to right are the input image, ground truth, output of DDRNet-23, output of SCTNet-B, and output of BSDNet-B</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_66803-fig-5.tif"/>
</fig>
<p><bold>Results on CamVid:</bold> Due to the lower pixel resolution in CamVid, the inference speed is generally higher than on Cityscapes. The results on the dataset are shown in <xref ref-type="table" rid="table-2">Table 2</xref>. With an input resolution of <inline-formula id="ieqn-104"><mml:math id="mml-ieqn-104"><mml:mn>720</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mn>960</mml:mn></mml:math></inline-formula>, BSDNet-B achieves the highest mIoU of <inline-formula id="ieqn-105"><mml:math id="mml-ieqn-105"><mml:mn>84.7</mml:mn><mml:mi mathvariant="normal">&#x0025;</mml:mi></mml:math></inline-formula> at 135.1 FPS, outperforming RTFormer-B by <inline-formula id="ieqn-106"><mml:math id="mml-ieqn-106"><mml:mn>2.2</mml:mn><mml:mi mathvariant="normal">&#x0025;</mml:mi></mml:math></inline-formula>. Although DDRNet-23-S achieves the second-highest FPS at 253.0 FPS, it sacrifices segmentation accuracy by omitting the pretraining process to achieve higher inference speed. This further demonstrates that our method strikes a better balance between speed and accuracy.</p>
<table-wrap id="table-2">
<label>Table 2</label>
<caption>
<title>Comparison with other state-of-the-art methods on CamVid. The FPS is measured with an input resolution of <inline-formula id="ieqn-107"><mml:math id="mml-ieqn-107"><mml:mn>720</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mn>960</mml:mn></mml:math></inline-formula></title>
</caption>
<table>
<colgroup>
<col/>
<col/>
<col/>
<col/>
</colgroup>
<thead>
<tr>
<th>Method</th>
<th>Params <inline-formula id="ieqn-108"><mml:math id="mml-ieqn-108"><mml:mo mathvariant="bold" stretchy="false">&#x2193;</mml:mo></mml:math></inline-formula></th>
<th>FPS <inline-formula id="ieqn-109"><mml:math id="mml-ieqn-109"><mml:mo mathvariant="bold" stretchy="false">&#x2191;</mml:mo></mml:math></inline-formula></th>
<th>mIoU <bold>(%)</bold> <inline-formula id="ieqn-110"><mml:math id="mml-ieqn-110"><mml:mo mathvariant="bold" stretchy="false">&#x2191;</mml:mo></mml:math></inline-formula></th>
</tr>
</thead>
<tbody>
<tr>
<td>STDC1</td>
<td>14.2 M</td>
<td>155.8</td>
<td>73.0</td>
</tr>
<tr>
<td>STDC2</td>
<td>22.2 M</td>
<td>123.5</td>
<td>73.9</td>
</tr>
<tr>
<td>DDRNet-23-S</td>
<td>5.7 M</td>
<td>253.0</td>
<td>74.7</td>
</tr>
<tr>
<td>DDRNet-23</td>
<td>20.1 M</td>
<td>126.5</td>
<td>76.3</td>
</tr>
<tr>
<td>RITFormer-S</td>
<td>4.8 M</td>
<td>241.1</td>
<td>81.4</td>
</tr>
<tr>
<td>RITFormer-B</td>
<td>16.8 M</td>
<td>127.0</td>
<td>82.5</td>
</tr>
<tr>
<td>BSDNet-S</td>
<td>4.4 M</td>
<td>254.6</td>
<td>83.4</td>
</tr>
<tr>
<td>BSDNet-B</td>
<td>16.1 M</td>
<td>135.1</td>
<td>84.7</td>
</tr>
</tbody>
</table>
</table-wrap>
<p><bold>Results on ADE20K:</bold> To further demonstrate the generalization ability and effectiveness of the BSDNet method, experiments are conducted on the ADE20K. As shown in <xref ref-type="table" rid="table-3">Table 3</xref>, BSDNet-B achieved the best <inline-formula id="ieqn-111"><mml:math id="mml-ieqn-111"><mml:mn>44.6</mml:mn><mml:mi mathvariant="normal">&#x0025;</mml:mi></mml:math></inline-formula> mIoU and 180.7 FPS. The result indicates that BSDNet not only performs excellently in street scenes but also has outstanding generalization ability in broader scene types. Furthermore, while the mIoU of BSDNet is close to that of SeaFormer-B, its 189.4 FPS is twice as high as that of SeaFormer-B. This demonstrates that BSDNet has excellent real-time performance. Visualization results on ADE20K are presented in <xref ref-type="fig" rid="fig-6">Fig. 6</xref>. Compared to SCTNet-B, BSDNet-B yields more accurate segmentation along object boundaries and more coherent predictions for large-area objects, further confirming its generalization capability in complex scenes.</p>
<table-wrap id="table-3">
<label>Table 3</label>
<caption>
<title>Comparison with other state-of-the-art methods on ADE20K. The FPS is measured with an input resolution of <inline-formula id="ieqn-112"><mml:math id="mml-ieqn-112"><mml:mn>512</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mn>512</mml:mn></mml:math></inline-formula></title>
</caption>
<table>
<colgroup>
<col/>
<col/>
<col/>
<col/>
</colgroup>
<thead>
<tr>
<th>Method</th>
<th>Params<inline-formula id="ieqn-113"><mml:math id="mml-ieqn-113"><mml:mo mathvariant="bold" stretchy="false">&#x2193;</mml:mo></mml:math></inline-formula></th>
<th>FPS<inline-formula id="ieqn-114"><mml:math id="mml-ieqn-114"><mml:mo mathvariant="bold" stretchy="false">&#x2191;</mml:mo></mml:math></inline-formula></th>
<th>mIoU (%)<inline-formula id="ieqn-115"><mml:math id="mml-ieqn-115"><mml:mo mathvariant="bold" stretchy="false">&#x2191;</mml:mo></mml:math></inline-formula></th>
</tr>
</thead>
<tbody>
<tr>
<td>PSPNet</td>
<td>13.7 M</td>
<td>62.7</td>
<td>29.6</td>
</tr>
<tr>
<td>SegFormer-B0</td>
<td>3.8 M</td>
<td>94.3</td>
<td>37.4</td>
</tr>
<tr>
<td>TopFormer-B</td>
<td>5.1 M</td>
<td>106.4</td>
<td>39.2</td>
</tr>
<tr>
<td>SeaFormer-B</td>
<td>8.6 M</td>
<td>50.6</td>
<td>41.0</td>
</tr>
<tr>
<td>SegNext-T</td>
<td>4.3 M</td>
<td>68.3</td>
<td>41.1</td>
</tr>
<tr>
<td>AFFormer-B</td>
<td>3.0 M</td>
<td>55.5</td>
<td>41.8</td>
</tr>
<tr>
<td>RTFormer-S</td>
<td>4.8 M</td>
<td>105.6</td>
<td>36.7</td>
</tr>
<tr>
<td>RTFormer-B</td>
<td>16.8 M</td>
<td>104.7</td>
<td>42.1</td>
</tr>
<tr>
<td>SCTNet-S</td>
<td>4.7 M</td>
<td>174.7</td>
<td>37.7</td>
</tr>
<tr>
<td>SCTNet-B</td>
<td>17.4 M</td>
<td>170.4</td>
<td>43.0</td>
</tr>
<tr>
<td>BSDNet-S</td>
<td>4.4 M</td>
<td>189.4</td>
<td>40.3</td>
</tr>
<tr>
<td>BSDNet-B</td>
<td>16.1 M</td>
<td>180.7</td>
<td>44.6</td>
</tr>
</tbody>
</table>
</table-wrap><fig id="fig-6">
<label>Figure 6</label>
<caption>
<title>Visualization results on ADE20K. The four columns from left to right are the input image, ground truth, output of SCTNet-B, and output of BSDNet-B</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_66803-fig-6.tif"/>
</fig>
</sec>
<sec id="s4_3">
<label>4.3</label>
<title>Ablation Study</title>
<p>This part verifies the effectiveness of FCB, SIDM, and DASP, and conducts ablation experiments on the proposed modules.</p>
<sec id="s4_3_1">
<label>4.3.1</label>
<title>Comparison on Different Types of Blocks</title>
<p>Five different types of blocks were used to replace the proposed FCB in the model, and evaluations were performed without ImageNet pretraining to accelerate the evaluation. As shown in <xref ref-type="table" rid="table-4">Table 4</xref>, using the proposed FCB outperforms the traditional ResBlock, with an mIoU that surpasses the nearest CF Block by <inline-formula id="ieqn-116"><mml:math id="mml-ieqn-116"><mml:mn>0.2</mml:mn><mml:mi mathvariant="normal">&#x0025;</mml:mi></mml:math></inline-formula>. Even when compared to the lightweight SegFormer Block, our method with FCB still achieves 10 FPS higher. This improvement primarily results from the more efficient convolution operations in FCB. It replaces the dot-product operation in the squeezing axial attention of the SegFormer Block, thereby effectively reducing computational complexity.</p>
<table-wrap id="table-4">
<label>Table 4</label>
<caption>
<title>Comparison of different blocks</title>
</caption>
<table>
<colgroup>
<col/>
<col/>
<col/>
<col/>
</colgroup>
<thead>
<tr>
<th>Block</th>
<th>FPS<inline-formula id="ieqn-117"><mml:math id="mml-ieqn-117"><mml:mo mathvariant="bold" stretchy="false">&#x2191;</mml:mo></mml:math></inline-formula></th>
<th>mIoU<inline-formula id="ieqn-118"><mml:math id="mml-ieqn-118"><mml:mo mathvariant="bold" stretchy="false">&#x2191;</mml:mo></mml:math></inline-formula></th>
<th>Params<inline-formula id="ieqn-119"><mml:math id="mml-ieqn-119"><mml:mo mathvariant="bold" stretchy="false">&#x2193;</mml:mo></mml:math></inline-formula></th>
</tr>
</thead>
<tbody>
<tr>
<td>ResBlock</td>
<td>69.2</td>
<td>78.2</td>
<td>15.1 M</td>
</tr>
<tr>
<td>SegFormerBlock</td>
<td>60.3</td>
<td>78.5</td>
<td>20.3 M</td>
</tr>
<tr>
<td>GFABlock</td>
<td>68.5</td>
<td>79.8</td>
<td>15.7 M</td>
</tr>
<tr>
<td>MSCANBlock</td>
<td>64.1</td>
<td>80.1</td>
<td>19.0 M</td>
</tr>
<tr>
<td>CFBlock</td>
<td>69.3</td>
<td>80.6</td>
<td>16.9 M</td>
</tr>
<tr>
<td>FCB (Ours)</td>
<td>70.6</td>
<td>80.8</td>
<td>16.1 M</td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
<sec id="s4_3_2">
<label>4.3.2</label>
<title>Comparison between Different Multi-Scale Feature Extraction Modules</title>
<p>Two multi-scale feature extraction modules are selected to compare with our DASP. As shown in <xref ref-type="table" rid="table-5">Table 5</xref>, the model achieves 71.2 FPS with ASPP, slightly higher than DASP. However, DASP captures the correlations between different feature layers that ASPP lacks, resulting in a <inline-formula id="ieqn-120"><mml:math id="mml-ieqn-120"><mml:mn>3.4</mml:mn><mml:mi mathvariant="normal">&#x0025;</mml:mi></mml:math></inline-formula> improvement in mIoU. Compared to DAPPM, DASP improves speed by 5.3 FPS. By eliminating the extensive upsampling operations during the feature concatenation process, the atrous convolutions introduced in DASP effectively reduce computational complexity. Additionally, the two different types of pooling operations used in the module are evaluated to select the most effective configuration.</p>
<table-wrap id="table-5">
<label>Table 5</label>
<caption>
<title>Comparison between DASP and other multi-scale feature extraction modules</title>
</caption>
<table>
<colgroup>
<col/>
<col/>
<col/>
<col/>
</colgroup>
<thead>
<tr>
<th>Block</th>
<th>FPS<inline-formula id="ieqn-121"><mml:math id="mml-ieqn-121"><mml:mo mathvariant="bold" stretchy="false">&#x2191;</mml:mo></mml:math></inline-formula></th>
<th>mIoU<inline-formula id="ieqn-122"><mml:math id="mml-ieqn-122"><mml:mo mathvariant="bold" stretchy="false">&#x2191;</mml:mo></mml:math></inline-formula></th>
<th>Params<inline-formula id="ieqn-123"><mml:math id="mml-ieqn-123"><mml:mo mathvariant="bold" stretchy="false">&#x2193;</mml:mo></mml:math></inline-formula></th>
</tr>
</thead>
<tbody>
<tr>
<td>ASPP</td>
<td>71.2</td>
<td>78.3</td>
<td>17.3 M</td>
</tr>
<tr>
<td>DAPPM</td>
<td>65.3</td>
<td>80.7</td>
<td>20.2 M</td>
</tr>
<tr>
<td>DASP&#x002B;Max-Pooling</td>
<td>67.4</td>
<td>79.3</td>
<td>19.0 M</td>
</tr>
<tr>
<td>DASP&#x002B;Avg-Pooling (Ours)</td>
<td>70.6</td>
<td>81.7</td>
<td>16.1 M</td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
<sec id="s4_3_3">
<label>4.3.3</label>
<title>Validation of the Effectiveness of SIDM</title>
<p>As shown in <xref ref-type="table" rid="table-6">Table 6</xref>, applying either the CWD loss or the OFA loss individually within SIDM brings only limited performance improvement. When combining both CWD and OFA losses, the performance is further improved to 81.7<inline-formula id="ieqn-124"><mml:math id="mml-ieqn-124"><mml:mi mathvariant="normal">&#x0025;</mml:mi></mml:math></inline-formula> mIoU, demonstrating the complementarity of these two distillation strategies. Incorporating SIDM into BSDNet leads to a <inline-formula id="ieqn-125"><mml:math id="mml-ieqn-125"><mml:mn>0.6</mml:mn><mml:mi mathvariant="normal">&#x0025;</mml:mi></mml:math></inline-formula> increase in mIoU, whereas its integration with DDRNet yields a smaller gain of <inline-formula id="ieqn-126"><mml:math id="mml-ieqn-126"><mml:mn>0.3</mml:mn><mml:mi mathvariant="normal">&#x0025;</mml:mi></mml:math></inline-formula>. It is primarily because DDRNet has already obtained sufficient semantic information from multiple bilateral fusions. This also validates that our SIDM module effectively promotes the extraction of more semantic information from the branches.</p>
<table-wrap id="table-6">
<label>Table 6</label>
<caption>
<title>Validation of the effectiveness of SIDM and comparison across different models</title>
</caption>
<table>
<colgroup>
<col/>
<col/>
<col/>
<col/>
</colgroup>
<thead>
<tr>
<th>Block</th>
<th>Seg100 <bold>(%)</bold></th>
<th>Seg75 <bold>(%)</bold></th>
<th>Seg50 <bold>(%)</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td>SegNext-T</td>
<td>79.8</td>
<td>78.0</td>
<td>&#x2013;</td>
</tr>
<tr>
<td>SegNext-T&#x002B;SIDM</td>
<td>80.0</td>
<td>78.5</td>
<td>&#x2013;</td>
</tr>
<tr>
<td>DDRNet-23</td>
<td>79.5</td>
<td>&#x2013;</td>
<td>&#x2013;</td>
</tr>
<tr>
<td>DDRNet-23&#x002B;SIDM</td>
<td>79.8</td>
<td>&#x2013;</td>
<td>&#x2013;</td>
</tr>
<tr>
<td>SegFormer-B</td>
<td>77.7</td>
<td>&#x2013;</td>
<td>72.2</td>
</tr>
<tr>
<td>SeaFormer-B&#x002B;SIDM</td>
<td>80.1</td>
<td>&#x2013;</td>
<td>73.0</td>
</tr>
<tr>
<td>SCTNet-B</td>
<td>80.5</td>
<td>79.8</td>
<td>76.5</td>
</tr>
<tr>
<td>SCTNet-B&#x002B;SIDM</td>
<td>80.9</td>
<td>80.3</td>
<td>77.4</td>
</tr>
<tr>
<td>BSDNet-SIDM</td>
<td>81.1</td>
<td>79.5</td>
<td>76.8</td>
</tr>
<tr>
<td>BSDNet&#x002B;CWD</td>
<td>81.4</td>
<td>&#x2013;</td>
<td>&#x2013;</td>
</tr>
<tr>
<td>BSDNet&#x002B;OFA</td>
<td>81.2</td>
<td>&#x2013;</td>
<td>&#x2013;</td>
</tr>
<tr>
<td>BSDNet&#x002B;SIDM (CWD&#x002B;OFA)</td>
<td>81.7</td>
<td>80.3</td>
<td>77.4</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>A fine-grained sensitivity analysis was conducted on the hyperparameter <inline-formula id="ieqn-127"><mml:math id="mml-ieqn-127"><mml:mi>&#x03B8;</mml:mi></mml:math></inline-formula>. As shown in <xref ref-type="fig" rid="fig-7">Fig. 7</xref>, the model performance gradually improves as <inline-formula id="ieqn-128"><mml:math id="mml-ieqn-128"><mml:mi>&#x03B8;</mml:mi></mml:math></inline-formula> increases, reaching the highest 81.7<inline-formula id="ieqn-129"><mml:math id="mml-ieqn-129"><mml:mi mathvariant="normal">&#x0025;</mml:mi></mml:math></inline-formula> mIoU at <inline-formula id="ieqn-130"><mml:math id="mml-ieqn-130"><mml:mi>&#x03B8;</mml:mi><mml:mo>=</mml:mo><mml:mn>1.5</mml:mn></mml:math></inline-formula>. However, further increasing <inline-formula id="ieqn-131"><mml:math id="mml-ieqn-131"><mml:mi>&#x03B8;</mml:mi></mml:math></inline-formula> to 1.6 results in a slight decline to 81.4<inline-formula id="ieqn-132"><mml:math id="mml-ieqn-132"><mml:mi mathvariant="normal">&#x0025;</mml:mi></mml:math></inline-formula> mIoU, potentially due to over-augmentation leading to over-distillation. These results demonstrate that a reasonable setting of <inline-formula id="ieqn-133"><mml:math id="mml-ieqn-133"><mml:mi>&#x03B8;</mml:mi></mml:math></inline-formula> facilitates effective alignment and learning of target class features, thereby enhancing the final semantic segmentation performance.</p>
<fig id="fig-7">
<label>Figure 7</label>
<caption>
<title>Sensitivity analysis of <inline-formula id="ieqn-134"><mml:math id="mml-ieqn-134"><mml:mi>&#x03B8;</mml:mi></mml:math></inline-formula></title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_66803-fig-7.tif"/>
</fig>
</sec>
<sec id="s4_3_4">
<label>4.3.4</label>
<title>Ablation Study of the Components on BSDNet</title>
<p>As shown in <xref ref-type="table" rid="table-7">Table 7</xref>, replacing the Res block with FCB improves the segmentation accuracy across all input resolutions. This enhancement is primarily attributed to the attention mechanisms in FCB, which effectively capture details. Incorporating DASP results in a further <inline-formula id="ieqn-135"><mml:math id="mml-ieqn-135"><mml:mn>0.9</mml:mn><mml:mi mathvariant="normal">&#x0025;</mml:mi></mml:math></inline-formula> mIoU gain under Seg100, mainly due to its superior capability in capturing multi-scale object information, particularly with higher-resolution inputs. SIDM achieved stable improvements across models with different input sizes, demonstrating the contribution of the semantic information learned in the Transformer branch.</p>
<table-wrap id="table-7">
<label>Table 7</label>
<caption>
<title>Ablation study of components on Cityscapes. The FPS is measured with an input resolution of <inline-formula id="ieqn-136"><mml:math id="mml-ieqn-136"><mml:mn>1024</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mn>2048</mml:mn></mml:math></inline-formula>, denoted as Seg100</title>
</caption>
<table>
<colgroup>
<col/>
<col/>
<col/>
<col/>
<col/>
</colgroup>
<thead>
<tr>
<th>Components</th>
<th>Seg100 <bold>(%)</bold></th>
<th>Seg75 <bold>(%)</bold></th>
<th>Seg50 <bold>(%)</bold></th>
<th>FPS</th>
</tr>
</thead>
<tbody>
<tr>
<td>Baseline</td>
<td>80.2</td>
<td>78.9</td>
<td>75.7</td>
<td>68.3</td>
</tr>
<tr>
<td>&#x002B;FCB</td>
<td>80.4 (&#x002B;0.2)</td>
<td>79.1 (&#x002B;0.2)</td>
<td>76.1 (&#x002B;0.4)</td>
<td>68.1</td>
</tr>
<tr>
<td>&#x002B;DASP</td>
<td>81.3 (&#x002B;0.9)</td>
<td>79.7 (&#x002B;0.6)</td>
<td>76.8 (&#x002B;0.7)</td>
<td>70.5</td>
</tr>
<tr>
<td>&#x002B;SIDM</td>
<td>81.7 (&#x002B;0.4)</td>
<td>80.3 (&#x002B;0.6)</td>
<td>77.4 (&#x002B;0.6)</td>
<td>70.6</td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
</sec>
</sec>
<sec id="s5">
<label>5</label>
<title>Conclusion</title>
<p>To efficiently perform semantic segmentation tasks in complex street scenes and achieve better segmentation results, a bilateral-branch real-time semantic segmentation method based on semantic information distillation is proposed. It achieves the accuracy of 81.7<inline-formula id="ieqn-137"><mml:math id="mml-ieqn-137"><mml:mi mathvariant="normal">&#x0025;</mml:mi></mml:math></inline-formula> mIoU (70.6 FPS) on Cityscapes. Our model introduces three key improvements: First, FCB effectively reduces the feature discrepancy between the two branches and aligns features with a greater focus on detail. Second, SIDM extracts high-quality semantic information from the Transformer branch at a lower cost, improving segmentation accuracy in street scenes. Third, the proposed DASP effectively captures multi-scale objects in complex street scenes, achieving more refined segmentation details at a lower cost. Extensive experiments show that BSDNet performs excellently on three datasets. BSDNet enables real-time segmentation in street scenes, enhancing the real-time performance of autonomous driving to improve safety. In future work, BSDNet will be transferred as a new baseline to other downstream tasks.</p>
</sec>
</body>
<back>
<ack>
<p>Thanks for the support from my teachers and friends during the writing of this thesis.</p>
</ack>
<sec>
<title>Funding Statement</title>
<p>This work is supported in part by the National Natural Science Foundation of China [Grant number 62471075], the Major Science and Technology Project Grant of the Chongqing Municipal Education Commission [Grant number KJZD-M202301901]; Graduate Innovation Fund of Chongqing [gzlcx20253235].</p>
</sec>
<sec>
<title>Author Contributions</title>
<p>The authors confirm contribution to the paper as follows: Methodologies, coding, and thesis writing, Huan Zeng; experimental guidance, thesis writing revision, Jianxun Zhang; dataset processing, Hongji Chen; experimental data organization, Xinwei Zhu. All authors reviewed the results and approved the final version of the manuscript.</p>
</sec>
<sec sec-type="data-availability">
<title>Availability of Data and Materials</title>
<p>Not applicable.</p>
</sec>
<sec>
<title>Ethics Approval</title>
<p>Not applicable.</p>
</sec>
<sec sec-type="COI-statement">
<title>Conflicts of Interest</title>
<p>The authors declare no conflicts of interest to report regarding the present study.</p>
</sec>
<ref-list content-type="authoryear">
<title>References</title>
<ref id="ref-1"><label>[1]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Wang</surname> <given-names>J</given-names></string-name>, <string-name><surname>Gou</surname> <given-names>C</given-names></string-name>, <string-name><surname>Wu</surname> <given-names>Q</given-names></string-name>, <string-name><surname>Feng</surname> <given-names>H</given-names></string-name>, <string-name><surname>Han</surname> <given-names>J</given-names></string-name>, <string-name><surname>Ding</surname> <given-names>E</given-names></string-name>, <etal>et al</etal></person-group>. <article-title>Rtformer: efficient design for real-time semantic segmentation with transformer</article-title>. <source>Adv Neural Inf Process Syst</source>. <year>2022</year>;<volume>35</volume>:<fpage>7423</fpage>&#x2013;<lpage>36</lpage>.</mixed-citation></ref>
<ref id="ref-2"><label>[2]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><surname>Chen</surname> <given-names>LC</given-names></string-name>, <string-name><surname>Papandreou</surname> <given-names>G</given-names></string-name>, <string-name><surname>Schroff</surname> <given-names>F</given-names></string-name>, <string-name><surname>Adam</surname> <given-names>H</given-names></string-name></person-group>. <article-title>Rethinking atrous convolution for semantic image segmentation</article-title>. <comment>arXiv:1706.05587. 2017</comment>.</mixed-citation></ref>
<ref id="ref-3"><label>[3]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Pan</surname> <given-names>H</given-names></string-name>, <string-name><surname>Hong</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Sun</surname> <given-names>W</given-names></string-name>, <string-name><surname>Jia</surname> <given-names>Y</given-names></string-name></person-group>. <article-title>Deep dual-resolution networks for real-time and accurate semantic segmentation of traffic scenes</article-title>. <source>IEEE Trans Intell Transp Syst</source>. <year>2022</year>;<volume>24</volume>(<issue>3</issue>):<fpage>3448</fpage>&#x2013;<lpage>60</lpage>. doi:<pub-id pub-id-type="doi">10.1109/tits.2022.3228042</pub-id>.</mixed-citation></ref>
<ref id="ref-4"><label>[4]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Hao</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Guo</surname> <given-names>J</given-names></string-name>, <string-name><surname>Han</surname> <given-names>K</given-names></string-name>, <string-name><surname>Tang</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Hu</surname> <given-names>H</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>Y</given-names></string-name>, <etal>et al</etal></person-group>. <article-title>One-for-all: bridge the gap between heterogeneous architectures in knowledge distillation</article-title>. <source>Adv Neural Inf Process Syst</source>. <year>2023</year>;<volume>36</volume>:<fpage>79570</fpage>&#x2013;<lpage>82</lpage>.</mixed-citation></ref>
<ref id="ref-5"><label>[5]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Long</surname> <given-names>J</given-names></string-name>, <string-name><surname>Shelhamer</surname> <given-names>E</given-names></string-name>, <string-name><surname>Darrell</surname> <given-names>T</given-names></string-name></person-group>. <article-title>Fully convolutional networks for semantic segmentation</article-title>. In: <conf-name>Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition</conf-name>. <publisher-loc>Boston, MA, USA</publisher-loc>; <year>2015</year>. p. <fpage>3431</fpage>&#x2013;<lpage>40</lpage>.</mixed-citation></ref>
<ref id="ref-6"><label>[6]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Badrinarayanan</surname> <given-names>V</given-names></string-name>, <string-name><surname>Kendall</surname> <given-names>A</given-names></string-name>, <string-name><surname>Cipolla</surname> <given-names>R</given-names></string-name></person-group>. <article-title>Segnet: a deep convolutional encoder-decoder architecture for image segmentation</article-title>. <source>IEEE Trans Pattern Anal Mach Intell</source>. <year>2017</year>;<volume>39</volume>(<issue>12</issue>):<fpage>2481</fpage>&#x2013;<lpage>95</lpage>. doi:<pub-id pub-id-type="doi">10.1109/tpami.2016.2644615</pub-id>; <pub-id pub-id-type="pmid">28060704</pub-id></mixed-citation></ref>
<ref id="ref-7"><label>[7]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Zhao</surname> <given-names>H</given-names></string-name>, <string-name><surname>Shi</surname> <given-names>J</given-names></string-name>, <string-name><surname>Qi</surname> <given-names>X</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>X</given-names></string-name>, <string-name><surname>Jia</surname> <given-names>J</given-names></string-name></person-group>. <article-title>Pyramid scene parsing network</article-title>. In: <conf-name>Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition</conf-name>. <publisher-loc>Honolulu, HI, USA</publisher-loc>; <year>2017</year>. p. <fpage>2881</fpage>&#x2013;<lpage>90</lpage>.</mixed-citation></ref>
<ref id="ref-8"><label>[8]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Lin</surname> <given-names>G</given-names></string-name>, <string-name><surname>Milan</surname> <given-names>A</given-names></string-name>, <string-name><surname>Shen</surname> <given-names>C</given-names></string-name>, <string-name><surname>Reid</surname> <given-names>I</given-names></string-name></person-group>. <article-title>Refinenet: multi-path refinement networks for high-resolution semantic segmentation</article-title>. In: <conf-name>Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition</conf-name>. <publisher-loc>Honolulu, HI, USA</publisher-loc>; <year>2017</year>. p. <fpage>1925</fpage>&#x2013;<lpage>34</lpage>.</mixed-citation></ref>
<ref id="ref-9"><label>[9]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Zhao</surname> <given-names>H</given-names></string-name>, <string-name><surname>Qi</surname> <given-names>X</given-names></string-name>, <string-name><surname>Shen</surname> <given-names>X</given-names></string-name>, <string-name><surname>Shi</surname> <given-names>J</given-names></string-name>, <string-name><surname>Jia</surname> <given-names>J</given-names></string-name></person-group>. <article-title>Icnet for real-time semantic segmentation on high-resolution images</article-title>. In: <conf-name>Proceedings of the European Conference on Computer Vision (ECCV)</conf-name>. <publisher-loc>Cham, Switzerland</publisher-loc>: <publisher-name>Springer International Publishing</publisher-name>; <year>2018</year>. p. <fpage>405</fpage>&#x2013;<lpage>20</lpage>.</mixed-citation></ref>
<ref id="ref-10"><label>[10]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Yu</surname> <given-names>C</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>J</given-names></string-name>, <string-name><surname>Peng</surname> <given-names>C</given-names></string-name>, <string-name><surname>Gao</surname> <given-names>C</given-names></string-name>, <string-name><surname>Yu</surname> <given-names>G</given-names></string-name>, <string-name><surname>Sang</surname> <given-names>N</given-names></string-name></person-group>. <article-title>Bisenet: bilateral segmentation network for real-time semantic segmentation</article-title>. In: <conf-name>Proceedings of the European Conference on Computer Vision (ECCV)</conf-name>. <publisher-loc>Cham, Switzerland</publisher-loc>: <publisher-name>Springer International Publishing</publisher-name>; <year>2018</year>. p. <fpage>325</fpage>&#x2013;<lpage>41</lpage>.</mixed-citation></ref>
<ref id="ref-11"><label>[11]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Yu</surname> <given-names>C</given-names></string-name>, <string-name><surname>Gao</surname> <given-names>C</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>J</given-names></string-name>, <string-name><surname>Yu</surname> <given-names>G</given-names></string-name>, <string-name><surname>Shen</surname> <given-names>C</given-names></string-name>, <string-name><surname>Sang</surname> <given-names>N</given-names></string-name></person-group>. <article-title>Bisenet v2: bilateral network with guided aggregation for real-time semantic segmentation</article-title>. <source>Int J Comput Vis</source>. <year>2021</year>;<volume>129</volume>(<issue>11</issue>):<fpage>3051</fpage>&#x2013;<lpage>68</lpage>. doi:<pub-id pub-id-type="doi">10.1007/s11263-021-01515-2</pub-id>.</mixed-citation></ref>
<ref id="ref-12"><label>[12]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Fan</surname> <given-names>M</given-names></string-name>, <string-name><surname>Lai</surname> <given-names>S</given-names></string-name>, <string-name><surname>Huang</surname> <given-names>J</given-names></string-name>, <string-name><surname>Wei</surname> <given-names>X</given-names></string-name>, <string-name><surname>Chai</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Luo</surname> <given-names>J</given-names></string-name>, <etal>et al</etal></person-group>. <article-title>Rethinking bisenet for real-time semantic segmentation</article-title>. In: <conf-name>Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</conf-name>. <publisher-loc>Nashville, TN, USA</publisher-loc>; <year>2021</year>. p. <fpage>9716</fpage>&#x2013;<lpage>25</lpage>.</mixed-citation></ref>
<ref id="ref-13"><label>[13]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Zhang</surname> <given-names>W</given-names></string-name>, <string-name><surname>Huang</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Luo</surname> <given-names>G</given-names></string-name>, <string-name><surname>Chen</surname> <given-names>T</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>X</given-names></string-name>, <string-name><surname>Liu</surname> <given-names>W</given-names></string-name>, <etal>et al</etal></person-group>. <article-title>Topformer: token pyramid transformer for mobile semantic segmentation</article-title>. In: <conf-name>Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</conf-name>. <publisher-loc>New Orleans, LA, USA</publisher-loc>; <year>2022</year>. p. <fpage>12083</fpage>&#x2013;<lpage>93</lpage>.</mixed-citation></ref>
<ref id="ref-14"><label>[14]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Wan</surname> <given-names>Q</given-names></string-name>, <string-name><surname>Huang</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Lu</surname> <given-names>J</given-names></string-name>, <string-name><surname>Yu</surname> <given-names>G</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>L</given-names></string-name></person-group>. <article-title>Seaformer: squeeze-enhanced axial transformer for mobile semantic segmentation</article-title>. In: <conf-name>The Eleventh International Conference on Learning Representations</conf-name>; <year>2023 May 1&#x2013;5</year>; <comment>Kigali, Rwanda</comment>.</mixed-citation></ref>
<ref id="ref-15"><label>[15]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Touvron</surname> <given-names>H</given-names></string-name>, <string-name><surname>Cord</surname> <given-names>M</given-names></string-name>, <string-name><surname>Douze</surname> <given-names>M</given-names></string-name>, <string-name><surname>Massa</surname> <given-names>F</given-names></string-name>, <string-name><surname>Sablayrolles</surname> <given-names>A</given-names></string-name>, <string-name><surname>J&#x00E9;gou</surname> <given-names>H</given-names></string-name></person-group>. <article-title>Training data-efficient image transformers &#x0026; distillation through attention</article-title>. In: <conf-name>International Conference on Machine Learning</conf-name>. <publisher-name>PMLR</publisher-name>; <year>2021</year>. p. <fpage>10347</fpage>&#x2013;<lpage>57</lpage>.</mixed-citation></ref>
<ref id="ref-16"><label>[16]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Huang</surname> <given-names>T</given-names></string-name>, <string-name><surname>You</surname> <given-names>S</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>F</given-names></string-name>, <string-name><surname>Qian</surname> <given-names>C</given-names></string-name>, <string-name><surname>Xu</surname> <given-names>C</given-names></string-name></person-group>. <article-title>Knowledge distillation from a stronger teacher</article-title>. <source>Adv Neural Inf Process Syst</source>. <year>2022</year>;<volume>35</volume>:<fpage>33716</fpage>&#x2013;<lpage>27</lpage>.</mixed-citation></ref>
<ref id="ref-17"><label>[17]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><surname>Romero</surname> <given-names>A</given-names></string-name>, <string-name><surname>Ballas</surname> <given-names>N</given-names></string-name>, <string-name><surname>Kahou</surname> <given-names>SE</given-names></string-name>, <string-name><surname>Chassang</surname> <given-names>A</given-names></string-name>, <string-name><surname>Gatta</surname> <given-names>C</given-names></string-name>, <string-name><surname>Bengio</surname> <given-names>Y</given-names></string-name></person-group>. <article-title>Fitnets: hints for thin deep nets</article-title>. <comment>arXiv:1412.6550. 2014</comment>.</mixed-citation></ref>
<ref id="ref-18"><label>[18]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><surname>Xie</surname> <given-names>J</given-names></string-name>, <string-name><surname>Shuai</surname> <given-names>B</given-names></string-name>, <string-name><surname>Hu</surname> <given-names>JF</given-names></string-name>, <string-name><surname>Lin</surname> <given-names>J</given-names></string-name>, <string-name><surname>Zheng</surname> <given-names>WS</given-names></string-name></person-group>. <article-title>Improving fast segmentation with teacher-student learning</article-title>. <comment>arXiv:1810.08476. 2018</comment>.</mixed-citation></ref>
<ref id="ref-19"><label>[19]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Liu</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Chen</surname> <given-names>K</given-names></string-name>, <string-name><surname>Liu</surname> <given-names>C</given-names></string-name>, <string-name><surname>Qin</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Luo</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>J</given-names></string-name></person-group>. <article-title>Structured knowledge distillation for semantic segmentation</article-title>. In: <conf-name>Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</conf-name>. <publisher-loc>Long Beach, CA, USA</publisher-loc>; <year>2019</year>. p. <fpage>2604</fpage>&#x2013;<lpage>13</lpage>.</mixed-citation></ref>
<ref id="ref-20"><label>[20]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Wang</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Zhou</surname> <given-names>W</given-names></string-name>, <string-name><surname>Jiang</surname> <given-names>T</given-names></string-name>, <string-name><surname>Bai</surname> <given-names>X</given-names></string-name>, <string-name><surname>Xu</surname> <given-names>Y</given-names></string-name></person-group>. <article-title>Intra-class feature variation distillation for semantic segmentation</article-title>. In: <conf-name>Computer Vision&#x2013;ECCV 2020: 16th European Conference</conf-name>; <comment>2020 Aug 23&#x2013;28</comment>; <publisher-loc>Glasgow, UK</publisher-loc>. <publisher-loc>Cham, Switzerland</publisher-loc>: <publisher-name>Springer International Publishing</publisher-name>; <year>2020</year>. p. <fpage>346</fpage>&#x2013;<lpage>62</lpage>.</mixed-citation></ref>
<ref id="ref-21"><label>[21]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Shu</surname> <given-names>C</given-names></string-name>, <string-name><surname>Liu</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Gao</surname> <given-names>J</given-names></string-name>, <string-name><surname>Yan</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Shen</surname> <given-names>C</given-names></string-name></person-group>. <article-title>Channel-wise knowledge distillation for dense prediction</article-title>. In: <conf-name>Proceedings of the IEEE/CVF International Conference on Computer Vision</conf-name>. <publisher-loc>Montreal, QC, Canada</publisher-loc>; <year>2021</year>. p. <fpage>5311</fpage>&#x2013;<lpage>20</lpage>.</mixed-citation></ref>
<ref id="ref-22"><label>[22]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Yuan</surname> <given-names>X</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>J</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>X</given-names></string-name>, <string-name><surname>Chu</surname> <given-names>Z</given-names></string-name></person-group>. <article-title>Ed-ged: nighttime image semantic segmentation based on enhanced detail and bidirectional guidance</article-title>. <source>Comput Mater Contin</source>. <year>2024</year>;<volume>80</volume>(<issue>2</issue>):<fpage>2443</fpage>&#x2013;<lpage>62</lpage>. doi:<pub-id pub-id-type="doi">10.32604/cmc.2024.052285</pub-id>.</mixed-citation></ref>
<ref id="ref-23"><label>[23]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Xiang</surname> <given-names>S</given-names></string-name>, <string-name><surname>Zhou</surname> <given-names>D</given-names></string-name>, <string-name><surname>Tian</surname> <given-names>D</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>Z</given-names></string-name></person-group>. <article-title>Bilateral dual-residual real-time semantic segmentation network</article-title>. <source>Comput Mater Contin</source>. <year>2025</year>;<volume>83</volume>(<issue>1</issue>):<fpage>497</fpage>&#x2013;<lpage>515</lpage>. doi:<pub-id pub-id-type="doi">10.32604/cmc.2025.060244</pub-id>.</mixed-citation></ref>
<ref id="ref-24"><label>[24]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Zhou</surname> <given-names>W</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>H</given-names></string-name>, <string-name><surname>Yan</surname> <given-names>W</given-names></string-name>, <string-name><surname>Lin</surname> <given-names>W</given-names></string-name></person-group>. <article-title>Mmsmcnet: modal memory sharing and morphological complementary networks for rgb-t urban scene semantic segmentation</article-title>. <source>IEEE Trans Circuits Syst Video Technol</source>. <year>2023</year>;<volume>33</volume>(<issue>12</issue>):<fpage>7096</fpage>&#x2013;<lpage>108</lpage>. doi:<pub-id pub-id-type="doi">10.1109/tcsvt.2023.3275314</pub-id>.</mixed-citation></ref>
<ref id="ref-25"><label>[25]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Zhou</surname> <given-names>W</given-names></string-name>, <string-name><surname>Jian</surname> <given-names>B</given-names></string-name>, <string-name><surname>Liu</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Jiang</surname> <given-names>Q</given-names></string-name></person-group>. <article-title>Multiattentive perception and multilayer transfer network using knowledge distillation for rgb-d indoor scene parsing</article-title>. <source>IEEE Trans Neural Netw Learn Syst</source>. <year>2025</year>:<fpage>1</fpage>&#x2013;<lpage>13</lpage>. doi:<pub-id pub-id-type="doi">10.1109/tnnls.2025.3575088</pub-id>; <pub-id pub-id-type="pmid">40493458</pub-id></mixed-citation></ref>
<ref id="ref-26"><label>[26]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><surname>Li</surname> <given-names>P</given-names></string-name>, <string-name><surname>Chen</surname> <given-names>J</given-names></string-name>, <string-name><surname>Yuan</surname> <given-names>L</given-names></string-name>, <string-name><surname>Xu</surname> <given-names>X</given-names></string-name>, <string-name><surname>Song</surname> <given-names>M</given-names></string-name></person-group>. <article-title>Triple-view knowledge distillation for semi-supervised semantic segmentation</article-title>. <comment>arXiv:2309.12557. 2023</comment>.</mixed-citation></ref>
<ref id="ref-27"><label>[27]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Li</surname> <given-names>P</given-names></string-name>, <string-name><surname>Chen</surname> <given-names>J</given-names></string-name>, <string-name><surname>Tang</surname> <given-names>C</given-names></string-name></person-group>. <article-title>Bridging knowledge distillation gap for few-sample unsupervised semantic segmentation</article-title>. <source>Inf Sci</source>. <year>2024</year>;<volume>673</volume>(<issue>4</issue>):<fpage>120714</fpage>. doi:<pub-id pub-id-type="doi">10.1016/j.ins.2024.120714</pub-id>.</mixed-citation></ref>
<ref id="ref-28"><label>[28]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Xu</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Wu</surname> <given-names>D</given-names></string-name>, <string-name><surname>Yu</surname> <given-names>C</given-names></string-name>, <string-name><surname>Chu</surname> <given-names>X</given-names></string-name>, <string-name><surname>Sang</surname> <given-names>N</given-names></string-name>, <string-name><surname>Gao</surname> <given-names>C</given-names></string-name></person-group>. <article-title>SCTnet: single-branch CNN with transformer semantic information for real-time segmentation</article-title>. <source>Proc AAAI Conf Artif Intell</source>. <year>2024</year>;<volume>38</volume>(<issue>6</issue>):<fpage>6378</fpage>&#x2013;<lpage>86</lpage>. doi:<pub-id pub-id-type="doi">10.1609/aaai.v38i6.28457</pub-id>.</mixed-citation></ref>
<ref id="ref-29"><label>[29]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Abuqaddom</surname> <given-names>I</given-names></string-name>, <string-name><surname>Mahafzah</surname> <given-names>BA</given-names></string-name>, <string-name><surname>Faris</surname> <given-names>H</given-names></string-name></person-group>. <article-title>Oriented stochastic loss descent algorithm to train very deep multi-layer neural networks without vanishing gradients</article-title>. <source>Knowl Based Syst</source>. <year>2021</year>;<volume>230</volume>(<issue>7553</issue>):<fpage>107391</fpage>. doi:<pub-id pub-id-type="doi">10.1016/j.knosys.2021.107391</pub-id>.</mixed-citation></ref>
<ref id="ref-30"><label>[30]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Hao</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Guo</surname> <given-names>J</given-names></string-name>, <string-name><surname>Jia</surname> <given-names>D</given-names></string-name>, <string-name><surname>Han</surname> <given-names>K</given-names></string-name>, <string-name><surname>Tang</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>C</given-names></string-name>, <etal>et al</etal></person-group>. <article-title>Learning efficient vision transformers via fine-grained manifold distillation</article-title>. <source>Adv Neural Inf Process Syst</source>. <year>2022</year>;<volume>35</volume>:<fpage>9164</fpage>&#x2013;<lpage>75</lpage>.</mixed-citation></ref>
<ref id="ref-31"><label>[31]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Wang</surname> <given-names>P</given-names></string-name>, <string-name><surname>Chen</surname> <given-names>P</given-names></string-name>, <string-name><surname>Yuan</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Liu</surname> <given-names>D</given-names></string-name>, <string-name><surname>Huang</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Hou</surname> <given-names>X</given-names></string-name>, <etal>et al</etal></person-group>. <article-title>Understanding convolution for semantic segmentation</article-title>. In: <conf-name>2018 IEEE Winter Conference on Applications of Computer Vision (WACV)</conf-name>. <publisher-loc>Lake Tahoe, NV, USA</publisher-loc>: <publisher-name>IEEE</publisher-name>; <year>2018</year>. p. <fpage>1451</fpage>&#x2013;<lpage>60</lpage>.</mixed-citation></ref>
<ref id="ref-32"><label>[32]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Cordts</surname> <given-names>M</given-names></string-name>, <string-name><surname>Omran</surname> <given-names>M</given-names></string-name>, <string-name><surname>Ramos</surname> <given-names>S</given-names></string-name>, <string-name><surname>Rehfeld</surname> <given-names>T</given-names></string-name>, <string-name><surname>Enzweiler</surname> <given-names>M</given-names></string-name>, <string-name><surname>Benenson</surname> <given-names>R</given-names></string-name>, <etal>et al</etal></person-group>. <article-title>The Cityscapes dataset for semantic urban scene understanding</article-title>. In: <conf-name>Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition</conf-name>. <publisher-loc>Las Vegas, NV, USA</publisher-loc>; <year>2016</year>. p. <fpage>3213</fpage>&#x2013;<lpage>23</lpage>.</mixed-citation></ref>
<ref id="ref-33"><label>[33]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Brostow</surname> <given-names>GJ</given-names></string-name>, <string-name><surname>Shotton</surname> <given-names>J</given-names></string-name>, <string-name><surname>Fauqueur</surname> <given-names>J</given-names></string-name>, <string-name><surname>Cipolla</surname> <given-names>R</given-names></string-name></person-group>. <article-title>Segmentation and recognition using structure from motion point clouds</article-title>. In: <conf-name>Computer Vision&#x2013;ECCV 2008: 10th European Conference on Computer Vision</conf-name>; <comment>2008 Oct 12&#x2013;18; Marseille, France</comment>. <publisher-loc>Berlin/Heidelberg</publisher-loc>: <publisher-name>Springer</publisher-name>. p. <fpage>44</fpage>&#x2013;<lpage>57</lpage>.</mixed-citation></ref>
<ref id="ref-34"><label>[34]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Zhou</surname> <given-names>B</given-names></string-name>, <string-name><surname>Zhao</surname> <given-names>H</given-names></string-name>, <string-name><surname>Puig</surname> <given-names>X</given-names></string-name>, <string-name><surname>Xiao</surname> <given-names>T</given-names></string-name>, <string-name><surname>Fidler</surname> <given-names>S</given-names></string-name>, <string-name><surname>Barriuso</surname> <given-names>A</given-names></string-name>, <etal>et al</etal></person-group>. <article-title>Semantic understanding of scenes through the ade20k dataset</article-title>. <source>Int J Comput Vis</source>. <year>2019</year>;<volume>127</volume>(<issue>3</issue>):<fpage>302</fpage>&#x2013;<lpage>21</lpage>. doi:<pub-id pub-id-type="doi">10.1007/s11263-018-1140-0</pub-id>.</mixed-citation></ref>
<ref id="ref-35"><label>[35]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Dong</surname> <given-names>B</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>P</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>F</given-names></string-name></person-group>. <article-title>Head-free lightweight semantic segmentation with linear transformer</article-title>. <source>Proc AAAI Conf Artif Intell</source>. <year>2023</year>;<volume>37</volume>(<issue>1</issue>):<fpage>516</fpage>&#x2013;<lpage>24</lpage>. doi:<pub-id pub-id-type="doi">10.1609/aaai.v37i1.25126</pub-id>.</mixed-citation></ref>
<ref id="ref-36"><label>[36]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Guo</surname> <given-names>MH</given-names></string-name>, <string-name><surname>Lu</surname> <given-names>CZ</given-names></string-name>, <string-name><surname>Hou</surname> <given-names>Q</given-names></string-name>, <string-name><surname>Liu</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Cheng</surname> <given-names>MM</given-names></string-name>, <string-name><surname>Hu</surname> <given-names>SM</given-names></string-name></person-group>. <article-title>Segnext: rethinking convolutional attention design for semantic segmentation</article-title>. <source>Adv Neural Inf Process Syst</source>. <year>2022</year>;<volume>35</volume>:<fpage>1140</fpage>&#x2013;<lpage>56</lpage>.</mixed-citation></ref>
<ref id="ref-37"><label>[37]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Xu</surname> <given-names>J</given-names></string-name>, <string-name><surname>Xiong</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Bhattacharyya</surname> <given-names>SP</given-names></string-name></person-group>. <article-title>Pidnet: a real-time semantic segmentation network inspired by pid controllers</article-title>. In: <conf-name>Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</conf-name>. <publisher-loc>Vancouver, BC, Canada</publisher-loc>; <year>2023</year>. p. <fpage>19529</fpage>&#x2013;<lpage>39</lpage>.</mixed-citation></ref>
<ref id="ref-38"><label>[38]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Milletari</surname> <given-names>F</given-names></string-name>, <string-name><surname>Navab</surname> <given-names>N</given-names></string-name>, <string-name><surname>Ahmadi</surname> <given-names>SA</given-names></string-name></person-group>. <article-title>V-net: fully convolutional neural networks for volumetric medical image segmentation</article-title>. In: <conf-name>2016 Fourth International Conference on 3D Vision (3DV)</conf-name>. <publisher-loc>Stanford, CA, USA</publisher-loc>: <publisher-name>IEEE</publisher-name>; <year>2016</year>. p. <fpage>565</fpage>&#x2013;<lpage>71</lpage>.</mixed-citation></ref>
</ref-list>
</back></article>