<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.1 20151215//EN" "http://jats.nlm.nih.gov/publishing/1.1/JATS-journalpublishing1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:mml="http://www.w3.org/1998/Math/MathML" xml:lang="en" article-type="research-article" dtd-version="1.1">
<front>
<journal-meta>
<journal-id journal-id-type="pmc">CMC</journal-id>
<journal-id journal-id-type="nlm-ta">CMC</journal-id>
<journal-id journal-id-type="publisher-id">CMC</journal-id>
<journal-title-group>
<journal-title>Computers, Materials &#x0026; Continua</journal-title>
</journal-title-group>
<issn pub-type="epub">1546-2226</issn>
<issn pub-type="ppub">1546-2218</issn>
<publisher>
<publisher-name>Tech Science Press</publisher-name>
<publisher-loc>USA</publisher-loc>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="publisher-id">59733</article-id>
<article-id pub-id-type="doi">10.32604/cmc.2025.059733</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Article</subject>
</subj-group>
</article-categories>
<title-group>
<article-title>DMHFR: Decoder with Multi-Head Feature Receptors for Tract Image Segmentation</article-title>
<alt-title alt-title-type="left-running-head">DMHFR: Decoder with Multi-Head Feature Receptors for Tract Image Segmentation</alt-title>
<alt-title alt-title-type="right-running-head">DMHFR: Decoder with Multi-Head Feature Receptors for Tract Image Segmentation</alt-title>
</title-group>
<contrib-group>
<contrib id="author-1" contrib-type="author">
<name name-style="western"><surname>Huang</surname><given-names>Jianuo</given-names></name><xref ref-type="aff" rid="aff-1">1</xref><xref ref-type="aff" rid="aff-2">2</xref></contrib>
<contrib id="author-2" contrib-type="author">
<name name-style="western"><surname>Lai</surname><given-names>Bohan</given-names></name><xref ref-type="aff" rid="aff-2">2</xref></contrib>
<contrib id="author-3" contrib-type="author">
<name name-style="western"><surname>Qiu</surname><given-names>Weiye</given-names></name><xref ref-type="aff" rid="aff-3">3</xref></contrib>
<contrib id="author-4" contrib-type="author">
<name name-style="western"><surname>Xu</surname><given-names>Caixu</given-names></name><xref ref-type="aff" rid="aff-4">4</xref></contrib>
<contrib id="author-5" contrib-type="author" corresp="yes">
<name name-style="western"><surname>He</surname><given-names>Jie</given-names></name><xref ref-type="aff" rid="aff-1">1</xref><xref ref-type="aff" rid="aff-5">5</xref><xref rid="cor1" ref-type="corresp">&#x002A;</xref><email>he.jie@zsxmhospital.com</email></contrib>
<aff id="aff-1"><label>1</label><institution>Department of Endoscopy Center, Zhongshan Hospital (Xiamen), Fudan University</institution>, <addr-line>Xiamen, 361015</addr-line>, <country>China</country></aff>
<aff id="aff-2"><label>2</label><institution>School of Computing and Data Science, Xiamen University Malaysia</institution>, <addr-line>Sepang, 43900</addr-line>, <country>Malaysia</country></aff>
<aff id="aff-3"><label>3</label><institution>School of Computer Science and Techonology, Tongji University</institution>, <addr-line>Shanghai, 200092</addr-line>, <country>China</country></aff>
<aff id="aff-4"><label>4</label><institution>Guangxi Key Laboratory of Machine Vision and Intelligent Control, Wuzhou University</institution>, <addr-line>Wuzhou, 543002</addr-line>, <country>China</country></aff>
<aff id="aff-5"><label>5</label><institution>Xiamen Clinical Research Center for Cancer Therapy</institution>, <addr-line>Xiamen, 361015</addr-line>, <country>China</country></aff>
</contrib-group>
<author-notes>
<corresp id="cor1"><label>&#x002A;</label>Corresponding Author: Jie He. Email: <email>he.jie@zsxmhospital.com</email></corresp>
</author-notes>
<pub-date date-type="collection" publication-format="electronic">
<year>2025</year>
</pub-date>
<pub-date date-type="pub" publication-format="electronic">
<day>06</day><month>03</month><year>2025</year>
</pub-date>
<volume>82</volume>
<issue>3</issue>
<fpage>4841</fpage>
<lpage>4862</lpage>
<history>
<date date-type="received">
<day>15</day>
<month>10</month>
<year>2024</year>
</date>
<date date-type="accepted">
<day>18</day>
<month>12</month>
<year>2024</year>
</date>
</history>
<permissions>
<copyright-statement>&#x00A9; 2025 The Authors.</copyright-statement>
<copyright-year>2025</copyright-year>
<copyright-holder>Published by Tech Science Press.</copyright-holder>
<license xlink:href="https://creativecommons.org/licenses/by/4.0/">
<license-p>This work is licensed under a <ext-link ext-link-type="uri" xlink:type="simple" xlink:href="https://creativecommons.org/licenses/by/4.0/">Creative Commons Attribution 4.0 International License</ext-link>, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
</license>
</permissions>
<self-uri content-type="pdf" xlink:href="TSP_CMC_59733.pdf"></self-uri>
<abstract>
<p>The self-attention mechanism of Transformers, which captures long-range contextual information, has demonstrated significant potential in image segmentation. However, their ability to learn local, contextual relationships between pixels requires further improvement. Previous methods face challenges in efficiently managing multi-scale features of different granularities from the encoder backbone, leaving room for improvement in their global representation and feature extraction capabilities. To address these challenges, we propose a novel Decoder with Multi-Head Feature Receptors (DMHFR), which receives multi-scale features from the encoder backbone and organizes them into three feature groups with different granularities: coarse, fine-grained, and full set. These groups are subsequently processed by Multi-Head Feature Receptors (MHFRs) after feature capture and modeling operations. MHFRs include two Three-Head Feature Receptors (THFRs) and one Four-Head Feature Receptor (FHFR). Each group of features is passed through these MHFRs and then fed into axial transformers, which help the model capture long-range dependencies within the features. The three MHFRs produce three distinct feature outputs. The output from the FHFR serves as auxiliary auxiliary features in the prediction head, and the prediction output and their losses will eventually be aggregated. Experimental results show that the Transformer using DMHFR outperforms 15 state of the arts (SOTA) methods on five public datasets. Specifically, it achieved significant improvements in mean DICE scores over the classic Parallel Reverse Attention Network (PraNet) method, with gains of 4.1%, 2.2%, 1.4%, 8.9%, and 16.3% on the CVC-ClinicDB, Kvasir-SEG, CVC-T, CVC-ColonDB, and ETIS-LaribPolypDB datasets, respectively.</p>
</abstract>
<kwd-group kwd-group-type="author">
<kwd>Medical image segmentation</kwd>
<kwd>feature exploration</kwd>
<kwd>feature aggregation</kwd>
<kwd>deep learning</kwd>
<kwd>multi-head feature receptor</kwd>
</kwd-group>
</article-meta>
</front>
<body>
<sec id="s1">
<label>1</label>
<title>Introduction</title>
<p>Colorectal cancer (CRC) is one of the leading causes of mortality worldwide, with early detection and removal of precursors, such as polyps, being crucial for improving survival rates. Timely and accurate localization of polyps during colonoscopy can significantly reduce the incidence of CRC. However, manual inspection of colonoscopy images is often subjective, tedious, time-consuming, and prone to errors. Therefore, developing automatic and accurate polyp detection systems is urgent to assist physicians and reduce diagnostic mistakes. This problem is typically framed as a dense prediction task, which create segmentation maps of the organs or lesions by performing classification of pixel-wise. Methods based on convolutional neural networks (CNNs) have seen significant success in computer vision tasks [<xref ref-type="bibr" rid="ref-1">1</xref>&#x2013;<xref ref-type="bibr" rid="ref-5">5</xref>], largely due to their inductive bias and ability to maintain translation invariance. One of the most notable methods is U-Net [<xref ref-type="bibr" rid="ref-1">1</xref>], which has made significant contributions to image detection and segmentation tasks through its encoder-decoder architecture. However, the receptive field of a CNN may restrict the model&#x2019;s focus to a localized area [<xref ref-type="bibr" rid="ref-4">4</xref>]. To address this issue, some works integrate attention mechanisms into their model [<xref ref-type="bibr" rid="ref-2">2</xref>,<xref ref-type="bibr" rid="ref-6">6</xref>,<xref ref-type="bibr" rid="ref-7">7</xref>] to enhance pixel-level features for better classification. While these methods have led to performance improvements, their ability to capture long-range dependencies can still be insufficient.</p>
<p>In addition to the aforementioned studies, the vision transformer [<xref ref-type="bibr" rid="ref-8">8</xref>], a novel transformer model, has made significant breakthroughs in medical image segmentation. By leveraging self-attention mechanisms to learn relationships among all tokens, it effectively addresses previous challenges in capturing long-range dependencies in medical images by its unique multi-head self-attention and multi-head perception fusion modules. Numerous vision-related tasks have adopted transformers in place of CNNs for feature extraction, achieving impressive performance [<xref ref-type="bibr" rid="ref-9">9</xref>&#x2013;<xref ref-type="bibr" rid="ref-11">11</xref>]. MedCLIP [<xref ref-type="bibr" rid="ref-12">12</xref>] demonstrates impressive performance on small-scale pre-training data by separating text and images for multimodal contrastive learning. Moreover, recent advancements in hierarchical vision transformers have helped reduce computational costs. Notable examples include the Swin transformer [<xref ref-type="bibr" rid="ref-9">9</xref>], which employs window attention mechanisms, and the pyramid vision transformer [<xref ref-type="bibr" rid="ref-13">13</xref>], which utilizes spatial reduction attention mechanisms. However, transformers may lose some fine-grained details and local pixel relationships during the encoding process due to downsampling and lack the ability to model these local details as effectively as CNNs, leading to suboptimal performance in some specific segmentation tasks. Although models like PVTv2 [<xref ref-type="bibr" rid="ref-11">11</xref>] and SegFormer [<xref ref-type="bibr" rid="ref-14">14</xref>] have embedded convolutional layers to attempt to overcome this limitation, the discriminative capability of these methods is restricted by the positioning of convolutional layers, which may limit their ability to effectively model features. Additionally, there is still room for improvement in processing multi-scale features from the encoder.</p>
<p>To address these challenges, we propose a novel Decoder with Multi-Head Feature Receptors (DMHFR), which receives pyramid features from the encoder backbone. We integrate the Frequency Channel Attention Network (FcaNet) [<xref ref-type="bibr" rid="ref-15">15</xref>] into the DMHFR. Before passing through the MHFRs, FcaNet models the features in the frequency domain, emphasizing and learning the most important frequency components through a frequency channel attention module. This enables the network to better capture image details and texture information. The Multi-Head Feature Receptors (MHFRs) consist of two Three-Head Feature Receptors (THFRs) and one Four-Head Feature Receptor (FHFR). THFRs process finer and coarser granularity features separately, while FHFR receives all the features from the encoder backbone. These receptors perceive and fuse multiple sets of features of varying granularity in parallel. Each MHFR produces an output, and the features passing through the FHFR are ultimately used to generate an auxiliary prediction map in the prediction head. Axial transformers are integrated after the MHFRs to capture long-range dependencies in the feature map. This enhances overall feature representation, helping the model to focus on both the boundaries and internal textures of polyps. This capability is crucial for detecting polyps of various shapes and sizes and significantly contributes to the generalization ability of DMHFR. The superiority and effectiveness of DMHFR were validated through experiments on five public colorectal polyp datasets, where it achieved SOTA results in polyp segmentation. The proposed method contributing to the development of the field of medical image segmentation, and early diagnosis and treatment of colorectal cancer. Our contributions can be summarized as follows:
<list list-type="bullet">
<list-item>
<p>A novel network architecture, we propose a novel Decoder with Multi-Head Feature Receptors (DMHFR) for 2D medical image segmentation, offering high accuracy and robustness, and can be integrated with other hierarchical visual encoders to enhance network performance.</p></list-item>
<list-item>
<p>A novel method to process multi-scale features, our DMHFR groups four multi-scale features into three feature groups with different granularities: coarse, fine-grained, and full set. This way of handling multi-scale features can help global and local features to be better perceived and integrated.</p></list-item>
<list-item>
<p>Two novel modules, The THFR and FHFR modules, collectively referred to as MHFRs, enhance feature representation by perceiving and aggregating feature maps at multiple resolutions, which offers great potential for improving deep learning in medical image segmentation.</p></list-item>
<list-item>
<p>Experimental results show that the proposed method exhibits outstanding learning ability and generalization ability compared with the SOTA methods on five public polyp datasets.</p></list-item>
</list></p>
</sec>
<sec id="s2">
<label>2</label>
<title>Related Work</title>
<sec id="s2_1">
<label>2.1</label>
<title>Traditional Methods for Polyp Segmentation</title>
<p>Early polyp segmentation techniques primarily relied on low-level feature processing, such as texture, color, and geometric characteristics [<xref ref-type="bibr" rid="ref-1">1</xref>]. Methods to focus on these features include region growing, watershed, and active contour analysis. The method of Sasmal et al. [<xref ref-type="bibr" rid="ref-16">16</xref>] consists of principal component pursuit and active contour model. The region-based method [<xref ref-type="bibr" rid="ref-17">17</xref>] divides the image into multiple regions to judge the features within regions respectively. A commonly used morphology-based method [<xref ref-type="bibr" rid="ref-17">17</xref>] is to perform preprocessing and subsequent processing on images to enhance edge features. The polyp segmentation method proposed by Gross et al. [<xref ref-type="bibr" rid="ref-18">18</xref>] combines Canny operator, nonlinear diffusion filtering, and other methods. However, due to the high similarity between polyps and the surrounding tissue, these traditional methods often struggle with accuracy, leading to an increased likelihood of missed or incorrect detections.</p>
</sec>
<sec id="s2_2">
<label>2.2</label>
<title>CNNs for Polyp Segmentation</title>
<p>CNN-based methods [<xref ref-type="bibr" rid="ref-19">19</xref>&#x2013;<xref ref-type="bibr" rid="ref-23">23</xref>] have made significant contributions to the development of polyp segmentation and have outperformed traditional methods in feature modeling, noise suppression, inference speed, and generalization ability. Akbari et al. [<xref ref-type="bibr" rid="ref-24">24</xref>] proposed a polyp segmentation model using a fully convolutional neural network, which used a new image patch selection method in the training phase, and the probability map generated by the network was effectively post-processed in the test phase. Brandao et al. [<xref ref-type="bibr" rid="ref-25">25</xref>] obtained the shape through a shading strategy and used it to recover the depth, and used the RGB model to receive the result to enrich the feature representation. Encoder-decoder-based models have demonstrated impressive performance in image segmentation. UNet [<xref ref-type="bibr" rid="ref-1">1</xref>] aggregates the encoder features with the upsampled features of the decoder through skip connections to generate high-resolution segmentation maps. UNet&#x002B;&#x002B; [<xref ref-type="bibr" rid="ref-3">3</xref>] links the encoder and decoder through nested and dense skip connections. The skip connections of UNet 3&#x002B; [<xref ref-type="bibr" rid="ref-5">5</xref>] include full-size internal connections between decoder blocks. Dilated convolutions extract and aggregate high-level semantic features with resolution preservation to achieve improvements in the encoder network. With the advancement of computer vision, ResNet [<xref ref-type="bibr" rid="ref-4">4</xref>] has become a widely adopted backbone in medical image segmentation methods. In our proposed PVT-DMHFR, the PreActBottleneck of ResNet is employed to optimize feature perception in MHFRs. Mask R-CNN [<xref ref-type="bibr" rid="ref-26">26</xref>] was adapted with a deeper feature extractor [<xref ref-type="bibr" rid="ref-27">27</xref>] for polyp segmentation. PraNet [<xref ref-type="bibr" rid="ref-2">2</xref>] generated a global attention feature map by inverting attention and used it to derive boundary information. PNS-Net [<xref ref-type="bibr" rid="ref-28">28</xref>] incorporated temporal and spatial cues based on self-attention for video polyp segmentation, while Spatial-Temporal Feature Transformation [<xref ref-type="bibr" rid="ref-29">29</xref>] aggregated features from adjacent frames to achieve notable performance in video polyp segmentation. U-KAN [<xref ref-type="bibr" rid="ref-30">30</xref>] redesigned the UNet by integrating the dedicated Kolmogorov-Arnold Networks (KAN) layers on the tokenized intermediate representation to improve segmentation performance. GCN-DE [<xref ref-type="bibr" rid="ref-31">31</xref>] projects both support and query images into a feature space, computes long-range and short-range dependencies within a global correlation module that processes the embeddings to reduce complexity and applies discriminative regularization to draw features of the same foreground class closer together, enhancing the accuracy of few-shot medical image segmentation. Polyp-Net [<xref ref-type="bibr" rid="ref-32">32</xref>] utilized a local gradient weighting-embedded level-set method, effectively reducing false-positive instances caused by high-intensity regions during prediction. UACANet [<xref ref-type="bibr" rid="ref-33">33</xref>] modified the U-Net shape network, and in its prediction module, the foreground, background and uncertainty region maps are aggregated with features, and the saliency map that provides this calculation assistance is calculated and propagated to the next prediction module. SANet [<xref ref-type="bibr" rid="ref-34">34</xref>] minimized the impact of irrelevant features on predictions through color transformation. MSNet [<xref ref-type="bibr" rid="ref-35">35</xref>] reduced complementary and redundant information across multi-scale features via a multi-scale subtraction network. While these CNN-based networks have achieved good results, they share a common limitation: an inability to efficiently capture both global information and fine details simultaneously. The decoder struggles to effectively aggregate global features for enhanced information supplementation, the overall generalization performance is not satisfactory, and the results in polyp segmentation also require further improvement.</p>
</sec>
<sec id="s2_3">
<label>2.3</label>
<title>Transformer for Polyp Segmentation</title>
<p>Transformer-based methods outperform CNN-based approaches due to their advantages in integrating global context, capturing long-range dependencies, and modeling features more effectively. DC-Net [<xref ref-type="bibr" rid="ref-36">36</xref>] is a dual context network that enhances segmentation performance by reshaping the original image and multiscale feature maps and integrating global and multiscale contextual information. Transfuse [<xref ref-type="bibr" rid="ref-10">10</xref>] combined CNN and transformer architectures through sequential and parallel connections in the encoder, performing well but limited in scalability due to the resulting increase in network size. Many methods [<xref ref-type="bibr" rid="ref-37">37</xref>&#x2013;<xref ref-type="bibr" rid="ref-40">40</xref>] use PVTv2 as the encoder backbone. SSFormer [<xref ref-type="bibr" rid="ref-37">37</xref>] sought to simplify the structure of the transformer encoder&#x2019;s backend, aiming to reduce model parameters while enhancing local information. However, this simplification led to a decline in performance. HSNet [<xref ref-type="bibr" rid="ref-38">38</xref>] combined CNN and Transformer in parallel within the decoder, though it did not fully address the differences in representation between the two architectures. Polyp-PVT [<xref ref-type="bibr" rid="ref-39">39</xref>] utilized the Cascaded Fusion Module (CFM) and Camouflage Identification Module (CIM) to merging features and directly extract intricate details from low-level features, respectively. Additionally, the Similarity Aggregation Module (SAM) was implemented to investigate higher-order relationships between the local features of low-level from CIM and the cues of high-level from CFM. Nevertheless, Polyp-PVT did not thoroughly explore and augment the encoder&#x2019;s output information, resulting in suboptimal feature aggregation by CFM. Furthermore, CIM&#x2019;s direct extraction of detailed information from low-level features introduced noise. PVT-CASCADE [<xref ref-type="bibr" rid="ref-40">40</xref>] accurately identified the most critical local features through its CASCADE module, which included an Attention Gate (AG), a multi-stage loss and feature aggregation component, as well as an upconv module. However, the upconv module risked losing fine-grained details, and the CASCADE module could result in redundant features, especially when dealing with objects with indistinct edges.</p>
<p>In conclusion, although previous methods have yielded impressive results in addressing the polyp segmentation challenge, they often fall short of effectively bridging the semantic gap between Transformer and CNN architectures. This unresolved gap can negatively affect network performance. Additionally, many Transformer-based polyp segmentation models struggle to adequately process the four pyramid features of varying granularities generated by the encoder backbone, which encapsulates rich spatial details and semantic information. To address these issues, we propose a novel decoder with multi-headed feature receptors.</p>
</sec>
</sec>
<sec id="s3">
<label>3</label>
<title>Method</title>
<p>This section first presents the overall architecture of the PVT-DMHFR. Then the proposed method MHFRs (THFR and FHFR) is described in detail.</p>
<sec id="s3_1">
<label>3.1</label>
<title>Overall Architecture</title>
<p>The network architecture of PVT-DMHFR is illustrated in <xref ref-type="fig" rid="fig-1">Fig. 1</xref>. It consists primarily of a Transformer encoder and DMHFR. DMHFR mainly consists of MHFRs, FcaNet, and axial transformers. PVTv2 [<xref ref-type="bibr" rid="ref-11">11</xref>] functions as the Transformer encoder to capture features that represent long-range dependencies across multiple scales from the input image. As shown in <xref ref-type="fig" rid="fig-1">Fig. 1a</xref>,<xref ref-type="fig" rid="fig-1">b</xref>, DMHFR receives four pyramid features from the PVTv2 encoder (<inline-formula id="ieqn-1"><mml:math id="mml-ieqn-1"><mml:msub><mml:mi>X</mml:mi><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula>, <inline-formula id="ieqn-2"><mml:math id="mml-ieqn-2"><mml:msub><mml:mi>X</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula>, <inline-formula id="ieqn-3"><mml:math id="mml-ieqn-3"><mml:msub><mml:mi>X</mml:mi><mml:mrow><mml:mn>3</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula>, and <inline-formula id="ieqn-4"><mml:math id="mml-ieqn-4"><mml:msub><mml:mi>X</mml:mi><mml:mrow><mml:mn>4</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula>), which are then modeled in the frequency domain by FcaNet [<xref ref-type="bibr" rid="ref-15">15</xref>] to effectively capture the details and texture information of the image. These features are subsequently divided into two sets of three-element features: <inline-formula id="ieqn-5"><mml:math id="mml-ieqn-5"><mml:mo fence="false" stretchy="false">{</mml:mo><mml:msubsup><mml:mi>X</mml:mi><mml:mrow><mml:mn>4</mml:mn></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:msubsup><mml:mi>X</mml:mi><mml:mrow><mml:mn>3</mml:mn></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:msubsup><mml:mi>X</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msubsup><mml:mo fence="false" stretchy="false">}</mml:mo></mml:math></inline-formula>, <inline-formula id="ieqn-6"><mml:math id="mml-ieqn-6"><mml:mo fence="false" stretchy="false">{</mml:mo><mml:msubsup><mml:mi>X</mml:mi><mml:mrow><mml:mn>3</mml:mn></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:msubsup><mml:mi>X</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:msubsup><mml:mi>X</mml:mi><mml:mrow><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msubsup><mml:mo fence="false" stretchy="false">}</mml:mo></mml:math></inline-formula>, as well as one set of four-element features: <inline-formula id="ieqn-7"><mml:math id="mml-ieqn-7"><mml:mo fence="false" stretchy="false">{</mml:mo><mml:msubsup><mml:mi>X</mml:mi><mml:mrow><mml:mn>4</mml:mn></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:msubsup><mml:mi>X</mml:mi><mml:mrow><mml:mn>3</mml:mn></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:msubsup><mml:mi>X</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:msubsup><mml:mi>X</mml:mi><mml:mrow><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msubsup><mml:mo fence="false" stretchy="false">}</mml:mo></mml:math></inline-formula>, they are respectively coarse-grained feature group, fine-grained feature group, and full-set feature group. These feature sets are fed into THFR32, THFR64, and FHFR128, which receive groups of features with the channel unified as 32, 64, and 128 respectively, then output after passing through the axial transformer [<xref ref-type="bibr" rid="ref-41">41</xref>].</p>
<fig id="fig-1">
<label>Figure 1</label>
<caption>
<title>Architecture of PVT-DMHFR network. (a) Backbone: PVTv2-b2 Encoder; (b) DMHFR decoder</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_59733-fig-1.tif"/>
</fig>
<p>DMHFR allows features of multiple different dimensional combinations to be perceived in parallel and be fused, retaining the features in the input image to a high degree, and most of the context output by MHFRs can be calculated in parallel by the axial transformer to express the global dependencies of features. This architecture achieves SOTA performance on several polyp segmentation benchmarks. Details are presented in the experimental section.</p>
</sec>
<sec id="s3_2">
<label>3.2</label>
<title>Transformer Encoder</title>
<p>Recent studies [<xref ref-type="bibr" rid="ref-8">8</xref>,<xref ref-type="bibr" rid="ref-13">13</xref>,<xref ref-type="bibr" rid="ref-14">14</xref>,<xref ref-type="bibr" rid="ref-42">42</xref>] on vision tasks have demonstrated that transformer-based pyramid structures are better than CNNs in terms of generalization, robustness, and capturing multi-scale and multi-level features. In this proposed method, PVTv2 [<xref ref-type="bibr" rid="ref-11">11</xref>] is employed to extract multi-scale features. Unlike traditional Transformers, PVTv2 does not use a patch embedding module but uses convolution operations to consistently capture spatial information, delivering state-of-the-art performance across various dense prediction applications. PVTv2 generates pyramid features <inline-formula id="ieqn-8"><mml:math id="mml-ieqn-8"><mml:msub><mml:mi>X</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>&#x2208;</mml:mo><mml:mo fence="false" stretchy="false">{</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mn>88</mml:mn><mml:mo>,</mml:mo><mml:mn>88</mml:mn><mml:mo>,</mml:mo><mml:mn>64</mml:mn><mml:mo stretchy="false">)</mml:mo><mml:mo>,</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mn>44</mml:mn><mml:mo>,</mml:mo><mml:mn>44</mml:mn><mml:mo>,</mml:mo><mml:mn>128</mml:mn><mml:mo stretchy="false">)</mml:mo><mml:mo>,</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mn>22</mml:mn><mml:mo>,</mml:mo><mml:mn>22</mml:mn><mml:mo>,</mml:mo><mml:mn>320</mml:mn><mml:mo stretchy="false">)</mml:mo><mml:mo>,</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mn>11</mml:mn><mml:mo>,</mml:mo><mml:mn>11</mml:mn><mml:mo>,</mml:mo><mml:mn>512</mml:mn><mml:mo stretchy="false">)</mml:mo><mml:mo fence="false" stretchy="false">}</mml:mo></mml:math></inline-formula> according to input image <inline-formula id="ieqn-9"><mml:math id="mml-ieqn-9"><mml:mi>I</mml:mi><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:mi>H</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mi>W</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mn>3</mml:mn></mml:mrow></mml:msup></mml:math></inline-formula>. These features are subsequently fed into the DMHFR.</p>
</sec>
<sec id="s3_3">
<label>3.3</label>
<title>Decoder with Multi-Head Feature Receptors (DMHFR)</title>
<p>Due to the high similarity between polyps and backgrounds and the limited ability of existing Transformer-based models to process (local) contextual information among pixels, the localization of local features with higher discrimination in the segmentation task is challenging. To address this challenge, we propose a novel Decoder with Multi-Head Feature Receptors (DHMFR) for pyramid features.</p>
<p>As <xref ref-type="fig" rid="fig-1">Fig. 1b</xref> shows, DHMFR includes FcaNet [<xref ref-type="bibr" rid="ref-15">15</xref>] to model features in the frequency domain, perceive details and texture information of images, our proposed MHFRs (two THFRs and one FHFR) to perceive and fuse pyramid features in parallel, and axial transformer [<xref ref-type="bibr" rid="ref-41">41</xref>] to keep the full expressiveness of joint distribution over features. The features of four different dimensions from the encoder backbone are passed through four FcaNet blocks, and then the four features are grouped into two three-element feature groups and one four-element feature group, these three feature groups are coarse-grained feature group, a fine-grained feature group, and full-set feature group. The channel of each feature in feature groups: <inline-formula id="ieqn-10"><mml:math id="mml-ieqn-10"><mml:mo fence="false" stretchy="false">{</mml:mo><mml:msubsup><mml:mi>X</mml:mi><mml:mrow><mml:mn>4</mml:mn></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:msubsup><mml:mi>X</mml:mi><mml:mrow><mml:mn>3</mml:mn></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:msubsup><mml:mi>X</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msubsup><mml:mo fence="false" stretchy="false">}</mml:mo></mml:math></inline-formula>, <inline-formula id="ieqn-11"><mml:math id="mml-ieqn-11"><mml:mo fence="false" stretchy="false">{</mml:mo><mml:msubsup><mml:mi>X</mml:mi><mml:mrow><mml:mn>3</mml:mn></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:msubsup><mml:mi>X</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:msubsup><mml:mi>X</mml:mi><mml:mrow><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msubsup><mml:mo fence="false" stretchy="false">}</mml:mo></mml:math></inline-formula>, and <inline-formula id="ieqn-12"><mml:math id="mml-ieqn-12"><mml:mo fence="false" stretchy="false">{</mml:mo><mml:msubsup><mml:mi>X</mml:mi><mml:mrow><mml:mn>4</mml:mn></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:msubsup><mml:mi>X</mml:mi><mml:mrow><mml:mn>3</mml:mn></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:msubsup><mml:mi>X</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:msubsup><mml:mi>X</mml:mi><mml:mrow><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msubsup><mml:mo fence="false" stretchy="false">}</mml:mo></mml:math></inline-formula> are adjusted to 32, 64, and 128, respectively, and they are passed to THFR32, THFR64, and FHFR128 accordingly. In MHFRs, the features within each group are perceived and fused in parallel, generating three output features. Next, these features are processed in parallel by the axial transformer, allowing the model to capture the global dependencies among features while preserving the full expressiveness of their joint distribution. Then, the channel of the output features will be adjusted to 32. Notably, the feature processed by FHFR128 is copied as an auxiliary prediction. This is because FHFR128 perceives and fuses all pyramid features in parallel, so its output is assumed to have a higher priority and designed to carry greater weight in the prediction output. Finally, the three output features <inline-formula id="ieqn-13"><mml:math id="mml-ieqn-13"><mml:msub><mml:mi>P</mml:mi><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>P</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>P</mml:mi><mml:mrow><mml:mn>3</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula>, and an auxiliary output feature <inline-formula id="ieqn-14"><mml:math id="mml-ieqn-14"><mml:msub><mml:mi>P</mml:mi><mml:mrow><mml:mn>3</mml:mn><mml:mi mathvariant="normal">&#x005F;</mml:mi><mml:mi>a</mml:mi><mml:mi>u</mml:mi><mml:mi>x</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> are sent to the prediction head, and the four different predictions are fused to generate the final segmentation map.</p>
<sec id="s3_3_1">
<label>3.3.1</label>
<title>Integration of Frequency Channel Attention Network (FcaNet)</title>
<p>In our proposed method, we integrate FcaNet [<xref ref-type="bibr" rid="ref-15">15</xref>] into our network architecture by feeding the features from the PVTv2 encoder into FcaNet, as shown in <xref ref-type="fig" rid="fig-1">Fig. 1</xref>, this integration enhances the model&#x2019;s feature extraction capabilities, by leveraging frequency domain attention, the network becomes more sensitive to subtle texture and boundary details of polyps. Despite the additional frequency processing, the integration of FcaNet ensures that the network remains computationally efficient. It selectively applies attention to only the most important frequency components, minimizing overhead while boosting performance. The flexibility of FcaNet&#x2019;s frequency-based attention mechanism also improves the model&#x2019;s generalization ability across diverse polyp datasets and imaging conditions. The incorporation of FcaNet into our segmentation network plays a crucial role in enhancing the accuracy and robustness of polyp segmentation tasks by providing a more detailed and contextually aware feature extraction process. This process is formulated as <xref ref-type="disp-formula" rid="eqn-1">Eq. (1)</xref>:
<disp-formula id="eqn-1"><label>(1)</label><mml:math id="mml-eqn-1" display="block"><mml:msubsup><mml:mi>x</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:mi>F</mml:mi><mml:mi>c</mml:mi><mml:mi>a</mml:mi><mml:mi>N</mml:mi><mml:mi>e</mml:mi><mml:mi>t</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:msubsup><mml:mi>x</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msubsup><mml:mo>)</mml:mo></mml:mrow><mml:mo>,</mml:mo><mml:mi>i</mml:mi><mml:mo>&#x2208;</mml:mo><mml:mo fence="false" stretchy="false">{</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mn>2</mml:mn><mml:mo>,</mml:mo><mml:mn>3</mml:mn><mml:mo>,</mml:mo><mml:mn>4</mml:mn><mml:mo fence="false" stretchy="false">}</mml:mo></mml:math></disp-formula></p>
</sec>
<sec id="s3_3_2">
<label>3.3.2</label>
<title>Multi-Head Feature Receptor (MHFR)</title>
<p>To effectively perceive and fuse multi-scale features, we propose the three-head feature receptor (THFR) and the four-head feature receptor (FHFR), collectively referred to as Multi-Head Feature Receptors (MHFRs). MHFRs are primarily composed of PreActBottleneck blocks [<xref ref-type="bibr" rid="ref-43">43</xref>], which is an enhanced version of the bottleneck structure originally developed in ResNet [<xref ref-type="bibr" rid="ref-4">4</xref>]. These blocks combine element-wise operations such as addition, multiplication, and concatenation. Due to the simplicity of these operations, this approach only results in a slight increase in resource demands and significantly enhances segmentation accuracy. This design effectively bridges spatial and semantic information, balancing the increased resource requirements with notable improvements in segmentation quality. The PreActBottleneck blocks help improve representation capabilities and allow efficient training through better gradient flow. The collaboration of these components allows the model to perceive and aggregate feature maps at multiple resolutions, thereby enhancing feature representation. Additionally, MHFRs are versatile and can be easily adapted to process pyramid features in other models by adjusting the channel setting of MHFRs, offering significant potential to enhance deep learning performance in various medical image segmentation tasks.</p>
<p>Three-Head Feature Receptor (THFR)</p>
<p>In PVT-DMHFR, we assume that adjacent features within the pyramid structure exhibit higher correlation, therefore, THFR receives two sets of features: three adjacent finer-grained features <inline-formula id="ieqn-15"><mml:math id="mml-ieqn-15"><mml:mo fence="false" stretchy="false">{</mml:mo><mml:msubsup><mml:mi>X</mml:mi><mml:mrow><mml:mn>3</mml:mn></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:msubsup><mml:mi>X</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:msubsup><mml:mi>X</mml:mi><mml:mrow><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msubsup><mml:mo fence="false" stretchy="false">}</mml:mo></mml:math></inline-formula> and three adjacent coarser-grained features <inline-formula id="ieqn-16"><mml:math id="mml-ieqn-16"><mml:mo fence="false" stretchy="false">{</mml:mo><mml:msubsup><mml:mi>X</mml:mi><mml:mrow><mml:mn>4</mml:mn></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:msubsup><mml:mi>X</mml:mi><mml:mrow><mml:mn>3</mml:mn></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:msubsup><mml:mi>X</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msubsup><mml:mo fence="false" stretchy="false">}</mml:mo></mml:math></inline-formula>. This design allows THFR to capture more detailed and original information from the image.</p>
<p>As shown in <xref ref-type="fig" rid="fig-2">Fig. 2</xref>, THFR receives three features of different sizes, with the height and width of each feature being double that of the next. The input channel and channel of the features need to be uniformly set to multiples of 32. The finest-grained feature, <inline-formula id="ieqn-17"><mml:math id="mml-ieqn-17"><mml:msub><mml:mi>X</mml:mi><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula>, is processed through PreActBottleneck blocks, all features are fused and output after various element-wise operations between features. This process can be expressed as <xref ref-type="disp-formula" rid="eqn-2">Eq. (2)</xref>:</p>
<fig id="fig-2">
<label>Figure 2</label>
<caption>
<title>Details of the introduced THFR, X1, X2, and X3 are multi-scale features, each with spatial dimensions that are double those of the preceding feature</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_59733-fig-2.tif"/>
</fig>
<p><disp-formula id="eqn-2"><label>(2)</label><mml:math id="mml-eqn-2" display="block"><mml:mrow><mml:mo>{</mml:mo><mml:mtable columnalign="left" rowspacing="4pt" columnspacing="1em"><mml:mtr><mml:mtd><mml:msubsup><mml:mi>X</mml:mi><mml:mrow><mml:mn>12</mml:mn></mml:mrow><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mi>C</mml:mi><mml:mi>o</mml:mi><mml:mi>n</mml:mi><mml:mi>c</mml:mi><mml:mi>a</mml:mi><mml:mi>t</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>U</mml:mi><mml:mrow><mml:mn>2</mml:mn><mml:mi>x</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>U</mml:mi><mml:mrow><mml:mn>2</mml:mn><mml:mi>x</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>X</mml:mi><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow><mml:mo>&#x2295;</mml:mo><mml:msub><mml:mi>X</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mo>&#x2297;</mml:mo><mml:msub><mml:mi>X</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mi>U</mml:mi><mml:mrow><mml:mn>4</mml:mn><mml:mi>x</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mi>P</mml:mi><mml:mi>A</mml:mi><mml:msub><mml:mi>B</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mi>P</mml:mi><mml:mi>A</mml:mi><mml:msub><mml:mi>B</mml:mi><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>X</mml:mi><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">)</mml:mo></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mspace width="2em" /><mml:mspace width="1em" /><mml:mo>&#x2295;</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mi>C</mml:mi><mml:mi>o</mml:mi><mml:mi>n</mml:mi><mml:mi>c</mml:mi><mml:mi>a</mml:mi><mml:mi>t</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>U</mml:mi><mml:mrow><mml:mn>2</mml:mn><mml:mi>x</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>X</mml:mi><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow><mml:mo>&#x2295;</mml:mo><mml:msub><mml:mi>X</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mo>&#x2297;</mml:mo><mml:msub><mml:mi>X</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mi>U</mml:mi><mml:mrow><mml:mn>2</mml:mn><mml:mi>x</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mi>P</mml:mi><mml:mi>A</mml:mi><mml:msub><mml:mi>B</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mi>P</mml:mi><mml:mi>A</mml:mi><mml:msub><mml:mi>B</mml:mi><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>X</mml:mi><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">)</mml:mo></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mi>O</mml:mi><mml:mi>u</mml:mi><mml:mi>t</mml:mi><mml:mi>p</mml:mi><mml:mi>u</mml:mi><mml:msub><mml:mi>t</mml:mi><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>X</mml:mi><mml:mrow><mml:mn>3</mml:mn></mml:mrow></mml:msub><mml:mo>&#x2295;</mml:mo><mml:msubsup><mml:mi>X</mml:mi><mml:mrow><mml:mn>12</mml:mn></mml:mrow><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msubsup><mml:mo stretchy="false">)</mml:mo><mml:mo>&#x2297;</mml:mo><mml:msub><mml:mi>X</mml:mi><mml:mrow><mml:mn>3</mml:mn></mml:mrow></mml:msub></mml:mtd></mml:mtr></mml:mtable><mml:mo fence="true" stretchy="true" symmetric="true"></mml:mo></mml:mrow></mml:math></disp-formula>where <inline-formula id="ieqn-18"><mml:math id="mml-ieqn-18"><mml:msubsup><mml:mi>X</mml:mi><mml:mrow><mml:mn>12</mml:mn></mml:mrow><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula> refers to the feature generated by the initial fusion of <inline-formula id="ieqn-19"><mml:math id="mml-ieqn-19"><mml:msub><mml:mi>X</mml:mi><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula> and <inline-formula id="ieqn-20"><mml:math id="mml-ieqn-20"><mml:msub><mml:mi>X</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula> in THFR, &#x201C;<inline-formula id="ieqn-21"><mml:math id="mml-ieqn-21"><mml:mo>&#x2295;</mml:mo></mml:math></inline-formula>&#x201D; denotes the element-wise addition, &#x201C;<inline-formula id="ieqn-22"><mml:math id="mml-ieqn-22"><mml:mo>&#x2297;</mml:mo></mml:math></inline-formula>&#x201D; denotes the element-wise multiplication, <inline-formula id="ieqn-23"><mml:math id="mml-ieqn-23"><mml:msub><mml:mi>U</mml:mi><mml:mrow><mml:mn>4</mml:mn><mml:mi>x</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mo>&#x22C5;</mml:mo><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> denotes the bilinear interpolation quadruple upsampling operation, <inline-formula id="ieqn-24"><mml:math id="mml-ieqn-24"><mml:msub><mml:mi>U</mml:mi><mml:mrow><mml:mn>2</mml:mn><mml:mi>x</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mo>&#x22C5;</mml:mo><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> denotes the bilinear interpolation double upsampling operation, <inline-formula id="ieqn-25"><mml:math id="mml-ieqn-25"><mml:mi>C</mml:mi><mml:mi>o</mml:mi><mml:mi>n</mml:mi><mml:mi>c</mml:mi><mml:mi>a</mml:mi><mml:mi>t</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mo>&#x22C5;</mml:mo><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> indicates the concatenation operation on the channel dimension, and <inline-formula id="ieqn-26"><mml:math id="mml-ieqn-26"><mml:mi>P</mml:mi><mml:mi>A</mml:mi><mml:mi>B</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mo>&#x22C5;</mml:mo><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> indicates the PreActBottleneck block.</p>
<p>Four-Head Feature Receptor (FHFR)</p>
<p>As shown in <xref ref-type="fig" rid="fig-3">Fig. 3</xref>, FHFR etends this approach to four encoder layers to receive four features of different sizes, with all pyramid features being perceived and fused within it. Therefore, we assume that the FHFR output holds a higher priority and carries more weight in the prediction output, to achieve this, we duplicate the output prediction to create an auxiliary prediction, which is then fused into the final prediction. Same as THFR, the height, and width of each feature are double that of the next. The input channel and channel of the features need to be uniformly set to multiples of 32. The finest-grained feature, <inline-formula id="ieqn-27"><mml:math id="mml-ieqn-27"><mml:msub><mml:mi>X</mml:mi><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula>, is processed through PreActBottleneck blocks, all features are fused and output after various element-wise operations between features. This process can be expressed as <xref ref-type="disp-formula" rid="eqn-3">Eq. (3)</xref>:</p>
<fig id="fig-3">
<label>Figure 3</label>
<caption>
<title>Details of the introduced FHFR, X1, X2, X3, and X4 are multi-scale features, each with spatial dimensions that are double those of the preceding feature</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_59733-fig-3.tif"/>
</fig>
<p><disp-formula id="eqn-3"><label>(3)</label><mml:math id="mml-eqn-3" display="block"><mml:mrow><mml:mo>{</mml:mo><mml:mtable columnalign="left" rowspacing="4pt" columnspacing="1em"><mml:mtr><mml:mtd><mml:msubsup><mml:mi>X</mml:mi><mml:mrow><mml:mn>12</mml:mn></mml:mrow><mml:mrow><mml:mi>F</mml:mi></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:mi>C</mml:mi><mml:mi>o</mml:mi><mml:mi>n</mml:mi><mml:mi>c</mml:mi><mml:mi>a</mml:mi><mml:mi>t</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>U</mml:mi><mml:mrow><mml:mn>2</mml:mn><mml:mi>x</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>U</mml:mi><mml:mrow><mml:mn>2</mml:mn><mml:mi>x</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>X</mml:mi><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow><mml:mo>&#x2295;</mml:mo><mml:msub><mml:mi>X</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mo>&#x2297;</mml:mo><mml:msub><mml:mi>X</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mi>U</mml:mi><mml:mrow><mml:mn>4</mml:mn><mml:mi>x</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mi>P</mml:mi><mml:mi>A</mml:mi><mml:msub><mml:mi>B</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mi>P</mml:mi><mml:mi>A</mml:mi><mml:msub><mml:mi>B</mml:mi><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>X</mml:mi><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">)</mml:mo></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:msub><mml:mi>X</mml:mi><mml:mrow><mml:mn>1234</mml:mn></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mi>U</mml:mi><mml:mrow><mml:mn>2</mml:mn><mml:mi>x</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>X</mml:mi><mml:mrow><mml:mn>3</mml:mn></mml:mrow></mml:msub><mml:mo>&#x2295;</mml:mo><mml:msub><mml:mi>U</mml:mi><mml:mrow><mml:mn>4</mml:mn><mml:mi>x</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>X</mml:mi><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow><mml:mo>&#x2295;</mml:mo><mml:msub><mml:mi>U</mml:mi><mml:mrow><mml:mn>4</mml:mn><mml:mi>x</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>X</mml:mi><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow><mml:mo stretchy="false">)</mml:mo><mml:mo>&#x2295;</mml:mo><mml:mrow><mml:msub><mml:mi>U</mml:mi><mml:mrow><mml:mn>2</mml:mn><mml:mi>x</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>U</mml:mi><mml:mrow><mml:mn>2</mml:mn><mml:mi>x</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>X</mml:mi><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow><mml:mo>&#x2295;</mml:mo><mml:msub><mml:mi>X</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mo>&#x2297;</mml:mo><mml:msub><mml:mi>X</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mo>&#x2295;</mml:mo><mml:msub><mml:mi>U</mml:mi><mml:mrow><mml:mn>2</mml:mn><mml:mi>x</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>X</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow><mml:mo>&#x2295;</mml:mo><mml:msub><mml:mi>X</mml:mi><mml:mrow><mml:mn>3</mml:mn></mml:mrow></mml:msub><mml:mo stretchy="false">)</mml:mo></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mspace width="1em" /><mml:mspace width="2em" /><mml:mo>&#x2295;</mml:mo><mml:msub><mml:mi>U</mml:mi><mml:mrow><mml:mn>8</mml:mn><mml:mi>x</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>X</mml:mi><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow><mml:mo>&#x2295;</mml:mo><mml:msub><mml:mi>X</mml:mi><mml:mrow><mml:mn>4</mml:mn></mml:mrow></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mo>&#x2297;</mml:mo><mml:msub><mml:mi>X</mml:mi><mml:mrow><mml:mn>4</mml:mn></mml:mrow></mml:msub></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mi>O</mml:mi><mml:mi>u</mml:mi><mml:mi>t</mml:mi><mml:mi>p</mml:mi><mml:mi>u</mml:mi><mml:msub><mml:mi>t</mml:mi><mml:mrow><mml:mi>F</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mi>C</mml:mi><mml:mi>o</mml:mi><mml:mi>n</mml:mi><mml:mi>v</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>C</mml:mi><mml:mi>o</mml:mi><mml:mi>n</mml:mi><mml:mi>c</mml:mi><mml:mi>a</mml:mi><mml:mi>t</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>X</mml:mi><mml:mrow><mml:mn>1234</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msubsup><mml:mi>X</mml:mi><mml:mrow><mml:mn>12</mml:mn></mml:mrow><mml:mrow><mml:mi>F</mml:mi></mml:mrow></mml:msubsup><mml:mo stretchy="false">)</mml:mo><mml:mo>&#x2295;</mml:mo><mml:mi>C</mml:mi><mml:mi>o</mml:mi><mml:mi>n</mml:mi><mml:mi>c</mml:mi><mml:mi>a</mml:mi><mml:mi>t</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>X</mml:mi><mml:mrow><mml:mn>1234</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msubsup><mml:mi>X</mml:mi><mml:mrow><mml:mn>12</mml:mn></mml:mrow><mml:mrow><mml:mi>F</mml:mi></mml:mrow></mml:msubsup><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">)</mml:mo><mml:mo>&#x2295;</mml:mo><mml:msub><mml:mi>X</mml:mi><mml:mrow><mml:mn>1234</mml:mn></mml:mrow></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mo>&#x2297;</mml:mo><mml:msub><mml:mi>X</mml:mi><mml:mrow><mml:mn>1234</mml:mn></mml:mrow></mml:msub></mml:mtd></mml:mtr></mml:mtable><mml:mo fence="true" stretchy="true" symmetric="true"></mml:mo></mml:mrow></mml:math></disp-formula>where <inline-formula id="ieqn-28"><mml:math id="mml-ieqn-28"><mml:msubsup><mml:mi>X</mml:mi><mml:mrow><mml:mn>12</mml:mn></mml:mrow><mml:mrow><mml:mi>F</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula> refers to the feature generated by the initial fusion of <inline-formula id="ieqn-29"><mml:math id="mml-ieqn-29"><mml:msub><mml:mi>X</mml:mi><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula> and <inline-formula id="ieqn-30"><mml:math id="mml-ieqn-30"><mml:msub><mml:mi>X</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula> in FHFR, <inline-formula id="ieqn-31"><mml:math id="mml-ieqn-31"><mml:msub><mml:mi>X</mml:mi><mml:mrow><mml:mn>1234</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula> refers to the feature generated by the initial fusion of <inline-formula id="ieqn-32"><mml:math id="mml-ieqn-32"><mml:msub><mml:mi>X</mml:mi><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula>, <inline-formula id="ieqn-33"><mml:math id="mml-ieqn-33"><mml:msub><mml:mi>X</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula>, <inline-formula id="ieqn-34"><mml:math id="mml-ieqn-34"><mml:msub><mml:mi>X</mml:mi><mml:mrow><mml:mn>3</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula>, <inline-formula id="ieqn-35"><mml:math id="mml-ieqn-35"><mml:msub><mml:mi>X</mml:mi><mml:mrow><mml:mn>4</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula>, <inline-formula id="ieqn-36"><mml:math id="mml-ieqn-36"><mml:mi>C</mml:mi><mml:mi>o</mml:mi><mml:mi>n</mml:mi><mml:mi>v</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mo>&#x22C5;</mml:mo><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> denotes a convolutional layer with a kernel size of 3.</p>

</sec>
<sec id="s3_3_3">
<label>3.3.3</label>
<title>Integration of Axial Transformer</title>
<p>In our proposed method, we integrate axial transformers into our network architecture by feeding the features processed by MHFRs into the axial transformers to capture long-range dependencies of the feature maps, as shown in <xref ref-type="fig" rid="fig-1">Fig. 1</xref>, thereby enhancing the overall feature representation. This integration allows the model to focus on both the boundaries and internal textures of polyps. By applying axial attention across the feature maps, the model efficiently captures the long-range dependencies crucial for detecting polyps in various shapes and sizes, even in cases where polyps are small or occluded by other tissues. This process can be written as <xref ref-type="disp-formula" rid="eqn-4">Eq. (4)</xref>:
<disp-formula id="eqn-4"><label>(4)</label><mml:math id="mml-eqn-4" display="block"><mml:mrow><mml:mo>{</mml:mo><mml:mtable columnalign="left" rowspacing="4pt" columnspacing="1em"><mml:mtr><mml:mtd><mml:mi>a</mml:mi><mml:mi>t</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mi>c</mml:mi><mml:mi>o</mml:mi><mml:mi>n</mml:mi><mml:mi>v</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>X</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mo>,</mml:mo><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>3</mml:mn></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mi>a</mml:mi><mml:mi>t</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>c</mml:mi><mml:mi>o</mml:mi><mml:mi>n</mml:mi><mml:mi>v</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>D</mml:mi><mml:mrow><mml:mn>0.5</mml:mn><mml:mi>x</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mi>c</mml:mi><mml:mi>o</mml:mi><mml:mi>n</mml:mi><mml:mi>v</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>X</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">)</mml:mo><mml:mo>,</mml:mo><mml:mi>i</mml:mi><mml:mo>&#x003C;</mml:mo><mml:mn>3</mml:mn></mml:mtd></mml:mtr></mml:mtable><mml:mo fence="true" stretchy="true" symmetric="true"></mml:mo></mml:mrow><mml:mo>,</mml:mo><mml:mi>i</mml:mi><mml:mo>&#x2208;</mml:mo><mml:mo fence="false" stretchy="false">{</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mn>2</mml:mn><mml:mo>,</mml:mo><mml:mn>3</mml:mn><mml:mo fence="false" stretchy="false">}</mml:mo></mml:math></disp-formula>where <inline-formula id="ieqn-37"><mml:math id="mml-ieqn-37"><mml:mi>a</mml:mi><mml:mi>t</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mo>&#x22C5;</mml:mo><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> refers to axial transformer, <inline-formula id="ieqn-38"><mml:math id="mml-ieqn-38"><mml:mi>c</mml:mi><mml:mi>o</mml:mi><mml:mi>n</mml:mi><mml:mi>v</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mo>&#x22C5;</mml:mo><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> refers to the convolutional layer with a kernel size of 1, <inline-formula id="ieqn-39"><mml:math id="mml-ieqn-39"><mml:msub><mml:mi>D</mml:mi><mml:mrow><mml:mn>0.5</mml:mn><mml:mi>x</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mo>&#x22C5;</mml:mo><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> denotes a bilinear interpolation half-scale downsampling operation.</p>
</sec>
</sec>
<sec id="s3_4">
<label>3.4</label>
<title>Loss function and feature fusion</title>
<p>We utilize additive aggregation with four prediction heads to compute the final prediction map, as expressed in <xref ref-type="disp-formula" rid="eqn-5">Eq. (5)</xref>:
<disp-formula id="eqn-5"><label>(5)</label><mml:math id="mml-eqn-5" display="block"><mml:mi>o</mml:mi><mml:mi>u</mml:mi><mml:mi>t</mml:mi><mml:mi>p</mml:mi><mml:mi>u</mml:mi><mml:mi>t</mml:mi><mml:mo>=</mml:mo><mml:mi>p</mml:mi><mml:msub><mml:mi>w</mml:mi><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>&#x00D7;</mml:mo><mml:msub><mml:mi>P</mml:mi><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:mi>p</mml:mi><mml:msub><mml:mi>w</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub><mml:mo>&#x00D7;</mml:mo><mml:msub><mml:mi>P</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:mi>p</mml:mi><mml:msub><mml:mi>w</mml:mi><mml:mrow><mml:mn>3</mml:mn></mml:mrow></mml:msub><mml:mo>&#x00D7;</mml:mo><mml:msub><mml:mi>P</mml:mi><mml:mrow><mml:mn>3</mml:mn></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:mi>p</mml:mi><mml:msub><mml:mi>w</mml:mi><mml:mrow><mml:mn>4</mml:mn></mml:mrow></mml:msub><mml:mo>&#x00D7;</mml:mo><mml:msub><mml:mi>P</mml:mi><mml:mrow><mml:mn>3</mml:mn><mml:mi mathvariant="normal">&#x005F;</mml:mi><mml:mi>a</mml:mi><mml:mi>u</mml:mi><mml:mi>x</mml:mi></mml:mrow></mml:msub></mml:math></disp-formula>where <inline-formula id="ieqn-40"><mml:math id="mml-ieqn-40"><mml:msub><mml:mi>P</mml:mi><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula>, <inline-formula id="ieqn-41"><mml:math id="mml-ieqn-41"><mml:msub><mml:mi>P</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula>, <inline-formula id="ieqn-42"><mml:math id="mml-ieqn-42"><mml:msub><mml:mi>P</mml:mi><mml:mrow><mml:mn>3</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula>, and <inline-formula id="ieqn-43"><mml:math id="mml-ieqn-43"><mml:msub><mml:mi>P</mml:mi><mml:mrow><mml:mn>3</mml:mn><mml:mi mathvariant="normal">&#x005F;</mml:mi><mml:mi>a</mml:mi><mml:mi>u</mml:mi><mml:mi>x</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> represent feature maps from four prediction heads, and <inline-formula id="ieqn-44"><mml:math id="mml-ieqn-44"><mml:mi>p</mml:mi><mml:msub><mml:mi>w</mml:mi><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula>, <inline-formula id="ieqn-45"><mml:math id="mml-ieqn-45"><mml:mi>p</mml:mi><mml:msub><mml:mi>w</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula>, <inline-formula id="ieqn-46"><mml:math id="mml-ieqn-46"><mml:mi>p</mml:mi><mml:msub><mml:mi>w</mml:mi><mml:mrow><mml:mn>3</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula> and <inline-formula id="ieqn-47"><mml:math id="mml-ieqn-47"><mml:mi>p</mml:mi><mml:msub><mml:mi>w</mml:mi><mml:mrow><mml:mn>4</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula> denote the weights assigned to each feature map in the final prediction map. In PVT-DMHFR, <inline-formula id="ieqn-48"><mml:math id="mml-ieqn-48"><mml:mi>p</mml:mi><mml:msub><mml:mi>w</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mn>1.0</mml:mn><mml:mo>,</mml:mo><mml:mi>i</mml:mi><mml:mo>&#x2208;</mml:mo><mml:mo fence="false" stretchy="false">{</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mn>2</mml:mn><mml:mo>,</mml:mo><mml:mn>3</mml:mn><mml:mo>,</mml:mo><mml:mn>4</mml:mn><mml:mo fence="false" stretchy="false">}</mml:mo></mml:math></inline-formula>.</p>
<p>The loss function for feature maps of each prediction head can be expressed as <xref ref-type="disp-formula" rid="eqn-6">Eq. (6)</xref>:
<disp-formula id="eqn-6"><label>(6)</label><mml:math id="mml-eqn-6" display="block"><mml:mi>l</mml:mi><mml:mi>o</mml:mi><mml:mi>s</mml:mi><mml:msub><mml:mi>s</mml:mi><mml:mrow><mml:mi>p</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mi>L</mml:mi><mml:mrow><mml:mi>w</mml:mi><mml:mi>I</mml:mi><mml:mi>o</mml:mi><mml:mi>U</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:mi>P</mml:mi><mml:mo>,</mml:mo><mml:mi>G</mml:mi><mml:mo>)</mml:mo></mml:mrow><mml:mo>+</mml:mo><mml:msub><mml:mi>L</mml:mi><mml:mrow><mml:mi>w</mml:mi><mml:mi>B</mml:mi><mml:mi>C</mml:mi><mml:mi>E</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:mi>P</mml:mi><mml:mo>,</mml:mo><mml:mi>G</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:math></disp-formula>where <inline-formula id="ieqn-49"><mml:math id="mml-ieqn-49"><mml:mi>l</mml:mi><mml:mi>o</mml:mi><mml:mi>s</mml:mi><mml:msub><mml:mi>s</mml:mi><mml:mrow><mml:mi>P</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> refers to each loss for prediction heads, <inline-formula id="ieqn-50"><mml:math id="mml-ieqn-50"><mml:msub><mml:mi>L</mml:mi><mml:mrow><mml:mi>w</mml:mi><mml:mi>I</mml:mi><mml:mi>o</mml:mi><mml:mi>U</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mo>&#x22C5;</mml:mo><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> denotes weight intersection over union (IoU) loss, and <inline-formula id="ieqn-51"><mml:math id="mml-ieqn-51"><mml:msub><mml:mi>L</mml:mi><mml:mrow><mml:mi>w</mml:mi><mml:mi>B</mml:mi><mml:mi>C</mml:mi><mml:mi>E</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mo>&#x22C5;</mml:mo><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> denotes weight binary cross entropy (BCE) loss, this combination of loss functions imposes restrictions on the prediction map regarding local details (pixel level) and global structure (object level). The final loss for each prediction head is separately computed and then aggregated as <xref ref-type="disp-formula" rid="eqn-7">Eq. (7)</xref>:
<disp-formula id="eqn-7"><label>(7)</label><mml:math id="mml-eqn-7" display="block"><mml:mi>l</mml:mi><mml:mi>o</mml:mi><mml:mi>s</mml:mi><mml:mi>s</mml:mi><mml:mo>=</mml:mo><mml:mi>l</mml:mi><mml:msub><mml:mi>w</mml:mi><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>&#x00D7;</mml:mo><mml:mi>l</mml:mi><mml:mi>o</mml:mi><mml:mi>s</mml:mi><mml:msub><mml:mi>s</mml:mi><mml:mrow><mml:mrow><mml:msub><mml:mi>P</mml:mi><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:mrow></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:mi>l</mml:mi><mml:msub><mml:mi>w</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub><mml:mo>&#x00D7;</mml:mo><mml:mi>l</mml:mi><mml:mi>o</mml:mi><mml:mi>s</mml:mi><mml:msub><mml:mi>s</mml:mi><mml:mrow><mml:mrow><mml:msub><mml:mi>P</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub></mml:mrow></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:mi>l</mml:mi><mml:msub><mml:mi>w</mml:mi><mml:mrow><mml:mn>3</mml:mn></mml:mrow></mml:msub><mml:mo>&#x00D7;</mml:mo><mml:mi>l</mml:mi><mml:mi>o</mml:mi><mml:mi>s</mml:mi><mml:msub><mml:mi>s</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:mi>l</mml:mi><mml:msub><mml:mi>w</mml:mi><mml:mrow><mml:mn>4</mml:mn></mml:mrow></mml:msub><mml:mo>&#x00D7;</mml:mo><mml:mi>l</mml:mi><mml:mi>o</mml:mi><mml:mi>s</mml:mi><mml:msub><mml:mi>s</mml:mi><mml:mrow><mml:mrow><mml:msub><mml:mi>P</mml:mi><mml:mrow><mml:mn>3</mml:mn><mml:mi mathvariant="normal">&#x005F;</mml:mi><mml:mi>a</mml:mi><mml:mi>u</mml:mi><mml:mi>x</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:mrow></mml:msub></mml:math></disp-formula>where <inline-formula id="ieqn-52"><mml:math id="mml-ieqn-52"><mml:mi>l</mml:mi><mml:msub><mml:mi>w</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mn>1.0</mml:mn><mml:mo>,</mml:mo><mml:mi>i</mml:mi><mml:mo>&#x2208;</mml:mo><mml:mo fence="false" stretchy="false">{</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mn>2</mml:mn><mml:mo>,</mml:mo><mml:mn>3</mml:mn><mml:mo>,</mml:mo><mml:mn>4</mml:mn><mml:mo fence="false" stretchy="false">}</mml:mo></mml:math></inline-formula>.</p>
</sec>
</sec>
<sec id="s4">
<label>4</label>
<title>Experiment</title>
<p>In this section, we first experimentally evaluate the performance of our proposed DMHFR decoder by comparing its results with those of state-of-the-art methods. Additionally, we conduct ablation studies to assess the effectiveness of the DMHFR decoder.</p>
<sec id="s4_1">
<label>4.1</label>
<title>Datasets</title>
<p>We validate the performance of the proposed method on five public polyp datasets: CVC-ClinicDB [<xref ref-type="bibr" rid="ref-44">44</xref>] has 612 polyp images extracted from 31 colonoscopy videos. Kvasir-SEG [<xref ref-type="bibr" rid="ref-45">45</xref>] has 1000 polyp images collected from the Kvasir dataset&#x2019;s polyp class. CVC-T [<xref ref-type="bibr" rid="ref-46">46</xref>], a subset of the EndoScene dataset, has 60 polyp images. CVC-ColonDB [<xref ref-type="bibr" rid="ref-47">47</xref>] has 380 polyp images. ETIS-LaribPolypDB [<xref ref-type="bibr" rid="ref-48">48</xref>] has 196 polyp images.</p>
</sec>
<sec id="s4_2">
<label>4.2</label>
<title>Evaluation Metrics</title>
<p>We utilize several key metrics as PraNet [<xref ref-type="bibr" rid="ref-2">2</xref>] used to assess the performance of our proposed method comprehensively. Mean Dice (mDic) [<xref ref-type="bibr" rid="ref-49">49</xref>] quantifies the overlap between predicted and ground truth segmentations, producing a score between 0 and 1, with higher values indicating better accuracy. Mean intersection over union (mIoU) also measures overlap but is more stringent, providing a ratio of the intersection to the union of the predicted and actual regions. Mean absolute error (MAE) calculates the average absolute differences between predictions and ground truth, providing insight into overall accuracy. Weighted F-measure (<inline-formula id="ieqn-53"><mml:math id="mml-ieqn-53"><mml:msubsup><mml:mi>F</mml:mi><mml:mrow><mml:mi>&#x03B2;</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x03C9;</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula>) [<xref ref-type="bibr" rid="ref-50">50</xref>] balances precision and recall, particularly useful in scenarios with class imbalance. S-measure (<inline-formula id="ieqn-54"><mml:math id="mml-ieqn-54"><mml:msub><mml:mi>S</mml:mi><mml:mrow><mml:mi>&#x03B1;</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>) [<xref ref-type="bibr" rid="ref-51">51</xref>] evaluates structural similarity by considering region and boundary adherence, while E-measure (<inline-formula id="ieqn-55"><mml:math id="mml-ieqn-55"><mml:msub><mml:mi>E</mml:mi><mml:mrow><mml:mi>&#x03BE;</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>) [<xref ref-type="bibr" rid="ref-52">52</xref>,<xref ref-type="bibr" rid="ref-53">53</xref>] extends this by incorporating boundary precision and region consistency, we report the both mean value of E-measure (m<inline-formula id="ieqn-56"><mml:math id="mml-ieqn-56"><mml:msub><mml:mi>E</mml:mi><mml:mrow><mml:mi>&#x03BE;</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>) and max value of E-measure (max<inline-formula id="ieqn-57"><mml:math id="mml-ieqn-57"><mml:msub><mml:mi>E</mml:mi><mml:mrow><mml:mi>&#x03BE;</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>). The equation of mDic is given in <xref ref-type="disp-formula" rid="eqn-8">Eq. (8)</xref>, and the equation of mIoU is given in <xref ref-type="disp-formula" rid="eqn-9">Eq. (9)</xref>:
<disp-formula id="eqn-8"><label>(8)</label><mml:math id="mml-eqn-8" display="block"><mml:mi>m</mml:mi><mml:mi>D</mml:mi><mml:mi>i</mml:mi><mml:mi>c</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mi>P</mml:mi><mml:mo>,</mml:mo><mml:mi>G</mml:mi><mml:mo>)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mn>2</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mrow><mml:mo>|</mml:mo><mml:mi>A</mml:mi><mml:mo>&#x22C2;</mml:mo><mml:mi>B</mml:mi><mml:mo>|</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:mo>|</mml:mo><mml:mi>A</mml:mi><mml:mo>|</mml:mo></mml:mrow><mml:mo>+</mml:mo><mml:mrow><mml:mo>|</mml:mo><mml:mi>B</mml:mi><mml:mo>|</mml:mo></mml:mrow></mml:mrow></mml:mfrac><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mn>2</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mi>T</mml:mi><mml:mi>P</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mi>T</mml:mi><mml:mi>P</mml:mi><mml:mo>+</mml:mo><mml:mi>F</mml:mi><mml:mi>P</mml:mi><mml:mo>+</mml:mo><mml:mi>F</mml:mi><mml:mi>N</mml:mi></mml:mrow></mml:mfrac></mml:math></disp-formula><disp-formula id="eqn-9"><label>(9)</label><mml:math id="mml-eqn-9" display="block"><mml:mi>m</mml:mi><mml:mi>I</mml:mi><mml:mi>o</mml:mi><mml:mi>U</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mi>P</mml:mi><mml:mo>,</mml:mo><mml:mi>G</mml:mi><mml:mo>)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mi>A</mml:mi><mml:mo>&#x22C2;</mml:mo><mml:mi>B</mml:mi></mml:mrow><mml:mrow><mml:mi>A</mml:mi><mml:mo>&#x22C3;</mml:mo><mml:mi>B</mml:mi></mml:mrow></mml:mfrac><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mi>T</mml:mi><mml:mi>P</mml:mi></mml:mrow><mml:mrow><mml:mi>T</mml:mi><mml:mi>P</mml:mi><mml:mo>+</mml:mo><mml:mi>F</mml:mi><mml:mi>P</mml:mi><mml:mo>+</mml:mo><mml:mi>F</mml:mi><mml:mi>N</mml:mi></mml:mrow></mml:mfrac></mml:math></disp-formula>where <inline-formula id="ieqn-58"><mml:math id="mml-ieqn-58"><mml:mi>P</mml:mi></mml:math></inline-formula> refers to prediction map, <inline-formula id="ieqn-59"><mml:math id="mml-ieqn-59"><mml:mi>G</mml:mi></mml:math></inline-formula> refers to the Ground Truth, <inline-formula id="ieqn-60"><mml:math id="mml-ieqn-60"><mml:mi>T</mml:mi><mml:mi>P</mml:mi></mml:math></inline-formula> denotes true positive instances, <inline-formula id="ieqn-61"><mml:math id="mml-ieqn-61"><mml:mi>F</mml:mi><mml:mi>P</mml:mi></mml:math></inline-formula> denotes false positive instances, <inline-formula id="ieqn-62"><mml:math id="mml-ieqn-62"><mml:mi>F</mml:mi><mml:mi>N</mml:mi></mml:math></inline-formula> denotes false negative instances. The equation of <inline-formula id="ieqn-63"><mml:math id="mml-ieqn-63"><mml:msubsup><mml:mi>F</mml:mi><mml:mrow><mml:mi>&#x03B2;</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x03C9;</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula> is given in <xref ref-type="disp-formula" rid="eqn-10">Eq. (10)</xref>:</p>
<p><disp-formula id="eqn-10"><label>(10)</label><mml:math id="mml-eqn-10" display="block"><mml:mrow><mml:mo>{</mml:mo><mml:mtable columnalign="left" rowspacing="4pt" columnspacing="1em"><mml:mtr><mml:mtd><mml:mi>R</mml:mi><mml:mo>=</mml:mo><mml:mstyle displaystyle="true" scriptlevel="0"><mml:mfrac><mml:mrow><mml:mi>T</mml:mi><mml:mi>P</mml:mi></mml:mrow><mml:mrow><mml:mi>T</mml:mi><mml:mi>P</mml:mi><mml:mo>+</mml:mo><mml:mi>F</mml:mi><mml:mi>N</mml:mi></mml:mrow></mml:mfrac></mml:mstyle></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mi>P</mml:mi><mml:mo>=</mml:mo><mml:mstyle displaystyle="true" scriptlevel="0"><mml:mfrac><mml:mrow><mml:mi>T</mml:mi><mml:mi>P</mml:mi></mml:mrow><mml:mrow><mml:mi>T</mml:mi><mml:mi>P</mml:mi><mml:mo>+</mml:mo><mml:mi>F</mml:mi><mml:mi>P</mml:mi></mml:mrow></mml:mfrac></mml:mstyle></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:msubsup><mml:mi>F</mml:mi><mml:mrow><mml:mi>&#x03B2;</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x03C9;</mml:mi></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:mstyle displaystyle="true" scriptlevel="0"><mml:mfrac><mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mn>1</mml:mn><mml:mo>+</mml:mo><mml:msup><mml:mi>&#x03B2;</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msup><mml:mo>)</mml:mo></mml:mrow><mml:msup><mml:mi>P</mml:mi><mml:mrow><mml:mi>&#x03C9;</mml:mi></mml:mrow></mml:msup><mml:mo>&#x22C5;</mml:mo><mml:msup><mml:mi>R</mml:mi><mml:mrow><mml:mi>&#x03C9;</mml:mi></mml:mrow></mml:msup></mml:mrow><mml:mrow><mml:msup><mml:mi>&#x03B2;</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msup><mml:mo>&#x22C5;</mml:mo><mml:msup><mml:mi>P</mml:mi><mml:mrow><mml:mi>&#x03C9;</mml:mi></mml:mrow></mml:msup><mml:mo>+</mml:mo><mml:msup><mml:mi>R</mml:mi><mml:mrow><mml:mi>&#x03C9;</mml:mi></mml:mrow></mml:msup></mml:mrow></mml:mfrac></mml:mstyle></mml:mtd></mml:mtr></mml:mtable><mml:mo fence="true" stretchy="true" symmetric="true"></mml:mo></mml:mrow></mml:math></disp-formula>where <inline-formula id="ieqn-64"><mml:math id="mml-ieqn-64"><mml:mrow><mml:mi mathvariant="italic">R</mml:mi></mml:mrow></mml:math></inline-formula> refers to recall, <inline-formula id="ieqn-65"><mml:math id="mml-ieqn-65"><mml:mi>P</mml:mi></mml:math></inline-formula> refers to precision, <inline-formula id="ieqn-66"><mml:math id="mml-ieqn-66"><mml:mi>&#x03B2;</mml:mi></mml:math></inline-formula> is a parameter to trade-off <inline-formula id="ieqn-67"><mml:math id="mml-ieqn-67"><mml:mrow><mml:mi mathvariant="italic">R</mml:mi></mml:mrow></mml:math></inline-formula> and <inline-formula id="ieqn-68"><mml:math id="mml-ieqn-68"><mml:mi>P</mml:mi></mml:math></inline-formula>. The MAE score is computed by <xref ref-type="disp-formula" rid="eqn-11">Eq. (11)</xref>:</p>
<p><disp-formula id="eqn-11"><label>(11)</label><mml:math id="mml-eqn-11" display="block"><mml:mi>M</mml:mi><mml:mi>A</mml:mi><mml:mi>E</mml:mi><mml:mo>=</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:mrow><mml:mi>H</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mi>W</mml:mi></mml:mrow></mml:mfrac><mml:munderover><mml:mo movablelimits="false">&#x2211;</mml:mo><mml:mrow><mml:mi>x</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>H</mml:mi></mml:mrow></mml:munderover><mml:munderover><mml:mo movablelimits="false">&#x2211;</mml:mo><mml:mrow><mml:mi>y</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>W</mml:mi></mml:mrow></mml:munderover><mml:mrow><mml:mo>|</mml:mo><mml:mi>P</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mi>x</mml:mi><mml:mo>,</mml:mo><mml:mi>y</mml:mi><mml:mo>)</mml:mo></mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mi>G</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo>,</mml:mo><mml:mi>y</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>|</mml:mo></mml:mrow></mml:math></disp-formula>where <inline-formula id="ieqn-69"><mml:math id="mml-ieqn-69"><mml:mi>H</mml:mi></mml:math></inline-formula> refers to height of images, <inline-formula id="ieqn-70"><mml:math id="mml-ieqn-70"><mml:mi>W</mml:mi></mml:math></inline-formula> refers to width of images. The <inline-formula id="ieqn-71"><mml:math id="mml-ieqn-71"><mml:msub><mml:mi>S</mml:mi><mml:mrow><mml:mi>&#x03B1;</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> is computed by <xref ref-type="disp-formula" rid="eqn-12">Eq. (12)</xref>:</p>
<p><disp-formula id="eqn-12"><label>(12)</label><mml:math id="mml-eqn-12" display="block"><mml:msub><mml:mi>S</mml:mi><mml:mrow><mml:mi>&#x03B1;</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mi>&#x03B1;</mml:mi><mml:msub><mml:mi>S</mml:mi><mml:mrow><mml:mn>0</mml:mn></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mn>1</mml:mn><mml:mo>&#x2212;</mml:mo><mml:mi>&#x03B1;</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:msub><mml:mi>S</mml:mi><mml:mrow><mml:mi>r</mml:mi></mml:mrow></mml:msub></mml:math></disp-formula>where <inline-formula id="ieqn-72"><mml:math id="mml-ieqn-72"><mml:msub><mml:mi>S</mml:mi><mml:mrow><mml:mi>r</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> and <inline-formula id="ieqn-73"><mml:math id="mml-ieqn-73"><mml:msub><mml:mi>S</mml:mi><mml:mrow><mml:mn>0</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula> refers to the region-aware and object-aware similarity measure, and the trade-off coefficient, <inline-formula id="ieqn-74"><mml:math id="mml-ieqn-74"><mml:mrow><mml:mi mathvariant="normal">&#x03B1;</mml:mi></mml:mrow></mml:math></inline-formula>, is set to 50 by default. The equation of <inline-formula id="ieqn-75"><mml:math id="mml-ieqn-75"><mml:msub><mml:mi>E</mml:mi><mml:mrow><mml:mi>&#x03BE;</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> is given in <xref ref-type="disp-formula" rid="eqn-13">Eq. (13)</xref>:</p>
<p><disp-formula id="eqn-13"><label>(13)</label><mml:math id="mml-eqn-13" display="block"><mml:msub><mml:mi>E</mml:mi><mml:mrow><mml:mi>&#x03BE;</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:mrow><mml:mi>H</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mi>W</mml:mi></mml:mrow></mml:mfrac><mml:munderover><mml:mo movablelimits="false">&#x2211;</mml:mo><mml:mrow><mml:mi>x</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>H</mml:mi></mml:mrow></mml:munderover><mml:munderover><mml:mo movablelimits="false">&#x2211;</mml:mo><mml:mrow><mml:mi>y</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>W</mml:mi></mml:mrow></mml:munderover><mml:mi>&#x03D5;</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo>,</mml:mo><mml:mi>y</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></disp-formula>where <inline-formula id="ieqn-76"><mml:math id="mml-ieqn-76"><mml:mi>&#x03D5;</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo>,</mml:mo><mml:mi>y</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> denotes the enhanced alignment matrix that capture pixel-level matching and image-level statistics.</p>
</sec>
<sec id="s4_3">
<label>4.3</label>
<title>Implementation Details</title>
<p>Our proposed method is implemented using the PyTorch 2.0.0 framework. The model is trained using an NVIDIA RTX 3090 GPU with 24 GB of memory. We use the Adam optimizer [<xref ref-type="bibr" rid="ref-54">54</xref>] and set the learning rate to 5 &#x00D7; 10<sup>&#x2212;5</sup> without decay. Following Polyp-PVT [<xref ref-type="bibr" rid="ref-39">39</xref>], we use a multi-scale {0.75, 1.0, 1.25} training strategy with gradient clipping set to 0.5, configure the batch size to 16, set the maximum number of epochs to 100, and resize input images to 352 &#x00D7; 352 pixels. To ensure fairness in our comparative experiments, we adopt the same data division method as used in PraNet. A total of 900 images from Kvasir-SEG and 548 images from CVC-ClinicDB are used as training sets, while the remaining 100 images from Kvasir-SEG and 64 images from CVC-ClinicDB are reserved as test sets for evaluating the model&#x2019;s learning ability. We further assess the model&#x2019;s generalization ability on three additional datasets: CVC-ColonDB, CVC-T, and ETIS-Larib. In this paper, we compare the proposed method with 15 state-of-the-art image segmentation models, including UNet [<xref ref-type="bibr" rid="ref-1">1</xref>], UNet&#x002B;&#x002B; [<xref ref-type="bibr" rid="ref-3">3</xref>], PraNet [<xref ref-type="bibr" rid="ref-2">2</xref>], MSNet [<xref ref-type="bibr" rid="ref-35">35</xref>], SANet [<xref ref-type="bibr" rid="ref-34">34</xref>], Transfuse [<xref ref-type="bibr" rid="ref-10">10</xref>], UACANet [<xref ref-type="bibr" rid="ref-33">33</xref>], TMF-Net [<xref ref-type="bibr" rid="ref-55">55</xref>], C2F-Net [<xref ref-type="bibr" rid="ref-21">21</xref>], Polyp-PVT [<xref ref-type="bibr" rid="ref-39">39</xref>], SSFormer [<xref ref-type="bibr" rid="ref-37">37</xref>], ColonFormer [<xref ref-type="bibr" rid="ref-56">56</xref>], ESFPNet [<xref ref-type="bibr" rid="ref-57">57</xref>], FCBFormer [<xref ref-type="bibr" rid="ref-58">58</xref>], and PVT-CASCADE [<xref ref-type="bibr" rid="ref-40">40</xref>].</p>
</sec>
<sec id="s4_4">
<label>4.4</label>
<title>Quantitative Analysis of Learning Ability</title>
<p><xref ref-type="table" rid="table-1">Table 1</xref> presents the quantitative results of the feature modeling capabilities comparison between PVT-DMHFR and 15 different SOTA methods trained on the ClinicDB and Kvasir-SEG datasets. The PVT-DMHFR exhibits superior feature modeling performance compared to other methods. Compared to the transformer-based method PVT-Cascade, our approach improves the mDic and mIoU on the CVC-ClinicDB dataset by 1.2% and 2.1%, respectively. Additionally, it increases the <inline-formula id="ieqn-77"><mml:math id="mml-ieqn-77"><mml:msubsup><mml:mi>F</mml:mi><mml:mrow><mml:mi>&#x03B2;</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x03C9;</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula> by 1.4%, enhances the <inline-formula id="ieqn-78"><mml:math id="mml-ieqn-78"><mml:msub><mml:mi>S</mml:mi><mml:mrow><mml:mi>&#x03B1;</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> by 1.2%, improves the m<inline-formula id="ieqn-79"><mml:math id="mml-ieqn-79"><mml:msub><mml:mi>E</mml:mi><mml:mrow><mml:mi>&#x03BE;</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> by 1.3%, improves the max<inline-formula id="ieqn-80"><mml:math id="mml-ieqn-80"><mml:msub><mml:mi>E</mml:mi><mml:mrow><mml:mi>&#x03BE;</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> by 1.5%, and reduces the MAE by 0.6%.</p>
<table-wrap id="table-1">
<label>Table 1</label>
<caption>
<title>Quantitative results on CVC-ClinicDB and Kvasir-SEG datasets. The best result of each evaluation metric is bolded</title>
</caption>
<table>
<colgroup>
<col/>
<col/>
<col/>
<col/>
<col/>
<col/>
<col/>
<col/>
<col/>
<col/>
<col/>
<col/>
<col/>
<col/>
<col/>
</colgroup>
<thead>
<tr>
<th rowspan="2">Model</th>
<th colspan="7">CVC-ClinicDB</th>
<th colspan="7">Kvasir-SEG</th>
</tr>
<tr>
<th>mDic</th>
<th>mIoU</th>
<th><inline-formula id="ieqn-81"><mml:math id="mml-ieqn-81"><mml:msubsup><mml:mi>F</mml:mi><mml:mrow><mml:mi>&#x03B2;</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x03C9;</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula></th>
<th><inline-formula id="ieqn-82"><mml:math id="mml-ieqn-82"><mml:msub><mml:mi>S</mml:mi><mml:mrow><mml:mi>&#x03B1;</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula></th>
<th>m<inline-formula id="ieqn-83"><mml:math id="mml-ieqn-83"><mml:msub><mml:mi>E</mml:mi><mml:mrow><mml:mi>&#x03BE;</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula></th>
<th>max<inline-formula id="ieqn-84"><mml:math id="mml-ieqn-84"><mml:msub><mml:mi>E</mml:mi><mml:mrow><mml:mi>&#x03BE;</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula></th>
<th>MAE</th>
<th>mDic</th>
<th>mIoU</th>
<th><inline-formula id="ieqn-85"><mml:math id="mml-ieqn-85"><mml:msubsup><mml:mi>F</mml:mi><mml:mrow><mml:mi>&#x03B2;</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x03C9;</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula></th>
<th><inline-formula id="ieqn-86"><mml:math id="mml-ieqn-86"><mml:msub><mml:mi>S</mml:mi><mml:mrow><mml:mi>&#x03B1;</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula></th>
<th>m<inline-formula id="ieqn-87"><mml:math id="mml-ieqn-87"><mml:msub><mml:mi>E</mml:mi><mml:mrow><mml:mi>&#x03BE;</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula></th>
<th>max<inline-formula id="ieqn-88"><mml:math id="mml-ieqn-88"><mml:msub><mml:mi>E</mml:mi><mml:mrow><mml:mi>&#x03BE;</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula></th>
<th>MAE</th>
</tr>
</thead>
<tbody>
<tr>
<td>U-Net [<xref ref-type="bibr" rid="ref-1">1</xref>]</td>
<td>0.833</td>
<td>0.754</td>
<td>0.812</td>
<td>0.903</td>
<td>0.924</td>
<td>0.939</td>
<td>0.019</td>
<td>0.816</td>
<td>0.742</td>
<td>0.797</td>
<td>0.845</td>
<td>0.863</td>
<td>0.877</td>
<td>0.051</td>
</tr>
<tr>
<td>U-Net&#x002B;&#x002B; [<xref ref-type="bibr" rid="ref-3">3</xref>]</td>
<td>0.901</td>
<td>0.843</td>
<td>0.898</td>
<td>0.929</td>
<td>0.958</td>
<td>0.973</td>
<td>0.015</td>
<td>0.823</td>
<td>0.744</td>
<td>0.815</td>
<td>0.856</td>
<td>0.891</td>
<td>0.898</td>
<td>0.044</td>
</tr>
<tr>
<td>PraNet [<xref ref-type="bibr" rid="ref-2">2</xref>]</td>
<td>0.897</td>
<td>0.859</td>
<td>0.895</td>
<td>0.938</td>
<td>0.961</td>
<td>0.976</td>
<td>0.010</td>
<td>0.897</td>
<td>0.846</td>
<td>0.883</td>
<td>0.910</td>
<td>0.944</td>
<td>0.948</td>
<td>0.029</td>
</tr>
<tr>
<td>MSNet [<xref ref-type="bibr" rid="ref-35">35</xref>]</td>
<td>0.922</td>
<td>0.862</td>
<td>0.904</td>
<td>0.943</td>
<td>0.973</td>
<td>0.986</td>
<td>0.009</td>
<td>0.889</td>
<td>0.834</td>
<td>0.881</td>
<td>0.907</td>
<td>0.937</td>
<td>0.942</td>
<td>0.034</td>
</tr>
<tr>
<td>SANet [<xref ref-type="bibr" rid="ref-34">34</xref>]</td>
<td>0.914</td>
<td>0.855</td>
<td>0.902</td>
<td>0.937</td>
<td>0.971</td>
<td>0.975</td>
<td>0.011</td>
<td>0.902</td>
<td>0.849</td>
<td>0.892</td>
<td>0.913</td>
<td>0.945</td>
<td>0.950</td>
<td>0.032</td>
</tr>
<tr>
<td>Transfuse [<xref ref-type="bibr" rid="ref-10">10</xref>]</td>
<td>0.900</td>
<td>0.839</td>
<td>0.893</td>
<td>0.936</td>
<td>0.964</td>
<td>0.968</td>
<td>0.010</td>
<td>0.907</td>
<td>0.855</td>
<td>0.898</td>
<td>0.911</td>
<td>0.948</td>
<td>0.954</td>
<td>0.025</td>
</tr>
<tr>
<td>UACANet [<xref ref-type="bibr" rid="ref-33">33</xref>]</td>
<td>0.911</td>
<td>0.854</td>
<td>0.912</td>
<td>0.948</td>
<td>0.970</td>
<td>0.979</td>
<td>0.009</td>
<td>0.913</td>
<td>0.862</td>
<td>0.897</td>
<td>0.914</td>
<td>0.949</td>
<td>0.958</td>
<td>0.027</td>
</tr>
<tr>
<td>TMF-Net [<xref ref-type="bibr" rid="ref-55">55</xref>]</td>
<td>0.899</td>
<td>0.842</td>
<td>0.885</td>
<td>0.937</td>
<td>0.952</td>
<td>0.959</td>
<td>0.011</td>
<td>0.877</td>
<td>0.814</td>
<td>0.849</td>
<td>0.897</td>
<td>0.922</td>
<td>0.929</td>
<td>0.034</td>
</tr>
<tr>
<td>C2F-Net [<xref ref-type="bibr" rid="ref-21">21</xref>]</td>
<td>0.922</td>
<td>0.865</td>
<td>0.929</td>
<td>0.941</td>
<td>0.975</td>
<td>0.981</td>
<td>0.008</td>
<td>0.901</td>
<td>0.839</td>
<td>0.896</td>
<td>0.911</td>
<td>0.938</td>
<td>0.944</td>
<td>0.029</td>
</tr>
<tr>
<td>Polyp-PVT [<xref ref-type="bibr" rid="ref-39">39</xref>]</td>
<td>0.935</td>
<td>0.887</td>
<td>0.933</td>
<td>0.949</td>
<td>0.982</td>
<td>0.986</td>
<td>0.006</td>
<td>0.907</td>
<td>0.863</td>
<td>0.903</td>
<td>0.914</td>
<td>0.956</td>
<td>0.961</td>
<td>0.027</td>
</tr>
<tr>
<td>SSFormer [<xref ref-type="bibr" rid="ref-37">37</xref>]</td>
<td>0.925</td>
<td>0.869</td>
<td>0.919</td>
<td>0.940</td>
<td>0.965</td>
<td>0.971</td>
<td>0.015</td>
<td>0.910</td>
<td>0.854</td>
<td>0.907</td>
<td>0.918</td>
<td>0.952</td>
<td>0.955</td>
<td>0.024</td>
</tr>
<tr>
<td>ColonFormer [<xref ref-type="bibr" rid="ref-56">56</xref>]</td>
<td>0.924</td>
<td>0.866</td>
<td>0.918</td>
<td>0.946</td>
<td>0.974</td>
<td>0.978</td>
<td>0.009</td>
<td>0.914</td>
<td>0.858</td>
<td>0.910</td>
<td>0.919</td>
<td>0.958</td>
<td>0.<bold>962</bold></td>
<td>0.026</td>
</tr>
<tr>
<td>ESFPNet [<xref ref-type="bibr" rid="ref-57">57</xref>]</td>
<td>0.913</td>
<td>0.855</td>
<td>0.902</td>
<td>0.931</td>
<td>0.957</td>
<td>0.963</td>
<td>0.010</td>
<td>0.881</td>
<td>0.813</td>
<td>0.872</td>
<td>0.886</td>
<td>0.927</td>
<td>0.934</td>
<td>0.038</td>
</tr>
<tr>
<td>FCBFormer [<xref ref-type="bibr" rid="ref-58">58</xref>]</td>
<td>0.901</td>
<td>0.847</td>
<td>0.887</td>
<td>0.917</td>
<td>0.956</td>
<td>0.961</td>
<td>0.013</td>
<td>0.912</td>
<td>0.857</td>
<td>0.905</td>
<td>0.915</td>
<td>0.951</td>
<td>0.956</td>
<td>0.024</td>
</tr>
<tr>
<td>PVT-Cascade [<xref ref-type="bibr" rid="ref-40">40</xref>]</td>
<td>0.926</td>
<td>0.871</td>
<td>0.921</td>
<td>0.937</td>
<td>0.972</td>
<td>0.974</td>
<td>0.012</td>
<td>0.916</td>
<td>0.862</td>
<td>0.908</td>
<td>0.921</td>
<td>0.957</td>
<td>0.960</td>
<td>0.024</td>
</tr>
<tr>
<td>PVT-DMHFR (Ours)</td>
<td><bold>0.938</bold></td>
<td><bold>0.892</bold></td>
<td><bold>0.935</bold></td>
<td><bold>0.949</bold></td>
<td><bold>0.985</bold></td>
<td><bold>0.989</bold></td>
<td><bold>0.006</bold></td>
<td><bold>0.919</bold></td>
<td><bold>0.866</bold></td>
<td><bold>0.910</bold></td>
<td><bold>0.924</bold></td>
<td><bold>0.958</bold></td>
<td>0.960</td>
<td><bold>0.023</bold></td>
</tr>
</tbody>
</table>
</table-wrap>
<p>On the Kvasir-SEG dataset, compared to the best CNN-based method, UACANet, our method achieves a 0.6% improvement in the mDic, a 0.4% increase in mIoU, a 1.3% rise in the <inline-formula id="ieqn-89"><mml:math id="mml-ieqn-89"><mml:msubsup><mml:mi>F</mml:mi><mml:mrow><mml:mi>&#x03B2;</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x03C9;</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula>, a 1.0% gain in the <inline-formula id="ieqn-90"><mml:math id="mml-ieqn-90"><mml:msub><mml:mi>S</mml:mi><mml:mrow><mml:mi>&#x03B1;</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>, a 0.9% boost in the m<inline-formula id="ieqn-91"><mml:math id="mml-ieqn-91"><mml:msub><mml:mi>E</mml:mi><mml:mrow><mml:mi>&#x03BE;</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>, improves the max<inline-formula id="ieqn-92"><mml:math id="mml-ieqn-92"><mml:msub><mml:mi>E</mml:mi><mml:mrow><mml:mi>&#x03BE;</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> by 0.2%, and a 0.4% reduction in the MAE. In conclusion, our approach delivers top performance across most metrics on the Kvasir-SEG dataset, with the exception of the max<inline-formula id="ieqn-93"><mml:math id="mml-ieqn-93"><mml:msub><mml:mi>E</mml:mi><mml:mrow><mml:mi>&#x03BE;</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>. On the ClinicDB dataset, our method outperforms all others, achieving the highest scores across all evaluation metrics.</p>
</sec>
<sec id="s4_5">
<label>4.5</label>
<title>Quantitative Analysis of Generalization Ability</title>
<p><xref ref-type="table" rid="table-2">Tables 2</xref> and <xref ref-type="table" rid="table-3">3</xref> present the results of performance comparison between our PVT-DMHFR and 15 methods on three unseen datasets: CVC-T, CVC-ColonDB, and ETIS-Larib. On the CVC-T dataset, our method achieves an mDic score of 0.8% higher than the best CNN-based method, SANet, and 0.7% higher than the best transformer-based method, PVT-Cascade. Our method performs best on all metrics, except for max<inline-formula id="ieqn-94"><mml:math id="mml-ieqn-94"><mml:msub><mml:mi>E</mml:mi><mml:mrow><mml:mi>&#x03BE;</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> and MAE, where it lags ColonFormer by 0.9% in max<inline-formula id="ieqn-95"><mml:math id="mml-ieqn-95"><mml:msub><mml:mi>E</mml:mi><mml:mrow><mml:mi>&#x03BE;</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> and SSFormer by 0.1% in MAE. On the CVC-ColonDB dataset, our method surpasses the best CNN-based method, SANet, by 5.7% in mDic, and the transformer-based method, PVT-Cascade, by 0.9%. Our method performs best on all metrics except <inline-formula id="ieqn-96"><mml:math id="mml-ieqn-96"><mml:msub><mml:mi>S</mml:mi><mml:mrow><mml:mi>&#x03B1;</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> and MAE, which lag behind Polyp-PVT by 0.2% and 0.3%. On the ETISLarib dataset, our method achieves a mDic score of 4.4% higher than the best CNN-based method SANet and 0.6% higher than the best transformer-based method PVT-Cascade. Our method performs best on all metrics, except for MAE, which lags behind PVT-Cascade by 0.3%.</p>
<table-wrap id="table-2">
<label>Table 2</label>
<caption>
<title>Quantitative results on CVC-T and CVC-ColonDB datasets. The best result of each evaluation metric is bolded</title>
</caption>
<table>
<colgroup>
<col/>
<col/>
<col/>
<col/>
<col/>
<col/>
<col/>
<col/>
<col/>
<col/>
<col/>
<col/>
<col/>
<col/>
<col/>
</colgroup>
<thead>
<tr>
<th rowspan="2">Model</th>
<th colspan="7">CVC-T</th>
<th colspan="7">CVC-ColonDB</th>
</tr>
<tr>
<th>mDic</th>
<th>mIoU</th>
<th><inline-formula id="ieqn-97"><mml:math id="mml-ieqn-97"><mml:msubsup><mml:mi>F</mml:mi><mml:mrow><mml:mi>&#x03B2;</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x03C9;</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula></th>
<th><inline-formula id="ieqn-98"><mml:math id="mml-ieqn-98"><mml:msub><mml:mi>S</mml:mi><mml:mrow><mml:mi>&#x03B1;</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula></th>
<th>m<inline-formula id="ieqn-99"><mml:math id="mml-ieqn-99"><mml:msub><mml:mi>E</mml:mi><mml:mrow><mml:mi>&#x03BE;</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula></th>
<th>max<inline-formula id="ieqn-100"><mml:math id="mml-ieqn-100"><mml:msub><mml:mi>E</mml:mi><mml:mrow><mml:mi>&#x03BE;</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula></th>
<th>MAE</th>
<th>mDic</th>
<th>mIoU</th>
<th><inline-formula id="ieqn-101"><mml:math id="mml-ieqn-101"><mml:msubsup><mml:mi>F</mml:mi><mml:mrow><mml:mi>&#x03B2;</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x03C9;</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula></th>
<th><inline-formula id="ieqn-102"><mml:math id="mml-ieqn-102"><mml:msub><mml:mi>S</mml:mi><mml:mrow><mml:mi>&#x03B1;</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula></th>
<th>m<inline-formula id="ieqn-103"><mml:math id="mml-ieqn-103"><mml:msub><mml:mi>E</mml:mi><mml:mrow><mml:mi>&#x03BE;</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula></th>
<th>max<inline-formula id="ieqn-104"><mml:math id="mml-ieqn-104"><mml:msub><mml:mi>E</mml:mi><mml:mrow><mml:mi>&#x03BE;</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula></th>
<th align="center">MAE</th>
</tr>
</thead>
<tbody>
<tr>
<td>U-Net [<xref ref-type="bibr" rid="ref-1">1</xref>]</td>
<td>0.758</td>
<td>0.685</td>
<td>0.742</td>
<td>0.864</td>
<td>0.893</td>
<td>0.908</td>
<td>0.017</td>
<td>0.637</td>
<td>0.549</td>
<td>0.592</td>
<td>0.746</td>
<td>0.776</td>
<td>0.819</td>
<td>0.053</td>
</tr>
<tr>
<td>U-Net&#x002B;&#x002B; [<xref ref-type="bibr" rid="ref-3">3</xref>]</td>
<td>0.794</td>
<td>0.732</td>
<td>0.767</td>
<td>0.869</td>
<td>0.899</td>
<td>0.915</td>
<td>0.011</td>
<td>0.631</td>
<td>0.553</td>
<td>0.614</td>
<td>0.765</td>
<td>0.774</td>
<td>0.812</td>
<td>0.048</td>
</tr>
<tr>
<td>PraNet [<xref ref-type="bibr" rid="ref-2">2</xref>]</td>
<td>0.883</td>
<td>0.819</td>
<td>0.868</td>
<td>0.923</td>
<td>0.953</td>
<td>0.973</td>
<td>0.007</td>
<td>0.722</td>
<td>0.649</td>
<td>0.716</td>
<td>0.824</td>
<td>0.852</td>
<td>0.877</td>
<td>0.041</td>
</tr>
<tr>
<td>MSNet [<xref ref-type="bibr" rid="ref-35">35</xref>]</td>
<td>0.873</td>
<td>0.801</td>
<td>0.852</td>
<td>0.930</td>
<td>0.952</td>
<td>0.968</td>
<td>0.008</td>
<td>0.748</td>
<td>0.682</td>
<td>0.728</td>
<td>0.836</td>
<td>0.861</td>
<td>0.868</td>
<td>0.043</td>
</tr>
<tr>
<td>SANet [<xref ref-type="bibr" rid="ref-34">34</xref>]</td>
<td>0.889</td>
<td>0.814</td>
<td>0.843</td>
<td>0.927</td>
<td>0.956</td>
<td>0.974</td>
<td>0.008</td>
<td>0.754</td>
<td>0.680</td>
<td>0.734</td>
<td>0.837</td>
<td>0.867</td>
<td>0.876</td>
<td>0.040</td>
</tr>
<tr>
<td>Transfuse [<xref ref-type="bibr" rid="ref-10">10</xref>]</td>
<td>0.881</td>
<td>0.803</td>
<td>0.857</td>
<td>0.928</td>
<td>0.954</td>
<td>0.971</td>
<td>0.007</td>
<td>0.762</td>
<td>0.683</td>
<td>0.742</td>
<td>0.839</td>
<td>0.873</td>
<td>0.879</td>
<td>0.037</td>
</tr>
<tr>
<td>UACANet [<xref ref-type="bibr" rid="ref-33">33</xref>]</td>
<td>0.876</td>
<td>0.799</td>
<td>0.852</td>
<td>0.931</td>
<td>0.951</td>
<td>0.962</td>
<td>0.008</td>
<td>0.750</td>
<td>0.674</td>
<td>0.739</td>
<td>0.831</td>
<td>0.864</td>
<td>0.898</td>
<td>0.039</td>
</tr>
<tr>
<td>TMF-Net [<xref ref-type="bibr" rid="ref-55">55</xref>]</td>
<td>0.882</td>
<td>0.803</td>
<td>0.844</td>
<td>0.927</td>
<td>0.955</td>
<td>0.967</td>
<td>0.008</td>
<td>0.715</td>
<td>0.629</td>
<td>0.697</td>
<td>0.821</td>
<td>0.845</td>
<td>0.864</td>
<td>0.041</td>
</tr>
<tr>
<td>C2F-Net [<xref ref-type="bibr" rid="ref-21">21</xref>]</td>
<td>0.871</td>
<td>0.809</td>
<td>0.860</td>
<td>0.917</td>
<td>0.961</td>
<td>0.969</td>
<td>0.009</td>
<td>0.724</td>
<td>0.657</td>
<td>0.713</td>
<td>0.822</td>
<td>0.838</td>
<td>0.867</td>
<td>0.045</td>
</tr>
<tr>
<td>Polyp-PVT [<xref ref-type="bibr" rid="ref-39">39</xref>]</td>
<td>0.886</td>
<td>0.816</td>
<td>0.863</td>
<td>0.930</td>
<td>0.958</td>
<td>0.966</td>
<td>0.009</td>
<td>0.809</td>
<td>0.728</td>
<td>0.794</td>
<td><bold>0.863</bold></td>
<td>0.909</td>
<td>0.914</td>
<td><bold>0.024</bold></td>
</tr>
<tr>
<td>SSFormer [<xref ref-type="bibr" rid="ref-37">37</xref>]</td>
<td>0.888</td>
<td>0.818</td>
<td>0.867</td>
<td>0.931</td>
<td>0.956</td>
<td>0.965</td>
<td><bold>0.007</bold></td>
<td>0.773</td>
<td>0.702</td>
<td>0.765</td>
<td>0.854</td>
<td>0.846</td>
<td>0.855</td>
<td>0.036</td>
</tr>
<tr>
<td>ColonFormer [<xref ref-type="bibr" rid="ref-56">56</xref>]</td>
<td>0.891</td>
<td>0.829</td>
<td>0.879</td>
<td>0.929</td>
<td>0.958</td>
<td><bold>0.976</bold></td>
<td>0.008</td>
<td>0.799</td>
<td>0.721</td>
<td>0.786</td>
<td>0.848</td>
<td>0.897</td>
<td>0.901</td>
<td>0.032</td>
</tr>
<tr>
<td>ESFPNet [<xref ref-type="bibr" rid="ref-57">57</xref>]</td>
<td>0.883</td>
<td>0.812</td>
<td>0.863</td>
<td>0.925</td>
<td>0.945</td>
<td>0.953</td>
<td>0.009</td>
<td>0.787</td>
<td>0.709</td>
<td>0.748</td>
<td>0.840</td>
<td>0.871</td>
<td>0.883</td>
<td>0.037</td>
</tr>
<tr>
<td>FCBFormer [<xref ref-type="bibr" rid="ref-58">58</xref>]</td>
<td>0.884</td>
<td>0.816</td>
<td>0.860</td>
<td>0.924</td>
<td>0.953</td>
<td>0.961</td>
<td>0.009</td>
<td>0.793</td>
<td>0.716</td>
<td>0.754</td>
<td>0.846</td>
<td>0.882</td>
<td>0.892</td>
<td>0.033</td>
</tr>
<tr>
<td>PVT-Cascade [<xref ref-type="bibr" rid="ref-40">40</xref>]</td>
<td>0.892</td>
<td>0.826</td>
<td>0.874</td>
<td>0.931</td>
<td>0.957</td>
<td>0.964</td>
<td>0.008</td>
<td>0.802</td>
<td>0.726</td>
<td>0.791</td>
<td>0.853</td>
<td>0.901</td>
<td>0.906</td>
<td>0.030</td>
</tr>
<tr>
<td>PVT-DMHFR (Ours)</td>
<td><bold>0.897</bold></td>
<td><bold>0.833</bold></td>
<td><bold>0.880</bold></td>
<td><bold>0.934</bold></td>
<td><bold>0.963</bold></td>
<td>0.967</td>
<td>0.008</td>
<td><bold>0.811</bold></td>
<td><bold>0.731</bold></td>
<td><bold>0.798</bold></td>
<td>0.861</td>
<td><bold>0.911</bold></td>
<td><bold>0.915</bold></td>
<td>0.027</td>
</tr>
</tbody>
</table>
</table-wrap><table-wrap id="table-3">
<label>Table 3</label>
<caption>
<title>Quantitative results on ETIS-LaribPolypDB dataset. The best result of each evaluation metric is bolded</title>
</caption>
<table>
<colgroup>
<col/>
<col/>
<col/>
<col/>
<col/>
<col/>
<col/>
<col/>
</colgroup>
<thead>
<tr>
<th rowspan="2">Model</th>
<th colspan="7">ETIS-LaribPolypDB</th>
</tr>
<tr>
<th>mDic</th>
<th>mIoU</th>
<th><inline-formula id="ieqn-105"><mml:math id="mml-ieqn-105"><mml:msubsup><mml:mi>F</mml:mi><mml:mrow><mml:mi>&#x03B2;</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x03C9;</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula></th>
<th><inline-formula id="ieqn-106"><mml:math id="mml-ieqn-106"><mml:msub><mml:mi>S</mml:mi><mml:mrow><mml:mi>&#x03B1;</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula></th>
<th>m<inline-formula id="ieqn-107"><mml:math id="mml-ieqn-107"><mml:msub><mml:mi>E</mml:mi><mml:mrow><mml:mi>&#x03BE;</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula></th>
<th>max<inline-formula id="ieqn-108"><mml:math id="mml-ieqn-108"><mml:msub><mml:mi>E</mml:mi><mml:mrow><mml:mi>&#x03BE;</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula></th>
<th align="center">MAE</th>
</tr>
</thead>
<tbody>
<tr>
<td>U-Net [<xref ref-type="bibr" rid="ref-1">1</xref>]</td>
<td>0.496</td>
<td>0.417</td>
<td>0.452</td>
<td>0.733</td>
<td>0.726</td>
<td>0.762</td>
<td>0.033</td>
</tr>
<tr>
<td>U-Net&#x002B;&#x002B; [<xref ref-type="bibr" rid="ref-3">3</xref>]</td>
<td>0.536</td>
<td>0.476</td>
<td>0.503</td>
<td>0.748</td>
<td>0.717</td>
<td>0.773</td>
<td>0.031</td>
</tr>
<tr>
<td>PraNet [<xref ref-type="bibr" rid="ref-2">2</xref>]</td>
<td>0.631</td>
<td>0.567</td>
<td>0.601</td>
<td>0.786</td>
<td>0.788</td>
<td>0.816</td>
<td>0.029</td>
</tr>
<tr>
<td>MSNet [<xref ref-type="bibr" rid="ref-35">35</xref>]</td>
<td>0.642</td>
<td>0.579</td>
<td>0.608</td>
<td>0.807</td>
<td>0.804</td>
<td>0.832</td>
<td>0.055</td>
</tr>
<tr>
<td>SANet [<xref ref-type="bibr" rid="ref-34">34</xref>]</td>
<td>0.747</td>
<td>0.655</td>
<td>0.687</td>
<td>0.852</td>
<td>0.883</td>
<td>0.899</td>
<td>0.017</td>
</tr>
<tr>
<td>Transfuse [<xref ref-type="bibr" rid="ref-10">10</xref>]</td>
<td>0.675</td>
<td>0.590</td>
<td>0.614</td>
<td>0.807</td>
<td>0.832</td>
<td>0.867</td>
<td>0.033</td>
</tr>
<tr>
<td>UACANet [<xref ref-type="bibr" rid="ref-33">33</xref>]</td>
<td>0.684</td>
<td>0.602</td>
<td>0.638</td>
<td>0.813</td>
<td>0.856</td>
<td>0.890</td>
<td>0.017</td>
</tr>
<tr>
<td>TMF-Net [<xref ref-type="bibr" rid="ref-55">55</xref>]</td>
<td>0.643</td>
<td>0.584</td>
<td>0.616</td>
<td>0.783</td>
<td>0.837</td>
<td>0.869</td>
<td>0.024</td>
</tr>
<tr>
<td>C2F-Net [<xref ref-type="bibr" rid="ref-21">21</xref>]</td>
<td>0.679</td>
<td>0.614</td>
<td>0.653</td>
<td>0.814</td>
<td>0.829</td>
<td>0.884</td>
<td>0.031</td>
</tr>
<tr>
<td>Polyp-PVT [<xref ref-type="bibr" rid="ref-39">39</xref>]</td>
<td>0.782</td>
<td>0.709</td>
<td>0.746</td>
<td>0.873</td>
<td>0.894</td>
<td>0.901</td>
<td>0.016</td>
</tr>
<tr>
<td>SSFormer [<xref ref-type="bibr" rid="ref-37">37</xref>]</td>
<td>0.771</td>
<td>0.713</td>
<td>0.742</td>
<td>0.879</td>
<td>0.891</td>
<td>0.899</td>
<td>0.019</td>
</tr>
<tr>
<td>ColonFormer [<xref ref-type="bibr" rid="ref-56">56</xref>]</td>
<td>0.783</td>
<td>0.707</td>
<td>0.740</td>
<td>0.871</td>
<td>0.886</td>
<td>0.894</td>
<td>0.020</td>
</tr>
<tr>
<td>ESFPNet [<xref ref-type="bibr" rid="ref-57">57</xref>]</td>
<td>0.768</td>
<td>0.672</td>
<td>0.719</td>
<td>0.848</td>
<td>0.870</td>
<td>0.885</td>
<td>0.017</td>
</tr>
<tr>
<td>FCBFormer [<xref ref-type="bibr" rid="ref-58">58</xref>]</td>
<td>0.756</td>
<td>0.669</td>
<td>0.708</td>
<td>0.842</td>
<td>0.867</td>
<td>0.883</td>
<td>0.018</td>
</tr>
<tr>
<td>PVT-Cascade [<xref ref-type="bibr" rid="ref-40">40</xref>]</td>
<td>0.785</td>
<td>0.711</td>
<td>0.754</td>
<td>0.870</td>
<td>0.895</td>
<td>0.900</td>
<td><bold>0.013</bold></td>
</tr>
<tr>
<td>PVT-DMHFR (Ours)</td>
<td><bold>0.791</bold></td>
<td><bold>0.718</bold></td>
<td><bold>0.756</bold></td>
<td><bold>0.870</bold></td>
<td><bold>0.901</bold></td>
<td><bold>0.904</bold></td>
<td>0.016</td>
</tr>
</tbody>
</table>
</table-wrap>
<p><xref ref-type="table" rid="table-4">Table 4</xref> presents the results of performance of combination of DMHFR and different backbones, including hierarchical transformer-based backbones (PVTv2b1 [<xref ref-type="bibr" rid="ref-11">11</xref>], PVTv2b2 [<xref ref-type="bibr" rid="ref-11">11</xref>], PVTv2b3 [<xref ref-type="bibr" rid="ref-11">11</xref>], PVTv2b4 [<xref ref-type="bibr" rid="ref-11">11</xref>], PVTv2b5 [<xref ref-type="bibr" rid="ref-11">11</xref>], mitb5 [<xref ref-type="bibr" rid="ref-14">14</xref>]), transformer-based backbone (R50&#x002B;ViT-B_16 [<xref ref-type="bibr" rid="ref-8">8</xref>]), and CNN-based backbone (ResNetV2 [<xref ref-type="bibr" rid="ref-43">43</xref>]). The results demonstrate that hierarchical transformer-based backbones, particularly PVTv2b3 and PVTv2b2, consistently outperform both transformer-based (R50&#x002B;ViT-B_16) and CNN-based (ResNetV2) backbones across all datasets and metrics.</p>
<table-wrap id="table-4">
<label>Table 4</label>
<caption>
<title>Quantitative results using different backbones for DMHFR, the top two results are highlighted in red and blue</title>
</caption>
<table>
<colgroup>
<col/>
<col/>
<col/>
<col/>
<col/>
<col/>
<col/>
<col/>
<col/>
<col/>
</colgroup>
<thead>
<tr>
<th colspan="2"></th>
<th colspan="6">Hierarchical transformer-based backbone</th>
<th align="center">Transformer-based backbone</th>
<th align="center">CNN-based backbone</th>
</tr>
<tr>
<th>Dataset</th>
<th>Metric</th>
<th align="center">PVTv2b1<break/> [<xref ref-type="bibr" rid="ref-11">11</xref>]</th>
<th align="center">PVTv2b2<break/> [<xref ref-type="bibr" rid="ref-11">11</xref>]</th>
<th align="center">PVTv2b3<break/> [<xref ref-type="bibr" rid="ref-11">11</xref>]</th>
<th align="center">PVTv2b4<break/> [<xref ref-type="bibr" rid="ref-11">11</xref>]</th>
<th align="center">PVTv2b5<break/> [<xref ref-type="bibr" rid="ref-11">11</xref>]</th>
<th align="center">mitb5<break/> [<xref ref-type="bibr" rid="ref-14">14</xref>]</th>
<th>R50&#x002B;ViT-B_16 [<xref ref-type="bibr" rid="ref-8">8</xref>]</th>
<th>ResNetV2 [<xref ref-type="bibr" rid="ref-43">43</xref>]</th>
</tr>
</thead>
<tbody>
<tr>
<td rowspan="2">CVC-ClinicDB</td>
<td>mDic</td>
<td>0.931</td>
<td><bold><styled-content style="color:#0000FF">0.938</styled-content></bold></td>
<td><bold><styled-content style="color:#FF0000">0.943</styled-content></bold></td>
<td>0.924</td>
<td>0.929</td>
<td>0.930</td>
<td>0.931</td>
<td>0.905</td>
</tr>
<tr>
<td>mIoU</td>
<td>0.885</td>
<td><bold><styled-content style="color:#0000FF">0.892</styled-content></bold></td>
<td><bold><styled-content style="color:#FF0000">0.899</styled-content></bold></td>
<td>0.876</td>
<td>0.884</td>
<td>0.887</td>
<td>0.881</td>
<td>0.851</td>
</tr>
<tr>
<td rowspan="2">Kvasir-SEG</td>
<td>mDic</td>
<td>0.916</td>
<td><bold><styled-content style="color:#0000FF">0.919</styled-content></bold></td>
<td>0.913</td>
<td>0.906</td>
<td><bold><styled-content style="color:#FF0000">0.922</styled-content></bold></td>
<td>0.909</td>
<td>0.886</td>
<td>0.865</td>
</tr>
<tr>
<td>mIoU</td>
<td>0.865</td>
<td><bold><styled-content style="color:#0000FF">0.866</styled-content></bold></td>
<td>0.864</td>
<td>0.858</td>
<td><bold><styled-content style="color:#FF0000">0.876</styled-content></bold></td>
<td>0.859</td>
<td>0.824</td>
<td>0.798</td>
</tr>
<tr>
<td rowspan="2">CVC-T</td>
<td>mDic</td>
<td><bold><styled-content style="color:#0000FF">0.902</styled-content></bold></td>
<td>0.897</td>
<td>0.896</td>
<td><bold><styled-content style="color:#FF0000">0.909</styled-content></bold></td>
<td>0.889</td>
<td>0.890</td>
<td>0.825</td>
<td>0.840</td>
</tr>
<tr>
<td>mIoU</td>
<td><bold><styled-content style="color:#0000FF">0.837</styled-content></bold></td>
<td>0.833</td>
<td>0.830</td>
<td><bold><styled-content style="color:#FF0000">0.843</styled-content></bold></td>
<td>0.824</td>
<td>0.821</td>
<td>0.741</td>
<td>0.759</td>
</tr>
<tr>
<td rowspan="2">CVC-ColonDB</td>
<td>mDic</td>
<td>0.770</td>
<td><bold><styled-content style="color:#0000FF">0.811</styled-content></bold></td>
<td>0.793</td>
<td>0.811</td>
<td><bold><styled-content style="color:#FF0000">0.823</styled-content></bold></td>
<td>0.802</td>
<td>0.720</td>
<td>0.704</td>
</tr>
<tr>
<td>mIoU</td>
<td>0.695</td>
<td><bold><styled-content style="color:#0000FF">0.731</styled-content></bold></td>
<td>0.714</td>
<td>0.731</td>
<td><bold><styled-content style="color:#FF0000">0.747</styled-content></bold></td>
<td>0.724</td>
<td>0.643</td>
<td>0.623</td>
</tr>
<tr>
<td rowspan="2">ETIS-LaribPolypDB</td>
<td>mDic</td>
<td>0.757</td>
<td><bold><styled-content style="color:#0000FF">0.791</styled-content></bold></td>
<td><bold><styled-content style="color:#FF0000">0.794</styled-content></bold></td>
<td>0.787</td>
<td>0.783</td>
<td>0.789</td>
<td>0.582</td>
<td>0.566</td>
</tr>
<tr>
<td>mIoU</td>
<td>0.688</td>
<td><bold><styled-content style="color:#FF0000">0.718</styled-content></bold></td>
<td>0.715</td>
<td><bold><styled-content style="color:#0000FF">0.716</styled-content></bold></td>
<td>0.714</td>
<td>0.714</td>
<td>0.508</td>
<td>0.491</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>For example, PVTv2b3 achieves the highest mDic (0.943) and mIoU (0.899) on the CVC-ClinicDB dataset, while PVTv2b5 excels on Kvasir-SEG with mDic of 0.922 and mIoU of 0.876. Additionally, mitb5 demonstrates well-balanced and strong performance across all five datasets. In contrast, the transformer-based backbone typically performs significantly worse than its hierarchical counterparts, achieving top scores only in one case (an mDic of 0.931 on CVC-ClinicDB). Meanwhile, the CNN-based backbone consistently underperforms, underscoring their limited effectiveness in these tasks when working in conjunction with DMHFR. This performance disparity can be attributed to the alignment between the hierarchical transformer-based backbones and DMHFR&#x2019;s input requirements. The hierarchical transformer-based backbones naturally produce outputs with dimensions that align with DMHFR&#x2019;s architecture, minimizing the need for extensive preprocessing or additional layers. Conversely, both transformer-based and CNN-based backbones require supplementary adjustments, such as additional transformations or layers, to ensure compatibility with DMHFR. These modifications not only introduce computational overhead but may also disrupt the original feature distributions, leading to reduced performance.</p>
<p>By contrast, the seamless integration of hierarchical transformer-based backbones with DMHFR enhances efficiency and preserves the integrity of feature representations. This synergy allows for streamlined processing and optimized information flow, resulting in superior performance across diverse datasets.</p>
<p>Overall, the combination of DMHFR and hierarchical transformer-based backbones demonstrates remarkable adaptability and effectiveness across diverse datasets.</p>
<p>Based on the above analysis, our PVT-DMHFR shows impressive learning and generalization capabilities on the challenging task of polyp segmentation, as well as performance that is superior to other SOTA methods.</p>
</sec>
<sec id="s4_6">
<label>4.6</label>
<title>Analysis of Visual Results</title>
<p>To thoroughly evaluate the performance of our proposed method, we compared our PVT-DMHFR with SOTA methods in terms of visual results. As shown in <xref ref-type="fig" rid="fig-4">Fig. 4</xref>, our PVT-DMHFR demonstrates several advantages over these SOTA methods: First, the transformer-based encoder backbone we employed, PVTv2, enhances polyp localization accuracy. Second, PVT-DMHFR consistently produces segmentation results with high accuracy for polyps of various sizes and shapes. This stability and accuracy are largely attributed to the proposed MHFRs, which effectively capture and fuse multiple groups of multi-scale information. Additionally, the integration of FcaNet and the axial transformer, applied before and after the MHFRs, strengthens the model&#x2019;s ability to extract features from the encoder backbone and capture long-range dependencies within the feature map, significantly improving overall feature representation.</p>
<fig id="fig-4">
<label>Figure 4</label>
<caption>
<title>Visualization results of SOTA methods on five datasets</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_59733-fig-4.tif"/>
</fig>
</sec>
<sec id="s4_7">
<label>4.7</label>
<title>Analysis of Computational Efficiency</title>
<p>We comprehensively evaluate PVT-DMHFR and SOTA methods in terms of floating-point operations (FLOPs) and the number of parameters (Params). As shown in <xref ref-type="table" rid="table-5">Table 5</xref>, the proposed PVT-DMHFR demonstrates an advantage over several strong competitors (e.g., PVT-Cascade, ColonFormer, SSFormer, FCBFormer, and ESFPNet) in terms of Params. However, it is slightly less efficient in FLOPs, surpassing only FCBFormer in this aspect. Overall, PVT-DMHFR achieves a well-balanced trade-off between computational efficiency and segmentation accuracy.</p>
<table-wrap id="table-5">
<label>Table 5</label>
<caption>
<title>Comparison results of computational efficiency</title>
</caption>
<table>
<colgroup>
<col/>
<col/>
<col/>
</colgroup>
<thead>
<tr>
<th>Method</th>
<th>FLOPs (G)</th>
<th>Param (M)</th>
</tr>
</thead>
<tbody>
<tr>
<td>U-Net [<xref ref-type="bibr" rid="ref-1">1</xref>]</td>
<td>103.49</td>
<td>31.04</td>
</tr>
<tr>
<td>U-Net&#x002B;&#x002B; [<xref ref-type="bibr" rid="ref-3">3</xref>]</td>
<td>377.45</td>
<td>47.18</td>
</tr>
<tr>
<td>PraNet [<xref ref-type="bibr" rid="ref-2">2</xref>]</td>
<td>13.15</td>
<td>30.50</td>
</tr>
<tr>
<td>MSNet [<xref ref-type="bibr" rid="ref-35">35</xref>]</td>
<td>16.97</td>
<td>27.69</td>
</tr>
<tr>
<td>SANet [<xref ref-type="bibr" rid="ref-34">34</xref>]</td>
<td>11.27</td>
<td>23.90</td>
</tr>
<tr>
<td>Transfuse [<xref ref-type="bibr" rid="ref-10">10</xref>]</td>
<td>21.75</td>
<td><bold>8.65</bold></td>
</tr>
<tr>
<td>UACANet [<xref ref-type="bibr" rid="ref-33">33</xref>]</td>
<td>59.65</td>
<td>67.11</td>
</tr>
<tr>
<td>TMF-Net [<xref ref-type="bibr" rid="ref-55">55</xref>]</td>
<td>28.78</td>
<td>52.89</td>
</tr>
<tr>
<td>C2F-Net [<xref ref-type="bibr" rid="ref-21">21</xref>]</td>
<td>36.13</td>
<td>25.21</td>
</tr>
<tr>
<td>Polyp-PVT [<xref ref-type="bibr" rid="ref-39">39</xref>]</td>
<td><bold>10.02</bold></td>
<td>25.11</td>
</tr>
<tr>
<td>SSFormer [<xref ref-type="bibr" rid="ref-37">37</xref>]</td>
<td>32.68</td>
<td>65.96</td>
</tr>
<tr>
<td>ColonFormer [<xref ref-type="bibr" rid="ref-56">56</xref>]</td>
<td>22.98</td>
<td>52.95</td>
</tr>
<tr>
<td>ESFPNet [<xref ref-type="bibr" rid="ref-57">57</xref>]</td>
<td>21.94</td>
<td>61.69</td>
</tr>
<tr>
<td>FCBFormer [<xref ref-type="bibr" rid="ref-58">58</xref>]</td>
<td>73.30</td>
<td>52.94</td>
</tr>
<tr>
<td>PVT-Cascade [<xref ref-type="bibr" rid="ref-40">40</xref>]</td>
<td>15.40</td>
<td>35.27</td>
</tr>
<tr>
<td>PVT-DMHFR (Ours)</td>
<td>55.84</td>
<td>30.05</td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
<sec id="s4_8">
<label>4.8</label>
<title>Ablation Studies</title>
<p>We performed ablation experiments to assess the contribution of each component in our method. The mDic and mIoU metrics were selected to represent network performance, and the results are summarized in <xref ref-type="table" rid="table-6">Table 6</xref>. When the FcaNet modules were removed, the mDic and mIoU scores on the CVC-ClinicDB dataset dropped by 0.4% and 0.5%, respectively. This indicates that passing features through the FcaNet module before processing them with the MHFRs helps the model learn and represent features more effectively. On the ETIS-LaribPolypDB dataset, the mDic and mIoU scores decreased by 0.3% and 1.1%, respectively, suggesting that FcaNet improves the model&#x2019;s generalization ability through superior feature capture. After removing the MHFRs, the mDic and mIoU scores fell by 1.1% and 0.8%, respectively, on the CVC-ClinicDB dataset, and by 1.4% and 1.6% on the CVC-ColonDB dataset. These results demonstrate that MHFRs play a crucial role in fusing multiple feature sets at various resolutions, leading to a significant improvement in segmentation accuracy. When the axial transformers were excluded, the mDic and mIoU scores on the CVC-ClinicDB dataset dropped by 0.5% and 1.6%, respectively. The mDic and mIoU scores on the CVC-ColonDB dataset dropped by 0.8% and 0.7%. This underscores the importance of axial transformers in capturing long-range dependencies within feature maps post-MHFR processing, further enhancing feature representation and improving both segmentation accuracy and generalization.</p>
<table-wrap id="table-6">
<label>Table 6</label>
<caption>
<title>Quantitative results for ablation studies on five polyp datasets</title>
</caption>
<table>
<colgroup>
<col/>
<col/>
<col/>
<col/>
<col/>
<col/>
<col/>
</colgroup>
<thead>
<tr>
<th>Dataset</th>
<th>Metric</th>
<th>PVT-DMHFR</th>
<th>w/o FcaNet</th>
<th>w/o MHFRs</th>
<th align="center">w/o axial transformers</th>
<th>Baseline</th>
</tr>
</thead>
<tbody>
<tr>
<td rowspan="2">CVC-ClinicDB</td>
<td>mDic</td>
<td><bold>0.938</bold></td>
<td>0.934</td>
<td>0.923</td>
<td>0.930</td>
<td>0.901</td>
</tr>
<tr>
<td>mIoU</td>
<td><bold>0.892</bold></td>
<td>0.887</td>
<td>0.879</td>
<td>0.882</td>
<td>0.848</td>
</tr>
<tr>
<td rowspan="2">Kvasir-SEG</td>
<td>mDic</td>
<td><bold>0.919</bold></td>
<td>0.917</td>
<td>0.916</td>
<td>0.914</td>
<td>0.910</td>
</tr>
<tr>
<td>mIoU</td>
<td><bold>0.866</bold></td>
<td>0.864</td>
<td>0.859</td>
<td>0.861</td>
<td>0.854</td>
</tr>
<tr>
<td rowspan="2">CVC-T</td>
<td>mDic</td>
<td><bold>0.897</bold></td>
<td>0.892</td>
<td>0.889</td>
<td>0.895</td>
<td>0.873</td>
</tr>
<tr>
<td>mIoU</td>
<td><bold>0.833</bold></td>
<td>0.832</td>
<td>0.827</td>
<td>0.828</td>
<td>0.804</td>
</tr>
<tr>
<td rowspan="2">CVC-ColonDB</td>
<td>mDic</td>
<td><bold>0.811</bold></td>
<td>0.809</td>
<td>0.795</td>
<td>0.803</td>
<td>0.792</td>
</tr>
<tr>
<td>mIoU</td>
<td><bold>0.731</bold></td>
<td>0.721</td>
<td>0.715</td>
<td>0.724</td>
<td>0.709</td>
</tr>
<tr>
<td rowspan="2">ETIS-LaribPolypDB</td>
<td>mDic</td>
<td><bold>0.791</bold></td>
<td>0.788</td>
<td>0.776</td>
<td>0.785</td>
<td>0.774</td>
</tr>
<tr>
<td>mIoU</td>
<td><bold>0.718</bold></td>
<td>0.707</td>
<td>0.710</td>
<td>0.712</td>
<td>0.670</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>We present visualization results to better demonstrate the impact of our proposed MHFRs and integration of FcaNet modules and axial transformers in PVT-DMHFR. As illustrated in <xref ref-type="fig" rid="fig-5">Fig. 5</xref>, the removal of any module from PVT-DMHFR results in a noticeable decline in segmentation accuracy. This performance degradation may stem from several factors: excluding FcaNet reduces the model&#x2019;s capacity to capture detailed features, removing the axial transformers diminishes its ability to account for long-range feature dependencies, and omitting MHFRs impairs the fusion of multi-level features from the encoder backbone, leading to the loss of crucial semantic information.</p>
<fig id="fig-5">
<label>Figure 5</label>
<caption>
<title>Segmentation results under different configurations of PVT-DMHFR</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_59733-fig-5a.tif"/>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_59733-fig-5b.tif"/>
</fig>
<p>Additionally, we explore the impact of replacing the Axial-Transformer with four alternative widely-used attention mechanisms: SE-Attention [<xref ref-type="bibr" rid="ref-62">62</xref>], CoT-Attention [<xref ref-type="bibr" rid="ref-59">59</xref>], EMA [<xref ref-type="bibr" rid="ref-60">60</xref>], and PSA [<xref ref-type="bibr" rid="ref-61">61</xref>], to evaluate their performance. As shown in <xref ref-type="table" rid="table-7">Table 7</xref>, none of these attention mechanisms outperforms the Axial-Transformer. For instance, on CVC-ClinicDB, Axial-Transformer achieves the highest mDic (0.938) and mIoU (0.892), while PSA and CoT-Attention fall slightly short with mDic values of 0.936 and 0.932, and mIoU values of 0.889 and 0.887, respectively. Similarly, on the CVC-T dataset, the Axial-Transformer maintains its lead with mDic and mIoU scores of 0.897 and 0.833, respectively. While EMA and PSA achieve marginal improvements in mIoU on Kvasir-SEG and ETIS-LaribPolypDB, these gains are isolated and do not match the overall performance of the Axial-Transformer.</p>
<table-wrap id="table-7">
<label>Table 7</label>
<caption>
<title>Quantitative results of using different widely-used attention mechanisms after MHFRs</title>
</caption>
<table>
<colgroup>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
</colgroup>
<thead>
<tr>
<th>Dataset</th>
<th>Metric</th>
<th align="center">Axial-transformer<break/> [<xref ref-type="bibr" rid="ref-41">41</xref>]</th>
<th align="center">SE-attention<break/> [<xref ref-type="bibr" rid="ref-40">40</xref>]</th>
<th align="center">CoT-attention<break/> [<xref ref-type="bibr" rid="ref-59">59</xref>]</th>
<th align="center">EMA<break/> [<xref ref-type="bibr" rid="ref-60">60</xref>]</th>
<th align="center">PSA<break/> [<xref ref-type="bibr" rid="ref-61">61</xref>]</th>
</tr>
</thead>
<tbody>
<tr>
<td align="center" rowspan="2">CVC-ClinicDB</td>
<td align="center">mDic</td>
<td align="center"><bold>0.938</bold></td>
<td align="center">0.913</td>
<td align="center">0.932</td>
<td align="center">0.925</td>
<td align="center">0.936</td>
</tr>
<tr>
<td align="center">mIoU</td>
<td align="center"><bold>0.892</bold></td>
<td align="center">0.867</td>
<td align="center">0.887</td>
<td align="center">0.882</td>
<td align="center">0.889</td>
</tr>
<tr>
<td align="center" rowspan="2">Kvasir-SEG</td>
<td align="center">mDic</td>
<td align="center"><bold>0.919</bold></td>
<td align="center">0.904</td>
<td align="center">0.901</td>
<td align="center">0.918</td>
<td align="center">0.916</td>
</tr>
<tr>
<td align="center">mIoU</td>
<td align="center">0.866</td>
<td align="center">0.842</td>
<td align="center">0.848</td>
<td align="center"><bold>0.869</bold></td>
<td align="center">0.854</td>
</tr>
<tr>
<td align="center" rowspan="2">CVC-T</td>
<td align="center">mDic</td>
<td align="center"><bold>0.897</bold></td>
<td align="center">0.889</td>
<td align="center">0.872</td>
<td align="center">0.893</td>
<td align="center">0.891</td>
</tr>
<tr>
<td align="center">mIoU</td>
<td align="center"><bold>0.833</bold></td>
<td align="center">0.816</td>
<td align="center">0.804</td>
<td align="center">0.825</td>
<td align="center">0.828</td>
</tr>
<tr>
<td align="center" rowspan="2">CVC-ColonDB</td>
<td align="center">mDic</td>
<td align="center"><bold>0.811</bold></td>
<td align="center">0.798</td>
<td align="center">0.805</td>
<td align="center">0.793</td>
<td align="center">0.796</td>
</tr>
<tr>
<td align="center">mIoU</td>
<td align="center"><bold>0.731</bold></td>
<td align="center">0.713</td>
<td align="center">0.730</td>
<td align="center">0.711</td>
<td align="center">0.722</td>
</tr>
<tr>
<td align="center" rowspan="2">ETIS-LaribPolypDB</td>
<td align="center">mDic</td>
<td align="center">0.791</td>
<td align="center">0.784</td>
<td align="center">0.762</td>
<td align="center">0.787</td>
<td align="center"><bold>0.795</bold></td>
</tr>
<tr>
<td align="center">mIoU</td>
<td align="center">0.718</td>
<td align="center">0.711</td>
<td align="center">0.684</td>
<td align="center">0.715</td>
<td align="center"><bold>0.719</bold></td>
</tr>
</tbody>
</table>
</table-wrap>
<p>The consistent underperformance of SE-Attention and CoT-Attention reflects their limited ability to model global dependencies. In addition, while EMA and PSA perform better, their results lack consistency across datasets. Axial-Transformer&#x2019;s superior ability to model long-range dependencies and spatial features explains its dominance, making it the most effective attention mechanism in this study.</p>
</sec>
</sec>
<sec id="s5">
<label>5</label>
<title>Conclusion</title>
<p>In this paper, we propose the DMHFR for the aggregation of pyramid features. The MHFRs (THFR and FHFR), perceive and fuse multiple sets of pyramid features from fine to coarse granularity, as well as the full set. Before these features are processed by the MHFRs, they pass through FcaNet to achieve better feature modeling. After the features are processed MHFRs, they are fed into axial transformers to capture the global dependencies of the features. Our experimental results demonstrate that the proposed PVT-DMHFR outperforms 15 SOTA methods across five public polyp datasets, highlighting its superior generalization and learning capabilities. Specifically, when trained and tested on visible datasets (CVC-ClinicDB and Kvasir-SEG) to assess learning ability, PVT-DMHFR achieves mDic scores of approximately 0.92 and 0.94, respectively. On unseen datasets (CVC-T, ColonDB, and ETIS), used to evaluate generalization capabilities, the PVT-DMHFR achieves mDic scores of 0.897, 0.811, and 0.791, respectively. Furthermore, our MHFRs are versatile and can be easily adapted to process pyramid features in other models by adjusting the channel setting of MHFRs, offering significant potential to enhance deep learning performance in various medical image segmentation tasks. Beyond medical imaging, the DMHFR decoder can also be applied to enhance transformer features in broader medical applications and general computer vision.</p>
</sec>
</body>
<back>
<ack><title>Acknowledgement</title>
<p>The authors are thankful to Xiamen Medical and Health Guidance Project and Grant from Guangxi Key Laboratory of Machine Vision and Intelligent Control.</p>
</ack>
<sec>
<title>Funding Statement</title>
<p>This work was supported by Xiamen Medical and Health Guidance Project in 2021 (No. 3502Z20214ZD1070). The research was financially supported by a grant from Guangxi Key Laboratory of Machine Vision and Intelligent Control, China (No. 2023B02).</p>
</sec>
<sec>
<title>Author Contributions</title>
<p>Study conception and design: Jianuo Huang, data collection: Jianuo Huang, Bohan Lai, Weiye Qiu, Caixu Xu; analysis and interpretation of results: Jianuo Huang, Bohan Lai, Weiye Qiu, Caixu Xu, Jie He; draft manuscript preparation: Jianuo Huang, Jie He. All authors reviewed the results and approved the final version of the manuscript.</p>
</sec>
<sec sec-type="data-availability">
<title>Availability of Data and Materials</title>
<p>All publicly available datasets are used in the study.</p>
</sec>
<sec>
<title>Ethics Approval</title>
<p>This study utilizes publicly available datasets, all of which have received prior ethical approval. No additional ethical approval was required for this work.</p>
</sec>
<sec sec-type="COI-statement">
<title>Conflicts of Interest</title>
<p>The authors declare no conflicts of interest to report regarding the present study.</p>
</sec>
<ref-list content-type="authoryear">
<title>References</title>
<ref id="ref-1"><label>[1]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Ronneberger</surname> <given-names>O</given-names></string-name>, <string-name><surname>Fischer</surname> <given-names>P</given-names></string-name>, <string-name><surname>Brox</surname> <given-names>T</given-names></string-name></person-group>. <article-title>U-Net: convolutional networks for biomedical image segmentation</article-title>. In: <conf-name>Medical Image Computing and Computer-Assisted Intervention&#x2013;MICCAI 2015: 18th International Conference; 2015 Oct 5&#x2013;9</conf-name>; <publisher-loc>Munich, Germany</publisher-loc>: <publisher-name>Springer International Publishing</publisher-name>. p. <fpage>234</fpage>&#x2013;<lpage>41</lpage>.</mixed-citation></ref>
<ref id="ref-2"><label>[2]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Fan</surname> <given-names>DP</given-names></string-name>, <string-name><surname>Ji</surname> <given-names>GP</given-names></string-name>, <string-name><surname>Zhou</surname> <given-names>T</given-names></string-name>, <string-name><surname>Chen</surname> <given-names>G</given-names></string-name>, <string-name><surname>Fu</surname> <given-names>H</given-names></string-name>, <string-name><surname>Shen</surname> <given-names>J</given-names></string-name>, <etal>et al</etal></person-group>. <article-title>PraNet: parallel reverse attention network for polyp segmentation</article-title>. In: <conf-name>International Conference on Medical Image Computing and Computer-Assisted Intervention</conf-name>; <year>2020</year>; <publisher-loc>Cham</publisher-loc>: <publisher-name>Springer International Publishing</publisher-name>. p. <fpage>263</fpage>&#x2013;<lpage>73</lpage>.</mixed-citation></ref>
<ref id="ref-3"><label>[3]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Zhou</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Rahman Siddiquee</surname> <given-names>MM</given-names></string-name>, <string-name><surname>Tajbakhsh</surname> <given-names>N</given-names></string-name>, <string-name><surname>Liang</surname> <given-names>J</given-names></string-name></person-group>. <article-title>UNet&#x002B;&#x002B;: a nested U-Net architecture for medical image segmentation</article-title>. In: <conf-name>Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support: 4th International Workshop, DLMIA 2018, and 8th International Workshop, ML-CDS 2018, Held in Conjunction with MICCAI 2018</conf-name>; <year>2018</year>; <publisher-loc>Granada, Spain</publisher-loc>. p. <fpage>3</fpage>&#x2013;<lpage>11</lpage>.</mixed-citation></ref>
<ref id="ref-4"><label>[4]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>He</surname> <given-names>K</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>X</given-names></string-name>, <string-name><surname>Ren</surname> <given-names>S</given-names></string-name>, <string-name><surname>Sun</surname> <given-names>J</given-names></string-name></person-group>. <article-title>Deep residual learning for image recognition</article-title>. In: <conf-name>Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition</conf-name>; <year>2016</year>; <publisher-loc>Las Vegas, NV, USA</publisher-loc>; p. <fpage>770</fpage>&#x2013;<lpage>8</lpage>.</mixed-citation></ref>
<ref id="ref-5"><label>[5]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Huang</surname> <given-names>H</given-names></string-name>, <string-name><surname>Lin</surname> <given-names>L</given-names></string-name>, <string-name><surname>Tong</surname> <given-names>R</given-names></string-name>, <string-name><surname>Hu</surname> <given-names>H</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>Q</given-names></string-name>, <string-name><surname>Iwamoto</surname> <given-names>Y</given-names></string-name>, <etal>et al.</etal></person-group> <article-title>UNet 3&#x002B;: a full-scale connected unet for medical image segmentation</article-title>. In: <conf-name>ICASSP 2020&#x2013;2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)</conf-name>; <year>2020</year>; <publisher-loc>Barcelona, Spain</publisher-loc>: <publisher-name>IEEE</publisher-name>. p. <fpage>1055</fpage>&#x2013;<lpage>9</lpage>.</mixed-citation></ref>
<ref id="ref-6"><label>[6]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><surname>Oktay</surname> <given-names>O</given-names></string-name>, <string-name><surname>Schlemper</surname> <given-names>J</given-names></string-name>, <string-name><surname>Folgoc</surname> <given-names>LL</given-names></string-name>, <string-name><surname>Lee</surname> <given-names>M</given-names></string-name>, <string-name><surname>Heinrich</surname> <given-names>M</given-names></string-name>, <string-name><surname>Misawa</surname> <given-names>K</given-names></string-name>, <etal>et al</etal></person-group>. <article-title>Attention U-Net: learning where to look for the pancreas</article-title>. <comment>arXiv preprint arXiv:1804.03999. 2018</comment>.</mixed-citation></ref>
<ref id="ref-7"><label>[7]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Chen</surname> <given-names>S</given-names></string-name>, <string-name><surname>Tan</surname> <given-names>X</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>B</given-names></string-name>, <string-name><surname>Hu</surname> <given-names>X</given-names></string-name></person-group>. <article-title>Reverse attention for salient object detection</article-title>. In: <conf-name>Proceedings of the European Conference on Computer Vision (ECCV)</conf-name>; <year>2018</year>; <publisher-loc>Munich, Germany</publisher-loc>. p. <fpage>234</fpage>&#x2013;<lpage>50</lpage>.</mixed-citation></ref>
<ref id="ref-8"><label>[8]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><surname>Dosovitskiy</surname> <given-names>A</given-names></string-name></person-group>. <article-title>An image is worth 16x16 words: transformers for image recognition at scale</article-title>. <comment>arXiv preprint arXiv:2010.11929. 2020</comment>.</mixed-citation></ref>
<ref id="ref-9"><label>[9]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Cao</surname> <given-names>H</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Chen</surname> <given-names>J</given-names></string-name>, <string-name><surname>Jiang</surname> <given-names>D</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>X</given-names></string-name>, <string-name><surname>Tian</surname> <given-names>Q</given-names></string-name>, <etal>et al</etal></person-group>. <article-title>Swin-Unet: Unet-like pure transformer for medical image segmentation</article-title>. In: <conf-name>European Conference on Computer Vision</conf-name>; <year>2022</year>; <publisher-loc>Cham</publisher-loc>: <publisher-name>Springer Nature Switzerland</publisher-name>. p. <fpage>205</fpage>&#x2013;<lpage>18</lpage>.</mixed-citation></ref>
<ref id="ref-10"><label>[10]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Zhang</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Liu</surname> <given-names>H</given-names></string-name>, <string-name><surname>Hu</surname> <given-names>Q</given-names></string-name></person-group>. <article-title>Transfuse: fusing transformers and CNNs for medical image segmentation</article-title>. In: <conf-name>Medical Image Computing and Computer Assisted Intervention&#x2013;MICCAI 2021: 24th International Conference, Strasbourg, France; 2021 Sep 27&#x2013;Oct 1</conf-name>; <publisher-loc>Strasbourg, France</publisher-loc>: <publisher-name>Springer International Publishing</publisher-name>. p. <fpage>14</fpage>&#x2013;<lpage>24</lpage>.</mixed-citation></ref>
<ref id="ref-11"><label>[11]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Wang</surname> <given-names>W</given-names></string-name>, <string-name><surname>Xie</surname> <given-names>E</given-names></string-name>, <string-name><surname>Li</surname> <given-names>X</given-names></string-name>, <string-name><surname>Fan</surname> <given-names>DP</given-names></string-name>, <string-name><surname>Song</surname> <given-names>K</given-names></string-name>, <string-name><surname>Liang</surname> <given-names>D</given-names></string-name>, <etal>et al</etal></person-group>. <article-title>PVT v2: improved baselines with pyramid vision transformer</article-title>. <source>Comput Vis Media</source>. <year>2022</year>;<volume>8</volume>(<issue>3</issue>):<fpage>415</fpage>&#x2013;<lpage>24</lpage>. doi:<pub-id pub-id-type="doi">10.1007/s41095-022-0274-8</pub-id>.</mixed-citation></ref>
<ref id="ref-12"><label>[12]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><surname>Wang</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Wu</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Agarwal</surname> <given-names>D</given-names></string-name>, <string-name><surname>Sun</surname> <given-names>J</given-names></string-name></person-group>. <article-title>MedCLIP: contrastive learning from unpaired medical images and text</article-title>. <comment>arXiv preprint arXiv:2210.10163. 2022</comment>.</mixed-citation></ref>
<ref id="ref-13"><label>[13]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Wang</surname> <given-names>W</given-names></string-name>, <string-name><surname>Xie</surname> <given-names>E</given-names></string-name>, <string-name><surname>Li</surname> <given-names>X</given-names></string-name>, <string-name><surname>Fan</surname> <given-names>DP</given-names></string-name>, <string-name><surname>Song</surname> <given-names>K</given-names></string-name>, <string-name><surname>Liang</surname> <given-names>D</given-names></string-name>, <etal>et al</etal></person-group>. <article-title>Pyramid vision transformer: a versatile backbone for dense prediction without convolutions</article-title>. In: <conf-name>Proceedings of the IEEE/CVF International Conference on Computer Vision</conf-name>; <year>2021</year>; <publisher-loc>Montreal, QC, Canada</publisher-loc>; p. <fpage>568</fpage>&#x2013;<lpage>78</lpage>.</mixed-citation></ref>
<ref id="ref-14"><label>[14]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Xie</surname> <given-names>E</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>W</given-names></string-name>, <string-name><surname>Yu</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Anandkumar</surname> <given-names>A</given-names></string-name>, <string-name><surname>Alvarez</surname> <given-names>JM</given-names></string-name>, <string-name><surname>Luo</surname> <given-names>P</given-names></string-name></person-group>. <article-title>SegFormer: simple and efficient design for semantic segmentation with transformers</article-title>. <source>Adv Neural Inf Process Syst</source>. <year>2021</year>;<volume>34</volume>:<fpage>12077</fpage>&#x2013;<lpage>90</lpage>.</mixed-citation></ref>
<ref id="ref-15"><label>[15]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Qin</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>P</given-names></string-name>, <string-name><surname>Wu</surname> <given-names>F</given-names></string-name>, <string-name><surname>Li</surname> <given-names>X</given-names></string-name></person-group>. <article-title>FcaNet: frequency channel attention networks</article-title>. In: <conf-name>Proceedings of the IEEE/CVF International Conference on Computer Vision</conf-name>; <year>2021</year>; <publisher-loc>Montreal, QC, Canada</publisher-loc>; p. <fpage>783</fpage>&#x2013;<lpage>92</lpage>.</mixed-citation></ref>
<ref id="ref-16"><label>[16]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Sasmal</surname> <given-names>P</given-names></string-name>, <string-name><surname>Iwahori</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Bhuyan</surname> <given-names>MK</given-names></string-name>, <string-name><surname>Kasugai</surname> <given-names>K</given-names></string-name></person-group>. <article-title>Active contour segmentation of polyps in capsule endoscopic images</article-title>. In: <conf-name>2018 International Conference on Signals and Systems (ICSigSys)</conf-name>; <year>2018</year>; <publisher-loc>Bali, Indonesia</publisher-loc>: <publisher-name>IEEE</publisher-name>. p. <fpage>201</fpage>&#x2013;<lpage>4</lpage>.</mixed-citation></ref>
<ref id="ref-17"><label>[17]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Xia</surname> <given-names>S</given-names></string-name>, <string-name><surname>Krishnan</surname> <given-names>SM</given-names></string-name>, <string-name><surname>Tjoa</surname> <given-names>MP</given-names></string-name>, <string-name><surname>Goh</surname> <given-names>PM</given-names></string-name></person-group>. <article-title>A novel methodology for extracting colon&#x2019;s lumen from colonoscopic images</article-title>. <source>J Syst Cybern Inform</source>. <year>2003</year>;<volume>1</volume>(<issue>2</issue>):<fpage>7</fpage>&#x2013;<lpage>12</lpage>.</mixed-citation></ref>
<ref id="ref-18"><label>[18]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Gross</surname> <given-names>S</given-names></string-name>, <string-name><surname>Kennel</surname> <given-names>M</given-names></string-name>, <string-name><surname>Stehle</surname> <given-names>T</given-names></string-name>, <string-name><surname>Wulff</surname> <given-names>J</given-names></string-name>, <string-name><surname>Tischendorf</surname> <given-names>J</given-names></string-name>, <string-name><surname>Trautwein</surname> <given-names>C</given-names></string-name>, <etal>et al</etal></person-group>. <article-title>Polyp segmentation in NBI colonoscopy</article-title>. In: <conf-name>Bildverarbeitung f&#x00FC;r die Medizin 2009: Algorithmen&#x2014;Systeme&#x2014;Anwendungen Proceedings des Workshops vom 22. bis 25. M&#x00E4;rz 2009 in Heidelberg</conf-name>; <publisher-loc>Heidelberg, Germany</publisher-loc>: <publisher-name>Springer Berlin Heidelberg</publisher-name>; <year>2009</year>. p. <fpage>252</fpage>&#x2013;<lpage>6</lpage>.</mixed-citation></ref>
<ref id="ref-19"><label>[19]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Shi</surname> <given-names>JH</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>Q</given-names></string-name>, <string-name><surname>Tang</surname> <given-names>YH</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>ZQ</given-names></string-name></person-group>. <article-title>Polyp-mixer: an efficient context-aware MLP-based paradigm for polyp segmentation</article-title>. <source>IEEE Trans Circuits Syst Video Technol</source>. <year>2022</year>;<volume>33</volume>(<issue>1</issue>):<fpage>30</fpage>&#x2013;<lpage>42</lpage>. doi:<pub-id pub-id-type="doi">10.1109/TCSVT.2022.3197643</pub-id>.</mixed-citation></ref>
<ref id="ref-20"><label>[20]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Cai</surname> <given-names>L</given-names></string-name>, <string-name><surname>Wu</surname> <given-names>M</given-names></string-name>, <string-name><surname>Chen</surname> <given-names>L</given-names></string-name>, <string-name><surname>Bai</surname> <given-names>W</given-names></string-name>, <string-name><surname>Yang</surname> <given-names>M</given-names></string-name>, <string-name><surname>Lyu</surname> <given-names>S</given-names></string-name>, <etal>et al</etal></person-group>. <article-title>Using guided self-attention with local information for polyp segmentation</article-title>. In: <conf-name>International Conference on Medical Image Computing and Computer-Assisted Intervention</conf-name>; <year>2022</year>; <publisher-loc>Cham</publisher-loc>: <publisher-name>Springer Nature Switzerland</publisher-name>. p. <fpage>629</fpage>&#x2013;<lpage>38</lpage>.</mixed-citation></ref>
<ref id="ref-21"><label>[21]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Zhang</surname> <given-names>R</given-names></string-name>, <string-name><surname>Lai</surname> <given-names>P</given-names></string-name>, <string-name><surname>Wan</surname> <given-names>X</given-names></string-name>, <string-name><surname>Fan</surname> <given-names>DJ</given-names></string-name>, <string-name><surname>Gao</surname> <given-names>F</given-names></string-name>, <string-name><surname>Wu</surname> <given-names>XJ</given-names></string-name>, <etal>et al</etal></person-group>. <article-title>Lesion-aware dynamic kernel for polyp segmentation</article-title>. In: <conf-name>International Conference on Medical Image Computing and Computer-Assisted Intervention</conf-name>; <year>2022</year>; <publisher-loc>Cham</publisher-loc>: <publisher-name>Springer Nature Switzerland</publisher-name>. p. <fpage>99</fpage>&#x2013;<lpage>109</lpage>.</mixed-citation></ref>
<ref id="ref-22"><label>[22]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Tomar</surname> <given-names>NK</given-names></string-name>, <string-name><surname>Jha</surname> <given-names>D</given-names></string-name>, <string-name><surname>Bagci</surname> <given-names>U</given-names></string-name>, <string-name><surname>Ali</surname> <given-names>S</given-names></string-name></person-group>. <article-title>TGANet: text-guided attention for improved polyp segmentation</article-title>. In: <conf-name>International Conference on Medical Image Computing and Computer-Assisted Intervention</conf-name>; <year>2022</year>; <publisher-loc>Cham</publisher-loc>: <publisher-name>Springer Nature Switzerland</publisher-name>. p. <fpage>151</fpage>&#x2013;<lpage>60</lpage>.</mixed-citation></ref>
<ref id="ref-23"><label>[23]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Bui</surname> <given-names>NT</given-names></string-name>, <string-name><surname>Hoang</surname> <given-names>DH</given-names></string-name>, <string-name><surname>Nguyen</surname> <given-names>QT</given-names></string-name>, <string-name><surname>Tran</surname> <given-names>MT</given-names></string-name>, <string-name><surname>Le</surname> <given-names>N</given-names></string-name></person-group>. <article-title>MEGANet: multi-scale edge-guided attention network for weak boundary polyp segmentation</article-title>. In: <conf-name>Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision</conf-name>; <year>2024</year>; <publisher-loc>Waikoloa, HI, USA</publisher-loc>; p. <fpage>7985</fpage>&#x2013;<lpage>94</lpage>.</mixed-citation></ref>
<ref id="ref-24"><label>[24]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Akbari</surname> <given-names>M</given-names></string-name>, <string-name><surname>Mohrekesh</surname> <given-names>M</given-names></string-name>, <string-name><surname>Nasr-Esfahani</surname> <given-names>E</given-names></string-name>, <string-name><surname>Soroushmehr</surname> <given-names>SR</given-names></string-name>, <string-name><surname>Karimi</surname> <given-names>N</given-names></string-name>, <string-name><surname>Samavi</surname> <given-names>S</given-names></string-name>, <etal>et al</etal></person-group>. <article-title>Polyp segmentation in colonoscopy images using fully convolutional network</article-title>. In: <conf-name>2018 40th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC)</conf-name>; <year>2018</year>; <publisher-loc>Honolulu, HI, USA</publisher-loc>: <publisher-name>IEEE</publisher-name>. p. <fpage>69</fpage>&#x2013;<lpage>72</lpage>.</mixed-citation></ref>
<ref id="ref-25"><label>[25]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Brandao</surname> <given-names>P</given-names></string-name>, <string-name><surname>Zisimopoulos</surname> <given-names>O</given-names></string-name>, <string-name><surname>Mazomenos</surname> <given-names>E</given-names></string-name>, <string-name><surname>Ciuti</surname> <given-names>G</given-names></string-name>, <string-name><surname>Bernal</surname> <given-names>J</given-names></string-name>, <string-name><surname>Visentini-Scarzanella</surname> <given-names>M</given-names></string-name>, <etal>et al</etal></person-group>. <article-title>Towards a computed-aided diagnosis system in colonoscopy: automatic polyp segmentation using convolution neural networks</article-title>. <source>J Med Robot Res</source>. <year>2018</year>;<volume>3</volume>(<issue>2</issue>):<fpage>1840002</fpage>. doi:<pub-id pub-id-type="doi">10.1142/S2424905X18400020</pub-id>.</mixed-citation></ref>
<ref id="ref-26"><label>[26]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>He</surname> <given-names>K</given-names></string-name>, <string-name><surname>Gkioxari</surname> <given-names>G</given-names></string-name>, <string-name><surname>Doll&#x00E1;r</surname> <given-names>P</given-names></string-name>, <string-name><surname>Girshick</surname> <given-names>R</given-names></string-name></person-group>. <article-title>Mask R-CNN</article-title>. In: <conf-name>Proceedings of the IEEE International Conference on Computer Vision</conf-name>; <year>2017</year>; <publisher-loc>Venice, Italy</publisher-loc>; p. <fpage>2961</fpage>&#x2013;<lpage>9</lpage>.</mixed-citation></ref>
<ref id="ref-27"><label>[27]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Qadir</surname> <given-names>HA</given-names></string-name>, <string-name><surname>Shin</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Solhusvik</surname> <given-names>J</given-names></string-name>, <string-name><surname>Bergsland</surname> <given-names>J</given-names></string-name>, <string-name><surname>Aabakken</surname> <given-names>L</given-names></string-name>, <string-name><surname>Balasingham</surname> <given-names>I</given-names></string-name></person-group>. <article-title>Polyp detection and segmentation using mask R-CNN: does a deeper feature extractor CNN always perform better?</article-title> In: <conf-name>2019 13th International Symposium on Medical Information and Communication Technology (ISMICT)</conf-name>; <year>2019</year>; <publisher-loc>Oslo, Norway</publisher-loc>: <publisher-name>IEEE</publisher-name>. p. <fpage>1</fpage>&#x2013;<lpage>6</lpage>.</mixed-citation></ref>
<ref id="ref-28"><label>[28]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Ji</surname> <given-names>GP</given-names></string-name>, <string-name><surname>Chou</surname> <given-names>YC</given-names></string-name>, <string-name><surname>Fan</surname> <given-names>DP</given-names></string-name>, <string-name><surname>Chen</surname> <given-names>G</given-names></string-name>, <string-name><surname>Fu</surname> <given-names>H</given-names></string-name>, <string-name><surname>Jha</surname> <given-names>D</given-names></string-name>, <etal>et al</etal></person-group>. <article-title>Progressively normalized self-attention network for video polyp segmentation</article-title>. In: <conf-name>International Conference on Medical Image Computing and Computer-Assisted Intervention</conf-name>; <year>2021</year>; <publisher-loc>Cham</publisher-loc>: <publisher-name>Springer International Publishing</publisher-name>. p. <fpage>142</fpage>&#x2013;<lpage>52</lpage>.</mixed-citation></ref>
<ref id="ref-29"><label>[29]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Wu</surname> <given-names>L</given-names></string-name>, <string-name><surname>Hu</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Ji</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Luo</surname> <given-names>P</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>S</given-names></string-name></person-group>. <article-title>Multi-frame collaboration for effective endoscopic video polyp detection via spatial-temporal feature transformation</article-title>. In: <conf-name>Medical Image Computing and Computer Assisted Intervention&#x2013;MICCAI 2021: 24th International Conference; 2021 Sep 27&#x2013;Oct 1</conf-name>; <publisher-loc>Strasbourg, France</publisher-loc>: <publisher-name>Springer International Publishing</publisher-name>. p. <fpage>302</fpage>&#x2013;<lpage>12</lpage>.</mixed-citation></ref>
<ref id="ref-30"><label>[30]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><surname>Li</surname> <given-names>C</given-names></string-name>, <string-name><surname>Liu</surname> <given-names>X</given-names></string-name>, <string-name><surname>Li</surname> <given-names>W</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>C</given-names></string-name>, <string-name><surname>Liu</surname> <given-names>H</given-names></string-name>, <string-name><surname>Yuan</surname> <given-names>Y</given-names></string-name></person-group>. <article-title>U-KAN makes strong backbone for medical image segmentation and generation</article-title>. <comment>arXiv preprint arXiv:2406.02918. 2024</comment>.</mixed-citation></ref>
<ref id="ref-31"><label>[31]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Sun</surname> <given-names>L</given-names></string-name>, <string-name><surname>Li</surname> <given-names>C</given-names></string-name>, <string-name><surname>Ding</surname> <given-names>X</given-names></string-name>, <string-name><surname>Huang</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Chen</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>G</given-names></string-name>, <etal>et al.</etal></person-group> <article-title>Few-shot medical image segmentation using a global correlation network with discriminative embedding</article-title>. <source>Comput Biol Med</source>. <year>2022</year>;<volume>140</volume>:<fpage>105067</fpage>. doi:<pub-id pub-id-type="doi">10.1016/j.compbiomed.2021.105067</pub-id>; <pub-id pub-id-type="pmid">34920364</pub-id></mixed-citation></ref>
<ref id="ref-32"><label>[32]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Banik</surname> <given-names>D</given-names></string-name>, <string-name><surname>Roy</surname> <given-names>K</given-names></string-name>, <string-name><surname>Bhattacharjee</surname> <given-names>D</given-names></string-name>, <string-name><surname>Nasipuri</surname> <given-names>M</given-names></string-name>, <string-name><surname>Krejcar</surname> <given-names>O</given-names></string-name></person-group>. <article-title>Polyp-Net: a multimodel fusion network for polyp segmentation</article-title>. <source>IEEE Trans Instrum Meas</source>. <year>2020</year>;<volume>70</volume>:<fpage>1</fpage>&#x2013;<lpage>12</lpage>. doi:<pub-id pub-id-type="doi">10.1109/TIM.2020.3015607</pub-id>.</mixed-citation></ref>
<ref id="ref-33"><label>[33]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Kim</surname> <given-names>T</given-names></string-name>, <string-name><surname>Lee</surname> <given-names>H</given-names></string-name>, <string-name><surname>Kim</surname> <given-names>D</given-names></string-name></person-group>. <article-title>UACANet: uncertainty augmented context attention for polyp segmentation</article-title>. In: <conf-name>Proceedings of the 29th ACM International Conference on Multimedia</conf-name>; <year>2021</year>; <publisher-loc>New York, NY, USA</publisher-loc>. p. <fpage>2167</fpage>&#x2013;<lpage>75</lpage>.</mixed-citation></ref>
<ref id="ref-34"><label>[34]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Wei</surname> <given-names>J</given-names></string-name>, <string-name><surname>Hu</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>R</given-names></string-name>, <string-name><surname>Li</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Zhou</surname> <given-names>SK</given-names></string-name>, <string-name><surname>Cui</surname> <given-names>S</given-names></string-name></person-group>. <article-title>Shallow attention network for polyp segmentation</article-title>. In: <conf-name>Medical Image Computing and Computer Assisted Intervention&#x2013;MICCAI 2021: 24th International Conference; 2021 Sep 27&#x2013;Oct 1</conf-name>; <publisher-loc>Strasbourg, France</publisher-loc>: <publisher-name>Springer International Publishing</publisher-name>. p. <fpage>699</fpage>&#x2013;<lpage>708</lpage>.</mixed-citation></ref>
<ref id="ref-35"><label>[35]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Zhao</surname> <given-names>X</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>L</given-names></string-name>, <string-name><surname>Lu</surname> <given-names>H</given-names></string-name></person-group>. <article-title>Automatic polyp segmentation via multi-scale subtraction network</article-title>. In: <conf-name>Medical Image Computing and Computer Assisted Intervention&#x2013;MICCAI 2021: 24th International Conference; 2021 Sep 27&#x2013;Oct 1</conf-name>; <publisher-loc>Strasbourg, France</publisher-loc>: <publisher-name>Springer International Publishing</publisher-name>. p. <fpage>120</fpage>&#x2013;<lpage>30</lpage>.</mixed-citation></ref>
<ref id="ref-36"><label>[36]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Xu</surname> <given-names>R</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>C</given-names></string-name>, <string-name><surname>Xu</surname> <given-names>S</given-names></string-name>, <string-name><surname>Meng</surname> <given-names>W</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>X</given-names></string-name></person-group>. <article-title>DC-Net: dual context network for 2D medical image segmentation</article-title>. In: <conf-name>Medical Image Computing and Computer Assisted Intervention&#x2013;MICCAI 2021: 24th International Conference; 2021 Sep 27&#x2013;Oct 1</conf-name>; <publisher-loc>Strasbourg, France</publisher-loc>: <publisher-name>Springer International Publishing</publisher-name>. p. <fpage>503</fpage>&#x2013;<lpage>13</lpage>.</mixed-citation></ref>
<ref id="ref-37"><label>[37]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Wang</surname> <given-names>J</given-names></string-name>, <string-name><surname>Huang</surname> <given-names>Q</given-names></string-name>, <string-name><surname>Tang</surname> <given-names>F</given-names></string-name>, <string-name><surname>Meng</surname> <given-names>J</given-names></string-name>, <string-name><surname>Su</surname> <given-names>J</given-names></string-name>, <string-name><surname>Song</surname> <given-names>S</given-names></string-name></person-group>. <article-title>Stepwise feature fusion: local guides global</article-title>. In: <conf-name>International Conference on Medical Image Computing and Computer-Assisted Intervention</conf-name>; <year>2022</year>; <publisher-loc>Cham</publisher-loc>: <publisher-name>Springer Nature Switzerland</publisher-name>. p. <fpage>110</fpage>&#x2013;<lpage>20</lpage>.</mixed-citation></ref>
<ref id="ref-38"><label>[38]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Zhang</surname> <given-names>W</given-names></string-name>, <string-name><surname>Fu</surname> <given-names>C</given-names></string-name>, <string-name><surname>Zheng</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>F</given-names></string-name>, <string-name><surname>Zhao</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Sham</surname> <given-names>CW</given-names></string-name></person-group>. <article-title>HSNet: a hybrid semantic network for polyp segmentation</article-title>. <source>Comput Biol Med</source>. <year>2022</year>;<volume>150</volume>:<fpage>106173</fpage>. doi:<pub-id pub-id-type="doi">10.1016/j.compbiomed.2022.106173</pub-id>; <pub-id pub-id-type="pmid">36257278</pub-id></mixed-citation></ref>
<ref id="ref-39"><label>[39]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><surname>Dong</surname> <given-names>B</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>W</given-names></string-name>, <string-name><surname>Fan</surname> <given-names>DP</given-names></string-name>, <string-name><surname>Li</surname> <given-names>J</given-names></string-name>, <string-name><surname>Fu</surname> <given-names>H</given-names></string-name>, <string-name><surname>Shao</surname> <given-names>L</given-names></string-name></person-group>. <article-title>Polyp-PVT: polyp segmentation with pyramid vision transformers</article-title>. <comment>arXiv preprint arXiv:2108.06932. 2021</comment>.</mixed-citation></ref>
<ref id="ref-40"><label>[40]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Rahman</surname> <given-names>MM</given-names></string-name>, <string-name><surname>Marculescu</surname> <given-names>R</given-names></string-name></person-group>. <article-title>Medical image segmentation via cascaded attention decoding</article-title>. In: <conf-name>Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision</conf-name>; <year>2023</year>; <publisher-loc>Waikoloa, HI, USA</publisher-loc>; p. <fpage>6222</fpage>&#x2013;<lpage>31</lpage>.</mixed-citation></ref>
<ref id="ref-41"><label>[41]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><surname>Ho</surname> <given-names>J</given-names></string-name>, <string-name><surname>Kalchbrenner</surname> <given-names>N</given-names></string-name>, <string-name><surname>Weissenborn</surname> <given-names>D</given-names></string-name>, <string-name><surname>Salimans</surname> <given-names>T</given-names></string-name></person-group>. <article-title>Axial attention in multidimensional transformers</article-title>. <comment>arXiv preprint arXiv:1912.12180. 2019</comment>.</mixed-citation></ref>
<ref id="ref-42"><label>[42]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Bhojanapalli</surname> <given-names>S</given-names></string-name>, <string-name><surname>Chakrabarti</surname> <given-names>A</given-names></string-name>, <string-name><surname>Glasner</surname> <given-names>D</given-names></string-name>, <string-name><surname>Li</surname> <given-names>D</given-names></string-name>, <string-name><surname>Unterthiner</surname> <given-names>T</given-names></string-name>, <string-name><surname>Veit</surname> <given-names>A</given-names></string-name></person-group>. <article-title>Understanding robustness of transformers for image classification</article-title>. In: <conf-name>Proceedings of the IEEE/CVF International Conference on Computer Vision</conf-name>; <year>2021</year>; <publisher-loc>Montreal, QC, Canada</publisher-loc>; p. <fpage>10231</fpage>&#x2013;<lpage>41</lpage>.</mixed-citation></ref>
<ref id="ref-43"><label>[43]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>He</surname> <given-names>K</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>X</given-names></string-name>, <string-name><surname>Ren</surname> <given-names>S</given-names></string-name>, <string-name><surname>Sun</surname> <given-names>J</given-names></string-name></person-group>. <article-title>Identity mappings in deep residual networks</article-title>. In: <conf-name>Computer Vision-ECCV 2016: 14th European Conference; 2016 Oct 11&#x2013;14</conf-name>; <publisher-loc>Amsterdam, The Netherlands</publisher-loc>: <publisher-name>Springer International Publishing</publisher-name>. p. <fpage>630</fpage>&#x2013;<lpage>45</lpage>.</mixed-citation></ref>
<ref id="ref-44"><label>[44]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Bernal</surname> <given-names>J</given-names></string-name>, <string-name><surname>S&#x00E1;nchez</surname> <given-names>FJ</given-names></string-name>, <string-name><surname>Fern&#x00E1;ndez-Esparrach</surname> <given-names>G</given-names></string-name>, <string-name><surname>Gil</surname> <given-names>D</given-names></string-name>, <string-name><surname>Rodr&#x00ED;guez</surname> <given-names>C</given-names></string-name>, <string-name><surname>Vilari&#x00F1;o</surname> <given-names>F</given-names></string-name></person-group>. <article-title>WM-DOVA maps for accurate polyp highlighting in colonoscopy: validation vs. saliency maps from physicians</article-title>. <source>Comput Med Imaging Graph</source>. <year>2015</year>;<volume>43</volume>:<fpage>99</fpage>&#x2013;<lpage>111</lpage>. doi:<pub-id pub-id-type="doi">10.1016/j.compmedimag.2015.02.007</pub-id>; <pub-id pub-id-type="pmid">25863519</pub-id></mixed-citation></ref>
<ref id="ref-45"><label>[45]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Jha</surname> <given-names>D</given-names></string-name>, <string-name><surname>Smedsrud</surname> <given-names>PH</given-names></string-name>, <string-name><surname>Riegler</surname> <given-names>MA</given-names></string-name>, <string-name><surname>Halvorsen</surname> <given-names>P</given-names></string-name>, <string-name><surname>De Lange</surname> <given-names>T</given-names></string-name>, <string-name><surname>Johansen</surname> <given-names>D</given-names></string-name>, <etal>et al.</etal></person-group> <article-title>Kvasir-seg: a segmented polyp dataset</article-title>. In: <conf-name>MultiMedia Modeling: 26th International Conference, MMM 2020; 2020 Jan 5&#x2013;8</conf-name>; <publisher-loc>Daejeon, Republic of Korea</publisher-loc>: <publisher-name>Springer International Publishing</publisher-name>. p. <fpage>451</fpage>&#x2013;<lpage>62</lpage>.</mixed-citation></ref>
<ref id="ref-46"><label>[46]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>V&#x00E1;zquez</surname> <given-names>D</given-names></string-name>, <string-name><surname>Bernal</surname> <given-names>J</given-names></string-name>, <string-name><surname>S&#x00E1;nchez</surname> <given-names>FJ</given-names></string-name>, <string-name><surname>Fern&#x00E1;ndez-Esparrach</surname> <given-names>G</given-names></string-name>, <string-name><surname>L&#x00F3;pez</surname> <given-names>AM</given-names></string-name>, <string-name><surname>Romero</surname> <given-names>A</given-names></string-name>, <etal>et al</etal></person-group>. <article-title>A benchmark for endoluminal scene segmentation of colonoscopy images</article-title>. <source>J Healthc Eng</source>. <year>2017</year>;<volume>2017</volume>(<issue>1</issue>):<fpage>4037190</fpage>. doi:<pub-id pub-id-type="doi">10.1155/2017/4037190</pub-id>; <pub-id pub-id-type="pmid">29065595</pub-id></mixed-citation></ref>
<ref id="ref-47"><label>[47]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Tajbakhsh</surname> <given-names>N</given-names></string-name>, <string-name><surname>Gurudu</surname> <given-names>SR</given-names></string-name>, <string-name><surname>Liang</surname> <given-names>J</given-names></string-name></person-group>. <article-title>Automated polyp detection in colonoscopy videos using shape and context information</article-title>. <source>IEEE Trans Med Imaging</source>. <year>2015</year>;<volume>35</volume>(<issue>2</issue>):<fpage>630</fpage>&#x2013;<lpage>44</lpage>. doi:<pub-id pub-id-type="doi">10.1109/TMI.2015.2487997</pub-id>; <pub-id pub-id-type="pmid">26462083</pub-id></mixed-citation></ref>
<ref id="ref-48"><label>[48]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Silva</surname> <given-names>J</given-names></string-name>, <string-name><surname>Histace</surname> <given-names>A</given-names></string-name>, <string-name><surname>Romain</surname> <given-names>O</given-names></string-name>, <string-name><surname>Dray</surname> <given-names>X</given-names></string-name>, <string-name><surname>Granado</surname> <given-names>B</given-names></string-name></person-group>. <article-title>Toward embedded detection of polyps in wce images for early diagnosis of colorectal cancer</article-title>. <source>Int J Comput Assist Radiol Surg</source>. <year>2014</year>;<volume>9</volume>:<fpage>283</fpage>&#x2013;<lpage>93</lpage>. doi:<pub-id pub-id-type="doi">10.1007/s11548-013-0926-3</pub-id>; <pub-id pub-id-type="pmid">24037504</pub-id></mixed-citation></ref>
<ref id="ref-49"><label>[49]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Milletari</surname> <given-names>F</given-names></string-name>, <string-name><surname>Navab</surname> <given-names>N</given-names></string-name>, <string-name><surname>Ahmadi</surname> <given-names>SA</given-names></string-name></person-group>. <article-title>V-net: fully convolutional neural networks for volumetric medical image segmentation</article-title>. In: <conf-name>2016 Fourth International Conference on 3D Vision (3DV)</conf-name>; <year>2016</year>; <publisher-loc>Stanford, CA, USA</publisher-loc>: <publisher-name>IEEE</publisher-name>. p. <fpage>565</fpage>&#x2013;<lpage>71</lpage>.</mixed-citation></ref>
<ref id="ref-50"><label>[50]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Margolin</surname> <given-names>R</given-names></string-name>, <string-name><surname>Zelnik-Manor</surname> <given-names>L</given-names></string-name>, <string-name><surname>Tal</surname> <given-names>A</given-names></string-name></person-group>. <article-title>How to evaluate foreground maps?</article-title> In: <conf-name>Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition</conf-name>; <year>2014</year>; <publisher-loc>Columbus, OH, USA</publisher-loc>; p. <fpage>248</fpage>&#x2013;<lpage>55</lpage>.</mixed-citation></ref>
<ref id="ref-51"><label>[51]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Fan</surname> <given-names>DP</given-names></string-name>, <string-name><surname>Cheng</surname> <given-names>MM</given-names></string-name>, <string-name><surname>Liu</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Li</surname> <given-names>T</given-names></string-name>, <string-name><surname>Borji</surname> <given-names>A</given-names></string-name></person-group>. <article-title>Structure-measure: a new way to evaluate foreground maps</article-title>. In: <conf-name>Proceedings of the IEEE International Conference on Computer Vision</conf-name>; <year>2017</year>; <publisher-loc>Venice, Italy</publisher-loc>; p. <fpage>4548</fpage>&#x2013;<lpage>57</lpage>.</mixed-citation></ref>
<ref id="ref-52"><label>[52]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Fan</surname> <given-names>DP</given-names></string-name>, <string-name><surname>Ji</surname> <given-names>GP</given-names></string-name>, <string-name><surname>Qin</surname> <given-names>X</given-names></string-name>, <string-name><surname>Cheng</surname> <given-names>MM</given-names></string-name></person-group>. <article-title>Cognitive vision inspired object segmentation metric and loss function</article-title>. <source>Sci Sin Informationis</source>. <year>2021</year>;<volume>6</volume>(<issue>6</issue>):<fpage>5</fpage>. doi:<pub-id pub-id-type="doi">10.1360/SSI-2020-0370</pub-id>.</mixed-citation></ref>
<ref id="ref-53"><label>[53]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><surname>Fan</surname> <given-names>DP</given-names></string-name>, <string-name><surname>Gong</surname> <given-names>C</given-names></string-name>, <string-name><surname>Cao</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Ren</surname> <given-names>B</given-names></string-name>, <string-name><surname>Cheng</surname> <given-names>MM</given-names></string-name>, <string-name><surname>Borji</surname> <given-names>A</given-names></string-name></person-group>. <article-title>Enhanced-alignment measure for binary foreground map evaluation</article-title>. <comment>arXiv preprint arXiv:1805.10421. 2018</comment>.</mixed-citation></ref>
<ref id="ref-54"><label>[54]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><surname>Loshchilov</surname> <given-names>I</given-names></string-name></person-group>. <article-title>Decoupled weight decay regularization</article-title>. <comment>arXiv preprint arXiv:1711.05101. 2017</comment>.</mixed-citation></ref>
<ref id="ref-55"><label>[55]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Yang</surname> <given-names>L</given-names></string-name>, <string-name><surname>Gu</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Bian</surname> <given-names>G</given-names></string-name>, <string-name><surname>Liu</surname> <given-names>Y</given-names></string-name></person-group>. <article-title>TMF-Net: a transformer-based multiscale fusion network for surgical instrument segmentation from endoscopic images</article-title>. <source>IEEE Trans Instrum Meas</source>. <year>2022</year>;<volume>72</volume>:<fpage>1</fpage>&#x2013;<lpage>15</lpage>. doi:<pub-id pub-id-type="doi">10.1109/TIM.2022.3225922</pub-id>.</mixed-citation></ref>
<ref id="ref-56"><label>[56]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Duc</surname> <given-names>NT</given-names></string-name>, <string-name><surname>Oanh</surname> <given-names>NT</given-names></string-name>, <string-name><surname>Thuy</surname> <given-names>NT</given-names></string-name>, <string-name><surname>Triet</surname> <given-names>TM</given-names></string-name>, <string-name><surname>Dinh</surname> <given-names>VS</given-names></string-name></person-group>. <article-title>ColonFormer: an efficient transformer based method for colon polyp segmentation</article-title>. <source>IEEE Access</source>. <year>2022</year>;<volume>10</volume>:<fpage>80575</fpage>&#x2013;<lpage>86</lpage>. doi:<pub-id pub-id-type="doi">10.1109/ACCESS.2022.3195241</pub-id>.</mixed-citation></ref>
<ref id="ref-57"><label>[57]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Chang</surname> <given-names>Q</given-names></string-name>, <string-name><surname>Ahmad</surname> <given-names>D</given-names></string-name>, <string-name><surname>Toth</surname> <given-names>J</given-names></string-name>, <string-name><surname>Bascom</surname> <given-names>R</given-names></string-name>, <string-name><surname>Higgins</surname> <given-names>WE</given-names></string-name></person-group>. <article-title>ESFPNet: efficient deep learning architecture for real-time lesion segmentation in autofluorescence bronchoscopic video</article-title>. In: <conf-name>Medical Imaging 2023: Biomedical Applications in Molecular, Structural, and Functional Imaging</conf-name>; <year>2023</year>; <publisher-loc>San Diego, CA, USA</publisher-loc>: <publisher-name>SPIE</publisher-name>. Vol. <volume>12468</volume>.</mixed-citation></ref>
<ref id="ref-58"><label>[58]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Sanderson</surname> <given-names>E</given-names></string-name>, <string-name><surname>Matuszewski</surname> <given-names>BJ</given-names></string-name></person-group>. <article-title>FCN-transformer feature fusion for polyp segmentation</article-title>. In: <conf-name>Annual Conference on Medical Image Understanding and Analysis</conf-name>; <year>2022</year>; <publisher-loc>Cham</publisher-loc>: <publisher-name>Springer International Publishing</publisher-name>. p. <fpage>892</fpage>&#x2013;<lpage>907</lpage>.</mixed-citation></ref>
<ref id="ref-59"><label>[59]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Li</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Yao</surname> <given-names>T</given-names></string-name>, <string-name><surname>Pan</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Mei</surname> <given-names>T</given-names></string-name></person-group>. <article-title>Contextual transformer networks for visual recognition</article-title>. <source>IEEE Transactions on Pattern Analysis and Machine Intelligence</source>. <year>2022</year>;<volume>45</volume>(<issue>2</issue>):<fpage>1489</fpage>&#x2013;<lpage>500</lpage>. doi:<pub-id pub-id-type="doi">10.48550/arXiv.2107.12292</pub-id>.</mixed-citation></ref>
<ref id="ref-60"><label>[60]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Ouyang</surname> <given-names>D</given-names></string-name>, <string-name><surname>He</surname> <given-names>S</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>G</given-names></string-name>, <string-name><surname>Luo</surname> <given-names>M</given-names></string-name>, <string-name><surname>Guo</surname> <given-names>H</given-names></string-name>, <string-name><surname>Zhan</surname> <given-names>J</given-names></string-name>, <etal>et al</etal></person-group>. <article-title>Efficient multi-scale attention module with cross-spatial learning</article-title>. In: <conf-name>ICASSP 2023&#x2013;2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)</conf-name>; <year>2023</year>; <publisher-loc>Rhodes Island, Greece</publisher-loc>: <publisher-name>IEEE</publisher-name>. p. <fpage>1</fpage>&#x2013;<lpage>5</lpage>.</mixed-citation></ref>
<ref id="ref-61"><label>[61]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Zhang</surname> <given-names>H</given-names></string-name>, <string-name><surname>Zu</surname> <given-names>K</given-names></string-name>, <string-name><surname>Lu</surname> <given-names>J</given-names></string-name>, <string-name><surname>Zou</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Meng</surname> <given-names>D</given-names></string-name></person-group>. <article-title>EPSANet: an efficient pyramid squeeze attention block on convolutional neural network</article-title>. In: <conf-name>Proceedings of the Asian Conference on Computer Vision</conf-name>; <year>2022</year>; <publisher-loc>Macau SAR, China</publisher-loc>. p. <fpage>1161</fpage>&#x2013;<lpage>77</lpage>.</mixed-citation></ref>
<ref id="ref-62"><label>[62]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Hu</surname> <given-names>J</given-names></string-name>, <string-name><surname>Shen</surname> <given-names>L</given-names></string-name>, <string-name><surname>Sun</surname> <given-names>G</given-names></string-name></person-group>. <article-title>Squeeze-and-excitation networks</article-title>. In: <conf-name>Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition</conf-name>; <year>2018</year>; <publisher-loc>Salt Lake City, UT, USA</publisher-loc>. p. <fpage>7132</fpage>&#x2013;<lpage>41</lpage>.</mixed-citation></ref>
</ref-list>
</back></article>









