<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.1 20151215//EN" "http://jats.nlm.nih.gov/publishing/1.1/JATS-journalpublishing1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:mml="http://www.w3.org/1998/Math/MathML" xml:lang="en" article-type="research-article" dtd-version="1.1">
<front>
<journal-meta>
<journal-id journal-id-type="pmc">CMC</journal-id>
<journal-id journal-id-type="nlm-ta">CMC</journal-id>
<journal-id journal-id-type="publisher-id">CMC</journal-id>
<journal-title-group>
<journal-title>Computers, Materials &#x0026; Continua</journal-title>
</journal-title-group>
<issn pub-type="epub">1546-2226</issn>
<issn pub-type="ppub">1546-2218</issn>
<publisher>
<publisher-name>Tech Science Press</publisher-name>
<publisher-loc>USA</publisher-loc>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="publisher-id">72550</article-id>
<article-id pub-id-type="doi">10.32604/cmc.2025.072550</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Article</subject>
</subj-group>
</article-categories>
<title-group>
<article-title>A Dual-Stream Framework for Landslide Segmentation with Cross-Attention Enhancement and Gated Multimodal Fusion</article-title>
<alt-title alt-title-type="left-running-head">A Dual-Stream Framework for Landslide Segmentation with Cross-Attention Enhancement and Gated Multimodal Fusion</alt-title>
<alt-title alt-title-type="right-running-head">A Dual-Stream Framework for Landslide Segmentation with Cross-Attention Enhancement and Gated Multimodal Fusion</alt-title>
</title-group>
<contrib-group>
<contrib id="author-1" contrib-type="author">
<name name-style="western"><surname>Islam</surname><given-names>Md Minhazul</given-names></name><xref ref-type="aff" rid="aff-1">1</xref><xref ref-type="aff" rid="aff-2">2</xref></contrib>
<contrib id="author-2" contrib-type="author" corresp="yes">
<name name-style="western"><surname>Yin</surname><given-names>Yunfei</given-names></name><email>yinyunfei@cqu.edu.cn</email><xref ref-type="aff" rid="aff-1">1</xref><xref ref-type="aff" rid="aff-2">2</xref><xref rid="cor1" ref-type="corresp">&#x002A;</xref></contrib>
<contrib id="author-3" contrib-type="author">
<name name-style="western"><surname>Islam</surname><given-names>Md Tanvir</given-names></name><xref ref-type="aff" rid="aff-1">1</xref><xref ref-type="aff" rid="aff-2">2</xref></contrib>
<contrib id="author-4" contrib-type="author">
<name name-style="western"><surname>Yuan</surname><given-names>Zheng</given-names></name><xref ref-type="aff" rid="aff-1">1</xref><xref ref-type="aff" rid="aff-2">2</xref></contrib>
<contrib id="author-5" contrib-type="author">
<name name-style="western"><surname>Dey</surname><given-names>Argho</given-names></name><xref ref-type="aff" rid="aff-1">1</xref><xref ref-type="aff" rid="aff-2">2</xref></contrib>
<aff id="aff-1"><label>1</label><institution>College of Computer Science, Chongqing University</institution>, <addr-line>Chongqing, 400044</addr-line>, <country>China</country></aff>
<aff id="aff-2"><label>2</label><institution>SUGON Industrial Control and Security Center</institution>, <addr-line>Chengdu, 610225</addr-line>, <country>China</country></aff>
</contrib-group>
<author-notes>
<corresp id="cor1"><label>&#x002A;</label>Corresponding Author: Yunfei Yin. Email: <email>yinyunfei@cqu.edu.cn</email></corresp>
</author-notes>
<pub-date date-type="collection" publication-format="electronic">
<year>2026</year>
</pub-date>
<pub-date date-type="pub" publication-format="electronic">
<day>12</day><month>1</month><year>2026</year>
</pub-date>
<volume>86</volume>
<issue>3</issue>
<elocation-id>8</elocation-id>
<history>
<date date-type="received">
<day>29</day>
<month>08</month>
<year>2025</year>
</date>
<date date-type="accepted">
<day>04</day>
<month>11</month>
<year>2025</year>
</date>
</history>
<permissions>
<copyright-statement>&#x00A9; 2025 The Authors.</copyright-statement>
<copyright-year>2025</copyright-year>
<copyright-holder>Published by Tech Science Press.</copyright-holder>
<license xlink:href="https://creativecommons.org/licenses/by/4.0/">
<license-p>This work is licensed under a <ext-link ext-link-type="uri" xlink:type="simple" xlink:href="https://creativecommons.org/licenses/by/4.0/">Creative Commons Attribution 4.0 International License</ext-link>, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
</license>
</permissions>
<self-uri content-type="pdf" xlink:href="TSP_CMC_72550.pdf"></self-uri>
<abstract>
<p>Automatic segmentation of landslides from remote sensing imagery is challenging because traditional machine learning and early CNN-based models often fail to generalize across heterogeneous landscapes, where segmentation maps contain sparse and fragmented landslide regions under diverse geographical conditions. To address these issues, we propose a lightweight dual-stream siamese deep learning framework that integrates optical and topographical data fusion with an adaptive decoder, guided multimodal fusion, and deep supervision. The framework is built upon the synergistic combination of cross-attention, gated fusion, and sub-pixel upsampling within a unified dual-stream architecture specifically optimized for landslide segmentation, enabling efficient context modeling and robust feature exchange between modalities. The decoder captures long-range context at deeper levels using lightweight cross-attention and refines spatial details at shallower levels through attention-gated skip fusion, enabling precise boundary delineation and fewer false positives. The gated fusion further enhances multimodal integration of optical and topographical cues, and the deep supervision stabilizes training and improves generalization. Moreover, to mitigate checkerboard artifacts, a learnable sub-pixel upsampling is devised to replace the traditional transposed convolution. Despite its compact design with fewer parameters, the model consistently outperforms state-of-the-art baselines. Experiments on two benchmark datasets, Landslide4Sense and Bijie, confirm the effectiveness of the framework. On the Bijie dataset, it achieves an F1-score of 0.9110 and an intersection over union (IoU) of 0.8839. These results highlight its potential for accurate large-scale landslide inventory mapping and real-time disaster response. The implementation is publicly available at <ext-link ext-link-type="uri" xlink:href="https://github.com/mishaown/DiGATe-UNet-LandSlide-Segmentation">https://github.com/mishaown/DiGATe-UNet-LandSlide-Segmentation</ext-link> (accessed on 3 November 2025).</p>
</abstract>
<kwd-group kwd-group-type="author">
<kwd>Landslide segmentation</kwd>
<kwd>remote sensing</kwd>
<kwd>dual-stream lightweight networks</kwd>
<kwd>digital elevation model (DEM)</kwd>
<kwd>gated fusion</kwd>
</kwd-group>
<funding-group>
<award-group id="awg1">
<funding-source>National Natural Science Foundation of China</funding-source>
<award-id>62262045</award-id>
</award-group>
<award-group id="awg2">
<funding-source>Fundamental Research Funds for the Central Universities</funding-source>
<award-id>2023CDJYGRH-YB11</award-id>
</award-group>
<award-group id="awg3">
<funding-source>Open Funding of SUGON Industrial Control and Security Center</funding-source>
<award-id>CUIT-SICSC-2025-03</award-id>
</award-group>
</funding-group>
</article-meta>
</front>
<body>
<sec id="s1">
<label>1</label>
<title>Introduction</title>
<p>Landslides are among the most destructive geological hazards worldwide, posing a significant threat to human lives, infrastructure, and economic stability. According to the United Nations International Strategy for Disaster Reduction (UNISDR), more than 1000 landslide-related disaster events occur annually, causing thousands of fatalities and severe economic losses [<xref ref-type="bibr" rid="ref-1">1</xref>]. Beyond immediate destruction, landslides disrupt transportation networks, damage infrastructure, and contaminate water sources, generating long-term environmental and socio-economic impacts [<xref ref-type="bibr" rid="ref-2">2</xref>]. Historical events such as the 2014 Oso landslide in Washington, which claimed 43 lives, and the 2017 Bijie County landslide in China, which caused 27 fatalities, highlight the urgent need for reliable landslide monitoring and risk assessment systems [<xref ref-type="bibr" rid="ref-3">3</xref>].</p>
<p>Traditional field surveys and manual interpretation methods, though accurate, are time-consuming and labor-intensive, making them unsuitable for large-scale or rapid response applications [<xref ref-type="bibr" rid="ref-1">1</xref>,<xref ref-type="bibr" rid="ref-4">4</xref>]. The emergence of high-resolution remote sensing and increased computing power has enabled more scalable solutions. Machine learning (ML) methods, including Random Forests, Support Vector Machines, and Logistic Regression, have been applied to automate landslide classification, but they often struggle with feature selection, generalization, and scalability [<xref ref-type="bibr" rid="ref-5">5</xref>,<xref ref-type="bibr" rid="ref-6">6</xref>]. Deep learning (DL), particularly convolutional neural networks (CNNs), has advanced the field by learning rich hierarchical representations from complex imagery [<xref ref-type="bibr" rid="ref-2">2</xref>,<xref ref-type="bibr" rid="ref-7">7</xref>]. Multi-source remote sensing data, combining optical, synthetic aperture radar (SAR), and DEM, further supports comprehensive analysis [<xref ref-type="bibr" rid="ref-3">3</xref>,<xref ref-type="bibr" rid="ref-8">8</xref>,<xref ref-type="bibr" rid="ref-9">9</xref>].</p>
<p>Despite these advances, challenges remain, as illustrated in <xref ref-type="fig" rid="fig-1">Fig. 1</xref>, since many deep learning studies focus primarily on detection and discrimination between landslide-affected and non-landslide areas, yet often underperform in segmentation, where precise boundary delineation across heterogeneous shapes and scales is required.</p>
<fig id="fig-1">
<label>Figure 1</label>
<caption>
<title>Key challenges in landslide segmentation: large variation in shape and scale, confusion with visually similar non-landslide regions (e.g., bare soil, roads, erosion), and the difficulty of detecting small or fragmented landslide areas, sourced from <italic>Bijie dataset</italic> (top) and <italic>LandSlide4Sense dataset</italic> (bottom)</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_72550-fig-1.tif"/>
</fig>
<p>Models trained primarily on visually distinct and recent landslides often fail on older, weathered, or vegetation-covered cases [<xref ref-type="bibr" rid="ref-3">3</xref>,<xref ref-type="bibr" rid="ref-4">4</xref>]. Architectures such as Faster Region-based Convolutional Neural Networks (Faster R-CNN) utilizing convolutional backbones with pretrained weights also encounter optimization issues such as vanishing gradients, limiting their ability to model complex patterns [<xref ref-type="bibr" rid="ref-5">5</xref>]. Detection frameworks additionally struggle with small-scale and fragmented regions&#x2019; landslides, where limited pixel representation leads to missed features [<xref ref-type="bibr" rid="ref-10">10</xref>,<xref ref-type="bibr" rid="ref-11">11</xref>]. In regions with similar spectral signatures (e.g., bare soil, terraced fields, erosion zones), segmentation models frequently produce false positives [<xref ref-type="bibr" rid="ref-12">12</xref>]. Several open-source landslide inventories with reliable annotation techniques have been released in recent years (e.g., [<xref ref-type="bibr" rid="ref-13">13</xref>&#x2013;<xref ref-type="bibr" rid="ref-15">15</xref>]), providing broader access to annotated data for research and benchmarking. However, many publicly available datasets remain limited in terms of spatial diversity, sensor modality, and annotation consistency, making it challenging to train models that generalize across heterogeneous geomorphological environments.</p>
<p>Landslides occur in various forms; in this study, we focus mainly on shallow slope failures, including rock slides, rock falls, and small debris slides typical of rainfall-triggered mountainous terrains. We focus on advancing landslide segmentation while acknowledging its close relationship with landslide detection. The objective of this work is to design a deep learning framework that not only distinguishes landslide-affected areas but also achieves pixel-level segmentation with accurate boundary delineation, even for small and visually complex regions. The main contributions of this study are summarized as follows:
<list list-type="bullet">
<list-item>
<p>We propose an adaptive dual-decoder architecture that processes modality-specific information in parallel streams. A lightweight cross-attention mechanism captures global context in deeper layers, while attention gates refine local spatial details in shallower layers, enabling robust and complementary feature exchange between modalities.</p></list-item>
<list-item>
<p>We introduce a guided multimodal fusion strategy that explicitly combines optical and topographic features at both encoder and decoder stages. The gated fusion module learns adaptive weighting with confidence regularization, ensuring stable and balanced feature integration across modalities.</p></list-item>
<list-item>
<p>We replace conventional transposed convolutions with learnable sub-pixel upsampling, mitigating checkerboard artifacts and improving high-resolution reconstruction quality with reduced computational overhead.</p></list-item>
</list></p>
</sec>
<sec id="s2">
<label>2</label>
<title>Related Studies on Landslide Segmentation</title>
<p>Landslides are highly destructive hazards that require rapid and accurate mapping for effective disaster mitigation. With the advent of deep learning, convolutional neural networks (CNNs) and their extensions have become the dominant methodology for landslide segmentation analysis. Recent studies increasingly formulate the task as semantic segmentation, where each pixel in high-resolution optical imagery (typically 0.5&#x2013;10 m spatial resolution from satellites such as WorldView, PlanetScope, Sentinel-2, or GaoFen) is classified as either landslide or non-landslide.</p>
<p>This section synthesizes prior research on landslide segmentation by organizing representative works according to four key technical challenges identified in recent literature: (1) detecting visually blurred or relic landslides, (2) integrating heterogeneous multi-source data, (3) coping with limited annotated datasets, and (4) improving segmentation accuracy. This organization reflects the main difficulties addressed by our proposed framework and provides context for the design choices discussed in <xref ref-type="sec" rid="s3">Section 3</xref>.</p>
<p>Despite notable progress, landslide segmentation research continues to face these core challenges. The following subsections summarize representative studies corresponding to each aspect.</p>
<sec id="s2_1">
<label>2.1</label>
<title>Visually Blurred (Relic) Landslides</title>
<p>Inactive or relic landslides are often difficult to identify in optical imagery because their visual characteristics depend on multiple factors such as the time elapsed since failure, vegetation regrowth, surface erosion, and local climatic conditions, which can collectively lead to low-contrast or blurred boundaries in visible-band imagery [<xref ref-type="bibr" rid="ref-3">3</xref>,<xref ref-type="bibr" rid="ref-4">4</xref>]. To address this, multi-stream and attention-based segmentation networks have been proposed. For example, feature-fusion segmentation network (FFS-Net) combines optical texture with DEM-derived topographic cues (i.e., elevation-based terrain attributes such as slope, aspect, and curvature), significantly improving the segmentation of blurred landslides over CNN-based baselines [<xref ref-type="bibr" rid="ref-3">3</xref>]. Similarly, hyper-pixel-wise contrastive learning augmented segmentation network (HPCL-Net) introduces hyper-pixel contrastive learning to enhance feature representation in small or ambiguous landslide patches [<xref ref-type="bibr" rid="ref-16">16</xref>]. Recently, Hu et al. [<xref ref-type="bibr" rid="ref-12">12</xref>] proposed a cross-attention landslide detector (CALandDet) framework that captures global context for improved segmentation of relic landslides, outperforming CNN-based baselines.</p>
</sec>
<sec id="s2_2">
<label>2.2</label>
<title>Integrating Multi-Source Data</title>
<p>The fusion of multi-modal inputs such as RGB (Red-Green-Blue), DEM, and slope layers has consistently improved landslide delineation. DEMs contribute elevation gradients and slope breaks that can help clarify landslide boundaries obscured in spectral imagery, although the slope information accurately reflects landslide morphology only when the DEM is acquired after the failure event. Studies using dual-stream networks, such as the gated dual-stream convolutional neural network (GDSNet), demonstrate that gated feature fusion of RGB and DEM inputs achieves higher detection accuracy than single-modality models [<xref ref-type="bibr" rid="ref-3">3</xref>,<xref ref-type="bibr" rid="ref-17">17</xref>]. The Landslide4Sense benchmark further validates this, showing that networks incorporating DEM and slope layers outperform those using only Sentinel-2 optical data across multiple global test sites [<xref ref-type="bibr" rid="ref-8">8</xref>,<xref ref-type="bibr" rid="ref-18">18</xref>]. Other approaches combine mathematical morphology with DEM and orthophoto data to extract reliable landslide polygons [<xref ref-type="bibr" rid="ref-19">19</xref>]. More recently, Wu et al. [<xref ref-type="bibr" rid="ref-9">9</xref>] proposed a hybrid CNN&#x2013;Transformer fusion network that integrates DEM with optical features, achieving superior accuracy on multi-regional benchmarks.</p>
</sec>
<sec id="s2_3">
<label>2.3</label>
<title>Handling Small Datasets</title>
<p>The scarcity of pixel-wise labeled landslide data presents a critical bottleneck for developing and validating deep learning models. Supervised segmentation networks rely on accurately annotated masks to optimize loss functions, learn discriminative spatial features, and evaluate model performance across heterogeneous terrain conditions. Insufficient labeled data often lead to overfitting and poor generalization, highlighting the need for large, well-annotated datasets to advance reliable and transferable landslide mapping. Manual delineation of landslides is costly and time-consuming, leading to limited training samples. To mitigate this, researchers employ data augmentation, transfer learning, and ensemble methods. For example, Faster R-CNN with aggressive augmentation improved detection recall under limited training data [<xref ref-type="bibr" rid="ref-5">5</xref>]. Transfer learning from large-scale image datasets (e.g., ImageNet) consistently boosts performance; Chandra et al. [<xref ref-type="bibr" rid="ref-4">4</xref>] demonstrated that U-Net with ResNet backbones achieved near-perfect landslide detection accuracy, far surpassing models trained from scratch. Beyond supervised learning, unsupervised and weakly supervised strategies such as multiscale adaptation and contrastive learning are emerging to reduce reliance on extensive ground truth [<xref ref-type="bibr" rid="ref-7">7</xref>].</p>
</sec>
<sec id="s2_4">
<label>2.4</label>
<title>Improving Segmentation Accuracy</title>
<p>Recent architectures enhance accuracy through attention mechanisms, multi-scale feature fusion, and advanced decoder designs. FFS-Net integrates multiscale channel attention to balance fine textures with semantic context [<xref ref-type="bibr" rid="ref-3">3</xref>], while the method proposed by Lu et al. [<xref ref-type="bibr" rid="ref-20">20</xref>] applies lightweight multi-scale fusion to boost pixel-level precision. Advanced U-Net variants incorporating multi-sensor inputs and refined loss functions consistently outperform vanilla U-Net baselines [<xref ref-type="bibr" rid="ref-4">4</xref>,<xref ref-type="bibr" rid="ref-19">19</xref>]. Ensemble strategies and hybrid CNN-Transformer models are also being explored for boundary refinement and context modeling [<xref ref-type="bibr" rid="ref-7">7</xref>,<xref ref-type="bibr" rid="ref-10">10</xref>,<xref ref-type="bibr" rid="ref-11">11</xref>,<xref ref-type="bibr" rid="ref-21">21</xref>]. The Landslide4Sense benchmark confirms that specialized fusion-based architectures can yield IoU and F1-score improvements of 5&#x2013;10% compared to earlier CNNs [<xref ref-type="bibr" rid="ref-8">8</xref>,<xref ref-type="bibr" rid="ref-18">18</xref>]. These developments underscore the importance of multi-branch, attention-augmented, and transformer-based designs for advancing landslide segmentation performance.</p>
</sec>
</sec>
<sec id="s3">
<label>3</label>
<title>Methods</title>
<p>This section describes the proposed dual-stream architecture for landslide segmentation from multimodal satellite imagery.</p>
<sec id="s3_1">
<label>3.1</label>
<title>Overview</title>
<p>The framework adopts a dual-stream encoder&#x2013;decoder design to exploit complementary optical and topographic features. As illustrated in <xref ref-type="fig" rid="fig-2">Fig. 2</xref>, two geographically co-registered inputs Stream A (optical RGB) and Stream B (DEM) are processed by a siamese EfficientNet-B4 backbone [<xref ref-type="bibr" rid="ref-22">22</xref>] with shared weights for consistent feature extraction. Each encoder produces five hierarchical feature maps <inline-formula id="ieqn-1"><mml:math id="mml-ieqn-1"><mml:mo fence="false" stretchy="false">{</mml:mo><mml:msub><mml:mi>f</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mi>A</mml:mi></mml:mrow></mml:msub><mml:msubsup><mml:mo fence="false" stretchy="false">}</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mn>5</mml:mn></mml:mrow></mml:msubsup></mml:math></inline-formula> and <inline-formula id="ieqn-2"><mml:math id="mml-ieqn-2"><mml:mo fence="false" stretchy="false">{</mml:mo><mml:msub><mml:mi>f</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mi>B</mml:mi></mml:mrow></mml:msub><mml:msubsup><mml:mo fence="false" stretchy="false">}</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mn>5</mml:mn></mml:mrow></mml:msubsup></mml:math></inline-formula> capturing fine-to-semantic representations.</p>
<fig id="fig-2">
<label>Figure 2</label>
<caption>
<title>Architecture of the proposed dual-stream model for the landslide segmentation</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_72550-fig-2.tif"/>
</fig>
<p>Mid-level features <inline-formula id="ieqn-3"><mml:math id="mml-ieqn-3"><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>f</mml:mi><mml:mrow><mml:mn>3</mml:mn><mml:mi>B</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>f</mml:mi><mml:mrow><mml:mn>4</mml:mn><mml:mi>B</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> from Stream B are fused with optical counterparts through a GateFuse module, strengthening semantic alignment while preserving modality-specific shallow layer details <inline-formula id="ieqn-4"><mml:math id="mml-ieqn-4"><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>f</mml:mi><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>f</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>. Two adaptive decoders progressively reconstruct high-resolution representations via cascaded TransUp and UpFlex modules. Decoder A operates on fused features for cross-modal representations, while Decoder B reconstructs Stream B independently to retain topographic specificity. Intermediate outputs <inline-formula id="ieqn-5"><mml:math id="mml-ieqn-5"><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mn>3</mml:mn><mml:mi>A</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mn>4</mml:mn><mml:mi>A</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> and <inline-formula id="ieqn-6"><mml:math id="mml-ieqn-6"><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mn>3</mml:mn><mml:mi>B</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mn>4</mml:mn><mml:mi>B</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> undergo late-stage gated fusion for enhanced boundary precision.</p>
<p>The fused decoder output is upsampled via SubPixelUp and projected through <inline-formula id="ieqn-7"><mml:math id="mml-ieqn-7"><mml:mn>1</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mn>1</mml:mn></mml:math></inline-formula> convolution to obtain main prediction <inline-formula id="ieqn-8"><mml:math id="mml-ieqn-8"><mml:msub><mml:mrow><mml:mover><mml:mi>y</mml:mi><mml:mo stretchy="false">&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mtext>main</mml:mtext></mml:mrow></mml:msub><mml:mspace width="negativethinmathspace" /><mml:mo>&#x2208;</mml:mo><mml:mspace width="negativethinmathspace" /><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:mi>K</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mi>H</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mi>W</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula>. Two auxiliary predictions <inline-formula id="ieqn-9"><mml:math id="mml-ieqn-9"><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mrow><mml:mover><mml:mi>y</mml:mi><mml:mo stretchy="false">&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mtext>aux2</mml:mtext></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mover><mml:mi>y</mml:mi><mml:mo stretchy="false">&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mtext>aux3</mml:mtext></mml:mrow></mml:msub><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> from intermediate stages provide deep supervision, improving gradient flow and training stability. The architecture integrates gated fusion at encoder and decoder levels, attention-guided reconstruction, and multi-scale supervision for accurate landslide segmentation.</p>
</sec>
<sec id="s3_2">
<label>3.2</label>
<title>Features Extracted from the Pretrained Network</title>
<p>Both input streams are processed by a similar backbone trained on ImageNet. Each encoder extracts a five-level hierarchy of multi-scale feature maps, capturing fine textures at shallow layers and semantic context at deeper layers. Intermediate features from both streams are retained as skip connections to guide the decoders during reconstruction. Stream A features <inline-formula id="ieqn-10"><mml:math id="mml-ieqn-10"><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>a</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mi>a</mml:mi><mml:mn>4</mml:mn></mml:msub><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> feed the fused decoder branch, while Stream B features <inline-formula id="ieqn-11"><mml:math id="mml-ieqn-11"><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>b</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mi>b</mml:mi><mml:mn>4</mml:mn></mml:msub><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> support the modality-specific decoder. This shared pretrained backbone accelerates convergence and stabilizes training on limited landslide samples.</p>
</sec>
<sec id="s3_3">
<label>3.3</label>
<title>Adaptive Decoder</title>
<p>The adaptive decoder comprises three complementary modules SubPixelUp, TransUp, and UpFlex that collectively address global alignment, selective fusion, and efficient upsampling. Unlike prior architectures applying upsampling and attention sequentially, this unified design integrates these mechanisms coherently. TransUp performs sub-pixel upsampling with lightweight cross-attention for global context alignment between decoder and skip features. The attention mechanism is applied directly across the skip-feature space rather than through flattened spatial tokens, enabling modality- and depth-aware fusion without excessive computational overhead. UpFlex extends classical U-Net skip connections by integrating sub-pixel upsampling with attention-gated fusion for spatial refinement. Its flexible design operates efficiently across multiple decoder scales without parameter redundancy, adaptively emphasizing salient encoder features while suppressing irrelevant noise. SubPixelUp provides the core upsampling operation, ensuring artifact-free resolution recovery with reduced computational overhead compared to transposed convolutions.</p>
<sec id="s3_3_1">
<label>3.3.1</label>
<title>SubPixelUp</title>
<p>The SubPixelUp module performs learnable upsampling by transforming channel information into spatial resolution through pixel-shuffle [<xref ref-type="bibr" rid="ref-23">23</xref>]. Given input feature <inline-formula id="ieqn-12"><mml:math id="mml-ieqn-12"><mml:mi>z</mml:mi><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:msub><mml:mi>C</mml:mi><mml:mrow><mml:mtext>in</mml:mtext></mml:mrow></mml:msub><mml:mo>&#x00D7;</mml:mo><mml:mi>H</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mi>W</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula>, desired output channels <inline-formula id="ieqn-13"><mml:math id="mml-ieqn-13"><mml:msub><mml:mi>C</mml:mi><mml:mrow><mml:mtext>out</mml:mtext></mml:mrow></mml:msub></mml:math></inline-formula>, and upscaling factor <inline-formula id="ieqn-14"><mml:math id="mml-ieqn-14"><mml:mi>r</mml:mi><mml:mo>=</mml:mo><mml:mn>2</mml:mn></mml:math></inline-formula>, a pointwise convolution expands the channel capacity by <inline-formula id="ieqn-15"><mml:math id="mml-ieqn-15"><mml:msup><mml:mi>r</mml:mi><mml:mn>2</mml:mn></mml:msup></mml:math></inline-formula>, producing <inline-formula id="ieqn-16"><mml:math id="mml-ieqn-16"><mml:mi>u</mml:mi><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>C</mml:mi><mml:mrow><mml:mtext>out</mml:mtext></mml:mrow></mml:msub><mml:mo>&#x22C5;</mml:mo><mml:msup><mml:mi>r</mml:mi><mml:mn>2</mml:mn></mml:msup><mml:mo stretchy="false">)</mml:mo><mml:mo>&#x00D7;</mml:mo><mml:mi>H</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mi>W</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula>, as illustrated in <xref ref-type="fig" rid="fig-2">Fig. 2</xref>. After Layer Normalization and Rectified Linear Unit (ReLU) activation, <inline-formula id="ieqn-17"><mml:math id="mml-ieqn-17"><mml:mrow><mml:mover><mml:mi>u</mml:mi><mml:mo stretchy="false">&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mo>=</mml:mo><mml:mtext>ReLU</mml:mtext><mml:mo stretchy="false">(</mml:mo><mml:mtext>LN</mml:mtext><mml:mo stretchy="false">(</mml:mo><mml:mi>u</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>, pixel-shuffle rearranges channels into spatial detail:
<disp-formula id="eqn-1"><label>(1)</label><mml:math id="mml-eqn-1" display="block"><mml:mi>y</mml:mi><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mtext>PixelShuffle</mml:mtext></mml:mrow><mml:mi>r</mml:mi></mml:msub><mml:mspace width="negativethinmathspace" /><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mover><mml:mi>u</mml:mi><mml:mo stretchy="false">&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mo>,</mml:mo><mml:mspace width="1em" /><mml:mi>y</mml:mi><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:msub><mml:mi>C</mml:mi><mml:mrow><mml:mrow><mml:mtext>out</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:mo>&#x00D7;</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mi>r</mml:mi><mml:mi>H</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>&#x00D7;</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mi>r</mml:mi><mml:mi>W</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mo>.</mml:mo></mml:math></disp-formula></p>
<p>This design enables stable resolution recovery while preserving spatial features, serving as the core upsampling unit in both <italic>TransUp</italic> and <italic>UpFlex</italic>.</p>
</sec>
<sec id="s3_3_2">
<label>3.3.2</label>
<title>TransUp</title>
<p>The TransUp module integrates sub-pixel upsampling with lightweight cross-attention for efficient resolution restoration and context alignment. As shown in <xref ref-type="fig" rid="fig-3">Fig. 3</xref>, decoder feature <inline-formula id="ieqn-18"><mml:math id="mml-ieqn-18"><mml:mi>d</mml:mi></mml:math></inline-formula> is upsampled by <inline-formula id="ieqn-19"><mml:math id="mml-ieqn-19"><mml:mi>r</mml:mi><mml:mo>=</mml:mo><mml:mn>2</mml:mn></mml:math></inline-formula> (<xref ref-type="disp-formula" rid="eqn-1">Eq. (1)</xref>) to produce <inline-formula id="ieqn-20"><mml:math id="mml-ieqn-20"><mml:msub><mml:mi>d</mml:mi><mml:mrow><mml:mo stretchy="false">&#x2191;</mml:mo></mml:mrow></mml:msub></mml:math></inline-formula> aligned with encoder skip feature <inline-formula id="ieqn-21"><mml:math id="mml-ieqn-21"><mml:mi>s</mml:mi></mml:math></inline-formula>. Both features are projected through <inline-formula id="ieqn-22"><mml:math id="mml-ieqn-22"><mml:mn>1</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mn>1</mml:mn></mml:math></inline-formula> convolutions to form queries <inline-formula id="ieqn-23"><mml:math id="mml-ieqn-23"><mml:mi>Q</mml:mi><mml:mo>=</mml:mo><mml:msub><mml:mtext>Conv</mml:mtext><mml:mrow><mml:mn>1</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>d</mml:mi><mml:mrow><mml:mo stretchy="false">&#x2191;</mml:mo></mml:mrow></mml:msub><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> and key&#x2013;value pairs <inline-formula id="ieqn-24"><mml:math id="mml-ieqn-24"><mml:mi>K</mml:mi><mml:mo>,</mml:mo><mml:mi>V</mml:mi><mml:mo>=</mml:mo><mml:msub><mml:mtext>Conv</mml:mtext><mml:mrow><mml:mn>1</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mi>s</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>. Multi-head cross-attention followed by residual multilayer perceptron MLP refinement produces updated query <inline-formula id="ieqn-25"><mml:math id="mml-ieqn-25"><mml:mrow><mml:mover><mml:mi>Q</mml:mi><mml:mo stretchy="false">&#x007E;</mml:mo></mml:mover></mml:mrow></mml:math></inline-formula>:
<disp-formula id="eqn-2"><label>(2)</label><mml:math id="mml-eqn-2" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd /><mml:mtd><mml:mi>y</mml:mi><mml:mo>=</mml:mo><mml:mrow><mml:mtext>ReLU</mml:mtext></mml:mrow><mml:mspace width="negativethinmathspace" /><mml:mstyle scriptlevel="0"><mml:mrow><mml:mo maxsize="1.2em" minsize="1.2em">(</mml:mo></mml:mrow></mml:mstyle><mml:mrow><mml:mtext>LN2d</mml:mtext></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mrow><mml:mtext>Conv</mml:mtext></mml:mrow><mml:mrow><mml:mn>3</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mn>3</mml:mn></mml:mrow></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mover><mml:mi>Q</mml:mi><mml:mo stretchy="false">&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mo>+</mml:mo><mml:mrow><mml:mtext>MLP</mml:mtext></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mtext>LN</mml:mtext></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mover><mml:mi>Q</mml:mi><mml:mo stretchy="false">&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">)</mml:mo><mml:mstyle scriptlevel="0"><mml:mrow><mml:mo maxsize="1.2em" minsize="1.2em">)</mml:mo></mml:mrow></mml:mstyle></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<fig id="fig-3">
<label>Figure 3</label>
<caption>
<title>The proposed TransUp module</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_72550-fig-3.tif"/>
</fig>
<p>This compact design achieves artifact-free upsampling while enhancing semantic consistency between decoder and skip features without excessive computational overhead.</p>
</sec>
<sec id="s3_3_3">
<label>3.3.3</label>
<title>UpFlex</title>
<p>The UpFlex module integrates learnable upsampling, attention-gated skip fusion, and local convolutional refinement. As illustrated in <xref ref-type="fig" rid="fig-4">Fig. 4</xref>, decoder feature <inline-formula id="ieqn-26"><mml:math id="mml-ieqn-26"><mml:mi>d</mml:mi></mml:math></inline-formula> is upsampled with <inline-formula id="ieqn-27"><mml:math id="mml-ieqn-27"><mml:mi>r</mml:mi><mml:mo>=</mml:mo><mml:mn>2</mml:mn></mml:math></inline-formula> following <xref ref-type="disp-formula" rid="eqn-1">Eq. (1)</xref>, producing <inline-formula id="ieqn-28"><mml:math id="mml-ieqn-28"><mml:msub><mml:mi>d</mml:mi><mml:mrow><mml:mo stretchy="false">&#x2191;</mml:mo></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mtext>SubPixelUp</mml:mtext><mml:mo stretchy="false">(</mml:mo><mml:mi>d</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> aligned with skip feature <inline-formula id="ieqn-29"><mml:math id="mml-ieqn-29"><mml:mi>s</mml:mi></mml:math></inline-formula>. An attention gate modulates the skip connection: <inline-formula id="ieqn-30"><mml:math id="mml-ieqn-30"><mml:mi>&#x03B1;</mml:mi><mml:mo>=</mml:mo><mml:mi>&#x03C3;</mml:mi><mml:mspace width="thinmathspace" /><mml:mstyle scriptlevel="0"><mml:mrow><mml:mo maxsize="1.2em" minsize="1.2em">(</mml:mo></mml:mrow></mml:mstyle><mml:mtext>LN</mml:mtext><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>W</mml:mi><mml:mi>g</mml:mi></mml:msub><mml:mo>&#x2217;</mml:mo><mml:msub><mml:mi>d</mml:mi><mml:mrow><mml:mo stretchy="false">&#x2191;</mml:mo></mml:mrow></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mo>+</mml:mo><mml:mtext>LN</mml:mtext><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>W</mml:mi><mml:mi>x</mml:mi></mml:msub><mml:mo>&#x2217;</mml:mo><mml:mi>s</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mstyle scriptlevel="0"><mml:mrow><mml:mo maxsize="1.2em" minsize="1.2em">)</mml:mo></mml:mrow></mml:mstyle></mml:math></inline-formula>, where <inline-formula id="ieqn-31"><mml:math id="mml-ieqn-31"><mml:mi>&#x03C3;</mml:mi></mml:math></inline-formula> is sigmoid activation and <inline-formula id="ieqn-32"><mml:math id="mml-ieqn-32"><mml:msub><mml:mi>W</mml:mi><mml:mi>g</mml:mi></mml:msub></mml:math></inline-formula>, <inline-formula id="ieqn-33"><mml:math id="mml-ieqn-33"><mml:msub><mml:mi>W</mml:mi><mml:mi>x</mml:mi></mml:msub></mml:math></inline-formula> are <inline-formula id="ieqn-34"><mml:math id="mml-ieqn-34"><mml:mn>1</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mn>1</mml:mn></mml:math></inline-formula> convolutions. The gated feature <inline-formula id="ieqn-35"><mml:math id="mml-ieqn-35"><mml:mrow><mml:mover><mml:mi>s</mml:mi><mml:mo stretchy="false">&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mo>=</mml:mo><mml:mi>&#x03B1;</mml:mi><mml:mo>&#x2299;</mml:mo><mml:mi>s</mml:mi></mml:math></inline-formula> is concatenated with <inline-formula id="ieqn-36"><mml:math id="mml-ieqn-36"><mml:msub><mml:mi>d</mml:mi><mml:mrow><mml:mo stretchy="false">&#x2191;</mml:mo></mml:mrow></mml:msub></mml:math></inline-formula> and refined through successive <inline-formula id="ieqn-37"><mml:math id="mml-ieqn-37"><mml:mn>3</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mn>3</mml:mn></mml:math></inline-formula> convolutions:
<disp-formula id="eqn-3"><label>(3)</label><mml:math id="mml-eqn-3" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd /><mml:mtd><mml:mi>y</mml:mi><mml:mo>=</mml:mo><mml:mrow><mml:mtext>DoubleConv</mml:mtext></mml:mrow><mml:mspace width="thinmathspace" /><mml:mstyle scriptlevel="0"><mml:mrow><mml:mo maxsize="1.2em" minsize="1.2em">(</mml:mo></mml:mrow></mml:mstyle><mml:mo stretchy="false">[</mml:mo><mml:mrow><mml:mover><mml:mi>s</mml:mi><mml:mo stretchy="false">&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mo>,</mml:mo><mml:mspace width="thinmathspace" /><mml:msub><mml:mi>d</mml:mi><mml:mrow><mml:mo stretchy="false">&#x2191;</mml:mo></mml:mrow></mml:msub><mml:mo stretchy="false">]</mml:mo><mml:mstyle scriptlevel="0"><mml:mrow><mml:mo maxsize="1.2em" minsize="1.2em">)</mml:mo></mml:mrow></mml:mstyle><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<fig id="fig-4">
<label>Figure 4</label>
<caption>
<title>The proposed UpFlex module</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_72550-fig-4.tif"/>
</fig>
<p>This selective fusion enhances spatial precision and boundary delineation, improving segmentation accuracy.</p>
</sec>
</sec>
<sec id="s3_4">
<label>3.4</label>
<title>Guided Feature Fusion</title>
<p>The proposed fusion design is inspired by the gated multimodal unit introduced in [<xref ref-type="bibr" rid="ref-24">24</xref>] and adapted here for pixel-level feature integration. As illustrated in <xref ref-type="fig" rid="fig-2">Fig. 2</xref>, the GateFuse module adaptively merges complementary information from the two input streams or their decoder branches. Here, <inline-formula id="ieqn-38"><mml:math id="mml-ieqn-38"><mml:mi>a</mml:mi></mml:math></inline-formula> and <inline-formula id="ieqn-39"><mml:math id="mml-ieqn-39"><mml:mi>b</mml:mi></mml:math></inline-formula> represent feature maps from different modalities, each of dimension <inline-formula id="ieqn-40"><mml:math id="mml-ieqn-40"><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:mi>C</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mi>H</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mi>W</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula>. A spatial attention mask <inline-formula id="ieqn-41"><mml:math id="mml-ieqn-41"><mml:mi>&#x03B1;</mml:mi></mml:math></inline-formula> is generated as <inline-formula id="ieqn-42"><mml:math id="mml-ieqn-42"><mml:mi>&#x03B1;</mml:mi><mml:mo>=</mml:mo><mml:mi>&#x03C3;</mml:mi><mml:mspace width="negativethinmathspace" /><mml:mstyle scriptlevel="0"><mml:mrow><mml:mo maxsize="1.2em" minsize="1.2em">(</mml:mo></mml:mrow></mml:mstyle><mml:msub><mml:mtext>Conv</mml:mtext><mml:mrow><mml:mn>1</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:mi>a</mml:mi><mml:mo>;</mml:mo><mml:mi>b</mml:mi><mml:mo stretchy="false">]</mml:mo><mml:mo stretchy="false">)</mml:mo><mml:mstyle scriptlevel="0"><mml:mrow><mml:mo maxsize="1.2em" minsize="1.2em">)</mml:mo></mml:mrow></mml:mstyle></mml:math></inline-formula>, where <inline-formula id="ieqn-43"><mml:math id="mml-ieqn-43"><mml:mo stretchy="false">[</mml:mo><mml:mi>a</mml:mi><mml:mo>;</mml:mo><mml:mi>b</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:math></inline-formula> denotes channel-wise concatenation and <inline-formula id="ieqn-44"><mml:math id="mml-ieqn-44"><mml:mi>&#x03C3;</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mo>&#x22C5;</mml:mo><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> is the sigmoid activation. The fused feature is then computed as
<disp-formula id="eqn-4"><label>(4)</label><mml:math id="mml-eqn-4" display="block"><mml:mi>f</mml:mi><mml:mo>=</mml:mo><mml:mi>&#x03B1;</mml:mi><mml:mo>&#x2299;</mml:mo><mml:mi>a</mml:mi><mml:mo>+</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mn>1</mml:mn><mml:mo>&#x2212;</mml:mo><mml:mi>&#x03B1;</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>&#x2299;</mml:mo><mml:mi>b</mml:mi><mml:mo>,</mml:mo></mml:math></disp-formula>where <inline-formula id="ieqn-45"><mml:math id="mml-ieqn-45"><mml:mo>&#x2299;</mml:mo></mml:math></inline-formula> indicates element-wise multiplication. To promote confident fusion decisions, a regularization term is added:
<disp-formula id="eqn-5"><label>(5)</label><mml:math id="mml-eqn-5" display="block"><mml:mrow><mml:mrow><mml:mi>&#x211B;</mml:mi></mml:mrow></mml:mrow><mml:mo>=</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:mrow><mml:mi>H</mml:mi><mml:mi>W</mml:mi></mml:mrow></mml:mfrac><mml:munder><mml:mo movablelimits="false">&#x2211;</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi></mml:mrow></mml:munder><mml:msub><mml:mi>&#x03B1;</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mspace width="thinmathspace" /><mml:mo stretchy="false">(</mml:mo><mml:mn>1</mml:mn><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mi>&#x03B1;</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mo>,</mml:mo></mml:math></disp-formula>which penalizes uncertain gating (<inline-formula id="ieqn-46"><mml:math id="mml-ieqn-46"><mml:mi>&#x03B1;</mml:mi><mml:mo>&#x2248;</mml:mo><mml:mn>0.5</mml:mn></mml:math></inline-formula>) and stabilizes multimodal learning.</p>
</sec>
<sec id="s3_5">
<label>3.5</label>
<title>Loss Function</title>
<p>To address severe class imbalance and guide confident multimodal fusion, the network employs a compound loss combining Tversky segmentation loss [<xref ref-type="bibr" rid="ref-25">25</xref>] with the gating regularization term in <xref ref-type="disp-formula" rid="eqn-5">Eq. (5)</xref>. This formulation stabilizes training, improves recall of sparse landslide pixels, and enforces decisive gating across fusion stages. Following the deep supervision strategy in [<xref ref-type="bibr" rid="ref-26">26</xref>], the model generates a main prediction <inline-formula id="ieqn-47"><mml:math id="mml-ieqn-47"><mml:msup><mml:mrow><mml:mover><mml:mi>y</mml:mi><mml:mo stretchy="false">&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mspace width="thinmathspace" /><mml:mtext>main</mml:mtext></mml:mrow></mml:msup></mml:math></inline-formula> and two auxiliary outputs <inline-formula id="ieqn-48"><mml:math id="mml-ieqn-48"><mml:mo stretchy="false">(</mml:mo><mml:msup><mml:mrow><mml:mover><mml:mi>y</mml:mi><mml:mo stretchy="false">&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mspace width="thinmathspace" /><mml:mtext>aux2</mml:mtext></mml:mrow></mml:msup><mml:mo>,</mml:mo><mml:msup><mml:mrow><mml:mover><mml:mi>y</mml:mi><mml:mo stretchy="false">&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mspace width="thinmathspace" /><mml:mtext>aux3</mml:mtext></mml:mrow></mml:msup><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> from intermediate decoder stages. The segmentation loss is then formulated as
<disp-formula id="eqn-6"><label>(6)</label><mml:math id="mml-eqn-6" display="block"><mml:msub><mml:mrow><mml:mi>&#x02112;</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mtext>tversky</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mi>&#x03BB;</mml:mi><mml:mn>0</mml:mn></mml:msub><mml:mspace width="thinmathspace" /><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msup><mml:mrow><mml:mover><mml:mi>y</mml:mi><mml:mo stretchy="false">&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mspace width="thinmathspace" /><mml:mrow><mml:mtext>main</mml:mtext></mml:mrow></mml:mrow></mml:msup><mml:mo>,</mml:mo><mml:mi>y</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>+</mml:mo><mml:msub><mml:mi>&#x03BB;</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:mspace width="thinmathspace" /><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msup><mml:mrow><mml:mover><mml:mi>y</mml:mi><mml:mo stretchy="false">&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mspace width="thinmathspace" /><mml:mrow><mml:mtext>aux2</mml:mtext></mml:mrow></mml:mrow></mml:msup><mml:mo>,</mml:mo><mml:mi>y</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>+</mml:mo><mml:msub><mml:mi>&#x03BB;</mml:mi><mml:mn>2</mml:mn></mml:msub><mml:mspace width="thinmathspace" /><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msup><mml:mrow><mml:mover><mml:mi>y</mml:mi><mml:mo stretchy="false">&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mspace width="thinmathspace" /><mml:mrow><mml:mtext>aux3</mml:mtext></mml:mrow></mml:mrow></mml:msup><mml:mo>,</mml:mo><mml:mi>y</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>,</mml:mo></mml:math></disp-formula>where <inline-formula id="ieqn-49"><mml:math id="mml-ieqn-49"><mml:mi>&#x2113;</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mo>&#x22C5;</mml:mo><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> denotes the pixel-wise Tversky loss with hyperparameters <inline-formula id="ieqn-50"><mml:math id="mml-ieqn-50"><mml:mi>&#x03B1;</mml:mi><mml:mo>=</mml:mo><mml:mn>0.3</mml:mn></mml:math></inline-formula> and <inline-formula id="ieqn-51"><mml:math id="mml-ieqn-51"><mml:mi>&#x03B2;</mml:mi><mml:mo>=</mml:mo><mml:mn>0.7</mml:mn></mml:math></inline-formula> to emphasize recall over precision, particularly beneficial for detecting sparse landslide regions. To further stabilize the gating mechanism and prevent unreliable fusion weights, a regularization term with weight <inline-formula id="ieqn-52"><mml:math id="mml-ieqn-52"><mml:msub><mml:mi>&#x03BB;</mml:mi><mml:mi>r</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:msup><mml:mn>10</mml:mn><mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mn>3</mml:mn></mml:mrow></mml:msup></mml:math></inline-formula> is incorporated, yielding the total objective function
<disp-formula id="eqn-7"><label>(7)</label><mml:math id="mml-eqn-7" display="block"><mml:mrow><mml:mi>&#x02112;</mml:mi></mml:mrow><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mi>&#x02112;</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mtext>tversky</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:msub><mml:mi>&#x03BB;</mml:mi><mml:mi>r</mml:mi></mml:msub><mml:munder><mml:mo movablelimits="false">&#x2211;</mml:mo><mml:mi>j</mml:mi></mml:munder><mml:msub><mml:mrow><mml:mi>&#x211B;</mml:mi></mml:mrow><mml:mi>j</mml:mi></mml:msub><mml:mo>,</mml:mo></mml:math></disp-formula>where <inline-formula id="ieqn-53"><mml:math id="mml-ieqn-53"><mml:msub><mml:mrow><mml:mi>&#x211B;</mml:mi></mml:mrow><mml:mi>j</mml:mi></mml:msub></mml:math></inline-formula> denotes the gating regularization at fusion stage <inline-formula id="ieqn-54"><mml:math id="mml-ieqn-54"><mml:mi>j</mml:mi></mml:math></inline-formula>. This composite loss effectively balances segmentation accuracy with stable multimodal integration throughout the network hierarchy.</p>
</sec>
</sec>
<sec id="s4">
<label>4</label>
<title>Experiment</title>
<p>In this section, we present the datasets, evaluation metrics, baseline models, and implementation details of our proposed method. We then evaluate performance on two real-world datasets and finally conduct ablation studies.</p>
<sec id="s4_1">
<label>4.1</label>
<title>Datasets and Evaluation Metrics</title>
<p>While large-scale datasets such as the Chinese Academy of Sciences (CAS) Landslide Dataset [<xref ref-type="bibr" rid="ref-27">27</xref>] provide multi-sensor coverage across diverse terrains, co-registered multi-modal datasets remain limited [<xref ref-type="bibr" rid="ref-28">28</xref>]. Optical data suffer from cloud cover and daylight constraints, whereas SAR can penetrate clouds [<xref ref-type="bibr" rid="ref-29">29</xref>], yet few datasets provide co-registered SAR or DEM for effective multi-modal fusion.</p>
<p>The <italic>Bijie dataset</italic> [<xref ref-type="bibr" rid="ref-30">30</xref>] is an optical benchmark from Bijie City, Guizhou Province, China, covering 26,853 <inline-formula id="ieqn-55"><mml:math id="mml-ieqn-55"><mml:msup><mml:mi>km</mml:mi><mml:mn>2</mml:mn></mml:msup></mml:math></inline-formula> in the Tibetan Plateau transitional zone (altitude 457&#x2013;2900 m). This landslide-prone region features steep terrain, unstable geology, and heavy rainfall (849&#x2013;1399 mm annually). The dataset comprises 770 positive landslide samples (rock falls, rock slides, debris slides) and 2003 non-landslide samples from TripleSat imagery (May&#x2013;August 2018, 0.8 m resolution). Each sample includes RGB imagery, manually delineated masks verified through field surveys, and DEM tiles (2 m elevation accuracy). Non-landslide samples represent diverse backgrounds (mountains, villages, rivers, forests, agricultural lands). Despite its value as a benchmark, coverage is limited to one region and acquisition period.</p>
<p>The <italic>LandSlide4Sense dataset</italic> [<xref ref-type="bibr" rid="ref-8">8</xref>] is a global benchmark comprising Sentinel-2 multispectral imagery (14 bands at 10 m resolution) and DEM-derived features from the Advanced Land Observing Satellite Phased Array type L-band Synthetic Aperture Radar (ALOS PALSAR). It provides 3799 training, 245 validation, and 800 test patches (<inline-formula id="ieqn-56"><mml:math id="mml-ieqn-56"><mml:mn>128</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mn>128</mml:mn></mml:math></inline-formula> pixels) with pixel-wise masks for training. Spanning multiple global regions (2015&#x2013;2021), it captures shallow translational slides and debris flows caused by rainfall, erosion, or anthropogenic disturbances. While offering greater environmental diversity than site-specific datasets, its 10 m resolution may miss small landslides, and the patch-based design limits contextual information.</p>
<p>Landslide segmentation is formulated as binary classification with strong class imbalance. Performance is evaluated using four pixel-level metrics:
<disp-formula id="eqn-8"><label>(8)</label><mml:math id="mml-eqn-8" display="block"><mml:mrow><mml:mtext>Precision</mml:mtext></mml:mrow><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mi>T</mml:mi><mml:mi>P</mml:mi></mml:mrow><mml:mrow><mml:mi>T</mml:mi><mml:mi>P</mml:mi><mml:mo>+</mml:mo><mml:mi>F</mml:mi><mml:mi>P</mml:mi></mml:mrow></mml:mfrac><mml:mo>,</mml:mo><mml:mspace width="1em" /><mml:mrow><mml:mtext>Recall</mml:mtext></mml:mrow><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mi>T</mml:mi><mml:mi>P</mml:mi></mml:mrow><mml:mrow><mml:mi>T</mml:mi><mml:mi>P</mml:mi><mml:mo>+</mml:mo><mml:mi>F</mml:mi><mml:mi>N</mml:mi></mml:mrow></mml:mfrac><mml:mo>,</mml:mo><mml:mspace width="1em" /><mml:mrow><mml:mtext>F1</mml:mtext></mml:mrow><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mn>2</mml:mn><mml:mi>T</mml:mi><mml:mi>P</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn><mml:mi>T</mml:mi><mml:mi>P</mml:mi><mml:mo>+</mml:mo><mml:mi>F</mml:mi><mml:mi>P</mml:mi><mml:mo>+</mml:mo><mml:mi>F</mml:mi><mml:mi>N</mml:mi></mml:mrow></mml:mfrac><mml:mo>,</mml:mo><mml:mspace width="1em" /><mml:mrow><mml:mtext>IoU</mml:mtext></mml:mrow><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mi>T</mml:mi><mml:mi>P</mml:mi></mml:mrow><mml:mrow><mml:mi>T</mml:mi><mml:mi>P</mml:mi><mml:mo>+</mml:mo><mml:mi>F</mml:mi><mml:mi>P</mml:mi><mml:mo>+</mml:mo><mml:mi>F</mml:mi><mml:mi>N</mml:mi></mml:mrow></mml:mfrac><mml:mo>,</mml:mo></mml:math></disp-formula>where precision and recall measure accuracy and completeness, while F1 and IoU quantify overall segmentation quality. Two image-level metrics assess threshold-independent robustness: AUROC evaluates ranking consistency, and AUPRC summarizes precision&#x2013;recall trade-offs:
<disp-formula id="eqn-9"><label>(9)</label><mml:math id="mml-eqn-9" display="block"><mml:mrow><mml:mtext>AUROC</mml:mtext></mml:mrow><mml:mo>=</mml:mo><mml:msubsup><mml:mo>&#x222B;</mml:mo><mml:mrow><mml:mn>0</mml:mn></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msubsup><mml:mrow><mml:mtext>TPR</mml:mtext></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mspace width="thinmathspace" /><mml:mi>d</mml:mi><mml:mrow><mml:mtext>FPR</mml:mtext></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>,</mml:mo><mml:mspace width="1em" /><mml:mrow><mml:mtext>AUPRC</mml:mtext></mml:mrow><mml:mo>=</mml:mo><mml:msubsup><mml:mo>&#x222B;</mml:mo><mml:mrow><mml:mn>0</mml:mn></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msubsup><mml:mrow><mml:mtext>Precision</mml:mtext></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>r</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mspace width="thinmathspace" /><mml:mi>d</mml:mi><mml:mrow><mml:mtext>Recall</mml:mtext></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>r</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>.</mml:mo></mml:math></disp-formula></p>
<p>Together, these metrics provide comprehensive evaluation of pixel-wise accuracy and model reliability under class imbalance.</p>
</sec>
<sec id="s4_2">
<label>4.2</label>
<title>Baseline Methods</title>
<p>To ensure a comprehensive and fair evaluation, we compared the proposed model with a diverse set of segmentation architectures representing the major evolutionary trends in semantic segmentation for landslides. The selected baselines include both classical convolutional networks and recent attention or transformer-based designs that have been widely adopted in remote sensing and landslide mapping. This diversity allows us to assess the relative contribution of multimodal fusion, attention mechanisms, and adaptive decoding within a unified benchmark.</p>
<p>Among convolutional encoder&#x2013;decoder networks, U-Net [<xref ref-type="bibr" rid="ref-2">2</xref>], DeepLabv3&#x002B; [<xref ref-type="bibr" rid="ref-3">3</xref>] and LinkNet [<xref ref-type="bibr" rid="ref-31">31</xref>] serve as foundational CNN baselines known for their efficiency and strong localization performance. Dual-Stream U-Net [<xref ref-type="bibr" rid="ref-32">32</xref>,<xref ref-type="bibr" rid="ref-33">33</xref>] extends this paradigm to multimodal and multi-temporal settings, providing a reference for assessing the benefit of dual-stream fusion. More recent variants, such as the residual-multihead-attention unet (RMAU-NET) [<xref ref-type="bibr" rid="ref-34">34</xref>] and the dilated convolution and EMA attention with pixel attention-guided unet (DEP-UNet) [<xref ref-type="bibr" rid="ref-35">35</xref>] incorporate residual and dense connections, multi-head attention, and pyramid pooling to enhance multi-scale context and mitigate class imbalance. Together, these architectures establish a strong baseline for evaluating improvements introduced by our adaptive decoder and gated fusion mechanisms.</p>
<p>We further include state-of-the-art attention and transformer-based models to represent the current frontier of landslide segmentation. ShapeFormer [<xref ref-type="bibr" rid="ref-21">21</xref>] leverages vision transformers and shape priors for boundary-aware segmentation, while the enhanced multi-scale residual high-resolution network (EMR-HRNet) [<xref ref-type="bibr" rid="ref-11">11</xref>] fuses multi-resolution features through an efficient refinement module built upon HRNet. The global information extraction and multi-scale feature fusion segmentation network (GMNet) [<xref ref-type="bibr" rid="ref-36">36</xref>] adopts a multi-branch design combining global, local, and morphological features via a gated fusion mechanism. Although these models capture richer context through self-attention and multi-scale fusion, they typically operate on single-modality optical data and lack explicit cross-stream gating or deep supervision. Our framework complements these approaches by unifying attention, multimodal fusion, and adaptive decoding into a lightweight dual-stream architecture specifically tailored for landslide segmentation.</p>
</sec>
<sec id="s4_3">
<label>4.3</label>
<title>Dataset Preprocessing</title>
<p>For the Bijie dataset, the landslide and non-landslide samples were merged and split into 70%, 20%, and 10% subsets for training, validation, and testing, yielding 1941, 554, and 278 samples, respectively. For the Landslide4Sense dataset, we adopted the official split of 3799 training, 245 validation, and 800 test images as defined by the dataset authors. All images were resized to <inline-formula id="ieqn-57"><mml:math id="mml-ieqn-57"><mml:mn>256</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mn>256</mml:mn></mml:math></inline-formula> pixels for convolutional backbones and <inline-formula id="ieqn-58"><mml:math id="mml-ieqn-58"><mml:mn>224</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mn>224</mml:mn></mml:math></inline-formula> for vision transformers, with standard augmentations (random flips, Gaussian noise, and Contrast Limited Adaptive Histogram Equalization (CLAHE)) applied during training.</p>
<p>In the Bijie dataset, each patch provides RGB and DEM layers. Stream A receives the RGB composite, while Stream B uses the DEM replicated three times, forming paired six-channel inputs that preserve architectural symmetry. For the Landslide4Sense dataset, Stream A uses RGB channels, and Stream B combines normalized difference vegetation index (NDVI), slope, and DEM features. This configuration efficiently combines optical, topographic, and vegetation information, balancing performance and computational cost. The NDVI is computed from the red (<italic>R</italic>) and near-infrared (<italic>NIR</italic>) bands as:
<disp-formula id="eqn-10"><label>(10)</label><mml:math id="mml-eqn-10" display="block"><mml:mrow><mml:mtext>NDVI</mml:mtext></mml:mrow><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mi>N</mml:mi><mml:mi>I</mml:mi><mml:mi>R</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mi>R</mml:mi></mml:mrow><mml:mrow><mml:mi>N</mml:mi><mml:mi>I</mml:mi><mml:mi>R</mml:mi><mml:mo>+</mml:mo><mml:mi>R</mml:mi></mml:mrow></mml:mfrac><mml:mo>.</mml:mo></mml:math></disp-formula></p>
<p>This configuration allows both datasets to be processed in a consistent dual-stream manner, while leveraging complementary spectral and topographic information for landslide segmentation.</p>
</sec>
<sec id="s4_4">
<label>4.4</label>
<title>Experimental Settings</title>
<p>All experiments were conducted in Python 3.10 using PyTorch 2.6.10 on Ubuntu 22.04.4 LTS, with an Intel(R) Xeon(R) Platinum 8350C CPU @ 2.60 GHz, an NVIDIA RTX A6000 GPU (Graphics Processing Unit) (48 GB). The hyperparameters used are detailed in <xref ref-type="table" rid="table-1">Table 1</xref>.</p>
<table-wrap id="table-1">
<label>Table 1</label>
<caption>
<title>Model hyperparameter values</title>
</caption>
<table>
<colgroup>
<col align="center" width="50mm"/>
<col align="center" width="50mm"/> </colgroup>
<thead>
<tr>
<th>Hyperparameter</th>
<th>Value</th>
</tr>
</thead>
<tbody>
<tr>
<td>Optimizer</td>
<td>Adam</td>
</tr>
<tr>
<td>Loss function</td>
<td>Tversky loss</td>
</tr>
<tr>
<td>Pretrained weights</td>
<td>True</td>
</tr>
<tr>
<td>Encoder freeze</td>
<td>True</td>
</tr>
<tr>
<td>Learning rate</td>
<td><inline-formula id="ieqn-59"><mml:math id="mml-ieqn-59"><mml:mn>3</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:msup><mml:mn>10</mml:mn><mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mn>4</mml:mn></mml:mrow></mml:msup></mml:math></inline-formula></td>
</tr>
<tr>
<td>Weight decay</td>
<td><inline-formula id="ieqn-60"><mml:math id="mml-ieqn-60"><mml:mn>1</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:msup><mml:mn>10</mml:mn><mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mn>4</mml:mn></mml:mrow></mml:msup></mml:math></inline-formula></td>
</tr>
<tr>
<td>Batch size</td>
<td>32</td>
</tr>
<tr>
<td>Epochs</td>
<td>100</td>
</tr>
<tr>
<td>Activation functions</td>
<td>ReLU, Softmax, Sigmoid</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>The Adam optimizer was employed with an initial learning rate of <inline-formula id="ieqn-61"><mml:math id="mml-ieqn-61"><mml:mn>3</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:msup><mml:mn>10</mml:mn><mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mn>4</mml:mn></mml:mrow></mml:msup></mml:math></inline-formula> and weight decay of <inline-formula id="ieqn-62"><mml:math id="mml-ieqn-62"><mml:mn>1</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:msup><mml:mn>10</mml:mn><mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mn>4</mml:mn></mml:mrow></mml:msup></mml:math></inline-formula>. Models were trained for 100 epochs with a batch size of 32. Tversky loss was used to address class imbalance, with sigmoid activation in the final layer for probabilistic predictions, ReLU in convolutional modules, and softmax in the cross-attention mechanism.</p>
</sec>
<sec id="s4_5">
<label>4.5</label>
<title>Experimental Results and Analysis</title>
<p>This section presents quantitative and qualitative results on the Bijie and Landslide4Sense datasets. Training and validation curves are provided in <xref ref-type="sec" rid="app-1">Appendix A</xref> (<xref ref-type="fig" rid="fig-8">Figs. A1</xref> and <xref ref-type="fig" rid="fig-9">A2</xref>).</p>
<sec id="s4_5_1">
<label>4.5.1</label>
<title>Results on the Bijie Dataset</title>
<p><xref ref-type="table" rid="table-2">Table 2</xref> compares the proposed model against baseline methods on the Bijie dataset. The proposed framework achieved superior performance across all metrics. Classical CNN models exhibited strong precision but limited spatial overlap (IoU &#x003C; 0.78), while transformer-based approaches improved recall but showed inconsistent boundary delineation. Recent fusion networks (DEP-UNet, EMR-HRNet, GMNet) enhanced overall quality yet remained inferior to the proposed model.</p>
<table-wrap id="table-2">
<label>Table 2</label>
<caption>
<title>Performance comparison on the Bijie dataset</title>
</caption>
<table>
<colgroup>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/> </colgroup>
<thead>
<tr>
<th>Model</th>
<th>Acc</th>
<th>Precision</th>
<th>Recall</th>
<th>F1</th>
<th>IoU</th>
</tr>
</thead>
<tbody>
<tr>
<td>UNet [<xref ref-type="bibr" rid="ref-2">2</xref>]</td>
<td>0.9798</td>
<td>0.9013</td>
<td>0.8642</td>
<td>0.8065</td>
<td>0.7776</td>
</tr>
<tr>
<td>Dual-stream UNet [<xref ref-type="bibr" rid="ref-32">32</xref>]</td>
<td>0.9824</td>
<td>0.8768</td>
<td>0.8877</td>
<td>0.8084</td>
<td>0.7809</td>
</tr>
<tr>
<td>LinkNet [<xref ref-type="bibr" rid="ref-31">31</xref>]</td>
<td>0.9750</td>
<td>0.8450</td>
<td>0.8707</td>
<td>0.8577</td>
<td>&#x2013;</td>
</tr>
<tr>
<td>DeepLabv3&#x002B; [<xref ref-type="bibr" rid="ref-3">3</xref>]</td>
<td>&#x2013;</td>
<td>0.8443</td>
<td>0.8837</td>
<td>0.8635</td>
<td>0.7599</td>
</tr>
<tr>
<td>TransUNet [<xref ref-type="bibr" rid="ref-21">21</xref>]</td>
<td>&#x2013;</td>
<td>0.8724</td>
<td>0.8823</td>
<td>0.8773</td>
<td>0.7815</td>
</tr>
<tr>
<td>ShapeFormer [<xref ref-type="bibr" rid="ref-21">21</xref>]</td>
<td>&#x2013;</td>
<td>0.8674</td>
<td>0.8952</td>
<td>0.8811</td>
<td>0.7872</td>
</tr>
<tr>
<td>RMAU-NET [<xref ref-type="bibr" rid="ref-34">34</xref>]</td>
<td>&#x2013;</td>
<td>0.7454</td>
<td>0.7414</td>
<td>0.7434</td>
<td>0.5733</td>
</tr>
<tr>
<td>DEP-UNet [<xref ref-type="bibr" rid="ref-35">35</xref>]</td>
<td>&#x2013;</td>
<td>0.8026</td>
<td>0.8806</td>
<td>0.8398</td>
<td>0.8386</td>
</tr>
<tr>
<td>EMR-HRNet [<xref ref-type="bibr" rid="ref-11">11</xref>]</td>
<td>0.9468</td>
<td>0.8791</td>
<td>0.9115</td>
<td>0.8950</td>
<td>0.8170</td>
</tr>
<tr>
<td>GMNet [<xref ref-type="bibr" rid="ref-36">36</xref>]</td>
<td>&#x2013;</td>
<td>0.8982</td>
<td>0.9179</td>
<td>0.9080</td>
<td>0.8304</td>
</tr>
<tr>
<td>Proposed model</td>
<td><bold>0.9896</bold></td>
<td><bold>0.9452</bold></td>
<td><bold>0.9249</bold></td>
<td><bold>0.9110</bold></td>
<td><bold>0.8839</bold></td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<fn id="table-3fn1" fn-type="other">
<p>Note: Bold numbers highlight the top-performing results across corresponding metrics.</p>
</fn>
</table-wrap-foot>
</table-wrap>
</sec>
<sec id="s4_5_2">
<label>4.5.2</label>
<title>Results on the Landslide4Sense Dataset</title>
<p><xref ref-type="table" rid="table-3">Table 3</xref> summarizes the comparative results on the Landslide4Sense dataset. The proposed model, using six input bands (RGB, NDVI, Slope, and DEM), achieved an F1 score of 0.753 and IoU of 0.677, outperforming the ResUNet baseline [<xref ref-type="bibr" rid="ref-8">8</xref>] trained with all 14 spectral bands. Compared with entries from the Landslide4Sense Competition [<xref ref-type="bibr" rid="ref-18">18</xref>], including U-Net (F1 0.598, RGB &#x002B; SWIR), Deeplabv3 (F1 0.592, NGB), and Swin Transformer (F1 0.656, RGB), our approach achieved the highest segmentation accuracy with fewer input channels.</p>
<table-wrap id="table-3">
<label>Table 3</label>
<caption>
<title>Performance table on the Landslide4Sense dataset</title>
</caption>
<table>
<colgroup>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/> </colgroup>
<thead>
<tr>
<th>Model</th>
<th>Bands</th>
<th>F1-Score</th>
<th>IoU</th>
</tr>
</thead>
<tbody>
<tr>
<td>UNet [<xref ref-type="bibr" rid="ref-18">18</xref>]</td>
<td>6 (RGB &#x002B; SWIR)</td>
<td>0.598</td>
<td>&#x2013;</td>
</tr>
<tr>
<td>Deeplabv3 [<xref ref-type="bibr" rid="ref-18">18</xref>]</td>
<td>3 (NGB)</td>
<td>0.592</td>
<td>&#x2013;</td>
</tr>
<tr>
<td>Swin transformer [<xref ref-type="bibr" rid="ref-18">18</xref>]</td>
<td>3 (RGB)</td>
<td>0.656</td>
<td>&#x2013;</td>
</tr>
<tr>
<td>ResUNet [<xref ref-type="bibr" rid="ref-8">8</xref>]</td>
<td>14 (Sentinel-2 &#x002B; DEM &#x002B; Slope)</td>
<td>0.716</td>
<td>0.653</td>
</tr>
<tr>
<td>Proposed model</td>
<td><bold>6 (RGB &#x002B; NDVI &#x002B; Slope &#x002B; DEM)</bold></td>
<td><bold>0.753</bold></td>
<td><bold>0.677</bold></td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<fn id="table-3fn2" fn-type="other">
<p>Note: Models differ in the number and type of spectral bands used, so the reported results should be interpreted as indicative rather than directly comparable. Bold numbers highlight the top-performing results across corresponding metrics.</p>
</fn>
</table-wrap-foot>
</table-wrap>
</sec>
</sec>
<sec id="s4_6">
<label>4.6</label>
<title>Ablation Study</title>
<p>We conduct ablation experiments to evaluate the contribution of individual components within the proposed framework, including the choice of backbone network, the effect of fusion strategies, the role of decoder components, and the impact of deep supervision. All experiments were conducted under identical settings for a fair comparison on Bijie dataset.</p>
<sec id="s4_6_1">
<label>4.6.1</label>
<title>Effect of Pretrained Backbones</title>
<p>Several pretrained encoders with ImageNet weights were evaluated to assess backbone selection impact on segmentation performance. <xref ref-type="fig" rid="fig-5">Fig. 5</xref> illustrates the trade-off between F1 score, IoU, and model parameters.</p>
<fig id="fig-5">
<label>Figure 5</label>
<caption>
<title>Influence of different backbones on segmentation performance</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_72550-fig-5.tif"/>
</fig>
<p>Among CNN-based backbones, ResNet50 and ResNet101 achieved comparable performance (F1 <inline-formula id="ieqn-63"><mml:math id="mml-ieqn-63"><mml:mo>&#x2248;</mml:mo></mml:math></inline-formula> 0.898, IoU <inline-formula id="ieqn-64"><mml:math id="mml-ieqn-64"><mml:mo>&#x2248;</mml:mo></mml:math></inline-formula> 0.868), while DenseNet121 yielded slightly lower scores (F1 0.8891, IoU 0.8586) with fewer parameters. MobileNetV3-Small exhibited the weakest performance (F1 0.8080, IoU 0.7766). EfficientNet-B4 achieved the highest performance (F1 0.9110, IoU 0.8839) with only 1.24M parameters. The transformer-based Vision Transformer-Base (ViT-Base) (16/224) utilized 39.10M parameters but yielded lower accuracy (F1 0.8730, IoU 0.8413).</p>
<p>Detailed computational efficiency metrics, including GFLOPs, FPS, and memory usage, are provided in <xref ref-type="sec" rid="app-2">Appendix B</xref> (<xref ref-type="table" rid="table-6">Table A1</xref>).</p>
</sec>
<sec id="s4_6_2">
<label>4.6.2</label>
<title>Effect of Fusion Strategies</title>
<p>We analyzed the impact of different fusion strategies at both pixel-level (segmentation) and image-level (presence detection) on 278 test images (77 positive, 201 negative). <xref ref-type="table" rid="table-4">Table 4</xref> compares variants from a single-stream channel fusion baseline to progressively adding late fusion, early fusion at selected stages, and multi-stage fusion.</p>
<table-wrap id="table-4">
<label>Table 4</label>
<caption>
<title>Comparison of different fusion strategies at pixel level (segmentation) and image level (presence detection)</title>
</caption>
<table>
<colgroup>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/> </colgroup>
<thead>
<tr>
<th rowspan="2">Fusion strategy</th>
<th colspan="4">Pixel-level segmentation</th>
<th colspan="4">Image-level presence detection</th>
</tr>
<tr>
<th>Acc</th>
<th>Recall</th>
<th>Precision</th>
<th>F1 (Dice)/ oU</th>
<th>AUROC</th>
<th>AUPRC</th>
<th>Best F1</th>
<th>Threshold</th>
</tr>
</thead>
<tbody>
<tr>
<td>Single-stream channel fusion</td>
<td>0.9867</td>
<td>0.9154</td>
<td>0.9134</td>
<td>0.8837/0.8522</td>
<td>0.9485</td>
<td>0.9330</td>
<td>0.9072</td>
<td>0.905</td>
</tr>
<tr>
<td>Only late fusion</td>
<td>0.9859</td>
<td>0.9225</td>
<td>0.9391</td>
<td>0.9002/0.8712</td>
<td>0.9632</td>
<td>0.9628</td>
<td>0.9474</td>
<td>0.680</td>
</tr>
<tr>
<td>Early fusion (stage 3, 4) &#x002B; late fusion</td>
<td><bold>0.9896</bold></td>
<td>0.9249</td>
<td><bold>0.9452</bold></td>
<td><bold>0.9110</bold>/<bold>0.8839</bold></td>
<td>0.9636</td>
<td>0.9636</td>
<td><bold>0.9536</bold></td>
<td>0.925</td>
</tr>
<tr>
<td>Early fusion (stage 1&#x2013;5) &#x002B; late fusion</td>
<td>0.9860</td>
<td>0.9423</td>
<td>0.8958</td>
<td>0.8801/0.8508</td>
<td><bold>0.9790</bold></td>
<td>0.9672</td>
<td>0.9202</td>
<td>0.910</td>
</tr>
<tr>
<td>Fusion at all stages (early &#x002B; late)</td>
<td>0.9860</td>
<td><bold>0.9459</bold></td>
<td>0.8902</td>
<td>0.8774/0.8472</td>
<td>0.9698</td>
<td><bold>0.9684</bold></td>
<td><bold>0.9536</bold></td>
<td>0.955</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<fn id="table-3fn3" fn-type="other">
<p>Note: Bold numbers highlight the top-performing results across corresponding metrics.</p>
</fn>
</table-wrap-foot>
</table-wrap>
<p>The single-stream baseline achieved moderate performance but exhibited weaker image-level detection. Late fusion alone improved both segmentation and detection metrics, though the low optimal threshold (0.68) indicates reduced model confidence near decision boundaries. The optimal configuration&#x2014;early fusion at stages 3 and 4 combined with late fusion&#x2014;achieved the highest segmentation performance (F1 &#x003D; 0.9110, IoU &#x003D; 0.8839) while maintaining robust image-level detection (AUROC &#x003D; 0.9636, best F1 &#x003D; 0.9536). Expanding fusion to all encoder stages degraded pixel-level precision despite competitive image-level metrics (AUROC up to 0.9790), suggesting that excessive fusion introduces redundancy and impairs spatial accuracy.</p>
</sec>
<sec id="s4_6_3">
<label>4.6.3</label>
<title>Effect of Decoder Components</title>
<p>We conducted ablation experiments to assess the contribution of three decoder modules: TransUp, UpFlex, and GateFuse.</p>
<p><xref ref-type="table" rid="table-5">Table 5</xref> shows that UpFlex provides the largest individual improvement (F1 &#x003D; 0.90 vs. baseline 0.78), demonstrating its effectiveness for spatial refinement. TransUp with GateFuse achieves F1 &#x003D; 0.85, while combining TransUp and UpFlex yields F1 &#x003D; 0.88. The full configuration with all three modules achieves optimal performance (F1 &#x003D; 0.91), confirming their complementary contributions to segmentation accuracy.</p>
<table-wrap id="table-5">
<label>Table 5</label>
<caption>
<title>Impact of decoder components on segmentation performance</title>
</caption>
<table>
<colgroup>
<col align="center" width="25mm"/>
<col align="center" width="25mm"/>
<col align="center" width="25mm"/>
<col align="center" width="25mm"/> </colgroup>
<thead>
<tr>
<th>TransUp</th>
<th>UpFlex</th>
<th>GateFuse</th>
<th>F1 score</th>
</tr>
</thead>
<tbody>
<tr>
<td><inline-formula id="ieqn-65"><mml:math id="mml-ieqn-65"><mml:mo>&#x00D7;</mml:mo></mml:math></inline-formula></td>
<td><inline-formula id="ieqn-66"><mml:math id="mml-ieqn-66"><mml:mo>&#x00D7;</mml:mo></mml:math></inline-formula></td>
<td><inline-formula id="ieqn-67"><mml:math id="mml-ieqn-67"><mml:mo>&#x00D7;</mml:mo></mml:math></inline-formula></td>
<td>0.78</td>
</tr>
<tr>
<td>&#x2713;</td>
<td><inline-formula id="ieqn-68"><mml:math id="mml-ieqn-68"><mml:mo>&#x00D7;</mml:mo></mml:math></inline-formula></td>
<td>&#x2713;</td>
<td>0.85</td>
</tr>
<tr>
<td><inline-formula id="ieqn-69"><mml:math id="mml-ieqn-69"><mml:mo>&#x00D7;</mml:mo></mml:math></inline-formula></td>
<td>&#x2713;</td>
<td>&#x2713;</td>
<td>0.90</td>
</tr>
<tr>
<td>&#x2713;</td>
<td>&#x2713;</td>
<td><inline-formula id="ieqn-70"><mml:math id="mml-ieqn-70"><mml:mo>&#x00D7;</mml:mo></mml:math></inline-formula></td>
<td>0.88</td>
</tr>
<tr>
<td>&#x2713;</td>
<td>&#x2713;</td>
<td>&#x2713;</td>
<td><bold>0.91</bold></td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
<sec id="s4_6_4">
<label>4.6.4</label>
<title>Effect of Deep Supervision</title>
<p>Deep supervision [<xref ref-type="bibr" rid="ref-26">26</xref>] is a widely adopted technique in segmentation networks, proven to enhance performance by injecting auxiliary losses at intermediate decoder stages to improve gradient flow and regularization. In our study, all experiments incorporated deep supervision (<xref ref-type="sec" rid="s3_5">Section 3.5</xref>) and regularization (<xref ref-type="sec" rid="s3_4">Section 3.4</xref>) wherever gated fusion was applied.</p>
<p>Empirically, this design consistently resulted in a performance boost of 1%&#x2013;2% across metrics compared to training without deep supervision. The improvement was most notable in challenging cases with small or fragmented landslide regions, where auxiliary supervision guided the decoder to produce sharper and more consistent segmentation masks. These findings align with prior evidence that deep supervision strengthens training stability and improves generalization in encoder&#x2013;decoder networks.</p>
</sec>
</sec>
</sec>
<sec id="s5">
<label>5</label>
<title>Discussion</title>
<p>The experimental evaluation highlights key aspects of the proposed framework corresponding to design choices in <xref ref-type="sec" rid="s3">Sections 3</xref> and <xref ref-type="sec" rid="s4">4</xref>.</p>
<p>The backbone analysis (<xref ref-type="sec" rid="s4_6_1">Section 4.6.1</xref>) confirms that lightweight architectures like EfficientNet-B4 achieve superior segmentation performance with fewer parameters than heavier alternatives (ResNet101, ViT), enabling efficient real-time landslide monitoring without sacrificing accuracy.</p>
<p>Fusion strategy comparisons (<xref ref-type="sec" rid="s4_6_2">Section 4.6.2</xref>) show that combining early fusion at mid-level encoder stages with gated late fusion at decoder stages maximizes complementary information while avoiding redundancy. Single-stream fusion achieved baseline performance, but staged gated fusion substantially improved both pixel-level segmentation and image-level detection. Indiscriminate fusion across all levels introduced noise and reduced spatial precision.</p>
<p>Decoder ablations (<xref ref-type="sec" rid="s4_6_3">Section 4.6.3</xref>) reveal that UpFlex provides the largest individual improvement through attention-gated skip fusion. The combination of TransUp, UpFlex, and GateFuse achieved optimal performance, validating their complementary roles in semantic alignment, spatial refinement, and adaptive fusion.</p>
<p>Deep supervision (<xref ref-type="sec" rid="s4_6_4">Section 4.6.4</xref>) consistently improved training stability and performance by 1%&#x2013;2%, particularly for small, fragmented landslides where auxiliary losses recovered finer details, aligning with established regularization benefits in encoder&#x2013;decoder networks.</p>
<p>Baseline comparisons (<xref ref-type="sec" rid="s4_5">Section 4.5</xref>) demonstrate that integrating staged fusion, adaptive decoders, and deep supervision enables our approach to outperform U-Net variants, ShapeFormer, and GMNet on the Bijie dataset. Results on Landslide4Sense confirm that optimized feature selection (RGB, NDVI, Slope, DEM) outperforms full-band or single-modality inputs, emphasizing strategic data fusion in remote sensing applications.</p>
<p><bold><italic>Limitations and Future Work</italic></bold></p>
<p>Despite strong results, illustrated in <xref ref-type="fig" rid="fig-6">Fig. 6</xref>, challenges remain for future work. As shown in <xref ref-type="fig" rid="fig-7">Fig. 7</xref>, the model still struggles in scenes with complex backgrounds and fine boundaries, occasionally leading to missed or imprecise detections. Performance differences between the Bijie and Landslide4Sense datasets further indicate sensitivity to input data quality, particularly resolution. The Bijie dataset benefits from high-resolution optical and DEMs that capture detailed topographic gradients, whereas coarse DEMs may reduce boundary precision and limit model transferability, a known constraint in large-scale landslide mapping.</p>
<fig id="fig-6">
<label>Figure 6</label>
<caption>
<title>Qualitative results of the proposed segmentation framework across two benchmark datasets: (<bold>a</bold>) Bijie Dataset and (<bold>b)</bold> LandSlide4Sense Dataset. For each dataset, the sequence displays the RGB image, annotated ground truth, predicted segmentation mask, and heatmap overlaid on the RGB image. The heatmap highlights landslide regions in red (positive) and non-landslide areas in blue (negative)</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_72550-fig-6.tif"/>
</fig><fig id="fig-7">
<label>Figure 7</label>
<caption>
<title>Examples of challenging cases from the Bijie and Landslide4Sense datasets. Each column shows the RGB image, the annotated ground truth, and the predicted segmentation mask. These samples illustrate failure cases where the model struggles in complex environments, leading to missed detections or inaccurate delineation of landslide boundaries</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_72550-fig-7.tif"/>
</fig>
<p>The current framework is limited to optical and topographic input with single temporal pairs. Future work will focus on integrating synthetic aperture radar (SAR) data for all-weather robustness and employing temporal fusion to capture pre and post-event changes. Incorporating edge-aware refinement, boundary-sensitive loss functions, and domain adaptation will further enhance segmentation accuracy and generalization across varied terrains, making the framework more adaptable for large-scale landslide monitoring.</p>
</sec>
<sec id="s6">
<label>6</label>
<title>Conclusion</title>
<p>This study presented a lightweight dual-stream segmentation framework for mapping landslides from remote sensing imagery. The proposed architecture employs a synergistic combination of a dual encoder-decoder design that preserves modality-specific fidelity while enabling cross-stream information exchange through guided gated fusion of optical and topographical features. An adaptive decoder incorporating cross-attention and attention-guided refinement further enhances boundary delineation and reduces false positives. Comprehensive experiments on two benchmark datasets demonstrated the strong performance of the proposed model, achieving an F1-score of <bold>0.9110</bold> and an IoU of <bold>0.8839</bold> on the Bijie dataset, surpassing state-of-the-art baselines.</p>
<p>The proposed model effectively segmented various types of landslides, including shallow slides, rock falls, and debris slides, demonstrating adaptability across datasets with different geomorphological and climatic triggers. Due to its compact design and computational efficiency, the model is well-suited for rapid regional mapping and near real-time disaster response, with potential for integration into operational geohazard monitoring systems. Future work will explore edge-aware refinement modules, multi-temporal and SAR data fusion, and domain adaptation strategies to improve generalization across diverse geographic environments. Through these extensions, the proposed framework is expected to further enhance the reliability, scalability, and practicality of automated landslide inventory mapping in real-world applications.</p>
</sec>
</body>
<back>
<ack>
<p>The authors would like to thank the School of Computer Science, Chongqing University, for providing the laboratory facilities and experimental resources that supported this research. During the preparation of this manuscript, the authors utilized ChatGPT (version GPT-5, OpenAI) for assistance in text refinement and LaTeX formatting. The authors have carefully reviewed and revised all AI-assisted content and accept full responsibility for the accuracy and integrity of the manuscript.</p>
</ack>
<sec>
<title>Funding Statement</title>
<p>This research was funded by the National Natural Science Foundation of China, grant number 62262045, the Fundamental Research Funds for the Central Universities, grant number 2023CDJYGRH-YB11, and the Open Funding of SUGON Industrial Control and Security Center, grant number CUIT-SICSC-2025-03.</p>
</sec>
<sec>
<title>Author Contributions</title>
<p>The authors confirm contribution to the paper as follows: Conceptualization and methodology, Md Minhazul Islam and Yunfei Yin; experimental setup, Zheng Yuan; programming, Md Minhazul Islam; validation, Md Minhazul Islam, Md Tanvir Islam and Zheng Yuan; formal analysis, Md Minhazul Islam; investigation and data curation, Md Minhazul Islam and Argho Dey; writing&#x2014;original draft preparation, Md Minhazul Islam; writing&#x2014;review and editing, Md Tanvir Islam and Yunfei Yin; supervision, Yunfei Yin. All authors reviewed the results and approved the final version of the manuscript.</p>
</sec>
<sec sec-type="data-availability">
<title>Availability of Data and Materials</title>
<p>The code supporting this study is openly available at <ext-link ext-link-type="uri" xlink:href="https://github.com/mishaown/DiGATe-UNet-LandSlide-Segmentation">https://github.com/mishaown/DiGATe-UNet-LandSlide-Segmentation</ext-link> (accessed on 3 November 2025).</p>
<p>The datasets used&#x2014;Bijie landslide dataset and Landslide4Sense&#x2014;are both publicly accessible:
<list list-type="simple">
<list-item><label>&#x2022;</label>
<p><bold>Bijie landslide dataset:</bold> Provided by Ji et al. of Wuhan University, this dataset includes 770 landslide images and 2003 non-landslide samples. It can be downloaded from the official site: <ext-link ext-link-type="uri" xlink:href="http://gpcv.whu.edu.cn/data/Bijie_pages.html">http://gpcv.whu.edu.cn/data/Bijie_pages.html</ext-link> [<xref ref-type="bibr" rid="ref-30">30</xref>] (accessed on 3 November 2025).</p></list-item>
<list-item><label>&#x2022;</label>
<p><bold>Landslide4Sense dataset:</bold> A multi-sensor landslide benchmark compiled from various global regions (spanning 2015&#x2013;2021), containing 3799 training images, 245 validation, and 800 test samples, with 14 data bands including Sentinel-2, slope and DEM layers. This is available via the Institute of Advanced Research in Artificial Intelligence (IARAI) Landslide4Sense repository/website: <ext-link ext-link-type="uri" xlink:href="https://www.iarai.ac.at/landslide4sense">https://www.iarai.ac.at/landslide4sense</ext-link> [<xref ref-type="bibr" rid="ref-8">8</xref>] (accessed on 3 November 2025).</p>
<p>Both datasets are openly available under terms of their original publication and hosting platforms.</p></list-item>
</list></p>
</sec>
<sec>
<title>Ethics Approval</title>
<p>The data used in this study were publicly available datasets on the Internet. No human participants or animals were involved in this research.</p>
</sec>
<sec sec-type="COI-statement">
<title>Conflicts of Interest</title>
<p>The authors declare that there are no financial or personal relationships with other people or organizations that could inappropriately influence the work. There is no professional or other personal interest of any nature or kind in any product, service, or company that could be construed as influencing the position presented in, or the review of, this manuscript.</p>
</sec>
<app-group id="appg-1">
<app id="app-1">
<title>Appendix A Training and Validation Curves</title>
<p>This appendix complements the experimental results presented in <xref ref-type="sec" rid="s4_5">Section 4.5</xref>. <xref ref-type="fig" rid="fig-8">Figs. A1</xref> and <xref ref-type="fig" rid="fig-9">A2</xref> show the training and validation curves of loss, F1 score, and IoU for the Bijie and Landslide4Sense datasets, respectively. These plots illustrate the convergence behavior and stability of the proposed model during optimization.</p>
<fig id="fig-8">
<label>Figure A1</label>
<caption>
<title>Training and validation curves (loss, F1 score, and IoU) for the Bijie dataset</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_72550-fig-8.tif"/>
</fig>
<fig id="fig-9">
<label>Figure A2</label>
<caption>
<title>Training and validation curves (loss, F1 score, and IoU) for the Landslide4Sense dataset</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_72550-fig-9.tif"/>
</fig>
</app>
<app id="app-2">
<title>Appendix B Computational Efficiency of Pretrained Backbones</title>
<p>This appendix complements the ablation study in <xref ref-type="sec" rid="s4_6">Section 4.6</xref> by comparing computational efficiency across different pretrained backbones. Metrics include floating-point operations (GFLOPs), trainable parameters (M), inference speed (frames per second, FPS), and peak GPU memory usage (MB).</p>
<table-wrap id="table-6">
<label>Table A1</label>
<caption>
<title>Computational efficiency of pretrained backbones. Bold values indicate the best performance</title>
</caption>
<table>
<colgroup>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/> </colgroup>
<thead>
<tr>
<th>Backbone</th>
<th>GFLOPs <inline-formula id="ieqn-71"><mml:math id="mml-ieqn-71"><mml:mo stretchy="false">&#x2193;</mml:mo></mml:math></inline-formula></th>
<th>Trainable params (M)</th>
<th>FPS <inline-formula id="ieqn-72"><mml:math id="mml-ieqn-72"><mml:mo stretchy="false">&#x2191;</mml:mo></mml:math></inline-formula></th>
<th>Peak memory (MB) <inline-formula id="ieqn-73"><mml:math id="mml-ieqn-73"><mml:mo stretchy="false">&#x2193;</mml:mo></mml:math></inline-formula></th>
</tr>
</thead>
<tbody>
<tr>
<td>ResNet101</td>
<td>36.46</td>
<td>36.377</td>
<td>37.79</td>
<td>1474.67</td>
</tr>
<tr>
<td>ResNet50</td>
<td>26.71</td>
<td>36.377</td>
<td>66.51</td>
<td>982.22</td>
</tr>
<tr>
<td>MobileNetV3-Small</td>
<td><bold>0.34</bold></td>
<td>1.393</td>
<td><bold>84.54</bold></td>
<td>975.81</td>
</tr>
<tr>
<td>EfficientNet-B4</td>
<td>4.19</td>
<td><bold>1.238</bold></td>
<td>40.34</td>
<td><bold>303.62</bold></td>
</tr>
<tr>
<td>ViT-Base (Patch16-224)</td>
<td>195.48</td>
<td>39.101</td>
<td>22.27</td>
<td>1909.09</td>
</tr>
<tr>
<td>DenseNet121</td>
<td>22.38</td>
<td>23.261</td>
<td>35.56</td>
<td>352.85</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<fn id="table-6fn1" fn-type="other"><p>Note: <inline-formula id="ieqn-74"><mml:math id="mml-ieqn-74"><mml:mo stretchy="false">&#x2191;</mml:mo></mml:math></inline-formula> indicates higher values are better (FPS), and <inline-formula id="ieqn-75"><mml:math id="mml-ieqn-75"><mml:mo stretchy="false">&#x2193;</mml:mo></mml:math></inline-formula> indicates lower values are better (GFLOPs and Peak Memory). Bold numbers highlight the most efficient results across corresponding metrics.</p>
</fn>
</table-wrap-foot>
</table-wrap>
</app>
</app-group>
<ref-list content-type="authoryear">
<title>References</title>
<ref id="ref-1"><label>[1]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Jiang</surname> <given-names>P</given-names></string-name>, <string-name><surname>Ma</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Mei</surname> <given-names>G</given-names></string-name></person-group>. <article-title>Review article: deep learning for potential landslide identification: data, models, applications, challenges, and opportunities</article-title>. <source>Landsls Debris Flows Hazard</source>. <year>2025 May</year>. doi:<pub-id pub-id-type="doi">10.5194/egusphere-2025-2158</pub-id>.</mixed-citation></ref>
<ref id="ref-2"><label>[2]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Oak</surname> <given-names>O</given-names></string-name>, <string-name><surname>Nazre</surname> <given-names>R</given-names></string-name>, <string-name><surname>Naigaonkar</surname> <given-names>S</given-names></string-name>, <string-name><surname>Sawant</surname> <given-names>S</given-names></string-name>, <string-name><surname>Vaidya</surname> <given-names>H</given-names></string-name></person-group>. <article-title>A comparative analysis of CNN-based deep learning models for landslide detection</article-title>. In: <conf-name>2024 Asian Conference on Intelligent Technologies (ACOIT); 2024 Sep 6&#x2013;7</conf-name>; <publisher-loc>Kolar, India</publisher-loc>: <publisher-name>IEEE</publisher-name>. p. <fpage>1</fpage>&#x2013;<lpage>6</lpage>. doi:<pub-id pub-id-type="doi">10.1109/ACOIT62457.2024.10939989</pub-id>.</mixed-citation></ref>
<ref id="ref-3"><label>[3]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Liu</surname> <given-names>X</given-names></string-name>, <string-name><surname>Peng</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Lu</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Li</surname> <given-names>W</given-names></string-name>, <string-name><surname>Yu</surname> <given-names>J</given-names></string-name>, <string-name><surname>Ge</surname> <given-names>D</given-names></string-name>, <etal>et al</etal></person-group>. <article-title>Feature-fusion segmentation network for landslide detection using high-resolution remote sensing images and digital elevation model data</article-title>. <source>IEEE Trans Geosci Remote Sens</source>. <year>2023</year>;<volume>61</volume>:<fpage>1</fpage>&#x2013;<lpage>14</lpage>. doi:<pub-id pub-id-type="doi">10.1109/TGRS.2022.3233637</pub-id>.</mixed-citation></ref>
<ref id="ref-4"><label>[4]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Chandra</surname> <given-names>N</given-names></string-name>, <string-name><surname>Sawant</surname> <given-names>S</given-names></string-name>, <string-name><surname>Vaidya</surname> <given-names>H</given-names></string-name></person-group>. <article-title>An efficient U-Net model for improved landslide detection from satellite images</article-title>. <source>PFG J Photogramm Remote Sens Geoinf Sci</source>. <year>2023 Mar</year>;<volume>91</volume>(<issue>1</issue>):<fpage>13</fpage>&#x2013;<lpage>28</lpage>. doi:<pub-id pub-id-type="doi">10.1007/s41064-023-00232-4</pub-id>.</mixed-citation></ref>
<ref id="ref-5"><label>[5]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Qin</surname> <given-names>H</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>J</given-names></string-name>, <string-name><surname>Mao</surname> <given-names>X</given-names></string-name>, <string-name><surname>Zhao</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Gao</surname> <given-names>X</given-names></string-name>, <string-name><surname>Lu</surname> <given-names>W</given-names></string-name></person-group>. <article-title>An improved faster R-CNN method for landslide detection in remote sensing images</article-title>. <source>J Geovis Spat Anal</source>. <year>2024 Jun</year>;<volume>8</volume>(<issue>1</issue>):<fpage>2</fpage>. doi:<pub-id pub-id-type="doi">10.1007/s41651-023-00163-z</pub-id>.</mixed-citation></ref>
<ref id="ref-6"><label>[6]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Sreelakshmi</surname> <given-names>S</given-names></string-name>, <string-name><surname>Vinod Chandra</surname> <given-names>SS</given-names></string-name>, <string-name><surname>Shaji</surname> <given-names>E</given-names></string-name></person-group>. <article-title>Landslide identification using machine learning techniques: review, motivation, and future prospects</article-title>. <source>Earth Sci Inform</source>. <year>2022 Dec</year>;<volume>15</volume>(<issue>4</issue>):<fpage>2063</fpage>&#x2013;<lpage>90</lpage>. doi:<pub-id pub-id-type="doi">10.1007/s12145-022-00889-2</pub-id>.</mixed-citation></ref>
<ref id="ref-7"><label>[7]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Hac&#x0131;efendio&#x011F;lu</surname> <given-names>K</given-names></string-name>, <string-name><surname>Varol</surname> <given-names>N</given-names></string-name>, <string-name><surname>To&#x011F;an</surname> <given-names>V</given-names></string-name>, <string-name><surname>Bahad&#x0131;r</surname> <given-names>&#x00DC;</given-names></string-name>, <string-name><surname>Kartal</surname> <given-names>ME</given-names></string-name></person-group>. <article-title>Automatic landslide detection and visualization by using deep ensemble learning method</article-title>. <source>Neural Comput Appl</source>. <year>2024</year>;<volume>36</volume>(<issue>18</issue>):<fpage>10761</fpage>&#x2013;<lpage>76</lpage>. doi:<pub-id pub-id-type="doi">10.1007/s00521-024-09638-6</pub-id>.</mixed-citation></ref>
<ref id="ref-8"><label>[8]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Ghorbanzadeh</surname> <given-names>O</given-names></string-name>, <string-name><surname>Xu</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Ghamisi</surname> <given-names>P</given-names></string-name>, <string-name><surname>Kopp</surname> <given-names>M</given-names></string-name>, <string-name><surname>Kreil</surname> <given-names>D</given-names></string-name></person-group>. <article-title>Landslide4Sense: reference benchmark data and deep learning models for landslide detection</article-title>. <source>IEEE Trans Geosci Remote Sens</source>. <year>2022</year>;<volume>60</volume>:<fpage>1</fpage>&#x2013;<lpage>17</lpage>. doi:<pub-id pub-id-type="doi">10.1109/TGRS.2022.3215209</pub-id>.</mixed-citation></ref>
<ref id="ref-9"><label>[9]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Wu</surname> <given-names>L</given-names></string-name>, <string-name><surname>Liu</surname> <given-names>R</given-names></string-name>, <string-name><surname>Ju</surname> <given-names>N</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>A</given-names></string-name>, <string-name><surname>Gou</surname> <given-names>J</given-names></string-name>, <string-name><surname>He</surname> <given-names>G</given-names></string-name>, <etal>et al</etal></person-group>. <article-title>Landslide mapping based on a hybrid CNN-transformer network and deep transfer learning using remote sensing images with topographic and spectral features</article-title>. <source>Int J Appl Earth Obs Geoinf</source>. <year>2024</year>;<volume>126</volume>:<fpage>103612</fpage>. doi:<pub-id pub-id-type="doi">10.1016/j.jag.2023.103612</pub-id>.</mixed-citation></ref>
<ref id="ref-10"><label>[10]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Wang</surname> <given-names>H</given-names></string-name>, <string-name><surname>Liu</surname> <given-names>J</given-names></string-name>, <string-name><surname>Zeng</surname> <given-names>S</given-names></string-name>, <string-name><surname>Xiao</surname> <given-names>K</given-names></string-name>, <string-name><surname>Yang</surname> <given-names>D</given-names></string-name>, <string-name><surname>Yao</surname> <given-names>G</given-names></string-name>, <etal>et al</etal></person-group>. <article-title>A novel landslide identification method for multi-scale and complex background region based on multi-model fusion: YOLO &#x002B; U-Net</article-title>. <source>Landslides</source>. <year>2024</year>;<volume>21</volume>(<issue>4</issue>):<fpage>901</fpage>&#x2013;<lpage>17</lpage>. doi:<pub-id pub-id-type="doi">10.1007/s10346-023-02184-7</pub-id>.</mixed-citation></ref>
<ref id="ref-11"><label>[11]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Jin</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Liu</surname> <given-names>X</given-names></string-name>, <string-name><surname>Huang</surname> <given-names>X</given-names></string-name></person-group>. <article-title>EMR-HRNet: a multi-scale feature fusion network for landslide segmentation from remote sensing images</article-title>. <source>Sensors</source>. <year>2024</year>;<volume>24</volume>(<issue>11</issue>):<fpage>3677</fpage>. doi:<pub-id pub-id-type="doi">10.3390/s24113677</pub-id>; <pub-id pub-id-type="pmid">38894469</pub-id></mixed-citation></ref>
<ref id="ref-12"><label>[12]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Hu</surname> <given-names>W</given-names></string-name>, <string-name><surname>Sun</surname> <given-names>G</given-names></string-name>, <string-name><surname>Zeng</surname> <given-names>X</given-names></string-name>, <string-name><surname>Tong</surname> <given-names>B</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Wu</surname> <given-names>X</given-names></string-name>, <etal>et al</etal></person-group>. <article-title>Hierarchical cross attention achieves pixel precise landslide segmentation in submeter optical imagery</article-title>. <source>Sci Rep</source>. <year>2025</year>;<volume>15</volume>(<issue>1</issue>):<fpage>21933</fpage>. doi:<pub-id pub-id-type="doi">10.1038/s41598-025-08695-8</pub-id>; <pub-id pub-id-type="pmid">40595351</pub-id></mixed-citation></ref>
<ref id="ref-13"><label>[13]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Licata</surname> <given-names>M</given-names></string-name>, <string-name><surname>Buleo Tebar</surname> <given-names>V</given-names></string-name>, <string-name><surname>Seitone</surname> <given-names>F</given-names></string-name>, <string-name><surname>Fubelli</surname> <given-names>G</given-names></string-name></person-group>. <article-title>The Open landslide project (OLP), a new inventory of shallow landslides for susceptibility models: the autumn 2019 extreme rainfall event in the langhe-monferrato region (Northwestern Italy)</article-title>. <source>Geosciences</source>. <year>2023</year>;<volume>13</volume>(<issue>10</issue>):<fpage>289</fpage>. doi:<pub-id pub-id-type="doi">10.3390/geosciences13100289</pub-id>.</mixed-citation></ref>
<ref id="ref-14"><label>[14]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><surname>Ferrario</surname> <given-names>MF</given-names></string-name></person-group>. <article-title>Inventory of landslides triggered by heavy rainfall in the Emilia-Romagna region (Italy) in May 2023 [Dataset]. Zenodo</article-title>; <year>2024 Aug</year>. doi:<pub-id pub-id-type="doi">10.5281/zenodo.13234762</pub-id>.</mixed-citation></ref>
<ref id="ref-15"><label>[15]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Rana</surname> <given-names>K</given-names></string-name>, <string-name><surname>Malik</surname> <given-names>N</given-names></string-name>, <string-name><surname>Ozturk</surname> <given-names>U</given-names></string-name></person-group>. <article-title>Landsifier v1.0: a python library to estimate likely triggers of mapped landslides</article-title>. <source>Nat Hazards Earth Syst Sci</source>. <year>2022</year>;<volume>22</volume>(<issue>11</issue>):<fpage>3751</fpage>&#x2013;<lpage>64</lpage>. doi:<pub-id pub-id-type="doi">10.5194/nhess-22-3751-2022</pub-id>.</mixed-citation></ref>
<ref id="ref-16"><label>[16]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><surname>Zhou</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Peng</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Ge</surname> <given-names>D</given-names></string-name>, <string-name><surname>Yu</surname> <given-names>J</given-names></string-name>, <string-name><surname>Xiang</surname> <given-names>W</given-names></string-name></person-group>. <article-title>A multi-source data fusion-based semantic segmentation model for relic landslide detection</article-title>. <comment>arXiv:2308.01251. 2025</comment>. doi:<pub-id pub-id-type="doi">10.48550/arXiv.2308.01251</pub-id>.</mixed-citation></ref>
<ref id="ref-17"><label>[17]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Wang</surname> <given-names>X</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>X</given-names></string-name>, <string-name><surname>Zheng</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Liu</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Xia</surname> <given-names>W</given-names></string-name>, <string-name><surname>Guo</surname> <given-names>H</given-names></string-name>, <etal>et al</etal></person-group>. <article-title>GDSNet: a gated dual-stream convolutional neural network for automatic recognition of coseismic landslides</article-title>. <source>Int J Appl Earth Obs Geoinf</source>. <year>2024</year>;<volume>127</volume>:<fpage>103677</fpage>. doi:<pub-id pub-id-type="doi">10.1016/j.jag.2024.103677</pub-id>.</mixed-citation></ref>
<ref id="ref-18"><label>[18]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><surname>Ghorbanzadeh</surname> <given-names>O</given-names></string-name>, <string-name><surname>Xu</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Zhao</surname> <given-names>H</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>J</given-names></string-name>, <string-name><surname>Zhong</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Zhao</surname> <given-names>D</given-names></string-name>, <etal>et al.</etal></person-group> <article-title>The outcome of the 2022 Landslide4Sense competition: advanced landslide detection from multi-source satellite imagery</article-title>. <comment>arXiv:2209.02556. 2022</comment>. doi:<pub-id pub-id-type="doi">10.48550/arXiv.2209.02556</pub-id>.</mixed-citation></ref>
<ref id="ref-19"><label>[19]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>&#x015E;ener</surname> <given-names>A</given-names></string-name>, <string-name><surname>Ergen</surname> <given-names>B</given-names></string-name></person-group>. <article-title>LandslideSegNet: an effective deep learning network for landslide segmentation using remote sensing imagery</article-title>. <source>Earth Sci Inform</source>. <year>2024</year>;<volume>17</volume>(<issue>5</issue>):<fpage>3963</fpage>&#x2013;<lpage>77</lpage>. doi:<pub-id pub-id-type="doi">10.1007/s12145-024-01434-z</pub-id>.</mixed-citation></ref>
<ref id="ref-20"><label>[20]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Lu</surname> <given-names>W</given-names></string-name>, <string-name><surname>Hu</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Shao</surname> <given-names>W</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>H</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>M</given-names></string-name></person-group>. <article-title>A multiscale feature fusion enhanced CNN with the multiscale channel attention mechanism for efficient landslide detection (MS2LandsNet) using medium-resolution remote sensing data</article-title>. <source>Int J Digit Earth</source>. <year>2024</year>;<volume>17</volume>(<issue>1</issue>):<fpage>2300731</fpage>. doi:<pub-id pub-id-type="doi">10.1080/17538947.2023.2300731</pub-id>.</mixed-citation></ref>
<ref id="ref-21"><label>[21]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Lv</surname> <given-names>P</given-names></string-name>, <string-name><surname>Ma</surname> <given-names>L</given-names></string-name>, <string-name><surname>Li</surname> <given-names>Q</given-names></string-name>, <string-name><surname>Du</surname> <given-names>F</given-names></string-name></person-group>. <article-title>ShapeFormer: a shape-enhanced vision transformer model for optical remote sensing image landslide detection</article-title>. <source>IEEE J Sel Top Appl Earth Obs Remote Sens</source>. <year>2023</year>;<volume>16</volume>:<fpage>2681</fpage>&#x2013;<lpage>9</lpage>. doi:<pub-id pub-id-type="doi">10.1109/JSTARS.2023.3253769</pub-id>.</mixed-citation></ref>
<ref id="ref-22"><label>[22]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><surname>Tan</surname> <given-names>M</given-names></string-name>, <string-name><surname>Le</surname> <given-names>QV</given-names></string-name></person-group>. <article-title>EfficientNet: rethinking model scaling for convolutional neural networks</article-title>. <comment>arXiv:1905.11946. 2020</comment>. doi:<pub-id pub-id-type="doi">10.48550/arXiv.1905.11946</pub-id>.</mixed-citation></ref>
<ref id="ref-23"><label>[23]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><surname>Shi</surname> <given-names>W</given-names></string-name>, <string-name><surname>Caballero</surname> <given-names>J</given-names></string-name>, <string-name><surname>Husz&#x00E1;r</surname> <given-names>F</given-names></string-name>, <string-name><surname>Totz</surname> <given-names>J</given-names></string-name>, <string-name><surname>Aitken</surname> <given-names>AP</given-names></string-name>, <string-name><surname>Bishop</surname> <given-names>R</given-names></string-name>, <etal>et al.</etal></person-group> <article-title>Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network</article-title>. <comment>arXiv:1609.05158. 2016</comment> DOI <pub-id pub-id-type="doi">10.48550/arXiv.1609.05158</pub-id>.</mixed-citation></ref>
<ref id="ref-24"><label>[24]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Arevalo</surname> <given-names>J</given-names></string-name>, <string-name><surname>Solorio</surname> <given-names>T</given-names></string-name>, <string-name><surname>Montes-y-G&#x00F3;mez</surname> <given-names>M</given-names></string-name>, <string-name><surname>Gonz&#x00E1;lez</surname> <given-names>FA</given-names></string-name></person-group>. <article-title>Gated multimodal networks</article-title>. <source>Neural Comput Appl</source>. <year>2020</year>;<volume>32</volume>(<issue>14</issue>):<fpage>10209</fpage>&#x2013;<lpage>28</lpage>. doi:<pub-id pub-id-type="doi">10.1007/s00521-019-04559-1</pub-id>.</mixed-citation></ref>
<ref id="ref-25"><label>[25]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><surname>Salehi</surname> <given-names>SSM</given-names></string-name>, <string-name><surname>Erdogmus</surname> <given-names>D</given-names></string-name>, <string-name><surname>Gholipour</surname> <given-names>A</given-names></string-name></person-group>. <article-title>Tversky loss function for image segmentation using 3D fully convolutional deep networks</article-title>. <comment>arXiv:1706.05721. 2017</comment>. doi:<pub-id pub-id-type="doi">10.48550/arXiv.1706.05721</pub-id>.</mixed-citation></ref>
<ref id="ref-26"><label>[26]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><surname>Li</surname> <given-names>R</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>X</given-names></string-name>, <string-name><surname>Huang</surname> <given-names>G</given-names></string-name>, <string-name><surname>Yang</surname> <given-names>W</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>K</given-names></string-name>, <string-name><surname>Gu</surname> <given-names>X</given-names></string-name>, <etal>et al.</etal></person-group> <article-title>A comprehensive review on deep supervision: theories and applications</article-title>. <comment>arXiv:2207.02376. 2022 Jul</comment>. doi:<pub-id pub-id-type="doi">10.48550/arXiv.2207.02376</pub-id>.</mixed-citation></ref>
<ref id="ref-27"><label>[27]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Xu</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Ouyang</surname> <given-names>C</given-names></string-name>, <string-name><surname>Xu</surname> <given-names>Q</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>D</given-names></string-name>, <string-name><surname>Zhao</surname> <given-names>B</given-names></string-name>, <string-name><surname>Luo</surname> <given-names>Y</given-names></string-name></person-group>. <article-title>CAS landslide dataset: a large-scale and multisensor dataset for deep learning-based landslide detection</article-title>. <source>Sci Data</source>. <year>2024</year>;<volume>11</volume>(<issue>1</issue>):<fpage>12</fpage>. doi:<pub-id pub-id-type="doi">10.1038/s41597-023-02847-z</pub-id>; <pub-id pub-id-type="pmid">38168493</pub-id></mixed-citation></ref>
<ref id="ref-28"><label>[28]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Fang</surname> <given-names>C</given-names></string-name>, <string-name><surname>Fan</surname> <given-names>X</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>X</given-names></string-name>, <string-name><surname>Nava</surname> <given-names>L</given-names></string-name>, <string-name><surname>Zhong</surname> <given-names>H</given-names></string-name>, <string-name><surname>Dong</surname> <given-names>X</given-names></string-name>, <etal>et al</etal></person-group>. <article-title>A globally distributed dataset of coseismic landslide mapping via multi-source high-resolution remote sensing images</article-title>. <source>Earth Syst Sci Data</source>. <year>2024</year>;<volume>16</volume>(<issue>10</issue>):<fpage>4817</fpage>&#x2013;<lpage>42</lpage>. doi:<pub-id pub-id-type="doi">10.5194/essd-16-4817-2024</pub-id>.</mixed-citation></ref>
<ref id="ref-29"><label>[29]</label><mixed-citation publication-type="book"><person-group person-group-type="author"><string-name><surname>Handwerger</surname> <given-names>AL</given-names></string-name>, <string-name><surname>Huang</surname> <given-names>MH</given-names></string-name>, <string-name><surname>Jones</surname> <given-names>SY</given-names></string-name>, <string-name><surname>Amatya</surname> <given-names>P</given-names></string-name>, <string-name><surname>Kerner</surname> <given-names>HR</given-names></string-name>, <string-name><surname>Kirschbaum</surname> <given-names>DB</given-names></string-name></person-group>. <source>Generating landslide density heatmaps for rapid detection using open-access satellite radar data in Google Earth Engine</source>. <publisher-name>Nat Hazards Earth Syst Sci</publisher-name>. <year>2022;22:753&#x2013;73</year>.doi:<pub-id pub-id-type="doi">10.5194/nhess-22-753-2022</pub-id>.</mixed-citation></ref>
<ref id="ref-30"><label>[30]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Ji</surname> <given-names>S</given-names></string-name>, <string-name><surname>Yu</surname> <given-names>D</given-names></string-name>, <string-name><surname>Shen</surname> <given-names>C</given-names></string-name>, <string-name><surname>Li</surname> <given-names>W</given-names></string-name>, <string-name><surname>Xu</surname> <given-names>Q</given-names></string-name></person-group>. <article-title>Landslide detection from an open satellite imagery and digital elevation model dataset using attention boosted convolutional neural networks</article-title>. <source>Landslides</source>. <year>2020</year>;<volume>17</volume>(<issue>6</issue>):<fpage>1337</fpage>&#x2013;<lpage>52</lpage>. doi:<pub-id pub-id-type="doi">10.1007/s10346-020-01353-2</pub-id>.</mixed-citation></ref>
<ref id="ref-31"><label>[31]</label><mixed-citation publication-type="book"><person-group person-group-type="author"><string-name><surname>Chaurasia</surname> <given-names>A</given-names></string-name>, <string-name><surname>Culurciello</surname> <given-names>E</given-names></string-name></person-group>. <chapter-title>LinkNet: exploiting encoder representations for efficient semantic segmentation</chapter-title>. In: <source>2017 IEEE Visual Communications and Image Processing (VCIP); 2017 Dec 10&#x2013;13; St. Petersburg, FL, USA: IEEE</source>. p. <fpage>1</fpage>&#x2013;<lpage>4</lpage>. doi:<pub-id pub-id-type="doi">10.1109/VCIP.2017.8305148</pub-id>.</mixed-citation></ref>
<ref id="ref-32"><label>[32]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><surname>Daudt</surname> <given-names>RC</given-names></string-name>, <string-name><surname>Saux</surname> <given-names>BL</given-names></string-name>, <string-name><surname>Boulch</surname> <given-names>A</given-names></string-name></person-group>. <article-title>Fully convolutional siamese networks for change detection</article-title>. <comment>arXiv:1810.08462. 2018</comment>. doi:<pub-id pub-id-type="doi">10.48550/arXiv.1810.08462</pub-id>.</mixed-citation></ref>
<ref id="ref-33"><label>[33]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Yang</surname> <given-names>K</given-names></string-name>, <string-name><surname>Xia</surname> <given-names>GS</given-names></string-name>, <string-name><surname>Liu</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Du</surname> <given-names>B</given-names></string-name>, <string-name><surname>Yang</surname> <given-names>W</given-names></string-name>, <string-name><surname>Pelillo</surname> <given-names>M</given-names></string-name>, <etal>et al</etal></person-group>. <article-title>Asymmetric siamese networks for semantic change detection in aerial images</article-title>. <source>IEEE Trans Geosci Remote Sens</source>. <year>2022</year>;<volume>60</volume>:<fpage>1</fpage>&#x2013;<lpage>18</lpage>. doi:<pub-id pub-id-type="doi">10.1109/TGRS.2021.3113912</pub-id>.</mixed-citation></ref>
<ref id="ref-34"><label>[34]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><surname>Pham</surname> <given-names>L</given-names></string-name>, <string-name><surname>Le</surname> <given-names>C</given-names></string-name>, <string-name><surname>Tang</surname> <given-names>H</given-names></string-name>, <string-name><surname>Truong</surname> <given-names>K</given-names></string-name>, <string-name><surname>Nguyen</surname> <given-names>T</given-names></string-name>, <string-name><surname>Lampert</surname> <given-names>J</given-names></string-name>, <etal>et al.</etal></person-group> <article-title>RMAU-NET: a residual-multihead-attention U-Net architecture for landslide segmentation and detection from remote sensing images</article-title>. <comment>arXiv:2507.11143. 2025</comment> doi:<pub-id pub-id-type="doi">10.48550/arXiv.2507.11143</pub-id>.</mixed-citation></ref>
<ref id="ref-35"><label>[35]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Li</surname> <given-names>G</given-names></string-name>, <string-name><surname>Li</surname> <given-names>K</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>G</given-names></string-name>, <string-name><surname>Pan</surname> <given-names>K</given-names></string-name>, <string-name><surname>Ding</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>Z</given-names></string-name>, <etal>et al.</etal></person-group> <article-title>A landslide area segmentation method based on an improved UNet</article-title>. <source>Sci Rep</source>. <year>2025 Apr</year>;<volume>15</volume>(<issue>1</issue>):<fpage>11852</fpage>. doi:<pub-id pub-id-type="doi">10.1038/s41598-025-94039-5</pub-id>; <pub-id pub-id-type="pmid">40195381</pub-id></mixed-citation></ref>
<ref id="ref-36"><label>[36]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Wang</surname> <given-names>K</given-names></string-name>, <string-name><surname>He</surname> <given-names>D</given-names></string-name>, <string-name><surname>Sun</surname> <given-names>Q</given-names></string-name>, <string-name><surname>Yi</surname> <given-names>L</given-names></string-name>, <string-name><surname>Yuan</surname> <given-names>X</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>Y</given-names></string-name></person-group>. <article-title>A novel network for semantic segmentation of landslide areas in remote sensing images with multi-branch and multi-scale fusion</article-title>. <source>Appl Soft Comput</source>. <year>2024 Jun</year>;<volume>158</volume>:<fpage>111542</fpage>. doi:<pub-id pub-id-type="doi">10.1016/j.asoc.2024.111542</pub-id>.</mixed-citation></ref>
</ref-list>
</back></article>


