<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.1 20151215//EN" "http://jats.nlm.nih.gov/publishing/1.1/JATS-journalpublishing1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:mml="http://www.w3.org/1998/Math/MathML" xml:lang="en" article-type="research-article" dtd-version="1.1">
<front>
<journal-meta>
<journal-id journal-id-type="pmc">SDHM</journal-id>
<journal-id journal-id-type="nlm-ta">SDHM</journal-id>
<journal-id journal-id-type="publisher-id">SDHM</journal-id>
<journal-title-group>
<journal-title>Structural Durability &#x0026; Health Monitoring</journal-title>
</journal-title-group>
<issn pub-type="epub">1930-2991</issn>
<issn pub-type="ppub">1930-2983</issn>
<publisher>
<publisher-name>Tech Science Press</publisher-name>
<publisher-loc>USA</publisher-loc>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="publisher-id">68987</article-id>
<article-id pub-id-type="doi">10.32604/sdhm.2025.068987</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Article</subject>
</subj-group>
</article-categories>
<title-group>
<article-title>Automatic Recognition Algorithm of Pavement Defects Based on S<sup><bold>3</bold></sup>M and SDI Modules Using UAV-Collected Road Images</article-title>
<alt-title alt-title-type="left-running-head">Automatic Recognition Algorithm of Pavement Defects Based on S<sup>3</sup>M and SDI Modules Using UAV-Collected Road Images</alt-title>
<alt-title alt-title-type="right-running-head">Automatic Recognition Algorithm of Pavement Defects Based on S<sup>3</sup>M and SDI Modules Using UAV-Collected Road Images</alt-title>
</title-group>
<contrib-group>
<contrib id="author-1" contrib-type="author">
<name name-style="western"><surname>Zhao</surname><given-names>Hongcheng</given-names></name><xref ref-type="aff" rid="aff-1">1</xref></contrib>
<contrib id="author-2" contrib-type="author">
<name name-style="western"><surname>Yang </surname><given-names>Tong</given-names></name><xref ref-type="aff" rid="aff-2">2</xref></contrib>
<contrib id="author-3" contrib-type="author">
<name name-style="western"><surname>Hu</surname><given-names>Yihui</given-names></name><xref ref-type="aff" rid="aff-2">2</xref></contrib>
<contrib id="author-4" contrib-type="author" corresp="yes">
<name name-style="western"><surname>Guo</surname><given-names>Fengxiang</given-names></name><xref ref-type="aff" rid="aff-2">2</xref><email>guofengxiang@kust.edu.cn</email></contrib>
<aff id="aff-1"><label>1</label><institution>Yunnan Transportation Science Research Institute Co., Ltd.</institution>, <addr-line>Kunming, 650200</addr-line>, <country>China</country></aff>
<aff id="aff-2"><label>2</label><institution>Faculty of Transportation Engineering, Kunming University of Science and Technology</institution>, <addr-line>Kunming, 650500</addr-line>, <country>China</country></aff>
</contrib-group>
<author-notes>
<corresp id="cor1"><label>&#x002A;</label>Corresponding Author: Fengxiang Guo. Email: <email>guofengxiang@kust.edu.cn</email></corresp>
</author-notes>
<pub-date date-type="collection" publication-format="electronic">
<year>2026</year>
</pub-date>
<pub-date date-type="pub" publication-format="electronic">
<day>6</day><month>1</month><year>2026</year>
</pub-date>
<volume>20</volume>
<issue>1</issue>
<elocation-id>6</elocation-id>
<history>
<date date-type="received">
<day>11</day>
<month>6</month>
<year>2025</year>
</date>
<date date-type="accepted">
<day>18</day>
<month>7</month>
<year>2025</year>
</date>
</history>
<permissions>
<copyright-statement>&#x00A9; 2026 The Authors.</copyright-statement>
<copyright-year>2026</copyright-year>
<copyright-holder>Published by Tech Science Press.</copyright-holder>
<license xlink:href="https://creativecommons.org/licenses/by/4.0/">
<license-p>This work is licensed under a <ext-link ext-link-type="uri" xlink:type="simple" xlink:href="https://creativecommons.org/licenses/by/4.0/">Creative Commons Attribution 4.0 International License</ext-link>, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
</license>
</permissions>
<self-uri content-type="pdf" xlink:href="_SDHM_68987.pdf"></self-uri>
<abstract>
<p>With the rapid development of transportation infrastructure, ensuring road safety through timely and accurate highway inspection has become increasingly critical. Traditional manual inspection methods are not only time-consuming and labor-intensive, but they also struggle to provide consistent, high-precision detection and real-time monitoring of pavement surface defects. To overcome these limitations, we propose an Automatic Recognition of Pavement Defect (ARPD) algorithm, which leverages unmanned aerial vehicle (UAV)-based aerial imagery to automate the inspection process. The ARPD framework incorporates a backbone network based on the Selective State Space Model (S<sup>3</sup>M), which is designed to capture long-range temporal dependencies. This enables effective modeling of dynamic correlations among redundant and often repetitive structures commonly found in road imagery. Furthermore, a neck structure based on Semantics and Detail Infusion (SDI) is introduced to guide cross-scale feature fusion. The SDI module enhances the integration of low-level spatial details with high-level semantic cues, thereby improving feature expressiveness and defect localization accuracy. Experimental evaluations demonstrate that the ARPD algorithm achieves a mean average precision (mAP) of 86.1% on a custom-labeled pavement defect dataset, outperforming the state-of-the-art YOLOv11 segmentation model. The algorithm also maintains strong generalization ability on public datasets. These results confirm that ARPD is well-suited for diverse real-world applications in intelligent, large-scale highway defect monitoring and maintenance planning.</p>
</abstract>
<kwd-group kwd-group-type="author">
<kwd>Pavement defects</kwd>
<kwd>state space model</kwd>
<kwd>UAV</kwd>
<kwd>detection algorithm</kwd>
<kwd>image processing</kwd>
</kwd-group>
<funding-group>
<award-group id="awg1">
<funding-source>Technical Service for the Development and Application of an Intelligent Visual Management Platform for Expressway Construction Progress Based on BIM Technology</funding-source>
<award-id>JKYZLX-2023-09</award-id>
</award-group>
<award-group id="awg2">
<funding-source>Technical Service for the Development of an Early Warning Model in the Research and Application of Key Technologies for Tunnel Operation Safety Monitoring and Early Warning Based on Digital Twin</funding-source>
<award-id>JK-S02-ZNGS-202412-JISHU-FA-0035</award-id>
</award-group>
</funding-group>
</article-meta>
</front>
<body>
<sec id="s1">
<label>1</label>
<title>Introduction</title>
<p>As critical components of national infrastructure, roads play a vital role in daily transportation and freight logistics. The safety, comfort, and operational efficiency have a direct impact on social stability and economic development. With the continuous growth in travel demand, pavement surfaces are increasingly subject to various forms of defects, such as longitudinal cracks (LC), transverse cracks (TC), oblique cracks (OC), alligator cracks (AC), potholes (PH), and asphalt repairs (RP), as illustrated in <xref ref-type="fig" rid="fig-1">Fig. 1</xref>. Under repeated loading, these defects can evolve into more severe issues such as through-cracks, rutting, spalling, and structural failure, posing significant safety risks [<xref ref-type="bibr" rid="ref-1">1</xref>]. If not addressed in a timely manner, pavement deterioration may lead to serious traffic accidents and endanger public safety. Therefore, developing efficient and accurate pavement inspection methods is of great significance for enhancing transportation safety [<xref ref-type="bibr" rid="ref-2">2</xref>].</p>
<fig id="fig-1">
<label>Figure 1</label>
<caption>
<title>UAV-based pavement surface defect inspection. (<bold>a</bold>) UAV aerial view. (<bold>b</bold>) Main defects</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="SDHM_68987-fig-1.tif"/>
</fig>
<p>Traditional road inspection methods still rely heavily on manual visual assessment, which is not only inefficient but also poses safety risks and introduces significant human error, falling short of the demands of modern transportation systems [<xref ref-type="bibr" rid="ref-3">3</xref>]. Although road inspection vehicles equipped with high-resolution cameras and sensors have been developed, their limited field of view results in blind spots, preventing comprehensive coverage. Moreover, the data collected by such vehicles still require manual filtering and analysis, which remains time-consuming and labor-intensive [<xref ref-type="bibr" rid="ref-4">4</xref>]. Fortunately, unmanned aerial vehicle (UAV) technology offers a promising alternative for road maintenance due to its wide field of view, wide coverage, and low cost. UAVs equipped with high-resolution cameras and infrared sensors enable fast and efficient detection of pavement defects. Compared to traditional manual inspection, UAV-based approaches significantly improve detection efficiency and accuracy while reducing personnel risks, making them a research hotspot in pavement defect detection [<xref ref-type="bibr" rid="ref-5">5</xref>]. However, manually reviewing large volumes of UAV imagery remains laborious, highlighting the urgent need for an automatic defect recognition algorithm based on UAV road surface images.</p>
<p>Over the past decades, road inspection methods have mainly evolved through two technological stages: (1) Image Processing (IP) Techniques: Early researchers developed defect detection methods using traditional IP algorithms such as thresholding [<xref ref-type="bibr" rid="ref-6">6</xref>], wavelet transforms [<xref ref-type="bibr" rid="ref-7">7</xref>], and edge computing [<xref ref-type="bibr" rid="ref-8">8</xref>,<xref ref-type="bibr" rid="ref-9">9</xref>]. While these techniques provided quick results, they required manual parameter tuning and lacked generalization capabilities. (2) Convolutional Neural Networks (CNNs): CNN-based methods leverage the deep representation capabilities of neural networks to automatically learn multi-level image features, enabling semantic understanding of image content [<xref ref-type="bibr" rid="ref-10">10</xref>&#x2013;<xref ref-type="bibr" rid="ref-12">12</xref>]. Compared to IP techniques, CNNs exhibit superior performance in identifying complex pavement structures and subtle defects. These methods are typically categorized into two types: Two-stage approaches based on region proposal networks [<xref ref-type="bibr" rid="ref-13">13</xref>&#x2013;<xref ref-type="bibr" rid="ref-15">15</xref>], which deliver high accuracy but suffer from low inference speed. Single-stage approaches based on direct bounding box regression [<xref ref-type="bibr" rid="ref-16">16</xref>&#x2013;<xref ref-type="bibr" rid="ref-18">18</xref>], which offer faster inference at the cost of slight accuracy degradation and are widely adopted in current object detection tasks. For instance, Shan et al. [<xref ref-type="bibr" rid="ref-19">19</xref>] designed an asymmetric loss function tailored for road crack recognition and implemented it within a U-Net framework, achieving precise crack pattern extraction on UAV datasets. Similarly, Tse et al. [<xref ref-type="bibr" rid="ref-20">20</xref>] employed a mean Intersection over Union (mIoU)-based loss function within U-Net to control gradient descent, attaining state-of-the-art performance in UAV-based crack detection. However, these methods primarily focus on crack features while neglecting other critical defects such as alligator cracking or potholes. To address the multi-defect detection challenge, Feng et al. [<xref ref-type="bibr" rid="ref-21">21</xref>] utilized a Context Encoder Network (CE-Net)-based semantic segmentation model for simultaneous detection and segmentation of various pavement defects, enabling comprehensive health assessments. Dugalam and Prakash [<xref ref-type="bibr" rid="ref-22">22</xref>] proposed a UAV LiDAR and Random Forest-based algorithm that achieved promising results for subsidence and pothole detection. Nonetheless, the precision and efficiency of these models still have room for improvement in complex multi-defect pavement scenarios.</p>
<p>In recent years, visual methodologies based on emerging CNN paradigms such as Transformers and Mamba have been successfully applied to transportation infrastructure inspection and broader structural health monitoring tasks [<xref ref-type="bibr" rid="ref-23">23</xref>,<xref ref-type="bibr" rid="ref-24">24</xref>]. Transformer-based models (e.g., Vision Transformer [<xref ref-type="bibr" rid="ref-25">25</xref>], MobileNet [<xref ref-type="bibr" rid="ref-26">26</xref>], and U-Net [<xref ref-type="bibr" rid="ref-27">27</xref>]) leverage self-attention mechanisms to capture global context and flexibly model long-range dependencies between features. However, these models are inherently limited by their high computational complexity. Furthermore, their dependency on large-scale training datasets and resource-intensive hardware significantly impairs their real-time applicability. To overcome these limitations, the Mamba architecture, built upon the Selective State Space Model (S<sup>3</sup>M), introduces explicit state variables to adaptively model input sequences [<xref ref-type="bibr" rid="ref-28">28</xref>]. This approach not only effectively captures long-term temporal dependencies but also reduces redundant information, thereby achieving outstanding performance in continuous-time sequence modeling tasks. For example, Han et al. [<xref ref-type="bibr" rid="ref-29">29</xref>] developed MambaCrackNet, which integrates residual vision Mamba blocks for pixel-level road crack segmentation, achieving strong results on public datasets. Similarly, Zhu et al. [<xref ref-type="bibr" rid="ref-30">30</xref>] proposed MSCrackMamba, a two-stage crack detection paradigm with Vision Mamba as its backbone, reporting a 3.55% improvement in mIoU over baseline models.</p>
<p>Despite these promising advances, substantial challenges persist in transitioning these methods to real-world deployment scenarios. Specifically, existing studies continue to face difficulties in addressing complex environmental variations, meeting real-time processing constraints, and detecting fine-grained or small-scale defects. In the context of UAV-based road surface defect detection, current CNN- and Transformer-based methods encounter several notable challenges: (1) Extremely limited semantic information: Although UAV imagery typically offers ultra-high resolution, pavement defects occupy only a minimal portion of the pixel space, resulting in sparse semantic cues for effective feature extraction. (2) Significant variation in object scale: Pavement defects encompass a wide range of categories, each with differing physical dimensions. (3) Irregular and sparsely distributed targets: As the most prevalent defect type, cracks tend to be narrow, elongated, and irregularly distributed. From a top-down UAV perspective, their spatial arrangement lacks predictable patterns, complicating detection and modeling. These challenges highlight the need for more robust and efficient detection frameworks capable of operating under real-world constraints while maintaining high precision and generalizability.</p>
<p>To address the aforementioned challenges and limitations, inspired by pioneering research, this study develops an Automatic Recognition of Pavement Defects (ARPD) algorithm based on UAV-acquired imagery. As illustrated in <xref ref-type="fig" rid="fig-2">Fig. 2</xref>, the ARPD framework consists of three major stages. First, the UAV-PDD2023 dataset [<xref ref-type="bibr" rid="ref-1">1</xref>] is utilized and randomly divided into training, validation, and testing subsets. The training and validation sets include paired images and label files, while the testing set contains only images with no duplication across datasets. Second, a backbone network based on the Selective State Space Model (S<sup>3</sup>M) [<xref ref-type="bibr" rid="ref-28">28</xref>] is integrated into ARPD for fine-grained feature extraction. This is followed by a Semantic and Detail Information (SDI)-based neck module [<xref ref-type="bibr" rid="ref-27">27</xref>], which performs multi-level feature fusion using the training and validation sets. Finally, the algorithm&#x2019;s generalization and real-time performance are evaluated through inference on both the new RDD2022 dataset [<xref ref-type="bibr" rid="ref-31">31</xref>] and the designated testing set.</p>
<fig id="fig-2">
<label>Figure 2</label>
<caption>
<title>The overview of ARPD algorithm</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="SDHM_68987-fig-2.tif"/>
</fig>
<p>The main contributions and innovations of this study are as follows:
<list list-type="simple">
<list-item><label>(1)</label><p>An automatic multi-class pavement surface defect recognition model based on a Selective State Space Model (S<sup>3</sup>M) and Semantics and Detail Infusion (SDI) is developed, achieving promising results on UAV-perspective datasets.</p></list-item>
<list-item><label>(2)</label><p>An S<sup>3</sup>M-based backbone is integrated into the model to extract fine-grained features through temporal state updates and Zero-Order Hold (ZOH), enabling efficient long-range dependency modeling among spatially discrete surface defects.</p></list-item>
<list-item><label>(3)</label><p>Additionally, the architecture incorporates a lightweight neck module with skip connections, which applies both spatial and channel-wise attention mechanisms to effectively integrate semantic cues at multiple scales for improved defect recognition accuracy.</p></list-item>
</list></p>
</sec>
<sec id="s2">
<label>2</label>
<title>ARPD Algorithm</title>
<sec id="s2_1">
<label>2.1</label>
<title>Algorithm Overview</title>
<p>As shown in <xref ref-type="fig" rid="fig-2">Fig. 2</xref>, the ARPD algorithm consists of three steps:
<list list-type="bullet">
<list-item>
<p>UAV-Based Pavement Dataset Construction: As illustrated in <xref ref-type="fig" rid="fig-2">Fig. 2</xref>, the UAV-PDD2023 dataset is constructed using aerial images captured by unmanned aerial vehicles (UAVs). The dataset is randomly divided into training, validation, and testing subsets to ensure diversity and independence across sets.</p></list-item>
<list-item>
<p>Pavement Defect Recognition: After undergoing preprocessing procedures (including resizing, random cropping, and Mosaic augmentation [<xref ref-type="bibr" rid="ref-17">17</xref>]), the dataset is fed into the ARPD algorithm. Specifically, the ARPD integrates a Selective State Space Model (S<sup>3</sup>M)-based backbone for adaptive long-range dependency modeling, which effectively captures edge semantic information of slender and small-scale defects such as cracks. Additionally, a Semantic and Detail Information (SDI)-enhanced neck module is employed for multi-scale feature fusion. A hybrid loss function is adopted to guide gradient propagation during training, enhancing the algorithm&#x2019;s ability to detect diverse defect types.</p></list-item>
<list-item>
<p>System Validation on Test and Public Datasets: The testing set and the public dataset RDD-2022, both containing only unlabeled images and excluded from the training process, are used to validate the ARPD algorithm. This evaluation demonstrates the algorithm&#x2019;s generalization ability and robustness under real-world conditions.</p></list-item>
</list></p>
</sec>
<sec id="s2_2">
<label>2.2</label>
<title>S<sup><bold><italic>3</italic></bold></sup>M-Based Backbone</title>
<p>As illustrated in <xref ref-type="fig" rid="fig-3">Fig. 3</xref>, the backbone of the ARPD algorithm integrates four key modules: VssMamba, VCM, SPPF, and C2PSA. The detailed configuration parameters of the backbone are presented in <xref ref-type="table" rid="table-1">Table 1</xref>.</p>

<p><list list-type="simple">
<list-item><label>(1)</label>
<p>VssMamba (Vision State Space Mamba) Module: As shown in <xref ref-type="fig" rid="fig-3">Fig. 3</xref>, the VssMamba module incorporates Depthwise Convolution (DWConv) [<xref ref-type="bibr" rid="ref-32">32</xref>] for enhanced feature extraction at the input stage. This design enables the network to capture deeper and more expressive feature representations. To maintain efficiency and stability during training and inference, Batch Normalization (BN) and Layer Normalization (LN) are employed. The computation is governed by the following equations:</p></list-item>
</list>
<disp-formula id="eqn-1"><label>(1)</label><mml:math id="mml-eqn-1" display="block"><mml:msup><mml:mi>Z</mml:mi><mml:mrow><mml:mi>l</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>2</mml:mn></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:mi>S</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mi>B</mml:mi><mml:mi>N</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mi>D</mml:mi><mml:mi>W</mml:mi><mml:mi>C</mml:mi><mml:mi>o</mml:mi><mml:mi>n</mml:mi><mml:msub><mml:mi>v</mml:mi><mml:mrow><mml:mn>1</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:msup><mml:mi>Z</mml:mi><mml:mrow><mml:mi>l</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>3</mml:mn></mml:mrow></mml:msup><mml:mo>)</mml:mo></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:math></disp-formula>where <italic>S</italic> denotes the nonlinear Sigmoid Linear Unit (SiLU) activation function. The integration of DWConv enables the VssMamba module to facilitate effective feature propagation and maintain stable training, particularly under deep stacking conditions. The corresponding formulation is expressed as follows:
<disp-formula id="eqn-2"><label>(2)</label><mml:math id="mml-eqn-2" display="block"><mml:msup><mml:mi>Z</mml:mi><mml:mrow><mml:mi>l</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:msup><mml:mi>S</mml:mi><mml:mrow><mml:mn>3</mml:mn></mml:mrow></mml:msup><mml:mrow><mml:mo>(</mml:mo><mml:mi>L</mml:mi><mml:mi>N</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mi>L</mml:mi><mml:mi>S</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:msup><mml:mi>Z</mml:mi><mml:mrow><mml:mi>l</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>2</mml:mn></mml:mrow></mml:msup><mml:mo>)</mml:mo></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mo>+</mml:mo><mml:msup><mml:mi>Z</mml:mi><mml:mrow><mml:mi>l</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>2</mml:mn></mml:mrow></mml:msup></mml:math></disp-formula>
<disp-formula id="eqn-3"><label>(3)</label><mml:math id="mml-eqn-3" display="block"><mml:msup><mml:mi>Z</mml:mi><mml:mrow><mml:mi>l</mml:mi></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:mi>R</mml:mi><mml:mi>B</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mi>L</mml:mi><mml:mi>N</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msup><mml:mi>Z</mml:mi><mml:mrow><mml:mi>l</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup><mml:mo>)</mml:mo></mml:mrow><mml:mo>+</mml:mo><mml:msup><mml:mi>Z</mml:mi><mml:mrow><mml:mi>l</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup></mml:math></disp-formula>where <inline-formula id="ieqn-1"><mml:math id="mml-ieqn-1"><mml:msup><mml:mi>Z</mml:mi><mml:mrow><mml:mi>l</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>3</mml:mn></mml:mrow></mml:msup></mml:math></inline-formula> and <inline-formula id="ieqn-2"><mml:math id="mml-ieqn-2"><mml:msup><mml:mi>Z</mml:mi><mml:mrow><mml:mi>l</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula> denote the input and output features, respectively, and <italic>RB</italic> represents the Residual Block. The VssMamba module consists of scanning and expansion steps, performing directional feature extraction along top-down, bottom-up, left-to-right, and right-to-left pathways. These operations are further validated with label supervision. This bidirectional scanning strategy not only ensures full spatial coverage of the input image but also constructs a rich multi-dimensional feature pool through systematic directional transformations, thereby enhancing the efficiency and comprehensiveness of multi-scale feature extraction.</p>
<fig id="fig-3">
<label>Figure 3</label>
<caption>
<title>VssMamba module. (<bold>a</bold>) VssMamba block. (<bold>b</bold>) S<sup>3</sup> block. Note that Multilayer Perceptron (MLP) represents multi-layer perceptron, and &#x2018;&#x002B;&#x2019; represents Concat concatenation</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="SDHM_68987-fig-3.tif"/>
</fig><table-wrap id="table-1">
<label>Table 1</label>
<caption>
<title>S<sup>3</sup>M-based backbone coefficients</title>
</caption>
<table>
<colgroup>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
</colgroup>
<thead>
<tr>
<th>Layer</th>
<th>Kernel size</th>
<th>Stride</th>
<th>Repetitions</th>
<th>Output channel</th>
</tr>
</thead>
<tbody>
<tr>
<td>Conv</td>
<td>4 &#x00D7; 4</td>
<td>4</td>
<td>1</td>
<td>128</td>
</tr>
<tr>
<td>VssMamba</td>
<td>1 &#x00D7; 1</td>
<td>1</td>
<td>2</td>
<td>128</td>
</tr>
<tr>
<td>VCM</td>
<td>3 &#x00D7; 3</td>
<td>2</td>
<td>1</td>
<td>256</td>
</tr>
<tr>
<td>VssMamba</td>
<td>1 &#x00D7; 1</td>
<td>1</td>
<td>2</td>
<td>256</td>
</tr>
<tr>
<td>VCM</td>
<td>3 &#x00D7; 3</td>
<td>2</td>
<td>1</td>
<td>512</td>
</tr>
<tr>
<td>VssMamba</td>
<td>1 &#x00D7; 1</td>
<td>1</td>
<td>2</td>
<td>512</td>
</tr>
<tr>
<td>VCM</td>
<td>3 &#x00D7; 3</td>
<td>2</td>
<td>1</td>
<td>1024</td>
</tr>
<tr>
<td>VssMamba</td>
<td>1 &#x00D7; 1</td>
<td>1</td>
<td>2</td>
<td>1024</td>
</tr>
<tr>
<td>SPPF</td>
<td>5 &#x00D7; 5</td>
<td>1</td>
<td>1</td>
<td>1024</td>
</tr>
<tr>
<td>C2PSA</td>
<td>3 &#x00D7; 3</td>
<td>2</td>
<td>2</td>
<td>1024</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>The S<sup>3</sup> (Select State Space) module maps a univariate input sequence <italic>x</italic>(<italic>t</italic>) &#x2208; <italic>R</italic> to an output sequence <italic>y</italic>(<italic>t</italic>) via an implicit intermediate hidden state <italic>h</italic>(<italic>t</italic>) &#x2208; <italic>R</italic><sup><italic>N</italic></sup>, as defined by the following first-order differential equation:
<disp-formula id="eqn-4"><label>(4)</label><mml:math id="mml-eqn-4" display="block"><mml:msup><mml:mi>h</mml:mi><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:mi>A</mml:mi><mml:mi>h</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mi>t</mml:mi><mml:mo>)</mml:mo></mml:mrow><mml:mo>+</mml:mo><mml:mi>B</mml:mi><mml:mi>x</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></disp-formula>
<disp-formula id="eqn-5"><label>(5)</label><mml:math id="mml-eqn-5" display="block"><mml:mi>y</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mi>C</mml:mi><mml:mi>h</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></disp-formula>where <inline-formula id="ieqn-3"><mml:math id="mml-ieqn-3"><mml:mi>A</mml:mi><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mi>R</mml:mi><mml:mrow><mml:mi>N</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mi>N</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula>, <inline-formula id="ieqn-4"><mml:math id="mml-ieqn-4"><mml:mi>B</mml:mi><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mi>R</mml:mi><mml:mrow><mml:mi>N</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup></mml:math></inline-formula> and <inline-formula id="ieqn-5"><mml:math id="mml-ieqn-5"><mml:mi>C</mml:mi><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mi>R</mml:mi><mml:mrow><mml:mi>N</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup></mml:math></inline-formula> are the state transition matrix, input projection matrix, and output mapping matrix, respectively. To enhance subsequent feature representation, Zero-Order Hold (ZOH) is employed in the S<sup>3</sup>M module for discretizing continuous feature signals [<xref ref-type="bibr" rid="ref-28">28</xref>]. For a continuous-time segment [<italic>t</italic><sub><italic>a</italic></sub>, <italic>t</italic><sub><italic>b</italic></sub>], the latent state representation <inline-formula id="ieqn-6"><mml:math id="mml-ieqn-6"><mml:mi>l</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>t</mml:mi><mml:mrow><mml:mi>b</mml:mi></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula> is given by:
<disp-formula id="eqn-6"><label>(6)</label><mml:math id="mml-eqn-6" display="block"><mml:mi>l</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>t</mml:mi><mml:mrow><mml:mi>b</mml:mi></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:msup><mml:mi>e</mml:mi><mml:mrow><mml:mi>A</mml:mi><mml:mo>&#x22C5;</mml:mo><mml:mrow><mml:mi mathvariant="normal">&#x0394;</mml:mi></mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msup><mml:mi>l</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>t</mml:mi><mml:mrow><mml:mi>a</mml:mi></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow><mml:mo>+</mml:mo><mml:msubsup><mml:mo>&#x222B;</mml:mo><mml:mrow><mml:msub><mml:mi>t</mml:mi><mml:mrow><mml:mi>a</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:msub><mml:mi>t</mml:mi><mml:mrow><mml:mi>b</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:msubsup><mml:msup><mml:mi>e</mml:mi><mml:mrow><mml:mi>A</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>t</mml:mi><mml:mrow><mml:mi>b</mml:mi></mml:mrow></mml:msub><mml:mo>&#x2212;</mml:mo><mml:mi>&#x03C4;</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:msup><mml:mi>B</mml:mi><mml:mi>x</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x03C4;</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mi>d</mml:mi><mml:mi>&#x03C4;</mml:mi></mml:math></disp-formula>where <inline-formula id="ieqn-7"><mml:math id="mml-ieqn-7"><mml:mrow><mml:mi mathvariant="normal">&#x0394;</mml:mi></mml:mrow><mml:mi>t</mml:mi><mml:mo>=</mml:mo><mml:msub><mml:mi>t</mml:mi><mml:mrow><mml:mi>b</mml:mi></mml:mrow></mml:msub><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mi>t</mml:mi><mml:mrow><mml:mi>a</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>. In this expression, the first term represents the free evolution of the latent state, while the second term reflects the cumulative influence of the input sequence <italic>x</italic>(<italic>t</italic>) over the interval. In practice, approximate numerical techniques, such as matrix exponential approximation, are employed to compute <inline-formula id="ieqn-8"><mml:math id="mml-ieqn-8"><mml:msup><mml:mi>e</mml:mi><mml:mrow><mml:mi>A</mml:mi><mml:mo>&#x22C5;</mml:mo><mml:mrow><mml:mi mathvariant="normal">&#x0394;</mml:mi></mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula>, enabling the transition to a discrete-time formulation of state evolution.
<list list-type="simple">
<list-item><label>(2)</label><p>Vision Clue Merge (VCM) Module: Although CNN and Transformer-based architectures typically utilize convolution operations for downsampling, directional feature scanning may introduce interference during multi-path extraction. To address this issue, VMamba [<xref ref-type="bibr" rid="ref-33">33</xref>] employs 1 &#x00D7; 1 convolutions for dimensionality reduction, while MambaYOLO [<xref ref-type="bibr" rid="ref-34">34</xref>] utilizes 4&#x00D7; compressed pointwise convolutions for downsampling. Inspired by these methods, the proposed VCM module adopts a 3 &#x00D7; 3 convolution with stride 2 for spatial downsampling and complements it with pointwise convolution to preserve informative clues during resolution reduction.</p></list-item>
<list-item><label>(3)</label><p>Spatial Pyramid Pooling Fast (SPPF) Module: A lightweight adaptation of spatial pyramid pooling, SPPF module [<xref ref-type="bibr" rid="ref-17">17</xref>] is designed to capture multi-scale features at the end of the backbone. By applying multiple max-pooling operations (typically with a 5 &#x00D7; 5 kernel) at different receptive field scales, it enables hierarchical feature aggregation, which is particularly advantageous for scenarios with high object scale variance. For a given input feature map</p></list-item>
</list>
<disp-formula id="eqn-7"><label>(7)</label><mml:math id="mml-eqn-7" display="block"><mml:msub><mml:mi>F</mml:mi><mml:mrow><mml:mi>o</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mi>n</mml:mi><mml:mo>&#x22C5;</mml:mo><mml:mrow><mml:mo>[</mml:mo><mml:mi>C</mml:mi><mml:mi>o</mml:mi><mml:mi>n</mml:mi><mml:mi>c</mml:mi><mml:mi>a</mml:mi><mml:mi>t</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mi>M</mml:mi><mml:mi>P</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>F</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mn>5</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mn>5</mml:mn><mml:mo>)</mml:mo></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mo>]</mml:mo></mml:mrow></mml:math></disp-formula>where <inline-formula id="ieqn-9"><mml:math id="mml-ieqn-9"><mml:msub><mml:mi>F</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> and <inline-formula id="ieqn-10"><mml:math id="mml-ieqn-10"><mml:msub><mml:mi>F</mml:mi><mml:mrow><mml:mi>o</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> represent the input and output features of SPPF, respectively, and <italic>MP</italic> represents maximum pooling.
<list list-type="simple">
<list-item><label>(4)</label><p>Compressed Channel-Wise Partial Self-Attention (C2PSA) Module: The C2PSA module combines channel and spatial attention mechanisms in a parallel structure to improve the representational power of convolutional blocks. Originally introduced in YOLOv11 [<xref ref-type="bibr" rid="ref-17">17</xref>] as an enhancement to YOLOv8, C2PSA selectively emphasizes informative feature channels and spatial locations through attention weighting. In this study, we incorporate C2PSA into the final stage of the ARPD backbone, aiming to further improve its performance in complex pavement imagery. Given an input feature map <italic>X</italic>, the output <italic>X</italic><sub><italic>o</italic></sub> is computed as:</p></list-item>
</list></p>
<p><disp-formula id="eqn-8"><label>(8)</label><mml:math id="mml-eqn-8" display="block"><mml:msub><mml:mi>&#x03B1;</mml:mi><mml:mrow><mml:mi>c</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mi>&#x03C3;</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>w</mml:mi><mml:mrow><mml:mi>c</mml:mi></mml:mrow></mml:msub><mml:mo>&#x22C5;</mml:mo><mml:mi>G</mml:mi><mml:mi>A</mml:mi><mml:mi>P</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mi>X</mml:mi><mml:mo>)</mml:mo></mml:mrow><mml:mo>+</mml:mo><mml:msub><mml:mi>b</mml:mi><mml:mrow><mml:mi>c</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">)</mml:mo></mml:math></disp-formula>
<disp-formula id="eqn-9"><label>(9)</label><mml:math id="mml-eqn-9" display="block"><mml:msub><mml:mi>&#x03B1;</mml:mi><mml:mrow><mml:mi>s</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mi>&#x03C3;</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>w</mml:mi><mml:mrow><mml:mi>s</mml:mi></mml:mrow></mml:msub><mml:mo>&#x22C5;</mml:mo><mml:mi>G</mml:mi><mml:mi>A</mml:mi><mml:mi>P</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mi>X</mml:mi><mml:mo>)</mml:mo></mml:mrow><mml:mo>+</mml:mo><mml:msub><mml:mi>b</mml:mi><mml:mrow><mml:mi>s</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">)</mml:mo></mml:math></disp-formula>
<disp-formula id="eqn-10"><label>(10)</label><mml:math id="mml-eqn-10" display="block"><mml:msub><mml:mi>&#x03C7;</mml:mi><mml:mrow><mml:mi>o</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mi>&#x03B1;</mml:mi><mml:mrow><mml:mi>c</mml:mi></mml:mrow></mml:msub><mml:mo>&#x2299;</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>&#x03B1;</mml:mi><mml:mrow><mml:mi>s</mml:mi></mml:mrow></mml:msub><mml:mo>&#x2299;</mml:mo><mml:mi>X</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></disp-formula>where <inline-formula id="ieqn-11"><mml:math id="mml-ieqn-11"><mml:msub><mml:mi>&#x03B1;</mml:mi><mml:mrow><mml:mi>c</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> and <inline-formula id="ieqn-12"><mml:math id="mml-ieqn-12"><mml:msub><mml:mi>&#x03B1;</mml:mi><mml:mrow><mml:mi>s</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> represent the channel and spatial attention weights, respectively, <italic>GAP</italic> denotes Global Average Pooling, <italic>w</italic> and <italic>b</italic> are learnable parameters, <inline-formula id="ieqn-13"><mml:math id="mml-ieqn-13"><mml:mi>&#x03C3;</mml:mi></mml:math></inline-formula> is the sigmoid activation function, and &#x2299; denotes element-wise multiplication.</p>
</sec>
<sec id="s2_3">
<label>2.3</label>
<title>SDI-Based Neck Structure</title>
<p>Existing neck structures predominantly rely on single-task feature pyramid networks (FPNs) to further process and enhance the features extracted by the backbone. In multi-scale object detection, classical backbone-neck-head architectures typically adopt FPN or Path Aggregation Network (PAN) for feature fusion, which have demonstrated promising results in transportation infrastructure maintenance scenarios. However, such neck designs often restrict inter-layer information transmission to intermediate layers only. To address this limitation, this study employs a lightweight skip connection-based neck structure for Semantics and Detail Infusion (SDI) [<xref ref-type="bibr" rid="ref-27">27</xref>], enabling more effective fusion of multi-scale pavement defect features. As illustrated in <xref ref-type="fig" rid="fig-4">Fig. 4</xref>, a Transformer encoder is first applied to extract multi-level feature maps and align their output channels. For the <italic>i</italic>-th feature map, higher-level features (containing richer semantic information) and lower-level features (capturing finer details) are explicitly injected via simple Hadamard product operations, thereby enhancing both semantic and detailed representations of the <italic>i</italic>-th feature layer. The refined features are subsequently passed into a decoder for resolution reconstruction and segmentation.</p>
<fig id="fig-4">
<label>Figure 4</label>
<caption>
<title>The structure of SDI module</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="SDHM_68987-fig-4.tif"/>
</fig>
<p>Given the input feature map <inline-formula id="ieqn-14"><mml:math id="mml-ieqn-14"><mml:mi>I</mml:mi><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mi>R</mml:mi><mml:mrow><mml:mi>H</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mi>W</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mi>C</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula> (where <italic>H</italic>, <italic>W</italic>, and <italic>C</italic> denote the height, width, and channel number, respectively), the Transformer encoder generates M feature maps <inline-formula id="ieqn-15"><mml:math id="mml-ieqn-15"><mml:mo fence="false" stretchy="false">{</mml:mo><mml:msubsup><mml:mi>f</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mn>0</mml:mn></mml:mrow></mml:msubsup><mml:mo>&#x2208;</mml:mo><mml:mo fence="false" stretchy="false">{</mml:mo><mml:msubsup><mml:mi>f</mml:mi><mml:mrow><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mn>0</mml:mn></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:msubsup><mml:mi>f</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow><mml:mrow><mml:mn>0</mml:mn></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:msubsup><mml:mi>f</mml:mi><mml:mrow><mml:mi>M</mml:mi></mml:mrow><mml:mrow><mml:mn>0</mml:mn></mml:mrow></mml:msubsup><mml:mo fence="false" stretchy="false">}</mml:mo><mml:mo>,</mml:mo><mml:mn>1</mml:mn><mml:mo>&#x2264;</mml:mo><mml:mi>i</mml:mi><mml:mo>&#x2264;</mml:mo><mml:mi>M</mml:mi></mml:math></inline-formula>, each integrated with both channel and spatial attention, enabling local and global information representation across layers.
<disp-formula id="eqn-11"><label>(11)</label><mml:math id="mml-eqn-11" display="block"><mml:msubsup><mml:mi>f</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:msubsup><mml:mi>&#x03BC;</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi></mml:mrow></mml:msubsup><mml:mstyle scriptlevel="0"><mml:mrow><mml:mo maxsize="2.047em" minsize="2.047em">(</mml:mo></mml:mrow></mml:mstyle><mml:msubsup><mml:mi>&#x03B4;</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>s</mml:mi></mml:mrow></mml:msubsup><mml:mrow><mml:mo>(</mml:mo><mml:msubsup><mml:mi>f</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mn>0</mml:mn></mml:mrow></mml:msubsup><mml:mo>)</mml:mo></mml:mrow><mml:mstyle scriptlevel="0"><mml:mrow><mml:mo maxsize="2.047em" minsize="2.047em">)</mml:mo></mml:mrow></mml:mstyle></mml:math></disp-formula>where <inline-formula id="ieqn-16"><mml:math id="mml-ieqn-16"><mml:msubsup><mml:mi>f</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msubsup></mml:math></inline-formula> represents the feature map of the <italic>i</italic>-th layer after <inline-formula id="ieqn-17"><mml:math id="mml-ieqn-17"><mml:msubsup><mml:mi>f</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mn>0</mml:mn></mml:mrow></mml:msubsup></mml:math></inline-formula> is fused with attention, <inline-formula id="ieqn-18"><mml:math id="mml-ieqn-18"><mml:msubsup><mml:mi>&#x03BC;</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula>and <inline-formula id="ieqn-19"><mml:math id="mml-ieqn-19"><mml:msubsup><mml:mi>&#x03B4;</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>s</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula> represent the channel and spatial attention parameters, respectively. Subsequently, <inline-formula id="ieqn-20"><mml:math id="mml-ieqn-20"><mml:msubsup><mml:mi>f</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msubsup></mml:math></inline-formula> will be adjusted to have <italic>c</italic> channels through 1 &#x00D7; 1 convolution to obtain <inline-formula id="ieqn-21"><mml:math id="mml-ieqn-21"><mml:msubsup><mml:mi>f</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msubsup><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mi>R</mml:mi><mml:mrow><mml:mi>H</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mi>W</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mi>c</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula>. Furthermore, the feature map size is adjusted at each <italic>j</italic>-th layer to be sent to the decoder, calculated as follows:
<disp-formula id="eqn-12"><label>(12)</label><mml:math id="mml-eqn-12" display="block"><mml:msubsup><mml:mi>f</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mn>3</mml:mn></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:mrow><mml:mo>{</mml:mo><mml:mtable columnalign="left left" rowspacing=".2em" columnspacing="1em" displaystyle="false"><mml:mtr><mml:mtd><mml:mrow><mml:mi>&#x1D49F;</mml:mi></mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:msubsup><mml:mi>f</mml:mi><mml:mrow><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>H</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>W</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mo>,</mml:mo><mml:mi>i</mml:mi><mml:mi>f</mml:mi><mml:mi>j</mml:mi><mml:mo>&#x003C;</mml:mo><mml:mi>i</mml:mi></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mrow><mml:mi>&#x02110;</mml:mi></mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:msubsup><mml:mi>f</mml:mi><mml:mrow><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msubsup><mml:mo>)</mml:mo></mml:mrow><mml:mo>,</mml:mo><mml:mi>i</mml:mi><mml:mi>f</mml:mi><mml:mi>j</mml:mi><mml:mo>=</mml:mo><mml:mi>i</mml:mi><mml:mspace width="2em" /><mml:mn>1</mml:mn><mml:mo>&#x2264;</mml:mo><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi><mml:mo>&#x2264;</mml:mo><mml:mi>M</mml:mi></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mrow><mml:mi>&#x1D4B0;</mml:mi></mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:msubsup><mml:mi>f</mml:mi><mml:mrow><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>H</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>W</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mo>,</mml:mo><mml:mi>i</mml:mi><mml:mi>f</mml:mi><mml:mi>j</mml:mi><mml:mo>&#x003E;</mml:mo><mml:mi>i</mml:mi></mml:mtd></mml:mtr></mml:mtable><mml:mo fence="true" stretchy="true" symmetric="true"></mml:mo></mml:mrow></mml:math></disp-formula>where <inline-formula id="ieqn-22"><mml:math id="mml-ieqn-22"><mml:mrow><mml:mi>&#x1D49F;</mml:mi></mml:mrow></mml:math></inline-formula>, <inline-formula id="ieqn-23"><mml:math id="mml-ieqn-23"><mml:mrow><mml:mi>&#x02110;</mml:mi></mml:mrow></mml:math></inline-formula>, and <inline-formula id="ieqn-24"><mml:math id="mml-ieqn-24"><mml:mrow><mml:mi>&#x1D4B0;</mml:mi></mml:mrow></mml:math></inline-formula> represent adaptive average pooling, identity mapping, and bilinear interpolation, respectively. Next, a 3 &#x00D7; 3 convolution is applied to each resized feature map <inline-formula id="ieqn-25"><mml:math id="mml-ieqn-25"><mml:msubsup><mml:mi>f</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mn>3</mml:mn></mml:mrow></mml:msubsup></mml:math></inline-formula> to facilitate smoother integration of curved and multi-scale pavement defect features.
<disp-formula id="eqn-13"><label>(13)</label><mml:math id="mml-eqn-13" display="block"><mml:msubsup><mml:mi>f</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mn>4</mml:mn></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:msub><mml:mi>&#x03B8;</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:msubsup><mml:mi>f</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mn>3</mml:mn></mml:mrow></mml:msubsup><mml:mo>)</mml:mo></mml:mrow></mml:math></disp-formula>where <inline-formula id="ieqn-26"><mml:math id="mml-ieqn-26"><mml:msub><mml:mi>&#x03B8;</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> denotes the smoothing coefficient. The element-wise Hadamard product is then applied to all resized feature maps to enrich the <italic>i</italic>-th-level features with additional semantic information and finer details.
<disp-formula id="eqn-14"><label>(14)</label><mml:math id="mml-eqn-14" display="block"><mml:msubsup><mml:mi>f</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mn>5</mml:mn></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:mi>H</mml:mi><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:msubsup><mml:mi>f</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mn>4</mml:mn></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:msubsup><mml:mi>f</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mn>2</mml:mn></mml:mrow><mml:mrow><mml:mn>4</mml:mn></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:msubsup><mml:mi>f</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>M</mml:mi></mml:mrow><mml:mrow><mml:mn>4</mml:mn></mml:mrow></mml:msubsup><mml:mo>)</mml:mo></mml:mrow><mml:mo>]</mml:mo></mml:mrow></mml:math></disp-formula>where <inline-formula id="ieqn-27"><mml:math id="mml-ieqn-27"><mml:mi>H</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mo>&#x22C5;</mml:mo><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> denotes the Hadamard product. Finally, <inline-formula id="ieqn-28"><mml:math id="mml-ieqn-28"><mml:msubsup><mml:mi>f</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mn>5</mml:mn></mml:mrow></mml:msubsup></mml:math></inline-formula> is assigned to the <italic>i</italic>-th decoder for further resolution reconstruction and segmentation, producing output results at three scales: large, medium, and small.</p>
<p>In addition, inspired by the state-of-the-art object detection advancements in YOLOv11, the C3k2 module, which integrates deformable convolutions and bottleneck enhancements, is incorporated into the neck of the ARPD algorithm to better address multi-scale pavement defect detection. As shown in <xref ref-type="fig" rid="fig-5">Fig. 5</xref>, C3k2 employs CBS blocks with deformable convolutions of various kernel sizes (e.g., 3 &#x00D7; 3, 5 &#x00D7; 5), allowing the model to extract features across multiple scales and better capture complex spatial characteristics.</p>
<fig id="fig-5">
<label>Figure 5</label>
<caption>
<title>The C3k2 module</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="SDHM_68987-fig-5.tif"/>
</fig>
</sec>
<sec id="s2_4">
<label>2.4</label>
<title>Loss Function</title>
<p>The APRD algorithm employs a hybrid loss function to regulate gradient updates, consisting of a classification loss <inline-formula id="ieqn-29"><mml:math id="mml-ieqn-29"><mml:msub><mml:mi>L</mml:mi><mml:mrow><mml:mi>c</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> and a regression loss <inline-formula id="ieqn-30"><mml:math id="mml-ieqn-30"><mml:msub><mml:mi>L</mml:mi><mml:mrow><mml:mi>r</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>, aiming to improve detection accuracy by simultaneously optimizing object classification and bounding box regression [<xref ref-type="bibr" rid="ref-17">17</xref>]. The loss is calculated as follows:
<disp-formula id="eqn-15"><label>(15)</label><mml:math id="mml-eqn-15" display="block"><mml:mrow><mml:mi>&#x02112;</mml:mi></mml:mrow><mml:mo>=</mml:mo><mml:msub><mml:mi>&#x03B3;</mml:mi><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:msub><mml:mi>L</mml:mi><mml:mrow><mml:mi>c</mml:mi></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:msub><mml:mi>&#x03B3;</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub><mml:msub><mml:mi>L</mml:mi><mml:mrow><mml:mi>r</mml:mi></mml:mrow></mml:msub></mml:math></disp-formula>where <inline-formula id="ieqn-31"><mml:math id="mml-ieqn-31"><mml:msub><mml:mi>&#x03B3;</mml:mi><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula> and <inline-formula id="ieqn-32"><mml:math id="mml-ieqn-32"><mml:msub><mml:mi>&#x03B3;</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula> denote the weights for classification and regression loss, respectively.</p>
<p>The classification loss typically adopts Binary Cross-Entropy (BCE), which measures the difference between the predicted class probability <italic>p</italic><sub><italic>i</italic></sub> and the ground truth label <italic>y</italic><sub><italic>i</italic></sub> for each predicted box [<xref ref-type="bibr" rid="ref-16">16</xref>]. The formula for classification loss is expressed as:
<disp-formula id="eqn-16"><label>(16)</label><mml:math id="mml-eqn-16" display="block"><mml:msub><mml:mi>L</mml:mi><mml:mrow><mml:mi>c</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mo>&#x2212;</mml:mo><mml:munder><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:munder><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>y</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mi>log</mml:mi><mml:mo>&#x2061;</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>p</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow><mml:mo>+</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:mn>1</mml:mn><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mi>y</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow><mml:mi>log</mml:mi><mml:mo>&#x2061;</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:mn>1</mml:mn><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mi>p</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:math></disp-formula></p>
<p>The regression loss is based on Complete Intersection over Union (CIoU), which evaluates the discrepancy between the predicted and ground truth bounding boxes in terms of center point coordinates, width, and height. For each predicted box, the IoU is calculated to derive the loss. The CIoU loss can be formulated as:
<disp-formula id="eqn-17"><label>(17)</label><mml:math id="mml-eqn-17" display="block"><mml:mi>I</mml:mi><mml:mi>o</mml:mi><mml:mi>U</mml:mi><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mi>b</mml:mi><mml:mo>&#x2229;</mml:mo><mml:msup><mml:mi>b</mml:mi><mml:mrow><mml:mi>G</mml:mi><mml:mi>T</mml:mi></mml:mrow></mml:msup></mml:mrow><mml:mrow><mml:mi>b</mml:mi><mml:mo>&#x222A;</mml:mo><mml:msup><mml:mi>b</mml:mi><mml:mrow><mml:mi>G</mml:mi><mml:mi>T</mml:mi></mml:mrow></mml:msup></mml:mrow></mml:mfrac></mml:math></disp-formula>
<disp-formula id="eqn-18"><label>(18)</label><mml:math id="mml-eqn-18" display="block"><mml:msub><mml:mi>L</mml:mi><mml:mrow><mml:mi>r</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mn>1</mml:mn><mml:mo>&#x2212;</mml:mo><mml:mi>I</mml:mi><mml:mi>o</mml:mi><mml:mi>U</mml:mi><mml:mo>+</mml:mo><mml:mfrac><mml:mrow><mml:msup><mml:mi>&#x03C1;</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msup><mml:mrow><mml:mo>(</mml:mo><mml:mi>b</mml:mi><mml:mo>,</mml:mo><mml:msup><mml:mi>b</mml:mi><mml:mrow><mml:mi>G</mml:mi><mml:mi>T</mml:mi></mml:mrow></mml:msup><mml:mo>)</mml:mo></mml:mrow></mml:mrow><mml:msup><mml:mi>c</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msup></mml:mfrac><mml:mo>+</mml:mo><mml:mi>&#x03D1;</mml:mi><mml:mo>&#x22C5;</mml:mo><mml:mi>&#x03B2;</mml:mi></mml:math></disp-formula>where <italic>&#x03C1;</italic>(<inline-formula id="ieqn-33"><mml:math id="mml-ieqn-33"><mml:mi>b</mml:mi><mml:mo>,</mml:mo><mml:msup><mml:mi>b</mml:mi><mml:mrow><mml:mi>G</mml:mi><mml:mi>T</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula>) is the Euclidean distance between the centers of the predicted and ground truth boxes, <italic>c</italic> is the diagonal length of the minimum enclosing box covering both predicted and ground truth boxes, and <inline-formula id="ieqn-34"><mml:math id="mml-ieqn-34"><mml:mi>&#x03D1;</mml:mi></mml:math></inline-formula> and <inline-formula id="ieqn-35"><mml:math id="mml-ieqn-35"><mml:mi>&#x03B2;</mml:mi></mml:math></inline-formula> represent the aspect ratio consistency term and the trade-off parameter, respectively.</p>
</sec>
</sec>
<sec id="s3">
<label>3</label>
<title>Experimental Results</title>
<p>To validate the effectiveness of the proposed model, ablation studies are first conducted on the backbone and neck structure. Subsequently, the model is compared with current state-of-the-art (SOTA) object detection methods on the same dataset. Finally, visualized results on the test set are presented to further verify the model&#x2019;s performance.</p>
<sec id="s3_1">
<label>3.1</label>
<title>Dataset and Training Details</title>
<p>As described above, the experimental data are sourced from the publicly available UAV-PDD2023 dataset [<xref ref-type="bibr" rid="ref-1">1</xref>], captured by a downward-facing camera mounted on a UAV flying steadily above road surfaces. A total of 2000 images are used for model training, which are randomly split into training, validation, and test sets in a 7:2:1 ratio, ensuring no image overlap across subsets. The dataset is categorized into six defect types based on visual characteristics: longitudinal cracks (LC), transverse cracks (TC), oblique cracks (OC), alligator cracks (AC), potholes (PH), and repairs (RP). The causes and potential hazards of each type are detailed in <xref ref-type="table" rid="table-2">Table 2</xref>. To further assess generalization and real-time capabilities, inference is also performed on the RDD2022 dataset [<xref ref-type="bibr" rid="ref-31">31</xref>] and the test set.</p>
<table-wrap id="table-2">
<label>Table 2</label>
<caption>
<title>Causes and hazards of various pavement surface defects</title>
</caption>
<table>
<colgroup>
<col/>
<col width="60mm"/>
<col width="80mm"/>
</colgroup>
<thead>
<tr>
<th>Type</th>
<th>Cause</th>
<th>Hazard</th>
</tr>
</thead>
<tbody>
<tr>
<td>TC</td>
<td>Traffic load</td>
<td>Further crack extension</td>
</tr>
<tr>
<td>LC</td>
<td>Traffic load</td>
<td>Further crack extension</td>
</tr>
<tr>
<td>OC</td>
<td>Uneven pavement settlement</td>
<td>Accelerated pavement fatigue</td>
</tr>
<tr>
<td>AC</td>
<td>Pavement aging and water seepage</td>
<td>Reduced pavement strength</td>
</tr>
<tr>
<td>PH</td>
<td>Traffic load</td>
<td>Traffic safety hazards, water accumulation</td>
</tr>
<tr>
<td>RP</td>
<td>Pavement aging and damage</td>
<td>Formation of new cracks</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>The APRD model is trained, validated, and tested on an Ubuntu 22.04 desktop equipped with an Intel Core i7-12700 CPU and an NVIDIA GeForce RTX 3060 GPU. Key training parameters include a learning rate of 0.01 for balancing convergence speed and stability, and a weight decay of 0.0005 to prevent overfitting. A fixed momentum value of 0.937 is used to enhance gradient descent efficiency. The model is trained for 300 epochs with a batch size of 16 to ensure stability and thorough convergence.</p>
</sec>
<sec id="s3_2">
<label>3.2</label>
<title>Evaluation Metrics</title>
<p>In the comparative experiments, precision (<italic>P</italic>), recall (<italic>R</italic>), and mean Average Precision (<italic>mAP</italic>) are used as evaluation metrics. Precision measures the proportion of correctly identified instances among all predicted positive instances, while recall evaluates the model&#x2019;s ability to correctly classify relevant instances. The definitions are as follows:
<disp-formula id="eqn-19"><label>(19)</label><mml:math id="mml-eqn-19" display="block"><mml:mi>P</mml:mi><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mi>T</mml:mi><mml:mi>P</mml:mi></mml:mrow><mml:mrow><mml:mi>T</mml:mi><mml:mi>P</mml:mi><mml:mo>+</mml:mo><mml:mi>F</mml:mi><mml:mi>P</mml:mi></mml:mrow></mml:mfrac><mml:mo>&#x00D7;</mml:mo><mml:mn>100</mml:mn><mml:mi mathvariant="normal">&#x0025;</mml:mi></mml:math></disp-formula>
<disp-formula id="eqn-20"><label>(20)</label><mml:math id="mml-eqn-20" display="block"><mml:mi>P</mml:mi><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mi>T</mml:mi><mml:mi>P</mml:mi></mml:mrow><mml:mrow><mml:mi>T</mml:mi><mml:mi>P</mml:mi><mml:mo>+</mml:mo><mml:mi>F</mml:mi><mml:mi>N</mml:mi></mml:mrow></mml:mfrac><mml:mo>&#x00D7;</mml:mo><mml:mn>100</mml:mn><mml:mi mathvariant="normal">&#x0025;</mml:mi></mml:math></disp-formula>
<disp-formula id="eqn-21"><label>(21)</label><mml:math id="mml-eqn-21" display="block"><mml:mi>A</mml:mi><mml:mi>P</mml:mi><mml:mo>=</mml:mo><mml:msubsup><mml:mo>&#x222B;</mml:mo><mml:mrow><mml:mn>0</mml:mn></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msubsup><mml:mi>P</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>R</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mi>d</mml:mi><mml:mi>R</mml:mi></mml:math></disp-formula><disp-formula id="eqn-22"><label>(22)</label><mml:math id="mml-eqn-22" display="block"><mml:mi>m</mml:mi><mml:mi>A</mml:mi><mml:mi>P</mml:mi><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:msub><mml:mi>n</mml:mi><mml:mrow><mml:mi>c</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:munderover><mml:mi>A</mml:mi><mml:msub><mml:mi>P</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:msub><mml:mi>n</mml:mi><mml:mrow><mml:mi>c</mml:mi></mml:mrow></mml:msub></mml:mfrac></mml:math></disp-formula>where <italic>TP</italic>, <italic>FP</italic>, and <italic>FN</italic> represent true positives (positive samples correctly predicted as positive), false positives (negative samples incorrectly predicted as positive), and false negatives (positive samples incorrectly predicted as negative), respectively. The <inline-formula id="ieqn-36"><mml:math id="mml-ieqn-36"><mml:msub><mml:mi>n</mml:mi><mml:mrow><mml:mi>c</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> denotes the total number of categories.</p>
</sec>
<sec id="s3_3">
<label>3.3</label>
<title>Ablation Studies</title>
<p>To evaluate the effectiveness of the proposed ARPD model&#x2019;s backbone and neck configurations, precision-recall (PR) curves are used to illustrate the balance between P and R. <xref ref-type="fig" rid="fig-6">Figs. 6</xref> and <xref ref-type="fig" rid="fig-7">7</xref> present the ablation study results for the backbone and neck modules, respectively.</p>
<fig id="fig-6">
<label>Figure 6</label>
<caption>
<title>Backbone ablation experiment of ARPD algorithm equipped with the same SDI-based neck. (<bold>a</bold>&#x2013;<bold>g</bold>) respectively represent PR curves for AC, LC, OC, PH, RP, TC, and the all-classes mAP</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="SDHM_68987-fig-6.tif"/>
</fig><fig id="fig-7">
<label>Figure 7</label>
<caption>
<title>Neck ablation experiment of ARPD algorithm equipped with the same S<sup>3</sup>M-based backbone. (<bold>a</bold>&#x2013;<bold>g</bold>) respectively represent PR curves for AC, LC, OC, PH, RP, TC, and the all-classes mAP</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="SDHM_68987-fig-7.tif"/>
</fig>
<p><bold>Backbone Ablation Study:</bold> Using a consistent C3k2-based neck structure across all configurations, different backbones are integrated into ARPD (including MobileNetV4 [<xref ref-type="bibr" rid="ref-26">26</xref>], U-NetV2 [<xref ref-type="bibr" rid="ref-27">27</xref>], DarkNet53 [<xref ref-type="bibr" rid="ref-17">17</xref>], and the proposed S<sup>3</sup>M) for comparison. As shown by the pink curve in <xref ref-type="fig" rid="fig-6">Fig. 6</xref>, the S<sup>3</sup>M backbone achieves the best performance in extracting pavement defects, reaching the highest mAP of 80.1%. Notably, for AC and RP classes, the S<sup>3</sup>M-based model achieves over 90% precision, demonstrating the strong adaptability of S<sup>3</sup>M to diverse road surface damage types.</p>

<p><bold>Neck Ablation Study:</bold> Building on the confirmed superiority of the S<sup>3</sup>M backbone, additional experiments are conducted by integrating various neck structures into ARPD while keeping the S<sup>3</sup>M backbone fixed. The tested neck modules include C3 [<xref ref-type="bibr" rid="ref-16">16</xref>], C2f [<xref ref-type="bibr" rid="ref-17">17</xref>], C3k2 [<xref ref-type="bibr" rid="ref-17">17</xref>], and the proposed SDI. As illustrated by the pink curve in <xref ref-type="fig" rid="fig-7">Fig. 7g</xref>, the ARPD model equipped with both the S<sup>3</sup>M backbone and SDI neck structure achieves the best overall performance, with a mAP of 86.1%. This also reflects a significant improvement compared to the best result from the backbone ablation study (80.1%), confirming the effectiveness of SDI in multi-scale feature integration for pavement defect detection.</p>
</sec>
<sec id="s3_4">
<label>3.4</label>
<title>Comparative Experiments</title>
<p>Based on the preceding ablation studies, the ARPD algorithm has demonstrated superior performance on UAV-based pavement defect datasets. To further validate its effectiveness, ARPD is compared against several state-of-the-art object detection models, including the YOLO series [<xref ref-type="bibr" rid="ref-16">16</xref>&#x2013;<xref ref-type="bibr" rid="ref-18">18</xref>], MobileNet series [<xref ref-type="bibr" rid="ref-26">26</xref>,<xref ref-type="bibr" rid="ref-35">35</xref>], DETR series [<xref ref-type="bibr" rid="ref-36">36</xref>,<xref ref-type="bibr" rid="ref-37">37</xref>] and Mamba-based models [<xref ref-type="bibr" rid="ref-33">33</xref>,<xref ref-type="bibr" rid="ref-34">34</xref>], under the same dataset conditions.</p>
<p>As presented in <xref ref-type="table" rid="table-3">Table 3</xref>, ARPD ranks first among all models with a mAP of 86.1%. The state-of-the-art detection framework YOLO11 demonstrates superior performance over YOLOv8 (83.1% vs. 78.4%), primarily due to its innovative architectural components such as C2PSA and C3k2, ranking second and third, respectively. The latest YOLO12 introduces A2C2f for enhanced attention and hierarchical training, it still struggles with sparse and multi-scale pavement defects, and is even outperformed by YOLOv8 in our experiments. Although Mamba-based architectures excel at image classification and long-range modeling, they typically require additional modules for fine-grained feature extraction. The complex characteristics of pavement defects significantly limit the effectiveness of state-space models in this context, indicating room for improvement. Moreover, while MobileNet variants are commonly adopted for lightweight deployment, they exhibit poor performance (&#x003C;65%) in pavement scenarios characterized by minimal semantic information and complex visual backgrounds. The DETR series, while offering real-time end-to-end detection capabilities, continues to face significant challenges in balancing inference speed and detection accuracy.</p>
<table-wrap id="table-3">
<label>Table 3</label>
<caption>
<title>Comparison experimental results between ARPD and advanced models</title>
</caption>
<table>
<colgroup>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
</colgroup>
<thead>
<tr>
<th rowspan="2">Method</th>
<th colspan="2">LC <bold>(%)</bold></th>
<th colspan="2">TC <bold>(%)</bold></th>
<th colspan="2">AC <bold>(%)</bold></th>
<th colspan="2">OC <bold>(%)</bold></th>
<th colspan="2">RP <bold>(%)</bold></th>
<th colspan="2">PH <bold>(%)</bold></th>
<th rowspan="2">mAP</th>
</tr>
<tr>
<th>P</th>
<th>R</th>
<th>P</th>
<th>R</th>
<th>P</th>
<th>R</th>
<th>P</th>
<th>R</th>
<th>P</th>
<th>R</th>
<th>P</th>
<th>R</th>
</tr>
</thead>
<tbody>
<tr>
<td>MobileNetV3</td>
<td>68.2</td>
<td>40.4</td>
<td>67.5</td>
<td>44.9</td>
<td>48.9</td>
<td>39.7</td>
<td>64.7</td>
<td>50</td>
<td>59.7</td>
<td>41.1</td>
<td>69.9</td>
<td>45.8</td>
<td>63.1</td>
</tr>
<tr>
<td>MobileNetV4</td>
<td>63.8</td>
<td>58.3</td>
<td>51.4</td>
<td>37.5</td>
<td>65.3</td>
<td>52</td>
<td>57.8</td>
<td>43.3</td>
<td>53.8</td>
<td>41.5</td>
<td>61.2</td>
<td>37.5</td>
<td>58.9</td>
</tr>
<tr>
<td>VMamba</td>
<td>69.5</td>
<td>56.8</td>
<td>54.7</td>
<td>50.6</td>
<td>62.8</td>
<td>52</td>
<td>63.9</td>
<td>53.1</td>
<td>78.6</td>
<td>68.6</td>
<td>76.9</td>
<td>64.5</td>
<td>67.7</td>
</tr>
<tr>
<td>MambaYOLO</td>
<td>76.5</td>
<td>65.5</td>
<td>78.1</td>
<td>61.9</td>
<td>67.5</td>
<td>51.7</td>
<td>69.3</td>
<td>59.7</td>
<td>74.1</td>
<td>65</td>
<td>80.1</td>
<td>71.2</td>
<td>74.3</td>
</tr>
<tr>
<td>YOLOv5</td>
<td>72.5</td>
<td>61</td>
<td>84.1</td>
<td>76.2</td>
<td>72.9</td>
<td>64.3</td>
<td>73.7</td>
<td>65.2</td>
<td>77.9</td>
<td>70.2</td>
<td>83.6</td>
<td>77</td>
<td>77.4</td>
</tr>
<tr>
<td>YOLOv8</td>
<td>77.3</td>
<td>76.9</td>
<td>80.2</td>
<td>72</td>
<td>75.4</td>
<td>75.8</td>
<td>85.1</td>
<td>72</td>
<td>80.5</td>
<td>83.8</td>
<td>72.2</td>
<td>63.8</td>
<td>78.4</td>
</tr>
<tr>
<td>YOLO11</td>
<td>85.9</td>
<td>75.9</td>
<td>85</td>
<td>75.1</td>
<td>83.1</td>
<td>78.4</td>
<td>85.8</td>
<td>71.9</td>
<td>80.9</td>
<td>81.7</td>
<td>78.4</td>
<td>70.9</td>
<td>83.1</td>
</tr>
<tr>
<td>YOLO12</td>
<td>80.0</td>
<td>73.5</td>
<td>79.7</td>
<td>69.1</td>
<td>83.3</td>
<td>76.5</td>
<td>70.1</td>
<td>64.5</td>
<td>84.4</td>
<td>83.8</td>
<td>71.3</td>
<td>65.1</td>
<td>78.1</td>
</tr>
<tr>
<td>RT-DETR</td>
<td>59.3</td>
<td>53.2</td>
<td>62.6</td>
<td>54.9</td>
<td>57.5</td>
<td>62.7</td>
<td>40.4</td>
<td>37.4</td>
<td>68.3</td>
<td>66.2</td>
<td>41.1</td>
<td>41.9</td>
<td>54.8</td>
</tr>
<tr>
<td>CO-DETR</td>
<td>71.1</td>
<td>63.0</td>
<td>71.9</td>
<td>61.0</td>
<td>77.1</td>
<td>69.7</td>
<td>57.8</td>
<td>50.5</td>
<td>77.7</td>
<td>70.6</td>
<td>60.3</td>
<td>47.1</td>
<td>69.3</td>
</tr>
<tr>
<td><bold>ARPD</bold></td>
<td><bold>87.3</bold></td>
<td><bold>79</bold></td>
<td><bold>88.9</bold></td>
<td><bold>77.3</bold></td>
<td><bold>91.1</bold></td>
<td><bold>85.6</bold></td>
<td><bold>81.5</bold></td>
<td><bold>72.5</bold></td>
<td><bold>89.1</bold></td>
<td><bold>78.2</bold></td>
<td><bold>78.2</bold></td>
<td><bold>68.9</bold></td>
<td><bold>86.1</bold></td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<fn id="table-3fn1" fn-type="other">
<p>Note: <italic>P</italic> and <italic>R</italic> represent Precision, Recall, respectively.</p>
</fn>
</table-wrap-foot>
</table-wrap>
<p>In summary, the proposed combination of S<sup>3</sup>M and SDI achieves the best detection accuracy among all evaluated models in UAV-based pavement defect recognition, making it a reliable component of the ARPD algorithm for automatic road damage inspection.</p>
</sec>
<sec id="s3_5">
<label>3.5</label>
<title>Visualization-Based Validation</title>
<p>Visualization results based on the test set are shown in <xref ref-type="fig" rid="fig-8">Fig. 8</xref>, where the first and second rows represent the original UAV images and the corresponding ARPD predictions. The proposed algorithm demonstrates accurate localization of elongated pavement defects, including fragmented and irregularly distributed oblique cracks. Furthermore, even when multiple types of defects with similar visual features appear simultaneously, ARPD maintains precise detection performance. These results confirm that ARPD exhibits strong adaptability and robustness even in scenarios with minimal pixel-level defect presence.</p>
<fig id="fig-8">
<label>Figure 8</label>
<caption>
<title>Visual verification of ARPD algorithm based on test-set</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="SDHM_68987-fig-8.tif"/>
</fig>
<p>Additionally, a separate visualization experiment is conducted on 200 randomly selected images from the public RDD-2022 dataset. As shown in <xref ref-type="fig" rid="fig-9">Fig. 9</xref>, images captured from an in-vehicle perspective often led to vertical cracks (LC) being misidentified as oblique cracks (OC), as indicated by the red arrows. Despite this visual ambiguity, ARPD successfully detects all defect instances in each image, further demonstrating its generalization capability and robustness, and highlighting its suitability for real-world road inspection tasks involving diverse defect types and imaging conditions.</p>
<fig id="fig-9">
<label>Figure 9</label>
<caption>
<title>Visual verification of ARPD algorithm based on RDD-2022</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="SDHM_68987-fig-9.tif"/>
</fig>
<p>To further evaluate the reproducibility and generalization capability of the proposed algorithm, additional comparative experiments are conducted on publicly available UAV-captured pavement defect datasets, specifically Drone-based IRP [<xref ref-type="bibr" rid="ref-38">38</xref>] and HighRPD [<xref ref-type="bibr" rid="ref-39">39</xref>]. As shown in <xref ref-type="table" rid="table-4">Table 4</xref>, all benchmark models are evaluated under identical configurations, and our method consistently achieved the highest accuracy. Notably, this experiment also assessed training FPS. Although RT-DETR outperformed the YOLO series in accuracy under ample computational resources, it lacks mobile-device compatibility and is surpassed by MambaYOLO in processing speed. Despite being a recent state-of-the-art detector, YOLOv11 shows performance limitations in complex pavement conditions. Overall, these results highlight the strong generalization and practical deployment potential of the proposed algorithm in real-world UAV-based pavement defect inspection.</p>
<table-wrap id="table-4">
<label>Table 4</label>
<caption>
<title>Training experimental results of ARPD and advanced models on different public datasets</title>
</caption>
<table>
<colgroup>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
</colgroup>
<thead>
<tr>
<th rowspan="2">Methods</th>
<th colspan="3">Drone-based IRP</th>
<th colspan="3">HighRPD</th>
</tr>
<tr>
<th>mAP (B)%</th>
<th>mAP (M)%</th>
<th>FPS</th>
<th>mAP (B)%</th>
<th>mAP (M)%</th>
<th>FPS</th>
</tr>
</thead>
<tbody>
<tr>
<td>RT-DETR</td>
<td>32.9</td>
<td>25.7</td>
<td>69</td>
<td>25.6</td>
<td>22.7</td>
<td>60</td>
</tr>
<tr>
<td>YOLO11</td>
<td>66.3</td>
<td>61.5</td>
<td>112</td>
<td>51.1</td>
<td>48.5</td>
<td>101</td>
</tr>
<tr>
<td>MambaYOLO</td>
<td>41.1</td>
<td>35.8</td>
<td>84</td>
<td>33.6</td>
<td>29.4</td>
<td>72</td>
</tr>
<tr>
<td>ARPD</td>
<td>75.8</td>
<td>72.4</td>
<td>130</td>
<td>68.4</td>
<td>66.0</td>
<td>112</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<fn id="table-4fn1" fn-type="other">
<p>Note: <italic>B</italic> and <italic>M</italic> represent the <italic>mAP</italic> corresponding to the bounding box and mask, respectively.</p>
</fn>
</table-wrap-foot>
</table-wrap>
</sec>
</sec>
<sec id="s4">
<label>4</label>
<title>Conclusion</title>
<p>This paper presents an Automatic Recognition of Pavement Defect (ARPD) algorithm, integrating a Selective State Space Model (S<sup>3</sup>M) and Semantic Detail Infusion (SDI), to address the challenges of recognizing multi-type road surface defects under limited semantic cues and large variations in object scale. A UAV-based dataset, UAV-PDD2023, was collected to provide full-coverage overhead images of road surfaces. The S<sup>3</sup>M-based backbone is embedded into ARPD to selectively model the most relevant temporal dependencies in long-sequence feature extraction. Considering that conventional neck modules rely heavily on pyramid structures and often fail to handle multi-scale features effectively, the SDI module is incorporated to enhance semantic and fine-detail fusion between shallow and deep features. This allows the algorithm to emphasize relevant damage features while suppressing noise, thereby improving detection accuracy.</p>
<p>The model&#x2019;s performance is further validated using both a held-out test set and an external benchmark dataset (RDD-2022). Experimental results indicate that ARPD surpasses state-of-the-art models such as YOLOv11 in both accuracy and generalization. However, the algorithm has not yet been deployed or evaluated on embedded hardware platforms, and the real-time inference speed remains untested. Future work will focus on
<list list-type="simple">
<list-item><label>(1)</label><p>Building a UAV-based pavement surface defect dataset that includes both asphalt and concrete pavements;</p></list-item>
<list-item><label>(2)</label><p>Develop a lightweight automatic road defect recognition algorithm that can be embedded with high precision, low energy consumption, and mobile-friendly features.</p></list-item>
</list></p>
</sec>
</body>
<back>
<ack>
<p>Not applicable.</p>
</ack>
<sec>
<title>Funding Statement</title>
<p>This work was supported in part by the Technical Service for the Development and Application of an Intelligent Visual Management Platform for Expressway Construction Progress Based on BIM Technology (grant NO. JKYZLX-2023-09), in part by the Technical Service for the Development of an Early Warning Model in the Research and Application of Key Technologies for Tunnel Operation Safety Monitoring and Early Warning Based on Digital Twin (grant NO. JK-S02-ZNGS-202412-JISHU-FA-0035), sponsored by Yunnan Transportation Science Research Institute Co., Ltd.</p>
</sec>
<sec>
<title>Author Contributions</title>
<p>The authors confirm contribution to the paper as follows: Conceptualization, Hongcheng Zhao and Tong Yang; methodology, Hongcheng Zhao, Tong Yang, Yihui Hu and Fengxiang Guo; software, Hongcheng Zhao and Tong Yang; formal analysis, Hongcheng Zhao, Tong Yang and Yihui Hu; investigation, Hongcheng Zhao, Tong Yang, Yihui Hu and Fengxiang Guo; resources, Fengxiang Guo; data curation, Yihui Hu and Fengxiang Guo; writing&#x2014;original draft preparation, Hongcheng Zhao and Tong Yang; writing&#x2014;review and editing, Hongcheng Zhao and Tong Yang; visualization, Hongcheng Zhao; supervision, Hongcheng Zhao and Fengxiang Guo; project administration, Fengxiang Guo; funding acquisition, Hongcheng Zhao and Fengxiang Guo. All authors reviewed the results and approved the final version of the manuscript.</p>
</sec>
<sec sec-type="data-availability">
<title>Availability of Data and Materials</title>
<p>Data available on request from the author [Tong Yang, yangt@stu.kust.edu.cn].</p>
</sec>
<sec>
<title>Ethics Approval</title>
<p>Not applicable.</p>
</sec>
<sec sec-type="COI-statement">
<title>Conflicts of Interest</title>
<p>The authors declare no conflicts of interest to report regarding the present study.</p>
</sec>
<ref-list content-type="authoryear">
<title>References</title>
<ref id="ref-1"><label>[1]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Yan</surname> <given-names>H</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>J</given-names></string-name></person-group>. <article-title>UAV-PDD2023: a benchmark dataset for pavement distress detection based on UAV images</article-title>. <source>Data Brief</source>. <year>2023</year>;<volume>51</volume>(<issue>12</issue>):<fpage>109692</fpage>. doi:<pub-id pub-id-type="doi">10.1016/j.dib.2023.109692</pub-id>; <pub-id pub-id-type="pmid">38020429</pub-id></mixed-citation></ref>
<ref id="ref-2"><label>[2]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Guo</surname> <given-names>F</given-names></string-name>, <string-name><surname>Qian</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Liu</surname> <given-names>J</given-names></string-name>, <string-name><surname>Yu</surname> <given-names>H</given-names></string-name></person-group>. <article-title>Pavement crack detection based on transformer network</article-title>. <source>Autom Constr</source>. <year>2023</year>;<volume>145</volume>(<issue>2</issue>):<fpage>104646</fpage>. doi:<pub-id pub-id-type="doi">10.1016/j.autcon.2022.104646</pub-id>.</mixed-citation></ref>
<ref id="ref-3"><label>[3]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Alkhedher</surname> <given-names>M</given-names></string-name>, <string-name><surname>Alsit</surname> <given-names>A</given-names></string-name>, <string-name><surname>Alhalabi</surname> <given-names>M</given-names></string-name>, <string-name><surname>AlKheder</surname> <given-names>S</given-names></string-name>, <string-name><surname>Gad</surname> <given-names>A</given-names></string-name>, <string-name><surname>Ghazal</surname> <given-names>M</given-names></string-name></person-group>. <article-title>Novel pavement crack detection sensor using coordinated mobile robots</article-title>. <source>Transp Res Part C Emerg Technol</source>. <year>2025</year>;<volume>172</volume>(<issue>1386</issue>):<fpage>105021</fpage>. doi:<pub-id pub-id-type="doi">10.1016/j.trc.2025.105021</pub-id>.</mixed-citation></ref>
<ref id="ref-4"><label>[4]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Guerrieri</surname> <given-names>M</given-names></string-name>, <string-name><surname>Parla</surname> <given-names>G</given-names></string-name>, <string-name><surname>Khanmohamadi</surname> <given-names>M</given-names></string-name>, <string-name><surname>Neduzha</surname> <given-names>L</given-names></string-name></person-group>. <article-title>Asphalt pavement damage detection through deep learning technique and cost-effective equipment: a case study in urban roads crossed by tramway lines</article-title>. <source>Infrastructures</source>. <year>2024</year>;<volume>9</volume>(<issue>2</issue>):<fpage>34</fpage>. doi:<pub-id pub-id-type="doi">10.3390/infrastructures9020034</pub-id>.</mixed-citation></ref>
<ref id="ref-5"><label>[5]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Askarzadeh</surname> <given-names>T</given-names></string-name>, <string-name><surname>Bridgelall</surname> <given-names>R</given-names></string-name>, <string-name><surname>Tolliver</surname> <given-names>DD</given-names></string-name></person-group>. <article-title>Drones for road condition monitoring: applications and benefits</article-title>. <source>J Transp Eng Part B Pavements</source>. <year>2025</year>;<volume>151</volume>(<issue>1</issue>):<fpage>04024055</fpage>. doi:<pub-id pub-id-type="doi">10.1061/jpeodx.pveng-1559</pub-id>.</mixed-citation></ref>
<ref id="ref-6"><label>[6]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Matarneh</surname> <given-names>S</given-names></string-name>, <string-name><surname>Elghaish</surname> <given-names>F</given-names></string-name>, <string-name><surname>Al-Ghraibah</surname> <given-names>A</given-names></string-name>, <string-name><surname>Abdellatef</surname> <given-names>E</given-names></string-name>, <string-name><surname>Edwards</surname> <given-names>DJ</given-names></string-name></person-group>. <article-title>An automatic image processing based on Hough transform algorithm for pavement crack detection and classification</article-title>. <source>Smart Sustain Built Environ</source>. <year>2025</year>;<volume>14</volume>(<issue>1</issue>):<fpage>1</fpage>&#x2013;<lpage>22</lpage>. doi:<pub-id pub-id-type="doi">10.1108/sasbe-01-2023-0004</pub-id>.</mixed-citation></ref>
<ref id="ref-7"><label>[7]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Tello-Cifuentes</surname> <given-names>L</given-names></string-name>, <string-name><surname>Marulanda</surname> <given-names>J</given-names></string-name>, <string-name><surname>Thomson</surname> <given-names>P</given-names></string-name></person-group>. <article-title>Detection and classification of pavement damages using wavelet scattering transform, fractal dimension by box-counting method and machine learning algorithms</article-title>. <source>Road Mater Pavement Des</source>. <year>2024</year>;<volume>25</volume>(<issue>3</issue>):<fpage>566</fpage>&#x2013;<lpage>84</lpage>. doi:<pub-id pub-id-type="doi">10.1080/14680629.2023.2219338</pub-id>.</mixed-citation></ref>
<ref id="ref-8"><label>[8]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Chou</surname> <given-names>JS</given-names></string-name>, <string-name><surname>Liu</surname> <given-names>CY</given-names></string-name></person-group>. <article-title>Optimized lightweight edge computing platform for UAV-assisted detection of concrete deterioration beneath bridge decks</article-title>. <source>J Comput Civ Eng</source>. <year>2025</year>;<volume>39</volume>(<issue>1</issue>):<fpage>04024045</fpage>. doi:<pub-id pub-id-type="doi">10.1061/jccee5.cpeng-5905</pub-id>.</mixed-citation></ref>
<ref id="ref-9"><label>[9]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Zhang</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Si</surname> <given-names>J</given-names></string-name>, <string-name><surname>Si</surname> <given-names>B</given-names></string-name></person-group>. <article-title>Integrative approach for high-speed road surface monitoring: a convergence of robotics, edge computing, and advanced object detection</article-title>. <source>Appl Sci</source>. <year>2024</year>;<volume>14</volume>(<issue>5</issue>):<fpage>1868</fpage>. doi:<pub-id pub-id-type="doi">10.3390/app14051868</pub-id>.</mixed-citation></ref>
<ref id="ref-10"><label>[10]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Liang</surname> <given-names>J</given-names></string-name>, <string-name><surname>Gu</surname> <given-names>X</given-names></string-name>, <string-name><surname>Jiang</surname> <given-names>D</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>Q</given-names></string-name></person-group>. <article-title>CNN-based network with multi-scale context feature and attention mechanism for automatic pavement crack segmentation</article-title>. <source>Autom Constr</source>. <year>2024</year>;<volume>164</volume>(<issue>4</issue>):<fpage>105482</fpage>. doi:<pub-id pub-id-type="doi">10.1016/j.autcon.2024.105482</pub-id>.</mixed-citation></ref>
<ref id="ref-11"><label>[11]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Li</surname> <given-names>P</given-names></string-name>, <string-name><surname>Zhou</surname> <given-names>B</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>C</given-names></string-name>, <string-name><surname>Hu</surname> <given-names>G</given-names></string-name>, <string-name><surname>Yan</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Guo</surname> <given-names>R</given-names></string-name>, <etal>et al</etal></person-group>. <article-title>CNN-based pavement defects detection using grey and depth images</article-title>. <source>Autom Constr</source>. <year>2024</year>;<volume>158</volume>(<issue>2</issue>):<fpage>105192</fpage>. doi:<pub-id pub-id-type="doi">10.1016/j.autcon.2023.105192</pub-id>.</mixed-citation></ref>
<ref id="ref-12"><label>[12]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Alshawabkeh</surname> <given-names>S</given-names></string-name>, <string-name><surname>Dong</surname> <given-names>D</given-names></string-name>, <string-name><surname>Cheng</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Li</surname> <given-names>L</given-names></string-name>, <string-name><surname>Wu</surname> <given-names>L</given-names></string-name></person-group>. <article-title>A hybrid approach for pavement crack detection using mask R-CNN and vision transformer model</article-title>. <source>Comput Mater Contin</source>. <year>2025</year>;<volume>82</volume>(<issue>1</issue>):<fpage>561</fpage>&#x2013;<lpage>77</lpage>. doi:<pub-id pub-id-type="doi">10.32604/cmc.2024.057213</pub-id>.</mixed-citation></ref>
<ref id="ref-13"><label>[13]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Ren</surname> <given-names>S</given-names></string-name>, <string-name><surname>He</surname> <given-names>K</given-names></string-name>, <string-name><surname>Girshick</surname> <given-names>R</given-names></string-name>, <string-name><surname>Sun</surname> <given-names>J</given-names></string-name></person-group>. <article-title>Faster R-CNN: towards real-time object detection with region proposal networks</article-title>. <source>IEEE Trans Pattern Anal Mach Intell</source>. <year>2016</year>;<volume>39</volume>(<issue>6</issue>):<fpage>1137</fpage>&#x2013;<lpage>49</lpage>. doi:<pub-id pub-id-type="doi">10.1109/TPAMI.2016.2577031</pub-id>; <pub-id pub-id-type="pmid">27295650</pub-id></mixed-citation></ref>
<ref id="ref-14"><label>[14]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>He</surname> <given-names>K</given-names></string-name>, <string-name><surname>Gkioxari</surname> <given-names>G</given-names></string-name>, <string-name><surname>Doll&#x00E1;r</surname> <given-names>P</given-names></string-name>, <string-name><surname>Girshick</surname> <given-names>R</given-names></string-name></person-group>. <article-title>Mask R-CNN</article-title>. In: <conf-name>Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV)</conf-name>; <year>2017 Oct 22&#x2013;29</year>;<publisher-loc>Venice, Italy</publisher-loc>. doi:<pub-id pub-id-type="doi">10.1109/ICCV.2017.322</pub-id>.</mixed-citation></ref>
<ref id="ref-15"><label>[15]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Cheng</surname> <given-names>B</given-names></string-name>, <string-name><surname>Misra</surname> <given-names>I</given-names></string-name>, <string-name><surname>Schwing</surname> <given-names>AG</given-names></string-name>, <string-name><surname>Kirillov</surname> <given-names>A</given-names></string-name>, <string-name><surname>Girdhar</surname> <given-names>R</given-names></string-name></person-group>. <article-title>Masked-attention mask transformer for universal image segmentation</article-title>. In: <conf-name>2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)</conf-name>; <year>2022 Jun 18&#x2013;24</year>; <publisher-loc>New Orleans, LA, USA</publisher-loc>. doi:<pub-id pub-id-type="doi">10.1109/CVPR52688.2022.00135</pub-id>.</mixed-citation></ref>
<ref id="ref-16"><label>[16]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><surname>Jocher</surname> <given-names>G</given-names></string-name>, <string-name><surname>Chaurasia</surname> <given-names>A</given-names></string-name>, <string-name><surname>Stoken</surname> <given-names>A</given-names></string-name>, <string-name><surname>Borovec</surname> <given-names>J</given-names></string-name>, <string-name><surname>Kwon</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Michael</surname> <given-names>K</given-names></string-name>, <etal>et al.</etal></person-group> <article-title>YOLOv5 by Ultralytics (Version 7.0) [Internet]. [cited 2025 Jul 17]</article-title>. Available from: <pub-id pub-id-type="doi">10.5281/zenodo.3908559</pub-id>.</mixed-citation></ref>
<ref id="ref-17"><label>[17]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><surname>Jocher</surname> <given-names>G</given-names></string-name>, <string-name><surname>Qiu</surname> <given-names>J</given-names></string-name>, <string-name><surname>Chaurasia</surname> <given-names>A</given-names></string-name></person-group>. <article-title>Ultralytics YOLO (Version 8.0.0) [Internet]. [cited 2025 Jul 17]</article-title>. Available from: <ext-link ext-link-type="uri" xlink:href="https://github.com/ultralytics/ultralytics">https://github.com/ultralytics/ultralytics</ext-link>.</mixed-citation></ref>
<ref id="ref-18"><label>[18]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><surname>Tian</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Ye</surname> <given-names>Q</given-names></string-name>, <string-name><surname>Doermann</surname> <given-names>D</given-names></string-name></person-group>. <article-title>YOLOv12: attention-centric real-time object detectors</article-title>. <comment>arXiv:2502.12524. 2025</comment>. doi:<pub-id pub-id-type="doi">10.48550/arXiv.2502.12524</pub-id>.</mixed-citation></ref>
<ref id="ref-19"><label>[19]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Shan</surname> <given-names>J</given-names></string-name>, <string-name><surname>Jiang</surname> <given-names>W</given-names></string-name>, <string-name><surname>Huang</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Yuan</surname> <given-names>D</given-names></string-name>, <string-name><surname>Liu</surname> <given-names>Y</given-names></string-name></person-group>. <article-title>Unmanned aerial vehicle (UAV)-based pavement image stitching without occlusion, crack semantic segmentation, and quantification</article-title>. <source>IEEE Trans Intell Transp Syst</source>. <year>2024</year>;<volume>25</volume>(<issue>11</issue>):<fpage>17038</fpage>&#x2013;<lpage>53</lpage>. doi:<pub-id pub-id-type="doi">10.1109/TITS.2024.3424525</pub-id>.</mixed-citation></ref>
<ref id="ref-20"><label>[20]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Tse</surname> <given-names>KW</given-names></string-name>, <string-name><surname>Pi</surname> <given-names>R</given-names></string-name>, <string-name><surname>Yang</surname> <given-names>W</given-names></string-name>, <string-name><surname>Yu</surname> <given-names>X</given-names></string-name>, <string-name><surname>Wen</surname> <given-names>CY</given-names></string-name></person-group>. <article-title>Advancing UAV-based inspection system: the USSA-net segmentation approach to crack quantification</article-title>. <source>IEEE Trans Instrum Meas</source>. <year>2024</year>;<volume>73</volume>:<fpage>2522914</fpage>. doi:<pub-id pub-id-type="doi">10.1109/TIM.2024.3418073</pub-id>.</mixed-citation></ref>
<ref id="ref-21"><label>[21]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Feng</surname> <given-names>S</given-names></string-name>, <string-name><surname>Gao</surname> <given-names>M</given-names></string-name>, <string-name><surname>Jin</surname> <given-names>X</given-names></string-name>, <string-name><surname>Zhao</surname> <given-names>T</given-names></string-name>, <string-name><surname>Yang</surname> <given-names>F</given-names></string-name></person-group>. <article-title>Fine-grained damage detection of cement concrete pavement based on UAV remote sensing image segmentation and stitching</article-title>. <source>Measurement</source>. <year>2024</year>;<volume>226</volume>(<issue>3</issue>):<fpage>113844</fpage>. doi:<pub-id pub-id-type="doi">10.1016/j.measurement.2023.113844</pub-id>.</mixed-citation></ref>
<ref id="ref-22"><label>[22]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Dugalam</surname> <given-names>R</given-names></string-name>, <string-name><surname>Prakash</surname> <given-names>G</given-names></string-name></person-group>. <article-title>Development of a random forest based algorithm for road health monitoring</article-title>. <source>Expert Syst Appl</source>. <year>2024</year>;<volume>251</volume>(<issue>1</issue>):<fpage>123940</fpage>. doi:<pub-id pub-id-type="doi">10.1016/j.eswa.2024.123940</pub-id>.</mixed-citation></ref>
<ref id="ref-23"><label>[23]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Li</surname> <given-names>M</given-names></string-name>, <string-name><surname>Yuan</surname> <given-names>J</given-names></string-name>, <string-name><surname>Ren</surname> <given-names>Q</given-names></string-name>, <string-name><surname>Luo</surname> <given-names>Q</given-names></string-name>, <string-name><surname>Fu</surname> <given-names>J</given-names></string-name>, <string-name><surname>Li</surname> <given-names>Z</given-names></string-name></person-group>. <article-title>CNN-transformer hybrid network for concrete dam crack patrol inspection</article-title>. <source>Autom Constr</source>. <year>2024</year>;<volume>163</volume>(<issue>1</issue>):<fpage>105440</fpage>. doi:<pub-id pub-id-type="doi">10.1016/j.autcon.2024.105440</pub-id>.</mixed-citation></ref>
<ref id="ref-24"><label>[24]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Liu</surname> <given-names>H</given-names></string-name>, <string-name><surname>Jia</surname> <given-names>C</given-names></string-name>, <string-name><surname>Shi</surname> <given-names>F</given-names></string-name>, <string-name><surname>Cheng</surname> <given-names>X</given-names></string-name>, <string-name><surname>Chen</surname> <given-names>S</given-names></string-name></person-group>. <article-title>SCSegamba: lightweight structure-aware vision mamba for crack segmentation in structures</article-title>. In: <conf-name>Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR2025)</conf-name>; <year>2025 Jun 10&#x2013;17</year>; <publisher-loc>Nashville, TN, USA</publisher-loc>.</mixed-citation></ref>
<ref id="ref-25"><label>[25]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Dosovitskiy</surname> <given-names>A</given-names></string-name>, <string-name><surname>Beyer</surname> <given-names>L</given-names></string-name>, <string-name><surname>Kolesnikov</surname> <given-names>A</given-names></string-name>, <string-name><surname>Weissenborn</surname> <given-names>D</given-names></string-name>, <string-name><surname>Zhai</surname> <given-names>X</given-names></string-name>, <string-name><surname>Unterthiner</surname> <given-names>T</given-names></string-name>, <etal>et al</etal></person-group>. <article-title>An image is worth 16 &#x00D7; 16 words: transformers for image recognition at scale</article-title>. <comment>arXiv:2010.11929. 2020</comment>. doi:<pub-id pub-id-type="doi">10.48550/arXiv.2010.11929</pub-id>.</mixed-citation></ref>
<ref id="ref-26"><label>[26]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Qin</surname> <given-names>D</given-names></string-name>, <string-name><surname>Leichner</surname> <given-names>C</given-names></string-name>, <string-name><surname>Delakis</surname> <given-names>M</given-names></string-name>, <string-name><surname>Fornoni</surname> <given-names>M</given-names></string-name>, <string-name><surname>Luo</surname> <given-names>S</given-names></string-name>, <string-name><surname>Yang</surname> <given-names>F</given-names></string-name>, <etal>et al</etal></person-group>. <article-title>MobileNetV4: universal models for the mobile ecosystem</article-title>. In: <conf-name>Proceedings of the Computer Vision&#x2014;ECCV 2024</conf-name>. <year>2024 Sep 29&#x2013;Oct 4</year>; <publisher-loc>Milan, Italy</publisher-loc>. doi:<pub-id pub-id-type="doi">10.1007/978-3-031-73661-2_5</pub-id>.</mixed-citation></ref>
<ref id="ref-27"><label>[27]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><surname>Peng</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Sonka</surname> <given-names>M</given-names></string-name>, <string-name><surname>Chen</surname> <given-names>DZ</given-names></string-name></person-group>. <article-title>U-net v2: rethinking the skip connections of U-net for medical image segmentation</article-title>. <comment>arXiv:2311.17791. 2023.</comment> doi:<pub-id pub-id-type="doi">10.48550/arXiv.2311.17791</pub-id>.</mixed-citation></ref>
<ref id="ref-28"><label>[28]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><surname>Gu</surname> <given-names>A</given-names></string-name>, <string-name><surname>Dao</surname> <given-names>T</given-names></string-name></person-group>. <article-title>Mamba: linear-time sequence modeling with selective state spaces</article-title>. <comment>arXiv:2312.00752. 2023</comment>. doi:<pub-id pub-id-type="doi">10.48550/arXiv.2312.00752</pub-id>.</mixed-citation></ref>
<ref id="ref-29"><label>[29]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Han</surname> <given-names>C</given-names></string-name>, <string-name><surname>Yang</surname> <given-names>H</given-names></string-name>, <string-name><surname>Yang</surname> <given-names>Y</given-names></string-name></person-group>. <article-title>Enhancing pixel-level crack segmentation with visual mamba and convolutional networks</article-title>. <source>Autom Constr</source>. <year>2024</year>;<volume>168</volume>(<issue>1</issue>):<fpage>105770</fpage>. doi:<pub-id pub-id-type="doi">10.1016/j.autcon.2024.105770</pub-id>.</mixed-citation></ref>
<ref id="ref-30"><label>[30]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><surname>Zhu</surname> <given-names>Q</given-names></string-name>, <string-name><surname>Fang</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Fan</surname> <given-names>L</given-names></string-name></person-group>. <article-title>MSCrackMamba: leveraging vision mamba for crack detection in fused multispectral imagery</article-title>. <comment>arXiv:2412.06211. 2024.</comment> doi:<pub-id pub-id-type="doi">10.48550/arXiv.2412.06211</pub-id>.</mixed-citation></ref>
<ref id="ref-31"><label>[31]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Arya</surname> <given-names>D</given-names></string-name>, <string-name><surname>Maeda</surname> <given-names>H</given-names></string-name>, <string-name><surname>Ghosh</surname> <given-names>SK</given-names></string-name>, <string-name><surname>Toshniwal</surname> <given-names>D</given-names></string-name>, <string-name><surname>Sekimoto</surname> <given-names>Y</given-names></string-name></person-group>. <article-title>RDD2022: a multi-national image dataset for automatic road damage detection</article-title>. <source>Geosci Data J</source>. <year>2024</year>;<volume>11</volume>(<issue>4</issue>):<fpage>846</fpage>&#x2013;<lpage>62</lpage>. doi:<pub-id pub-id-type="doi">10.1002/gdj3.260</pub-id>.</mixed-citation></ref>
<ref id="ref-32"><label>[32]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Chollet</surname> <given-names>F</given-names></string-name></person-group>. <article-title>Xception: deep learning with depthwise separable convolutions</article-title>. In: <conf-name>Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)</conf-name>; <year>2017 Jul 21&#x2013;26</year>;<publisher-loc>Honolulu, HI, USA</publisher-loc>.</mixed-citation></ref>
<ref id="ref-33"><label>[33]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Liu</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Tian</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Zhao</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Yu</surname> <given-names>H</given-names></string-name>, <string-name><surname>Xie</surname> <given-names>L</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>Y</given-names></string-name>, <etal>et al</etal></person-group>. <article-title>Vmamba: visual state space model</article-title>. In: <conf-name>Proceedings of the Neural Information Processing Systems 37 (NeurIPS 2024)</conf-name>. <year>2024 Dec 10&#x2013;15</year>; <publisher-loc>Vancouver, BC, Canada</publisher-loc>.</mixed-citation></ref>
<ref id="ref-34"><label>[34]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><surname>Wang</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Li</surname> <given-names>C</given-names></string-name>, <string-name><surname>Xu</surname> <given-names>H</given-names></string-name>, <string-name><surname>Zhu</surname> <given-names>X</given-names></string-name></person-group>. <article-title>Mamba YOLO: SSMs-based YOLO for object detection</article-title>. <comment>arXiv:2406.05835. 2024</comment>. doi:<pub-id pub-id-type="doi">10.48550/arXiv.2406.05835</pub-id>.</mixed-citation></ref>
<ref id="ref-35"><label>[35]</label><mixed-citation publication-type="book"><person-group person-group-type="author"><string-name><surname>Koonce</surname> <given-names>B</given-names></string-name></person-group>. <chapter-title>MobileNetV3</chapter-title>. In: <person-group person-group-type="editor"><string-name><surname>Koonce</surname> <given-names>B</given-names></string-name></person-group>, editor. <source>Convolutional neural networks with swift for tensorflow: image recognition and dataset categorization</source>. <publisher-loc>Berlin/Heidelberg, Germany</publisher-loc>: <publisher-name>Springer</publisher-name>; <year>2021</year>. p. <fpage>125</fpage>&#x2013;<lpage>44</lpage>. doi:<pub-id pub-id-type="doi">10.1007/978-1-4842-6168-2_11</pub-id>.</mixed-citation></ref>
<ref id="ref-36"><label>[36]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Zhao</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Lv</surname> <given-names>W</given-names></string-name>, <string-name><surname>Xu</surname> <given-names>S</given-names></string-name>, <string-name><surname>Wei</surname> <given-names>J</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>G</given-names></string-name>, <string-name><surname>Dang</surname> <given-names>Q</given-names></string-name>, <etal>et al</etal></person-group>. <article-title>DETRs beat YOLOs on real-time object detection</article-title>. In: <conf-name>Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR2024)</conf-name>; <year>2024 Jun 17&#x2013;21</year>; <publisher-loc>Seattle, WA, USA</publisher-loc>.</mixed-citation></ref>
<ref id="ref-37"><label>[37]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Zong</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Song</surname> <given-names>G</given-names></string-name>, <string-name><surname>Liu</surname> <given-names>Y</given-names></string-name></person-group>. <article-title>DETRs with collaborative hybrid assignments training</article-title>. In: <conf-name>Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR2023)</conf-name>; <year>2023 Jun 18&#x2013;22</year>;<publisher-loc>Vancouver, BC, Canada</publisher-loc>.</mixed-citation></ref>
<ref id="ref-38"><label>[38]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><surname>Nooralishahi</surname> <given-names>P</given-names></string-name>, <string-name><surname>Ramos</surname> <given-names>G</given-names></string-name>, <string-name><surname>Maldague</surname> <given-names>X</given-names></string-name></person-group>. <article-title>Dataset for drone-based inspection of road pavement structures for cracks, Mendeley Data, V1 [Internet]. [cited 2025 Jul 17]</article-title>. Available from: <ext-link ext-link-type="uri" xlink:href="https://data.mendeley.com/datasets/csd32bm8zx/1">https://data.mendeley.com/datasets/csd32bm8zx/1</ext-link>.</mixed-citation></ref>
<ref id="ref-39"><label>[39]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>He</surname> <given-names>J</given-names></string-name>, <string-name><surname>Gong</surname> <given-names>L</given-names></string-name>, <string-name><surname>Xu</surname> <given-names>C</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>P</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Zheng</surname> <given-names>O</given-names></string-name>, <etal>et al</etal></person-group>. <article-title>HighRPD: a high-altitude drone dataset of road pavement distress</article-title>. <source>Data Brief</source>. <year>2025</year>;<volume>59</volume>:<fpage>111377</fpage>. doi:<pub-id pub-id-type="doi">10.1016/j.dib.2025.111377</pub-id>; <pub-id pub-id-type="pmid">40034725</pub-id></mixed-citation></ref>
</ref-list>
</back></article>