<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.1 20151215//EN" "http://jats.nlm.nih.gov/publishing/1.1/JATS-journalpublishing1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:mml="http://www.w3.org/1998/Math/MathML" xml:lang="en" article-type="research-article" dtd-version="1.1">
<front>
<journal-meta>
<journal-id journal-id-type="pmc">CMC</journal-id>
<journal-id journal-id-type="nlm-ta">CMC</journal-id>
<journal-id journal-id-type="publisher-id">CMC</journal-id>
<journal-title-group>
<journal-title>Computers, Materials &#x0026; Continua</journal-title>
</journal-title-group>
<issn pub-type="epub">1546-2226</issn>
<issn pub-type="ppub">1546-2218</issn>
<publisher>
<publisher-name>Tech Science Press</publisher-name>
<publisher-loc>USA</publisher-loc>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="publisher-id">70563</article-id>
<article-id pub-id-type="doi">10.32604/cmc.2025.070563</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Article</subject>
</subj-group>
</article-categories>
<title-group>
<article-title>VitSeg-Det &#x0026; TransTra-Count: Networks for Robust Crack Detection and Measurement in Dynamic Video Scenes</article-title>
<alt-title alt-title-type="left-running-head">VitSeg-Det &#x0026; TransTra-Count: Networks for Robust Crack Detection and Measurement in Dynamic Video Scenes</alt-title>
<alt-title alt-title-type="right-running-head">VitSeg-Det &#x0026; TransTra-Count: Networks for Robust Crack Detection and Measurement in Dynamic Video Scenes</alt-title>
</title-group>
<contrib-group>
<contrib id="author-1" contrib-type="author">
<name name-style="western"><surname>Zhao</surname><given-names>Langyue</given-names></name><xref ref-type="aff" rid="aff-1">1</xref><xref ref-type="aff" rid="aff-2">2</xref></contrib>
<contrib id="author-2" contrib-type="author" corresp="yes">
<name name-style="western"><surname>Yuan</surname><given-names>Yubin</given-names></name><xref ref-type="aff" rid="aff-3">3</xref><xref rid="cor1" ref-type="corresp">&#x002A;</xref><email>harley_yuan@nuaa.edu.cn</email></contrib>
<contrib id="author-3" contrib-type="author" corresp="yes">
<name name-style="western"><surname>Wu</surname><given-names>Yiquan</given-names></name><xref ref-type="aff" rid="aff-2">2</xref><xref rid="cor1" ref-type="corresp">&#x002A;</xref><email>nuaatracking@163.com</email></contrib>
<aff id="aff-1"><label>1</label><institution>College of Computer Science, Weinan Normal University</institution>, <addr-line>Weinan, 714000</addr-line>, <country>China</country></aff>
<aff id="aff-2"><label>2</label><institution>College of Electronic and Information Engineering, Nanjing University of Aeronautics and Astronautics</institution>, <addr-line>Nanjing, 210000</addr-line>, <country>China</country></aff>
<aff id="aff-3"><label>3</label><institution>College of Information Engineering, Yangzhou University</institution>, <addr-line>Yangzhou, 225127</addr-line>, <country>China</country></aff>
</contrib-group>
<author-notes>
<corresp id="cor1"><label>&#x002A;</label>Corresponding Authors: Yubin Yuan. Email: <email>harley_yuan@nuaa.edu.cn</email>; Yiquan Wu. Email: <email>nuaatracking@163.com</email></corresp>
</author-notes>
<pub-date date-type="collection" publication-format="electronic">
<year>2026</year>
</pub-date>
<pub-date date-type="pub" publication-format="electronic">
<day>10</day><month>2</month><year>2026</year>
</pub-date>
<volume>87</volume>
<issue>1</issue>
<elocation-id>82</elocation-id>
<history>
<date date-type="received">
<day>18</day>
<month>07</month>
<year>2025</year>
</date>
<date date-type="accepted">
<day>22</day>
<month>10</month>
<year>2025</year>
</date>
</history>
<permissions>
<copyright-statement>&#x00A9; 2026 The Authors.</copyright-statement>
<copyright-year>2026</copyright-year>
<copyright-holder>Published by Tech Science Press.</copyright-holder>
<license xlink:href="https://creativecommons.org/licenses/by/4.0/">
<license-p>This work is licensed under a <ext-link ext-link-type="uri" xlink:type="simple" xlink:href="https://creativecommons.org/licenses/by/4.0/">Creative Commons Attribution 4.0 International License</ext-link>, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
</license>
</permissions>
<self-uri content-type="pdf" xlink:href="TSP_CMC_70563.pdf"></self-uri>
<abstract>
<p>Regular detection of pavement cracks is essential for infrastructure maintenance. However, existing methods often ignore the challenges such as the continuous evolution of crack features between video frames and the difficulty of defect quantification. To this end, this paper proposes an integrated framework for pavement crack detection, segmentation, tracking and counting based on Transformer. Firstly, we design the VitSeg-Det network, which is an integrated detection and segmentation network that can accurately locate and segment tiny cracks in complex scenes. Second, the TransTra-Count system is developed to automatically count the number of defects by combining defect tracking with width estimation. Finally, we conduct experimental verification on three datasets. The results show that the proposed method is superior to the existing deep learning methods in detection accuracy. In addition, the actual scene video test shows that the framework can accurately label the defect location and output the number of defects in real time.</p>
</abstract>
<kwd-group kwd-group-type="author">
<kwd>Crack detection</kwd>
<kwd>multi object tracking</kwd>
<kwd>semantic segmentation</kwd>
<kwd>counting</kwd>
<kwd>transformer</kwd>
</kwd-group>
<funding-group>
<award-group id="awg1">
<funding-source>Natural Science Foundation of Shaanxi Province of China</funding-source>
<award-id>2024JC-YBQN-0695</award-id>
</award-group>
</funding-group>
</article-meta>
</front>
<body>
<sec id="s1">
<label>1</label>
<title>Introduction</title>
<p>Cracks are a critical form of damage in civil engineering and are commonly found in various infrastructures, including bridges, pavements, buildings, dams, and tunnels. These cracks not only impair the functionality of structural components but also pose significant safety hazards, potentially leading to catastrophic consequences. Thus, early crack detection is vital for pavement safety. However, identifying cracks, particularly microcracks, is challenging due to complex backgrounds, uneven illumination, and obstructions. While conventional manual inspection is tedious and error-prone, computer vision advancements now enable automated crack detection.</p>
<p>Current mainstream detection algorithms, such as Faster R-CNN [<xref ref-type="bibr" rid="ref-1">1</xref>], the YOLO series [<xref ref-type="bibr" rid="ref-2">2</xref>], and DETR [<xref ref-type="bibr" rid="ref-3">3</xref>], have achieved efficient localization and classification on single-frame images. However, they still face several limitations: (1) Lack of Dynamic Continuity Modeling: These models process individual frames independently, ignoring the continuous evolution of cracks across video sequences. As a result, the same crack may be repeatedly detected in real world scenarios, leading to duplicate counting or trajectory drift. (2) Inability to Track Crack Identity: Most existing detectors lack an object tracking mechanism, making it difficult to determine whether cracks detected in consecutive frames belong to the same entity. This hinders consistent identity maintenance over time. (3) Unpredictable Object Counts: While detection models can localize objects with bounding boxes, they cannot directly output the total number of defects. In video sequences, defect counting often relies on additional post processing logic, increasing system complexity. (4) Limited Performance on Small Objects: Traditional convolution based detection networks struggle with small or texture blurred cracks due to restricted receptive fields and resolution reduction. (5) Absence of Robust Counting Mechanisms: Most industrial inspection systems either lack integrated counting functionality or rely on heuristic threshold matching rules, missing end-to-end learnable and robust counting solutions.</p>
<p>In practical engineering applications, such as road condition assessment [<xref ref-type="bibr" rid="ref-4">4</xref>], pipeline crack monitoring [<xref ref-type="bibr" rid="ref-5">5</xref>], weld defect inspection [<xref ref-type="bibr" rid="ref-6">6</xref>] and other scenarios, defect detection systems must process continuous video streams rather than performing single frame image analysis. Under such dynamic conditions, defect manifestations across temporal frames are susceptible to multiple interference factors, such as camera perspectives, illumination changes, sensor vibration, or occlusion. Defect morphology may dynamically evolve (e.g., crack propagation/lateral widening), while appearance features may suffer from blurring or distortion. Consequently, static detectors alone are incapable of achieving accurate defect identification and quantification, let alone fulfilling the demands for automated diagnosis and early warning systems in engineering practice.</p>
<p>In view of the above problems, we propose a dynamic crack defect integrated measurement framework based on Transformer from the perspective of visual measurement system design. The framework is structurally integrated with tasks such as detection and segmentation, tracking and counting, and gives consideration to recognition accuracy and quantitative analysis capability. By introducing a fine feature sampling mechanism and cross frame identity matching technology, the system can realize continuous identification and quantitative statistics of cracks in the video stream and has the measurement capability of high automation, strong robustness and low delay. The main contributions of this chapter are as follows:
<list list-type="simple">
<list-item><label>(1)</label><p>A multi task integrated visual inspection system for dynamic scenes is proposed, integrating the tasks of detection, segmentation, tracking and counting in the same architecture to achieve continuous recognition, status tracking and quantity statistics of defect objects in the video stream. In view of the significant characteristics of crack like defects in dynamic scenes, mechanisms such as multi scale dilated convolution, channel spatial attention fusion, and dynamic feature sampling are adopted, enabling the system to maintain stable recognition capabilities in complex environments such as dynamics, occlusion, and non-rigid changes.</p></list-item>
<list-item><label>(2)</label><p>The integrated network of VitSeg-Det detection and segmentation is designed, and the multi scale feature expression system is built with EfficientNet-b5 as the backbone. The micro scale feature scoring module and macro scale perception module are integrated to achieve high precision positioning and mask generation of microcracks. Among them, the scoring module combines the channel and spatial attention mechanism, selects the high information entropy area through dynamic sampling and inputs it into the Transformer encoder to improve the response ability of the model to fine-grained defects; Macro scale branching uses dilated convolution to model the global topological structure of cracks, which enhances the system&#x2019;s perception of long range irregular cracks.</p></list-item>
<list-item><label>(3)</label><p>The TransTra-Count object tracking and counting module is proposed. Based on the self attention mechanism of Transformer, Spatial Feature Dual Modal Data Association and Long Term Memory Update Trade off Model (CrackDSF-LMe) is constructed to maintain the stable identity of crack objects under occlusion, blurring and illumination disturbance. The system integrates an unsupervised mask skeleton width estimation algorithm, combines a width smoothing mechanism and a change rate constraint, effectively solves the problem of &#x201C;sudden change of the same crack width&#x201D;, and realizes dynamic quantitative assessment and robust statistics of crack defect.</p></list-item>
<list-item><label>(4)</label><p>A high resolution, sequential and self-built video dataset of cracks, RoadDefect-MT, was constructed, covering a variety of complex scenes such as pavement disease types, occlusion and illumination changes, to comprehensively verify the performance of the system in terms of measurement accuracy, tracking consistency and statistical stability. This dataset fills the gap in the existing public data set that lacks dynamic features and measurement annotation, and provides basic resources and experimental platform support for subsequent crack identification and visual measurement research.</p></list-item>
</list></p>
<p>In addition, the system in this paper fully considers the project deployment requirements in the design, supports the operation on edge equipment, and is suitable for industrial measurement scenarios such as road inspection, weld quality assessment, pipeline crack monitoring, and it has good popularization and practical value. This work not only improves the crack detection accuracy, but also improves the &#x201C;defect recognition&#x201D; to &#x201C;quantifiable measurement&#x201D;, providing a design paradigm for future video intelligent measurement systems.</p>
</sec>
<sec id="s2">
<label>2</label>
<title>Related Work</title>
<p>Recent advances in computer vision have revolutionized infrastructure monitoring, enabling automated, precise measurement for structural health assessment. Key applications include pavement condition evaluation [<xref ref-type="bibr" rid="ref-7">7</xref>], bridge structural analysis [<xref ref-type="bibr" rid="ref-8">8</xref>], and concrete crack detection [<xref ref-type="bibr" rid="ref-9">9</xref>], where vision-based systems now play a pivotal role in operational decision-making. Particularly in metrology, the integration of computer vision with quantitative measurement techniques has emerged as a critical research focus, addressing the fundamental challenge of achieving automated defect identification with parametric precision and repeatability.</p>
<sec id="s2_1">
<label>2.1</label>
<title>CNN-Based Crack Detection Method</title>
<p>The rise of deep learning has provided new technological pathways for automatic crack detection and measurement [<xref ref-type="bibr" rid="ref-10">10</xref>], particularly within the CNN framework. Through end-to-end training methods, deep learning enables automatic learning of multi-level semantic features, significantly improving detection accuracy and scene adaptability. Research in this field can be broadly categorized into two main types: image segmentation and object detection.</p>
<p>Segmentation Methods. The goal of image segmentation tasks is to achieve pixel-level localization of crack areas. Common approaches include architectures such as FCN [<xref ref-type="bibr" rid="ref-11">11</xref>], DeepLab [<xref ref-type="bibr" rid="ref-12">12</xref>], and U-Net [<xref ref-type="bibr" rid="ref-13">13</xref>]. In the field of crack segmentation, Sun et al. developed a multi-scale attention module based on Deeplabv3&#x002B; to guide the decoder in extracting more fine-grained crack edge information [<xref ref-type="bibr" rid="ref-14">14</xref>]. Kang et al. integrated Faster-RCNN with tubular flow field and distance transformation modules to achieve segmentation and parametric measurement under complex backgrounds [<xref ref-type="bibr" rid="ref-15">15</xref>]. Ali and Cha introduced adversarial training mechanisms to alleviate the issue of scarce annotations while enhancing segmentation performance on concrete structures [<xref ref-type="bibr" rid="ref-16">16</xref>]. Kang and Cha designed a semantic transformation network combining multi-head attention mechanisms and compression modules, significantly improving segmentation accuracy and computational efficiency [<xref ref-type="bibr" rid="ref-17">17</xref>].</p>
<p>Detection Methods. Object detection methods focus on rapidly localizing crack targets through bounding box regression and classification models, which can be divided into two stage and single stage detection approaches. Two stage detectors like the R-CNN series algorithms [<xref ref-type="bibr" rid="ref-18">18</xref>] in accuracy, with innovations such as illumination robust Gaussian mixture integration [<xref ref-type="bibr" rid="ref-19">19</xref>] and morphology enhanced bounding box optimization [<xref ref-type="bibr" rid="ref-20">20</xref>]. Meanwhile, single stage models (YOLO, SSD [<xref ref-type="bibr" rid="ref-21">21</xref>]) prioritize speed, with recent improvements including deformable SSD for complex cracks [<xref ref-type="bibr" rid="ref-22">22</xref>], lightweight MobileNet variants [<xref ref-type="bibr" rid="ref-23">23</xref>], and attention-augmented YOLOv3 [<xref ref-type="bibr" rid="ref-24">24</xref>]. Hybrid approaches like YOLO-MF [<xref ref-type="bibr" rid="ref-25">25</xref>] further bridge speed and functionality by incorporating flow based defect counting.</p>
<p>Although CNN-based methods have demonstrated strong capabilities in static image analysis, they still face three major limitations: The restricted receptive field hinders the modeling of long range structural dependencies, thereby limiting the accurate identification of elongated or discontinuous cracks. The inability to maintain interframe consistency complicates the achievement of temporally coherent structural measurements. Detection results often require additional modules for tasks such as object counting, which undermines system integration and operational automation.</p>
</sec>
<sec id="s2_2">
<label>2.2</label>
<title>Transformer-Based Crack Detection Method</title>
<p>The Transformer architecture has demonstrated remarkable advantages in recent visual tasks, owing to its global modeling capacity and powerful self attention mechanism. Since the initial application of Vision Transformer (ViT) in image classification, related approaches have rapidly expanded to domains such as semantic segmentation, object detection, and video modeling.</p>
<p>Segmentation Methods. The application of Vision Transformer (ViT) in segmentation tasks primarily revolves around two types of architectures: pure Transformer models (e.g., SETR [<xref ref-type="bibr" rid="ref-26">26</xref>], SegFormer [<xref ref-type="bibr" rid="ref-27">27</xref>]) and hybrid CNN-Transformer models (e.g., Swin-UNet [<xref ref-type="bibr" rid="ref-28">28</xref>]). In the context of crack segmentation: Wang et al.&#x2019;s efficient depthwise separable convolution (88.08% mIoU with 20% data) [<xref ref-type="bibr" rid="ref-29">29</xref>] and Zhou et al.&#x2019;s SCDeepLab (inverted residuals &#x002B; SwinTransformer), which outperforms CNNs [<xref ref-type="bibr" rid="ref-30">30</xref>]. These advances demonstrate transformers&#x2019; superiority in accuracy and robustness for crack analysis.</p>
<p>Detection Methods. Transformers have revolutionized crack detection through their self-attention mechanisms, with several key advancements: Swin Transformers [<xref ref-type="bibr" rid="ref-31">31</xref>] enhance noise robustness via window-based attention, Linformer [<xref ref-type="bibr" rid="ref-32">32</xref>] reduces complexity to O(n) using low rank approximations [<xref ref-type="bibr" rid="ref-33">33</xref>], and Crack-DETR [<xref ref-type="bibr" rid="ref-34">34</xref>] combines high/low frequency features for noise resistant detection. Additional innovations include attention fused encoder decoders [<xref ref-type="bibr" rid="ref-34">34</xref>] for improved accuracy and NMS-free contrastive learning [<xref ref-type="bibr" rid="ref-35">35</xref>] for pavement defects. These developments demonstrate Transformers&#x2019; superiority in handling complex crack detection scenarios while addressing computational challenges.</p>
<p>The Transformer architecture effectively addresses three critical challenges in crack measurement: global context modeling, cross frame identity consistency, and morphological parameter quantification. Recent advances demonstrate its dual capability in both enhancing structural recognition accuracy and enabling automated measurement of defect characteristics (location, width, evolution trends) through superior sequence modeling. While successful in static image analysis, current Transformer applications largely neglect dynamic video requirements, particularly in modeling temporal patterns and quantifiable metrics like width variation trends and target consistency across frames. This highlights a crucial research gap in developing video oriented Transformer architectures for structural measurement tasks.</p>
</sec>
</sec>
<sec id="s3">
<label>3</label>
<title>Methodology</title>
<p>This paper proposes an integrated framework for the detection, segmentation, tracking, and counting of pavement crack defects in video sequences. As illustrated in <xref ref-type="fig" rid="fig-1">Fig. 1</xref>, the system is designed to achieve continuous identification of crack objects across spatial temporal dimensions and automate the quantification of structural parameters. Built upon a Transformer based architecture, the method integrates spatial structure modeling and temporal state maintenance mechanisms to overcome limitations of conventional image level approaches, such as poor cross frame correlation and repeated counting. The system consists of a front end crack perception module and a back end structural measurement module. The former employs a VitSeg-Det network to perform high precision object detection and pixel level mask segmentation, while the latter fuses segmentation results with multi frame features and incorporates a TransTra-Count module to establish object level tracking chains for identity preservation and quantity statistics. Additionally, the system integrates mask skeleton extraction and width estimation algorithms, enabling the extraction of crack geometric parameters without manual annotation. Ultimately, the system can stably output structured measurement results under continuous video input, offering an efficient and reliable vision based solution for intelligent assessment of surface defects in infrastructure.</p>
<fig id="fig-1">
<label>Figure 1</label>
<caption>
<title>Framework of the Transformer-based detection, segmentation, tracking and counting algorithm</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_70563-fig-1.tif"/>
</fig>
<sec id="s3_1">
<label>3.1</label>
<title>Detection and Segmentation Model: VitSeg-Det</title>
<p>VitSeg-Det is an integrated visual modeling structure for pavement crack detection and segmentation proposed by us, which aims to achieve high-precision crack area recognition and structural parameter extraction, and provide accurate and detailed input for follow-up tracking and measurement modules. The network takes EfficientNet-b5 as the backbone feature extraction structure, and uses its compound expansion mechanism in width, depth and resolution to achieve efficient multi scale receptive field modeling. On this basis, VitSeg-Det introduces the refined feature refinement module Sampled ViT and the macro scale structure modeling module. The former combines the channel attention and spatial attention mechanisms to build a scoring network, dynamically samples the areas with rich fracture morphology information, and leads the Transformer encoder to capture the changes of fracture details through the lightweight embedding mechanism; the latter uses dilated convolution to expand the receptive field and extract topological continuity and irregular distribution pattern of cracks in the global space. The above two types of features jointly construct the segmentation branch and detection branch through abstract fusion to achieve the unified modeling of pixel level crack mask generation and target box positioning. While maintaining lightweight and deployability, the entire architecture significantly improves the model&#x2019;s ability to identify tiny cracks, branching structures and defects in complex backgrounds, laying an accurate perceptual foundation for highly robust defect measurement and dynamic structure modeling in video sequences. Its structure is shown in <xref ref-type="fig" rid="fig-2">Fig. 2</xref>.</p>
<fig id="fig-2">
<label>Figure 2</label>
<caption>
<title>The detailed network of the Transformer-based pavement crack segmentation and detection (VitSeg-Det)</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_70563-fig-2.tif"/>
</fig>
<sec id="s3_1_1">
<label>3.1.1</label>
<title>Abstract Feature Fusion</title>
<p>After feature extraction, we abstract feature F into micro scale features <inline-formula id="ieqn-1"><mml:math id="mml-ieqn-1"><mml:msub><mml:mi>F</mml:mi><mml:mrow><mml:mi>m</mml:mi><mml:mi>i</mml:mi><mml:mi>c</mml:mi><mml:mi>r</mml:mi><mml:mi>o</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> and macro scale feature <inline-formula id="ieqn-2"><mml:math id="mml-ieqn-2"><mml:msub><mml:mi>F</mml:mi><mml:mrow><mml:mi>m</mml:mi><mml:mi>a</mml:mi><mml:mi>c</mml:mi><mml:mi>r</mml:mi><mml:mi>o</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>. The pavement crack detection task <inline-formula id="ieqn-3"><mml:math id="mml-ieqn-3"><mml:msub><mml:mi>F</mml:mi><mml:mrow><mml:mi>m</mml:mi><mml:mi>i</mml:mi><mml:mi>c</mml:mi><mml:mi>r</mml:mi><mml:mi>o</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> mainly describes the local details of cracks, such as edge sharpness, microcracks, surface texture, etc. These features are crucial to distinguish real cracks from noise (such as stains, and shadows). <inline-formula id="ieqn-4"><mml:math id="mml-ieqn-4"><mml:msub><mml:mi>F</mml:mi><mml:mrow><mml:mi>m</mml:mi><mml:mi>a</mml:mi><mml:mi>c</mml:mi><mml:mi>r</mml:mi><mml:mi>o</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> is mainly used to model the continuity, branch structure, overall trend of cracks, etc., to overcome the fracture and false detection problems of local features (such as edges and textures) in complex scenes. <inline-formula id="ieqn-5"><mml:math id="mml-ieqn-5"><mml:msub><mml:mi>F</mml:mi><mml:mrow><mml:mi>m</mml:mi><mml:mi>i</mml:mi><mml:mi>c</mml:mi><mml:mi>r</mml:mi><mml:mi>o</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> and <inline-formula id="ieqn-6"><mml:math id="mml-ieqn-6"><mml:msub><mml:mi>F</mml:mi><mml:mrow><mml:mi>m</mml:mi><mml:mi>a</mml:mi><mml:mi>c</mml:mi><mml:mi>r</mml:mi><mml:mi>o</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> complement each other, and both further reduce the computational load of the model. <inline-formula id="ieqn-7"><mml:math id="mml-ieqn-7"><mml:msub><mml:mi>F</mml:mi><mml:mrow><mml:mi>m</mml:mi><mml:mi>i</mml:mi><mml:mi>c</mml:mi><mml:mi>r</mml:mi><mml:mi>o</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> and <inline-formula id="ieqn-8"><mml:math id="mml-ieqn-8"><mml:msub><mml:mi>F</mml:mi><mml:mrow><mml:mi>m</mml:mi><mml:mi>a</mml:mi><mml:mi>c</mml:mi><mml:mi>r</mml:mi><mml:mi>o</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> are fused by the following methods to generate the final abstract feature <inline-formula id="ieqn-9"><mml:math id="mml-ieqn-9"><mml:msubsup><mml:mi>F</mml:mi><mml:mrow><mml:mi>f</mml:mi><mml:mi>u</mml:mi><mml:mi>s</mml:mi><mml:mi>e</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula>, as shown in <xref ref-type="disp-formula" rid="eqn-1">Formula (1)</xref>, where <bold>W</bold><sub><italic>g</italic>1</sub> is a learnable parameter, which is automatically optimized through standard backpropagation.
<disp-formula id="eqn-1"><label>(1)</label><mml:math id="mml-eqn-1" display="block"><mml:msubsup><mml:mi>F</mml:mi><mml:mrow><mml:mi>f</mml:mi><mml:mi>u</mml:mi><mml:mi>s</mml:mi><mml:mi>e</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:mrow><mml:mtext>Sigmoid</mml:mtext></mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mrow><mml:mtext mathvariant="bold">W</mml:mtext></mml:mrow><mml:mrow><mml:mi>g</mml:mi><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>&#x22C5;</mml:mo><mml:msubsup><mml:mi>F</mml:mi><mml:mrow><mml:mi>m</mml:mi><mml:mi>i</mml:mi><mml:mi>c</mml:mi><mml:mi>r</mml:mi><mml:mi>o</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msubsup><mml:mo>)</mml:mo></mml:mrow><mml:mo>&#x2299;</mml:mo><mml:msubsup><mml:mi>F</mml:mi><mml:mrow><mml:mi>m</mml:mi><mml:mi>a</mml:mi><mml:mi>c</mml:mi><mml:mi>r</mml:mi><mml:mi>o</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msubsup></mml:math></disp-formula></p>
</sec>
<sec id="s3_1_2">
<label>3.1.2</label>
<title>Fine Feature Refinement Module: Sampled-ViT</title>
<p>In VitSeg-Det, there is a Sample-ViT module, which is mainly used to refine features. <inline-formula id="ieqn-10"><mml:math id="mml-ieqn-10"><mml:msub><mml:mi>F</mml:mi><mml:mrow><mml:mi>m</mml:mi><mml:mi>i</mml:mi><mml:mi>c</mml:mi><mml:mi>r</mml:mi><mml:mi>o</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> is the output embedding of Sampled-ViT. First, the refined feature <inline-formula id="ieqn-11"><mml:math id="mml-ieqn-11"><mml:msubsup><mml:mi>F</mml:mi><mml:mrow><mml:mi>f</mml:mi><mml:mi>i</mml:mi><mml:mrow><mml:mtext>n</mml:mtext></mml:mrow><mml:mi>e</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula> is obtained through a lightweight scoring network, and then it is flattened into patch embedding using 16 &#x00D7; 16 convolution. It is connected with learning position embeddings and sent to the encoder of ViT. The transformer encoder uses six layers of multi head attention mapping.</p>
<p>The lightweight scoring network adopts the method of combining channel attention and spatial attention, in which the spatial attention part uses the depth separable convolution to further reduce the number of parameters. In specific implementation, channel attention generates channel weight through global average pooling and a full connection layer, while a spatial attention calculates spatial importance score through a deeply separable convolution (first channel by channel 3 &#x00D7; 3 convolution, then a 1 &#x00D7; 1 convolution dimension reduction), and finally multiplies the two to get the fused feature score. While maintaining the sensitivity to the crack area, this design significantly reduces the computational complexity and is more suitable for edge equipment deployment. The information score S of each spatial feature position is:
<disp-formula id="eqn-2"><label>(2)</label><mml:math id="mml-eqn-2" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd /><mml:mtd><mml:mi>S</mml:mi><mml:mo>=</mml:mo><mml:mi>L</mml:mi><mml:mi>i</mml:mi><mml:mi>g</mml:mi><mml:mi>h</mml:mi><mml:mi>t</mml:mi><mml:mi>S</mml:mi><mml:mi>c</mml:mi><mml:mi>o</mml:mi><mml:mi>r</mml:mi><mml:mi>e</mml:mi><mml:mi>N</mml:mi><mml:mi>e</mml:mi><mml:mi>t</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mi>F</mml:mi><mml:mo>)</mml:mo></mml:mrow><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mtext mathvariant="bold">R</mml:mtext></mml:mrow><mml:mrow><mml:mi>B</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mn>1</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mi>H</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mi>W</mml:mi></mml:mrow></mml:msup></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<disp-formula id="eqn-3"><label>(3)</label><mml:math id="mml-eqn-3" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd /><mml:mtd><mml:mi>S</mml:mi><mml:mo>=</mml:mo><mml:mrow><mml:mtext>Sigmoid</mml:mtext></mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mtext>MLP</mml:mtext></mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mtext>GAP</mml:mtext></mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mi>F</mml:mi><mml:mo>)</mml:mo></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mo>&#x2299;</mml:mo><mml:mrow><mml:mtext>Sigmoid</mml:mtext></mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mtext>DepthwiseSepConv</mml:mtext></mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mi>F</mml:mi><mml:mo>)</mml:mo></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<disp-formula id="eqn-4"><label>(4)</label><mml:math id="mml-eqn-4" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd /><mml:mtd><mml:mover><mml:mrow><mml:mi>S</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mi>F</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:mrow><mml:mo>&#x2192;</mml:mo></mml:mover><mml:mo>=</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>s</mml:mi><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>s</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mi>s</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<disp-formula id="eqn-5"><label>(5)</label><mml:math id="mml-eqn-5" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd /><mml:mtd><mml:msub><mml:mi>F</mml:mi><mml:mrow><mml:mi>m</mml:mi><mml:mi>i</mml:mi><mml:mi>c</mml:mi><mml:mi>r</mml:mi><mml:mi>o</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mrow><mml:mtext>Flatten</mml:mtext></mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:msubsup><mml:mi>F</mml:mi><mml:mrow><mml:mi>f</mml:mi><mml:mi>i</mml:mi><mml:mi>n</mml:mi><mml:mi>e</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msubsup><mml:mo>)</mml:mo></mml:mrow><mml:mo>+</mml:mo><mml:mi>P</mml:mi><mml:mi>E</mml:mi></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p><italic>PE</italic> for position embeddings, <inline-formula id="ieqn-12"><mml:math id="mml-ieqn-12"><mml:msubsup><mml:mi>F</mml:mi><mml:mrow><mml:mi>f</mml:mi><mml:mi>i</mml:mi><mml:mi>n</mml:mi><mml:mi>e</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula> by scoring network S sorted before sampling characteristic of Top-N.</p>
<p>Refine feature <inline-formula id="ieqn-13"><mml:math id="mml-ieqn-13"><mml:msubsup><mml:mi>F</mml:mi><mml:mrow><mml:mi>f</mml:mi><mml:mi>i</mml:mi><mml:mi>n</mml:mi><mml:mi>e</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula>. As shown in <xref ref-type="disp-formula" rid="eqn-4">Formula (4)</xref>, <italic>S</italic> is expanded and sorted to get a set of vectors <inline-formula id="ieqn-14"><mml:math id="mml-ieqn-14"><mml:mover><mml:mrow><mml:mi>S</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mi>F</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:mrow><mml:mo>&#x2192;</mml:mo></mml:mover></mml:math></inline-formula> with the length of <italic>H</italic>&#x00B7;<italic>W</italic> In information theory, it can be approximately considered that <inline-formula id="ieqn-15"><mml:math id="mml-ieqn-15"><mml:msub><mml:mi>s</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> is positively correlated with the information entropy of the feature location, so the high <italic>S</italic> value area carries more task related information (such as the shape and direction of the crack). We dynamically select Top-N features before sampling, and take the predicted information score <italic>S</italic> as the modulation factor of the fine feature set for sampling. Note that N &#x2208; <italic>H</italic>&#x00B7;<italic>W</italic>, N can change randomly with the image content, so that N &#x003D; [<italic>&#x03C9;</italic>&#x00B7;<italic>H</italic>&#x00B7;<italic>W</italic>], and select Top-N feature location according to dynamic <italic>&#x03C9;</italic>, where the value of <italic>&#x03C9;</italic> will be embodied in the ablation experiment later:
<disp-formula id="eqn-6"><label>(6)</label><mml:math id="mml-eqn-6" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd /><mml:mtd><mml:mi>&#x03C9;</mml:mi><mml:mo>=</mml:mo><mml:mrow><mml:mtext>DynamicOmega</mml:mtext></mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mi>S</mml:mi><mml:mo>)</mml:mo></mml:mrow><mml:mo>&#x2208;</mml:mo><mml:mrow><mml:mo>[</mml:mo><mml:msub><mml:mi>&#x03C9;</mml:mi><mml:mrow><mml:mo movablelimits="true" form="prefix">min</mml:mo></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>&#x03C9;</mml:mi><mml:mrow><mml:mo movablelimits="true" form="prefix">max</mml:mo></mml:mrow></mml:msub><mml:mo>]</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>Macro scale feature <inline-formula id="ieqn-16"><mml:math id="mml-ieqn-16"><mml:msub><mml:mi>F</mml:mi><mml:mrow><mml:mi>m</mml:mi><mml:mi>a</mml:mi><mml:mi>c</mml:mi><mml:mi>r</mml:mi><mml:mi>o</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>. In view of the long range and irregular distribution characteristics of pavement cracks on the macro scale, we added a macro scale feature extraction module, expanded the receptive field through 7 &#x00D7; 7 dilated convolution (dilation rate &#x003D; 2), captured the global topological characteristics of cracks (such as continuity, branch structure, etc.) without sacrificing the resolution, and output the macro scale feature map <inline-formula id="ieqn-17"><mml:math id="mml-ieqn-17"><mml:msub><mml:mi>F</mml:mi><mml:mrow><mml:mi>m</mml:mi><mml:mi>a</mml:mi><mml:mi>c</mml:mi><mml:mi>r</mml:mi><mml:mi>o</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>. Compared with concatenating multiple small convolution cores, a single large dilated convolution core can effectively reduce the number of parameters when maintaining the same receptive field. This module and the fine feature module form a hierarchical complement.</p>
<p>As shown in <xref ref-type="fig" rid="fig-2">Fig. 2</xref>, the <inline-formula id="ieqn-18"><mml:math id="mml-ieqn-18"><mml:msubsup><mml:mi>F</mml:mi><mml:mrow><mml:mi>f</mml:mi><mml:mi>i</mml:mi><mml:mi>n</mml:mi><mml:mi>e</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula> after the fusion of <inline-formula id="ieqn-19"><mml:math id="mml-ieqn-19"><mml:msub><mml:mi>F</mml:mi><mml:mrow><mml:mi>m</mml:mi><mml:mi>i</mml:mi><mml:mi>c</mml:mi><mml:mi>r</mml:mi><mml:mi>o</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> and <inline-formula id="ieqn-20"><mml:math id="mml-ieqn-20"><mml:msub><mml:mi>F</mml:mi><mml:mrow><mml:mi>m</mml:mi><mml:mi>a</mml:mi><mml:mi>c</mml:mi><mml:mi>r</mml:mi><mml:mi>o</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> is divided into two parts. After the final convolution and upsampling operations, one part generates a binary segmentation map, and the other part is sent to the decoder of the transformer.</p>
<p>The detection network is shown in <xref ref-type="fig" rid="fig-2">Fig. 2</xref>. After the feature passes through the 6-layer encoder, it is input into the decoder together with the location code and the learned query. The output of each layer of the decoder predicts the type and location of defects, similar to the feature pyramid network.</p>
</sec>
</sec>
<sec id="s3_2">
<label>3.2</label>
<title>Tracking and Counting Model: TransTra-Count</title>
<p>The technology of using traditional CNN based networks to detect defects has become very mature, but when it comes to defect counting, these detection networks can&#x2019;t distinguish whether defects are detected in the same way, which leads to the problem of multiple counting. Therefore, we have developed a Transformer based defect tracking and counting method, TransTra-Count.</p>
<p>The tracking network is shown in <xref ref-type="fig" rid="fig-3">Fig. 3</xref>. In this paper, we propose a pavement crack tracking method, CrackDSF-ULMe. The model is based on the DETR. Our core contribution is to establish a data association model that includes the appearance feature similarity and spatial similarity of cracks according to the characteristics of pavement cracks. It can not only deal with the small displacement between consecutive frames but also adapt to the feature fluctuations caused by illumination changes, and can also solve the occlusion and ambiguity problems. In the long term memory update module, we introduce the change of illumination as the control signal to suppress shadow interference, and propose an adaptive aggregation algorithm to fuse the output of two adjacent frames to alleviate the occlusion problem. At the same time, the unsupervised method is used to monitor the crack width according to the segmentation mask, and the width loss will be increased in the loss function to restrict the width change rate, so as to avoid the unreasonable situation that &#x201C;the same crack has a sudden change in the width of adjacent frames&#x201D;; The trajectory management module automatically initializes and terminates the trajectory according to the match configuration reliability and historical activity to ensure that the life cycle of each crack ID is consistent with the physical reality. Through the closed-loop process of feature enhancement &#x2192; detection &#x2192; association &#x2192; long-term memory update &#x2192; track management, the whole system has realized the whole process of structured automatic measurement and output of crack object number, location, size and damage level, as shown in <xref ref-type="fig" rid="fig-4">Fig. 4</xref>, providing a quantitative basis with space-time continuity for pavement disease diagnosis.</p>
<fig id="fig-3">
<label>Figure 3</label>
<caption>
<title>The detailed network of transformer-based spatial-feature dual-modal joint enhancement long-term memory method for pavement crack tracking (CrackDSF-LMe)</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_70563-fig-3.tif"/>
</fig><fig id="fig-4">
<label>Figure 4</label>
<caption>
<title>The closed-loop flowchart of CrackDSF-LMe</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_70563-fig-4.tif"/>
</fig>
<p>We input the results of VitSeg-Det into the tracking model, because the pavement cracks in the video may have different appearances due to changes in light, rain cover or shooting angles, and the same crack looks like different objects in different frames. Therefore, we use VitSeg-Det to obtain the multi scale characteristics of the cracks, as well as the segmentation mask, to improve the robustness of the network.</p>
<sec id="s3_2_1">
<label>3.2.1</label>
<title>Spatial-Feature Dual-Modal Data Association: DSFM</title>
<p>In the data association phase, the algorithm combines spatial IoU measurement and feature similarity calculation based on the attention mechanism, and dynamically balances their contributions through learnable weight parameters. The cost matrix is shown in <xref ref-type="disp-formula" rid="eqn-7">Eq. (7)</xref>:
<disp-formula id="eqn-7"><label>(7)</label><mml:math id="mml-eqn-7" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd /><mml:mtd><mml:msub><mml:mrow><mml:mtext>C</mml:mtext></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo>[</mml:mo><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>k</mml:mi><mml:mo>]</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:msub><mml:mi>&#x03BB;</mml:mi><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>&#x22C5;</mml:mo><mml:mrow><mml:mtext>IoU</mml:mtext></mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mrow><mml:mtext mathvariant="bold">B</mml:mtext></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mover><mml:mrow><mml:mtext mathvariant="bold">B</mml:mtext></mml:mrow><mml:mo stretchy="false">&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>t</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow><mml:mo>+</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:mn>1</mml:mn><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mi>&#x03BB;</mml:mi><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow><mml:mo>&#x22C5;</mml:mo><mml:mrow><mml:mtext>DecoderAttention</mml:mtext></mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mi>Q</mml:mi><mml:mo>,</mml:mo><mml:mi>K</mml:mi><mml:mo>)</mml:mo></mml:mrow><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mtext mathvariant="bold">R</mml:mtext></mml:mrow><mml:mrow><mml:mi>N</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mi>K</mml:mi></mml:mrow></mml:msup></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<disp-formula id="eqn-8"><label>(8)</label><mml:math id="mml-eqn-8" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd /><mml:mtd><mml:mrow><mml:mtext>DecoderAttention</mml:mtext></mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mtext mathvariant="bold">Q</mml:mtext></mml:mrow><mml:mo>,</mml:mo><mml:mrow><mml:mtext mathvariant="bold">K</mml:mtext></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mrow><mml:mtext>softmax</mml:mtext></mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mfrac><mml:mrow><mml:mrow><mml:mtext mathvariant="bold">Q</mml:mtext></mml:mrow><mml:mo>&#x22C5;</mml:mo><mml:msup><mml:mrow><mml:mtext mathvariant="bold">K</mml:mtext></mml:mrow><mml:mrow><mml:mrow><mml:mtext>T</mml:mtext></mml:mrow></mml:mrow></mml:msup></mml:mrow><mml:msqrt><mml:msub><mml:mi>d</mml:mi><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub></mml:msqrt></mml:mfrac><mml:mo>)</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p><inline-formula id="ieqn-21"><mml:math id="mml-ieqn-21"><mml:msub><mml:mrow><mml:mtext>C</mml:mtext></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo>[</mml:mo><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>k</mml:mi><mml:mo>]</mml:mo></mml:mrow></mml:math></inline-formula> is the matching cost of the <italic>i</italic>-th detection and the <italic>k</italic>-th track, and stores the matching cost of all detection track pairs. The smaller the value, the higher the matching probability; <inline-formula id="ieqn-22"><mml:math id="mml-ieqn-22"><mml:msub><mml:mi>&#x03BB;</mml:mi><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula> is the spatial weight coefficient (0 &#x2264; <inline-formula id="ieqn-23"><mml:math id="mml-ieqn-23"><mml:msub><mml:mi>&#x03BB;</mml:mi><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula> &#x2264; 1), which controls the relative importance of location and feature information; <inline-formula id="ieqn-24"><mml:math id="mml-ieqn-24"><mml:msub><mml:mrow><mml:mtext mathvariant="bold">B</mml:mtext></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> is the coordinate matrix of the current frame detection frame, and <inline-formula id="ieqn-25"><mml:math id="mml-ieqn-25"><mml:msub><mml:mrow><mml:mover><mml:mrow><mml:mtext mathvariant="bold">B</mml:mtext></mml:mrow><mml:mo stretchy="false">&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>t</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula> represents the coordinate matrix of the previous frame trajectory prediction frame. The current frame position is predicted based on the historical trajectory motion model. <bold>Q</bold> and <bold>K</bold> are introduced below. <inline-formula id="ieqn-26"><mml:math id="mml-ieqn-26"><mml:msub><mml:mi>d</mml:mi><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> &#x003D; 512 is the dimension of <bold>K</bold>.</p>
<p>The association decision <inline-formula id="ieqn-27"><mml:math id="mml-ieqn-27"><mml:msubsup><mml:mi>a</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>k</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula> is used to determine whether the detection box should update the existing track or create a new track, as shown in <xref ref-type="disp-formula" rid="eqn-9">Formula (9)</xref>:
<disp-formula id="eqn-9"><label>(9)</label><mml:math id="mml-eqn-9" display="block"><mml:msubsup><mml:mi>a</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>k</mml:mi></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:mrow><mml:mo>{</mml:mo><mml:mtable columnalign="left left" rowspacing="4pt" columnspacing="1em"><mml:mtr><mml:mtd><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mspace width="thinmathspace" /><mml:mrow><mml:mtext mathvariant="italic">if</mml:mtext></mml:mrow><mml:mspace width="thinmathspace" /><mml:mi>k</mml:mi><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mtext>arg max</mml:mtext></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mspace width="thinmathspace" /><mml:msub><mml:mrow><mml:mtext>C</mml:mtext></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo>[</mml:mo><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi><mml:mo>]</mml:mo></mml:mrow><mml:mspace width="thinmathspace" /><mml:mrow><mml:mtext>&#xA0;and</mml:mtext></mml:mrow><mml:mspace width="thinmathspace" /><mml:msub><mml:mrow><mml:mtext>C</mml:mtext></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo>[</mml:mo><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi><mml:mo>]</mml:mo></mml:mrow><mml:mo>&#x003E;</mml:mo><mml:msub><mml:mi>&#x03B8;</mml:mi><mml:mrow><mml:mrow><mml:mtext>match</mml:mtext></mml:mrow></mml:mrow></mml:msub></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mrow><mml:mtext>&#xA0;otherweise</mml:mtext></mml:mrow></mml:mtd></mml:mtr></mml:mtable><mml:mo fence="true" stretchy="true" symmetric="true"></mml:mo></mml:mrow></mml:math></disp-formula>where <inline-formula id="ieqn-28"><mml:math id="mml-ieqn-28"><mml:msub><mml:mrow><mml:mtext>arg max</mml:mtext></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:msub><mml:mrow><mml:mtext>C</mml:mtext></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo>[</mml:mo><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi><mml:mo>]</mml:mo></mml:mrow></mml:math></inline-formula> represents the <italic>j</italic> that makes <inline-formula id="ieqn-29"><mml:math id="mml-ieqn-29"><mml:msub><mml:mrow><mml:mtext>C</mml:mtext></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo>[</mml:mo><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi><mml:mo>]</mml:mo></mml:mrow></mml:math></inline-formula> the largest, which is used to find the most possible historical track corresponding to the current detection; <inline-formula id="ieqn-30"><mml:math id="mml-ieqn-30"><mml:msub><mml:mi>&#x03B8;</mml:mi><mml:mrow><mml:mrow><mml:mtext>match</mml:mtext></mml:mrow></mml:mrow></mml:msub></mml:math></inline-formula> is the matching threshold, which is generally between 0.5 and 0.7. In this paper, 0.6 is selected to filter low-quality matching and avoid incorrect correlation.</p>
<p>When <inline-formula id="ieqn-31"><mml:math id="mml-ieqn-31"><mml:msubsup><mml:mi>a</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>k</mml:mi></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:math></inline-formula>, the system considers that the previous detection <italic>i</italic> and track <italic>k</italic> are the continuation of the same crack, and the detection <italic>i</italic> will update the status of track <italic>k</italic> (such as the updated position, width, and memory features); When <inline-formula id="ieqn-32"><mml:math id="mml-ieqn-32"><mml:msubsup><mml:mi>a</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>k</mml:mi></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:mn>0</mml:mn></mml:math></inline-formula>, the system thinks that this is a new crack or a false detection, and the detection <italic>i</italic> may initialize a new track (if the newborn condition is met).</p>
<p>If the matching score <inline-formula id="ieqn-33"><mml:math id="mml-ieqn-33"><mml:msub><mml:mrow><mml:mtext>C</mml:mtext></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo>[</mml:mo><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>k</mml:mi><mml:mo>]</mml:mo></mml:mrow></mml:math></inline-formula> of a detection object <inline-formula id="ieqn-34"><mml:math id="mml-ieqn-34"><mml:msubsup><mml:mi>F</mml:mi><mml:mrow><mml:mi>f</mml:mi><mml:mi>u</mml:mi><mml:mi>s</mml:mi><mml:mi>e</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula> (the feature vector of the <italic>i</italic>-th detection object in the current frame) and all existing tracks in the current frame is lower than <inline-formula id="ieqn-35"><mml:math id="mml-ieqn-35"><mml:msub><mml:mi>&#x03B8;</mml:mi><mml:mrow><mml:mrow><mml:mtext>new</mml:mtext></mml:mrow></mml:mrow></mml:msub></mml:math></inline-formula> (the threshold value of the new track, <inline-formula id="ieqn-36"><mml:math id="mml-ieqn-36"><mml:msub><mml:mi>&#x03B8;</mml:mi><mml:mrow><mml:mrow><mml:mtext>new</mml:mtext></mml:mrow></mml:mrow></mml:msub></mml:math></inline-formula> &#x003D; 0.5 in this paper), it is considered as a new target, and its feature <inline-formula id="ieqn-37"><mml:math id="mml-ieqn-37"><mml:msubsup><mml:mi>F</mml:mi><mml:mrow><mml:mi>f</mml:mi><mml:mi>i</mml:mi><mml:mi>n</mml:mi><mml:mi>e</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula> is added to the track feature memory set <inline-formula id="ieqn-38"><mml:math id="mml-ieqn-38"><mml:msub><mml:mrow><mml:mover><mml:mrow><mml:mtext>M</mml:mtext></mml:mrow><mml:mo>&#x02D9;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> of the current frame; If the proportion of unmatched frames of a track <inline-formula id="ieqn-39"><mml:math id="mml-ieqn-39"><mml:msubsup><mml:mover><mml:mi>M</mml:mi><mml:mo>&#x2192;</mml:mo></mml:mover><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula> in the latest <italic>T</italic> frame exceeds <inline-formula id="ieqn-40"><mml:math id="mml-ieqn-40"><mml:msub><mml:mi>&#x03B8;</mml:mi><mml:mrow><mml:mrow><mml:mtext>term</mml:mtext></mml:mrow></mml:mrow></mml:msub></mml:math></inline-formula> (the threshold value of the termination track, <inline-formula id="ieqn-41"><mml:math id="mml-ieqn-41"><mml:msub><mml:mi>&#x03B8;</mml:mi><mml:mrow><mml:mrow><mml:mtext>term</mml:mtext></mml:mrow></mml:mrow></mml:msub></mml:math></inline-formula> &#x003D; 0.85 in this article), it is considered that the target has left the scene, and the track is removed from the set <inline-formula id="ieqn-42"><mml:math id="mml-ieqn-42"><mml:msub><mml:mrow><mml:mover><mml:mrow><mml:mtext>M</mml:mtext></mml:mrow><mml:mo>&#x02D9;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>. <italic>T</italic> is the size of the time window to judge the termination, <italic>T</italic> &#x003D; 20 in this article. The formula for updating the track is as follows:
<disp-formula id="eqn-10"><label>(10)</label><mml:math id="mml-eqn-10" display="block"><mml:msub><mml:mrow><mml:mover><mml:mrow><mml:mtext>M</mml:mtext></mml:mrow><mml:mo>&#x02D9;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mrow><mml:mo>{</mml:mo><mml:mtable columnalign="left" rowspacing="4pt" columnspacing="1em"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mover><mml:mrow><mml:mtext>M</mml:mtext></mml:mrow><mml:mo>&#x02D9;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>&#x22C3;</mml:mo><mml:mrow><mml:mo>{</mml:mo><mml:msubsup><mml:mover><mml:mi>f</mml:mi><mml:mo>&#x2192;</mml:mo></mml:mover><mml:mrow><mml:mi>f</mml:mi><mml:mi>u</mml:mi><mml:mi>s</mml:mi><mml:mi>e</mml:mi><mml:mi mathvariant="normal">&#x005F;</mml:mi><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msubsup><mml:mo>}</mml:mo></mml:mrow><mml:mo>,</mml:mo><mml:mrow><mml:mtext mathvariant="italic">if</mml:mtext></mml:mrow><mml:mtext>&#x00A0;</mml:mtext><mml:munder><mml:mo movablelimits="true" form="prefix">max</mml:mo><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:munder><mml:msub><mml:mrow><mml:mtext>C</mml:mtext></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo>[</mml:mo><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>k</mml:mi><mml:mo>]</mml:mo></mml:mrow><mml:mo>&#x003C;</mml:mo><mml:msub><mml:mi>&#x03B8;</mml:mi><mml:mrow><mml:mrow><mml:mtext>new</mml:mtext></mml:mrow></mml:mrow></mml:msub></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mover><mml:mrow><mml:mtext>M</mml:mtext></mml:mrow><mml:mo>&#x02D9;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mi mathvariant="normal">&#x2216;</mml:mi><mml:mrow><mml:mo>{</mml:mo><mml:msubsup><mml:mover><mml:mi>M</mml:mi><mml:mo>&#x2192;</mml:mo></mml:mover><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msubsup><mml:mo>}</mml:mo></mml:mrow><mml:mo>,</mml:mo><mml:mrow><mml:mtext mathvariant="italic">if</mml:mtext></mml:mrow><mml:mtext>&#x00A0;</mml:mtext><mml:mfrac><mml:mn>1</mml:mn><mml:mi>T</mml:mi></mml:mfrac><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>s</mml:mi><mml:mo>=</mml:mo><mml:mi>t</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mi>T</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:munderover><mml:msubsup><mml:mi>g</mml:mi><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msubsup><mml:mo>&#x003E;</mml:mo><mml:msub><mml:mi>&#x03B8;</mml:mi><mml:mrow><mml:mrow><mml:mtext>term</mml:mtext></mml:mrow></mml:mrow></mml:msub></mml:mtd></mml:mtr></mml:mtable><mml:mo fence="true" stretchy="true" symmetric="true"></mml:mo></mml:mrow></mml:math></disp-formula></p>
</sec>
<sec id="s3_2_2">
<label>3.2.2</label>
<title>Adaptive Appearance Aggregation-Long-Term Memory Update Trade Off Model: ULMeM</title>
<p>Different from most existing methods in CrackDSF-ULMe, our core contribution is to establish a long-term memory, maintain the long-term time characteristics of each crack, and effectively inject time information into the follow-up tracking process, so that some extremely long cracks will not be identified repeatedly.</p>
<p>Generally, in the video stream, the object changes and moves very little in consecutive frames, so we have designed a Long-term Memory Retention Update Trade off Module (ULMeM), as shown in <xref ref-type="fig" rid="fig-4">Fig. 4</xref>. Initially, we send the <inline-formula id="ieqn-43"><mml:math id="mml-ieqn-43"><mml:msubsup><mml:mrow><mml:mover><mml:mi>M</mml:mi><mml:mo stretchy="false">&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula> output by encoder into the decoder, and then update it according to the update <xref ref-type="disp-formula" rid="eqn-11">Formula (11)</xref>. The decoder output is <inline-formula id="ieqn-44"><mml:math id="mml-ieqn-44"><mml:msubsup><mml:mrow><mml:mover><mml:mi>O</mml:mi><mml:mo stretchy="false">&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula>, the value of <inline-formula id="ieqn-45"><mml:math id="mml-ieqn-45"><mml:msub><mml:mi>&#x03BB;</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula> will be discussed in the experiment.
<disp-formula id="eqn-11"><label>(11)</label><mml:math id="mml-eqn-11" display="block"><mml:msubsup><mml:mi>M</mml:mi><mml:mrow><mml:mi>t</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:mn>1</mml:mn><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mi>&#x03BB;</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow><mml:mo>&#x22C5;</mml:mo><mml:msubsup><mml:mrow><mml:mover><mml:mi>M</mml:mi><mml:mo stretchy="false">&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msubsup><mml:mo>+</mml:mo><mml:msub><mml:mi>&#x03BB;</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub><mml:mo>&#x22C5;</mml:mo><mml:msubsup><mml:mrow><mml:mover><mml:mi>O</mml:mi><mml:mo stretchy="false">&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msubsup></mml:math></disp-formula></p>
<p>Problems such as blurring or occlusion often occur in video streams. An intuitive way to solve this problem is to use multi frame features to enhance single frame representation. We use adaptive aggregation algorithm in ULMeM to fuse the output of two adjacent frames. Due to occlusion and blurring, the output embedded signal of the current frame may not be reliable. Therefore, as shown in <xref ref-type="fig" rid="fig-5">Fig. 5</xref>, we generate a channel weighted <inline-formula id="ieqn-46"><mml:math id="mml-ieqn-46"><mml:msub><mml:mi>W</mml:mi><mml:mrow><mml:mi>g</mml:mi><mml:mn>2</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula> for each tracking instance to determine the proportion of track <italic>k</italic> retained in the memory of the current frame to alleviate this problem:
<disp-formula id="eqn-12"><label>(12)</label><mml:math id="mml-eqn-12" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd /><mml:mtd><mml:msubsup><mml:mrow><mml:mover><mml:mi>O</mml:mi><mml:mo stretchy="false">&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:mn>1</mml:mn><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mi>W</mml:mi><mml:mrow><mml:mi>g</mml:mi><mml:mn>2</mml:mn></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow><mml:mo>&#x22C5;</mml:mo><mml:msubsup><mml:mi>O</mml:mi><mml:mrow><mml:mi>t</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msubsup><mml:mo>+</mml:mo><mml:msub><mml:mi>W</mml:mi><mml:mrow><mml:mi>g</mml:mi><mml:mn>2</mml:mn></mml:mrow></mml:msub><mml:mo>&#x22C5;</mml:mo><mml:msubsup><mml:mi>O</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msubsup></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<disp-formula id="eqn-13"><label>(13)</label><mml:math id="mml-eqn-13" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd /><mml:mtd><mml:msub><mml:mi>W</mml:mi><mml:mrow><mml:mi>g</mml:mi><mml:mn>2</mml:mn></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mrow><mml:mtext>Sigmoid</mml:mtext></mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mtext>MLP</mml:mtext></mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:msubsup><mml:mrow><mml:mover><mml:mi>O</mml:mi><mml:mo stretchy="false">&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>t</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msubsup><mml:mo>&#x2225;</mml:mo><mml:mrow><mml:mi mathvariant="normal">&#x0394;</mml:mi></mml:mrow><mml:msub><mml:mrow><mml:mi>&#x2113;</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>&#x2225;</mml:mo><mml:msubsup><mml:mi>O</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msubsup><mml:mo>)</mml:mo></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<disp-formula id="eqn-14"><label>(14)</label><mml:math id="mml-eqn-14" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd /><mml:mtd><mml:mi mathvariant="normal">&#x0394;</mml:mi><mml:msub><mml:mrow><mml:mi>&#x2113;</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mrow></mml:mrow><mml:msub><mml:mrow><mml:mo symmetric="true">&#x2016;</mml:mo><mml:mrow><mml:mtext>hist</mml:mtext></mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>I</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mrow><mml:mtext>hist</mml:mtext></mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>I</mml:mi><mml:mrow><mml:mi>t</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow><mml:mo symmetric="true">&#x2016;</mml:mo></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>wherein, <inline-formula id="ieqn-47"><mml:math id="mml-ieqn-47"><mml:msubsup><mml:mrow><mml:mover><mml:mi>O</mml:mi><mml:mo stretchy="false">&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula> is the updated decoder output embedding, which is regarded as Q, <inline-formula id="ieqn-48"><mml:math id="mml-ieqn-48"><mml:msubsup><mml:mi>O</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula> is V, and <inline-formula id="ieqn-49"><mml:math id="mml-ieqn-49"><mml:msubsup><mml:mrow><mml:mover><mml:mi>M</mml:mi><mml:mo stretchy="false">&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula> is K. <inline-formula id="ieqn-50"><mml:math id="mml-ieqn-50"><mml:mrow><mml:mi mathvariant="normal">&#x0394;</mml:mi></mml:mrow><mml:msub><mml:mrow><mml:mi>&#x2113;</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> represents the inter frame illumination change, calculated using the gray histogram moment, and <inline-formula id="ieqn-51"><mml:math id="mml-ieqn-51"><mml:msub><mml:mi>I</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> represents the image of the <italic>t</italic>-th frame. [&#x00B7;&#x2225;&#x00B7;&#x2225;&#x00B7;] is a vector splicing operation. Splicing <inline-formula id="ieqn-52"><mml:math id="mml-ieqn-52"><mml:msubsup><mml:mrow><mml:mover><mml:mi>O</mml:mi><mml:mo stretchy="false">&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>t</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula> is to ensure the continuity of crack ID, and splicing <inline-formula id="ieqn-53"><mml:math id="mml-ieqn-53"><mml:msubsup><mml:mi>O</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula> is to obtain the latest state of the crack. Splicing <inline-formula id="ieqn-54"><mml:math id="mml-ieqn-54"><mml:mrow><mml:mi mathvariant="normal">&#x0394;</mml:mi></mml:mrow><mml:msub><mml:mrow><mml:mi>&#x2113;</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> can resist the interference of sudden brightness changes. When <inline-formula id="ieqn-55"><mml:math id="mml-ieqn-55"><mml:mi mathvariant="normal">&#x0394;</mml:mi><mml:msub><mml:mi>&#x2113;</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> becomes larger, it means that the illumination changes sharply, then <inline-formula id="ieqn-56"><mml:math id="mml-ieqn-56"><mml:msub><mml:mi>W</mml:mi><mml:mrow><mml:mi>g</mml:mi><mml:mn>2</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula> tends to 1, and historical memory is preferred. When <inline-formula id="ieqn-57"><mml:math id="mml-ieqn-57"><mml:mrow><mml:mo symmetric="true">&#x2016;</mml:mo><mml:msubsup><mml:mi>O</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msubsup><mml:mo>&#x2212;</mml:mo><mml:msubsup><mml:mi>O</mml:mi><mml:mrow><mml:mi>t</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msubsup><mml:mo symmetric="true">&#x2016;</mml:mo></mml:mrow></mml:math></inline-formula> increases, it means that there is a significant difference between the appearance of the previous detection and the historical track, and the object may be deformed, occluded or change in state (such as crack widening), then <inline-formula id="ieqn-58"><mml:math id="mml-ieqn-58"><mml:msub><mml:mi>W</mml:mi><mml:mrow><mml:mi>g</mml:mi><mml:mn>2</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula> tends to 0, updates the memory, and adapts to the object change.</p>
<fig id="fig-5">
<label>Figure 5</label>
<caption>
<title>The structure diagram of Memory Retention-Update Trade-off Module (ULMeM)</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_70563-fig-5.tif"/>
</fig>
</sec>
<sec id="s3_2_3">
<label>3.2.3</label>
<title>Loss Function</title>
<p>At last, we design a combination of detection, tracking and width loss. The detection loss dominated initial training is used to quickly locate the crack, and the tracking and width loss dominated in the later stage for fine tracking. Before calculating the loss, we first design an unsupervised crack width calculation method, which interacts with the width loss.</p>
<p>According to the standard for rating technical conditions of highways, the damage degree of cracks is judged by the average width of cracks, so we use an unsupervised method to calculate the average width of cracks in this paper. The process is shown in Algorithm 1. Firstly, the input binary mask is skeletonized to extract the central skeleton of the object; Then calculate the Euclidean distance transformation of the mask (that is, the distance from each foreground pixel to the nearest background pixel). At this time, the distance value of the skeleton pixel represents the maximum inscribed circle radius (that is, half width) from the position to the edge; Finally, take the average value of twice the distance of all skeleton pixels (that is, the full width) as the average width of the object.</p>
<p>In order to make the width curve smoother and suppress the measurement noise, the width is gradually updated, as shown in the following formula:
<disp-formula id="eqn-15"><label>(15)</label><mml:math id="mml-eqn-15" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd /><mml:mtd><mml:msubsup><mml:mi>w</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:msub><mml:mi>&#x03B3;</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:msubsup><mml:mi>w</mml:mi><mml:mrow><mml:mi>t</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msubsup><mml:mo>+</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:mn>1</mml:mn><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mi>&#x03B3;</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow><mml:msubsup><mml:mi>w</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msubsup></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<disp-formula id="eqn-16"><label>(16)</label><mml:math id="mml-eqn-16" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd /><mml:mtd><mml:msub><mml:mi>&#x03B3;</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mi>&#x03B3;</mml:mi><mml:mrow><mml:mrow><mml:mtext>base</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:mo>&#x22C5;</mml:mo><mml:mi>exp</mml:mi><mml:mo>&#x2061;</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mi>&#x03BB;</mml:mi><mml:mrow><mml:mn>3</mml:mn></mml:mrow></mml:msub><mml:mo>&#x22C5;</mml:mo><mml:mrow><mml:mo>|</mml:mo><mml:msubsup><mml:mi>w</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msubsup><mml:mo>&#x2212;</mml:mo><mml:msubsup><mml:mi>w</mml:mi><mml:mrow><mml:mi>t</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msubsup><mml:mo>|</mml:mo></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>where, <inline-formula id="ieqn-59"><mml:math id="mml-ieqn-59"><mml:msubsup><mml:mi>w</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula> is the width of the current frame after smoothing, <inline-formula id="ieqn-60"><mml:math id="mml-ieqn-60"><mml:msubsup><mml:mi>w</mml:mi><mml:mrow><mml:mi>t</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula> is the width of the previous frame after smoothing, <inline-formula id="ieqn-61"><mml:math id="mml-ieqn-61"><mml:msubsup><mml:mi>w</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula> is the detection width of the current frame, and <inline-formula id="ieqn-62"><mml:math id="mml-ieqn-62"><mml:msub><mml:mi>&#x03B3;</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> is the adaptive memory attenuation coefficient. After the adaptive memory attenuation is added, it can better adapt to the dynamic changing scene. When the width changes sharply (such as rapid deformation of objects), it will not cause hysteresis due to the setting of too large <inline-formula id="ieqn-63"><mml:math id="mml-ieqn-63"><mml:msub><mml:mi>&#x03B3;</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>, or the measurement noise will suddenly increase (such as sensor abnormalities), and too small <inline-formula id="ieqn-64"><mml:math id="mml-ieqn-64"><mml:msub><mml:mi>&#x03B3;</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> will introduce noise. In this paper, <inline-formula id="ieqn-65"><mml:math id="mml-ieqn-65"><mml:msub><mml:mi>&#x03B3;</mml:mi><mml:mrow><mml:mrow><mml:mtext>base</mml:mtext></mml:mrow></mml:mrow></mml:msub></mml:math></inline-formula> (foundation attenuation coefficient) &#x003D; 0.9, <inline-formula id="ieqn-66"><mml:math id="mml-ieqn-66"><mml:msub><mml:mi>&#x03BB;</mml:mi><mml:mrow><mml:mn>3</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula> (control adjustment sensitivity) &#x003D; 0.1. Therefore, when the width changes dramatically, that is, when the <inline-formula id="ieqn-67"><mml:math id="mml-ieqn-67"><mml:msub><mml:mi>L</mml:mi><mml:mrow><mml:mi>w</mml:mi><mml:mi>i</mml:mi><mml:mi>d</mml:mi><mml:mi>t</mml:mi><mml:mi>h</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> (width smoothing loss) is large, <inline-formula id="ieqn-68"><mml:math id="mml-ieqn-68"><mml:msub><mml:mi>&#x03B3;</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> can be reduced adaptively to respond to new data faster.</p>
<p>The specific formula of the loss function is shown in <xref ref-type="disp-formula" rid="eqn-17">(17)</xref>. <inline-formula id="ieqn-69"><mml:math id="mml-ieqn-69"><mml:msub><mml:mi>L</mml:mi><mml:mrow><mml:mi>d</mml:mi><mml:mi>e</mml:mi><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> uses Focal Loss to detect branch loss, which optimizes the accuracy of crack detection; <inline-formula id="ieqn-70"><mml:math id="mml-ieqn-70"><mml:msub><mml:mi>L</mml:mi><mml:mrow><mml:mi>t</mml:mi><mml:mi>r</mml:mi><mml:mi>a</mml:mi><mml:mi>c</mml:mi><mml:mi>k</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> is the loss of track consistency to ensure the space-time continuity of crack ID, <inline-formula id="ieqn-71"><mml:math id="mml-ieqn-71"><mml:msub><mml:mi>&#x03B4;</mml:mi><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula> is the weight of track consistency loss to balance detection and tracking; <inline-formula id="ieqn-72"><mml:math id="mml-ieqn-72"><mml:msub><mml:mi>L</mml:mi><mml:mrow><mml:mi>w</mml:mi><mml:mi>i</mml:mi><mml:mi>d</mml:mi><mml:mi>t</mml:mi><mml:mi>h</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> can suppress the jump of width estimation, and <inline-formula id="ieqn-73"><mml:math id="mml-ieqn-73"><mml:msub><mml:mi>&#x03B4;</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula> is the weight of width smoothing loss.
<disp-formula id="eqn-17"><label>(17)</label><mml:math id="mml-eqn-17" display="block"><mml:mi>L</mml:mi><mml:mo>=</mml:mo><mml:msub><mml:mi>L</mml:mi><mml:mrow><mml:mi>d</mml:mi><mml:mi>e</mml:mi><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:msub><mml:mi>&#x03B4;</mml:mi><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:msub><mml:mi>L</mml:mi><mml:mrow><mml:mi>t</mml:mi><mml:mi>r</mml:mi><mml:mi>a</mml:mi><mml:mi>c</mml:mi><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:msub><mml:mi>&#x03B4;</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub><mml:msub><mml:mi>L</mml:mi><mml:mrow><mml:mi>w</mml:mi><mml:mi>i</mml:mi><mml:mi>d</mml:mi><mml:mi>t</mml:mi><mml:mi>h</mml:mi></mml:mrow></mml:msub></mml:math></disp-formula></p>
<fig id="fig-13">
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_70563-fig-13.tif"/>
</fig>
<p>In this paper, <inline-formula id="ieqn-77"><mml:math id="mml-ieqn-77"><mml:msub><mml:mi>&#x03B4;</mml:mi><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula> &#x003D; 0.8, <inline-formula id="ieqn-78"><mml:math id="mml-ieqn-78"><mml:msub><mml:mi>&#x03B4;</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula> &#x003D; 0.4, and the calculation method of <inline-formula id="ieqn-79"><mml:math id="mml-ieqn-79"><mml:msub><mml:mi>L</mml:mi><mml:mrow><mml:mi>t</mml:mi><mml:mi>r</mml:mi><mml:mi>a</mml:mi><mml:mi>c</mml:mi><mml:mi>k</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> is shown in <xref ref-type="disp-formula" rid="eqn-18">Formula (18)</xref>.
<disp-formula id="eqn-18"><label>(18)</label><mml:math id="mml-eqn-18" display="block"><mml:msub><mml:mi>L</mml:mi><mml:mrow><mml:mi>t</mml:mi><mml:mi>r</mml:mi><mml:mi>a</mml:mi><mml:mi>c</mml:mi><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:mrow><mml:mo symmetric="true">&#x2016;</mml:mo><mml:mi>P</mml:mi><mml:mo symmetric="true">&#x2016;</mml:mo></mml:mrow></mml:mfrac><mml:munder><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>k</mml:mi><mml:mo>)</mml:mo></mml:mrow><mml:mo>&#x2208;</mml:mo><mml:mi>P</mml:mi></mml:mrow></mml:munder><mml:mi>log</mml:mi><mml:mo>&#x2061;</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:mfrac><mml:mrow><mml:mi>exp</mml:mi><mml:mo>&#x2061;</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:msubsup><mml:mover><mml:mi>f</mml:mi><mml:mo>&#x2192;</mml:mo></mml:mover><mml:mrow><mml:mi>f</mml:mi><mml:mi>u</mml:mi><mml:mi>s</mml:mi><mml:mi>e</mml:mi><mml:mi mathvariant="normal">&#x005F;</mml:mi><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msubsup><mml:mo>&#x22C5;</mml:mo><mml:msubsup><mml:mover><mml:mi>M</mml:mi><mml:mo>&#x2192;</mml:mo></mml:mover><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msubsup><mml:mrow><mml:mo>/</mml:mo></mml:mrow><mml:mi>&#x03C4;</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:munder><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>q</mml:mi></mml:mrow></mml:munder><mml:mi>exp</mml:mi><mml:mo>&#x2061;</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:msubsup><mml:mover><mml:mi>f</mml:mi><mml:mo>&#x2192;</mml:mo></mml:mover><mml:mrow><mml:mi>f</mml:mi><mml:mi>u</mml:mi><mml:mi>s</mml:mi><mml:mi>e</mml:mi><mml:mi mathvariant="normal">&#x005F;</mml:mi><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msubsup><mml:mo>&#x22C5;</mml:mo><mml:msubsup><mml:mover><mml:mi>M</mml:mi><mml:mo>&#x2192;</mml:mo></mml:mover><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mi>q</mml:mi></mml:mrow></mml:msubsup><mml:mrow><mml:mo>/</mml:mo></mml:mrow><mml:mi>&#x03C4;</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:mfrac><mml:mo>)</mml:mo></mml:mrow></mml:math></disp-formula>where, <italic>P</italic> is the successfully associated detection tracking pair, and <inline-formula id="ieqn-80"><mml:math id="mml-ieqn-80"><mml:mrow><mml:mo symmetric="true">&#x2016;</mml:mo><mml:mi>P</mml:mi><mml:mo symmetric="true">&#x2016;</mml:mo></mml:mrow></mml:math></inline-formula> is the effective tracking number of the current frame; <italic>&#x03C4;</italic> is the temperature coefficient less than 1, which can amplify the similarity score of the current matching pair (detection <italic>i</italic> and track <italic>k</italic>) and control the sharpness of the probability distribution. <inline-formula id="ieqn-81"><mml:math id="mml-ieqn-81"><mml:msubsup><mml:mover><mml:mi>f</mml:mi><mml:mo>&#x2192;</mml:mo></mml:mover><mml:mrow><mml:mi>f</mml:mi><mml:mi>u</mml:mi><mml:mi>s</mml:mi><mml:mi>e</mml:mi><mml:mi mathvariant="normal">&#x005F;</mml:mi><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msubsup><mml:mo>&#x22C5;</mml:mo><mml:msubsup><mml:mover><mml:mi>M</mml:mi><mml:mo>&#x2192;</mml:mo></mml:mover><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula> represents the current detection and track matching degree. When the video accuracy is high, a smaller <italic>&#x03C4;</italic> such as 0.05 can be selected to distinguish similar cracks more strictly. When encountering low quality video streams, a larger <italic>&#x03C4;</italic> such as 0.5 can be selected, allowing more relaxed matching. In this paper, <italic>&#x03C4;</italic> &#x003D; 0.1; <inline-formula id="ieqn-82"><mml:math id="mml-ieqn-82"><mml:munder><mml:mo movablelimits="false">&#x2211;</mml:mo><mml:mrow><mml:mi>q</mml:mi></mml:mrow></mml:munder><mml:mi>exp</mml:mi><mml:mo>&#x2061;</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:msubsup><mml:mover><mml:mi>f</mml:mi><mml:mo>&#x2192;</mml:mo></mml:mover><mml:mrow><mml:mi>f</mml:mi><mml:mi>u</mml:mi><mml:mi>s</mml:mi><mml:mi>e</mml:mi><mml:mi mathvariant="normal">&#x005F;</mml:mi><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msubsup><mml:mo>&#x22C5;</mml:mo><mml:msubsup><mml:mover><mml:mi>M</mml:mi><mml:mo>&#x2192;</mml:mo></mml:mover><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mi>q</mml:mi></mml:mrow></mml:msubsup><mml:mrow><mml:mo>/</mml:mo></mml:mrow><mml:mi>&#x03C4;</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula> is the competitive weight of all possible tracks, which is used to prevent other adjacent crack tracks from interfering with the current matching. In conclusion, <inline-formula id="ieqn-83"><mml:math id="mml-ieqn-83"><mml:msub><mml:mi>L</mml:mi><mml:mrow><mml:mi>t</mml:mi><mml:mi>r</mml:mi><mml:mi>a</mml:mi><mml:mi>c</mml:mi><mml:mi>k</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> can promote the network to learn the discriminant features, so that the feature similarity of the same crack in different frames is higher than that of different cracks.</p>
<p><inline-formula id="ieqn-84"><mml:math id="mml-ieqn-84"><mml:msub><mml:mi>L</mml:mi><mml:mrow><mml:mi>w</mml:mi><mml:mi>i</mml:mi><mml:mi>d</mml:mi><mml:mi>t</mml:mi><mml:mi>h</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> is the key module to ensure the continuity of crack size measurement, which directly determines the reliability of width change trend analysis. It includes two parts: absolute error term and width change consistency term. The calculation method is shown in <xref ref-type="disp-formula" rid="eqn-19">Formula (19)</xref>. The first half is the absolute error term, which constrains the time smoothness of the width estimation, and &#x2225;&#x00B7;&#x2225;<sup>2</sup> is the square error used to amplify significant differences; The second half is the consistency term of width change, which can prevent width mutation. <inline-formula id="ieqn-85"><mml:math id="mml-ieqn-85"><mml:msubsup><mml:mi>w</mml:mi><mml:mrow><mml:mi>t</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula> is the width of the previous frame to detect <italic>i</italic>, <inline-formula id="ieqn-86"><mml:math id="mml-ieqn-86"><mml:mi mathvariant="normal">&#x0394;</mml:mi><mml:mi>t</mml:mi></mml:math></inline-formula> is the time between frames (seconds), in this paper <inline-formula id="ieqn-87"><mml:math id="mml-ieqn-87"><mml:mi mathvariant="normal">&#x0394;</mml:mi><mml:mi>t</mml:mi></mml:math></inline-formula> &#x003D; 1/30 s, and <inline-formula id="ieqn-88"><mml:math id="mml-ieqn-88"><mml:msub><mml:mi>&#x03B3;</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula> is the weight coefficient of width change rate, <inline-formula id="ieqn-89"><mml:math id="mml-ieqn-89"><mml:msub><mml:mi>&#x03B3;</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula> &#x003D; 0.4.
<disp-formula id="eqn-19"><label>(19)</label><mml:math id="mml-eqn-19" display="block"><mml:msub><mml:mi>L</mml:mi><mml:mrow><mml:mi>w</mml:mi><mml:mi>i</mml:mi><mml:mi>d</mml:mi><mml:mi>t</mml:mi><mml:mi>h</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:mrow><mml:mo>|</mml:mo><mml:mi>P</mml:mi><mml:mo>|</mml:mo></mml:mrow></mml:mfrac><mml:munder><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>k</mml:mi><mml:mo>)</mml:mo></mml:mrow><mml:mo>&#x2208;</mml:mo><mml:mi>P</mml:mi></mml:mrow></mml:munder><mml:msup><mml:mrow><mml:mo symmetric="true">&#x2016;</mml:mo><mml:msubsup><mml:mi>w</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msubsup><mml:mo>&#x2212;</mml:mo><mml:msubsup><mml:mi>w</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msubsup><mml:mo symmetric="true">&#x2016;</mml:mo></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msup><mml:mo>+</mml:mo><mml:msub><mml:mi>&#x03B3;</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub><mml:munder><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>k</mml:mi><mml:mo>)</mml:mo></mml:mrow><mml:mo>&#x2208;</mml:mo><mml:mi>P</mml:mi></mml:mrow></mml:munder><mml:msup><mml:mrow><mml:mo symmetric="true">&#x2016;</mml:mo><mml:mfrac><mml:mrow><mml:msubsup><mml:mi>w</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msubsup><mml:mo>&#x2212;</mml:mo><mml:msubsup><mml:mi>w</mml:mi><mml:mrow><mml:mi>t</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msubsup></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x0394;</mml:mi><mml:mi>t</mml:mi></mml:mrow></mml:mfrac><mml:mo>&#x2212;</mml:mo><mml:mfrac><mml:mrow><mml:msubsup><mml:mi>w</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msubsup><mml:mo>&#x2212;</mml:mo><mml:msubsup><mml:mi>w</mml:mi><mml:mrow><mml:mi>t</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msubsup></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x0394;</mml:mi><mml:mi>t</mml:mi></mml:mrow></mml:mfrac><mml:mo symmetric="true">&#x2016;</mml:mo></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msup></mml:math></disp-formula></p>
</sec>
<sec id="s3_2_4">
<label>3.2.4</label>
<title>Count Evaluation Module: CoM</title>
<p>In the TransTra-Count system, the counting module plays a critical role in the statistical analysis of crack targets and the evaluation of damage levels. Its core function is to achieve real-time, unique, and non-repetitive counting of crack objects in video streams while outputting structured measurement results. Deeply integrated with the detection, segmentation, and tracking branches, the module forms a closed-loop automated measurement framework tailored for surface defect quantification. The overall workflow is illustrated in <xref ref-type="fig" rid="fig-6">Fig. 6</xref>.</p>
<fig id="fig-6">
<label>Figure 6</label>
<caption>
<title>The closed-loop flowchart of TransTra-Count</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_70563-fig-6.tif"/>
</fig>
<p>First, the system determines whether structural crack targets exist in the current frame based on results from the front-end detection and segmentation subnetworks. If no valid defect region is detected, the system skips to the next frame to avoid unnecessary computation, forming a sparse, efficient, and low-redundancy frame-level processing pipeline. Once a potential crack region is identified, the counting module immediately invokes a multi-object tracking mechanism to perform real-time matching between the current detection and historical trajectories. This matching process relies on multimodal similarity measures incorporating spatial location, segmentation mask, and appearance features. A matching control strategy is then applied to determine the temporal continuity of the target: if it is a new crack instance, the system assigns a unique ID and updates the global defect counter; if it corresponds to an existing target tracked across frames, the original ID and counting state are retained to ensure temporal consistency and avoid duplicate counting.</p>
<p>Ultimately, the counting module outputs structured statistical results-including crack quantities and identifiers-providing quantitative, traceable, and engineering-ready decision support for pavement condition management, maintenance prioritization, and lifecycle assessment. By establishing a closed-loop feedback mechanism integrating detection, tracking, and measurement, the module not only enhances counting accuracy and system responsiveness but also embodies an industrial intelligent inspection philosophy oriented toward &#x201C;structured measurement&#x201D;. The system demonstrates strong generalization capability and deployment feasibility in practical applications, particularly in use cases such as road crack maintenance, bridge structural diagnosis, and condition assessment of transportation infrastructure.</p>
</sec>
</sec>
</sec>
<sec id="s4">
<label>4</label>
<title>Dataset Contruction</title>
<p>Our work relies on two public datasets, DeepCrack [<xref ref-type="bibr" rid="ref-36">36</xref>] (general crack segmentation dataset), UVA-PDD2023 [<xref ref-type="bibr" rid="ref-37">37</xref>] (pavement disease detection dataset) and UAPD [<xref ref-type="bibr" rid="ref-38">38</xref>]. And independently collected the RoadDefect-MT data set to form a multi-source, multi scene pavement disease database. The specific composition is as follows:</p>
<sec id="s4_1">
<label>4.1</label>
<title>Public Dataset</title>
<p>DeepCrack: provides high-resolution (up to 2592 &#x00D7; 1944) fine crack annotation, covering various scenes such as walls and floors, and is used for the generalized learning of slender crack characteristics by the model.</p>
<p>UVA-PDD2023: including the annotation data of common pavement diseases such as cracks, with low resolution (640 &#x00D7; 480), which is suitable for the training of small target detection ability of the model.</p>
<p>UAPD: comprises 3151 images with an original resolution of 7952 &#x00D7; 5304 pixels. It includes six types of road defects with diverse sizes and morphological characteristics: longitudinal cracks (LC), transverse cracks (TC), alligator cracks (AC), oblique cracks (OC), repair marks, and potholes.</p>
</sec>
<sec id="s4_2">
<label>4.2</label>
<title>Self-Built RoadDefect-MT Dataset</title>
<p>The Dajiang Mini 3 Pro UAV (equipped with a 4K/60 fps camera) is used to fly at a height of 2&#x2013;5 m to ensure that pixel level diseases are visible. The 4K resolution video capture is carried out on the roads in Jiangning District of Nanjing and the campus of Nanjing University of Aeronautics and Astronautics for asphalt roads. Through systematic inspection, it covers different lighting conditions (sunny and cloudy) and shooting angles (overhead and squint) to ensure data diversity.</p>
<p>The constructed RoadDefect-MT (Measurement &#x0026; Tracking) dataset consists of 33 video clips (each lasting 2&#x2013;18 s, with a resolution of 3840 &#x00D7; 2160 and a frame rate of 30 fps), totaling 3390 frames, each of which has been meticulously annotated. The dataset covers four typical types of road defects: transverse cracks, longitudinal cracks, mesh cracks, and crack patches. The distribution of these defects is not uniform, with crack patches being the most prevalent. The remaining three categories are roughly evenly distributed, with an overall ratio of approximately 1 (transverse): 1 (longitudinal): 1 (mesh): 2 (patches). This distribution reflects the real-world scenario where repaired areas are more commonly encountered in road networks. In order to adapt to the dynamic detection task, the annotation follows the MOT (Multi Object Tracking) format, and the annotation objects include defect categories, bounding boxes, and motion tracks. The video capture mode covers five UAV motion states: forward shooting, <italic>in situ</italic> rotation, forward rotation, and horizontal/vertical screen switching to better meet the needs of different scenes. <xref ref-type="fig" rid="fig-7">Fig. 7</xref> is an example of dataset annotation.</p>
<fig id="fig-7">
<label>Figure 7</label>
<caption>
<title>Some samples of RoadDefect-MT dataset</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_70563-fig-7.tif"/>
</fig>
<p>The RoadDefect MT dataset has high resolution, which can provide pixel level defect details and support the accurate positioning of small targets (such as fine cracks); and multi-mode UAV motion simulation of a real patrol scene can reduce the dependence of the model on a fixed shooting angle.</p>
</sec>
</sec>
<sec id="s5">
<label>5</label>
<title>Experiment</title>
<p>In this study, we use high-performance computing equipment to train the deep learning model. In terms of hardware configuration, a computer equipped with the 12-th generation Intel&#x00AE; Core&#x2122; I9-12900K central processor, NVIDIA GeForce RTX 3090 graphics processor and 64 GB memory computer. The software environment is based on Python 3.8 programming language, and the PyTorch framework is used for model development. At the same time, CUDA 11.1 acceleration library is used to optimize the GPU computing performance.</p>
<p>A phased training strategy was adopted in this systems to ensure stable and efficient learning. The entire process consists of two core stages:</p>
<p>In the first stage, the focus is on training the VitSeg-Det integrated detection and segmentation network. This network employs an end-to-end joint training approach, with a backbone shared by both detection and segmentation tasks. During training, the model simultaneously receives sample data annotated with bounding boxes and segmentation masks. In the forward pass, the network computes both detection loss and segmentation loss in parallel. These two losses are combined into a total loss function via weighted summation, and then the weights of the shared backbone and the two task-specific heads are updated synchronously through backpropagation. This design enables the detection and segmentation tasks to optimize collaboratively and mutually reinforce each other, allowing the shared backbone to learn more general and robust feature representations.</p>
<p>In the second stage, after the VitSeg-Det network is fully trained, all its weights are frozen, and the TransTra-Count tracking and counting network is trained on this basis. Specifically, consecutive video frames are fed into the frozen VitSeg-Det network to obtain precise bounding boxes, category confidence scores, and corresponding deep appearance features of all cracks in each frame. This information (boxes &#x002B; features), along with the tracking trajectories from the previous frame, forms training sample pairs that are input into the TransTra-Count network. The core task of this network is to learn data association-i.e., determining which target in the current frame corresponds to which track in the previous frame for the same crack entity-and to develop the ability to maintain consistent target identity IDs in complex scenarios through training.</p>
<p>In terms of model performance evaluation, this research has constructed a set of multi-dimensional comprehensive evaluation system to comprehensively measure the performance of the algorithm in the crack segmentation, detection and tracking tasks. For the crack segmentation task, we use the average intersection and union ratio (mIoU) as the core index, and evaluate the pixel level positioning accuracy by calculating the overlap ratio of predicted and real labeled regions. In the crack detection task, we take the average accuracy (mAP) as the main indicator, combined with the detection performance under different IoU thresholds (such as AP50), supplemented by the Precision, Recall rate, F1 score, Floating Point Operations Per Second (FLOPS), Frame Per Second (FPS) and other indicators, to conduct a comprehensive evaluation from the two dimensions of detection accuracy and integrity. For more challenging crack tracking tasks, we introduce higher-order evaluation indicators: HOTA (Higher Order Tracking Accuracy) is used to comprehensively measure the overall performance of detection and correlation, DetA (Detection Accuracy) and AssA (Association Accuracy) are used to quantify the pure detection accuracy and pure correlation accuracy respectively, and IDF1 scores are used to evaluate identity retention capability. In addition, in order to comprehensively analyze the tracking quality, we also introduced the area under the curve (AUC) to evaluate the tracking stability, used the P-norm to calculate the trajectory prediction error, and combined it with the accuracy index to verify the reliability of the tracking results.</p>
<sec id="s5_1">
<label>5.1</label>
<title>Sampled-ViT Ablation Experiment</title>
<p>In <xref ref-type="sec" rid="s3_1_2">Section 3.1.2</xref>, we designed a lightweight scoring network to obtain <inline-formula id="ieqn-90"><mml:math id="mml-ieqn-90"><mml:msubsup><mml:mi>F</mml:mi><mml:mrow><mml:mi>f</mml:mi><mml:mi>i</mml:mi><mml:mi>n</mml:mi><mml:mi>e</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula>. When extracting <inline-formula id="ieqn-91"><mml:math id="mml-ieqn-91"><mml:msubsup><mml:mi>F</mml:mi><mml:mrow><mml:mi>f</mml:mi><mml:mi>i</mml:mi><mml:mi>n</mml:mi><mml:mi>e</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula>, <italic>&#x03C9;</italic> adaptively changes, thus controlling the sampling number N of the feature vector <inline-formula id="ieqn-92"><mml:math id="mml-ieqn-92"><mml:mover><mml:mrow><mml:mi>S</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mi>F</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:mrow><mml:mo>&#x2192;</mml:mo></mml:mover></mml:math></inline-formula>. The generation of dynamic &#x03C9; directly depends on the statistical characteristics of the scoring matrix <italic>S</italic>, which is specifically shown as follows:
<disp-formula id="eqn-20"><label>(20)</label><mml:math id="mml-eqn-20" display="block"><mml:mi>&#x03C9;</mml:mi><mml:mo>=</mml:mo><mml:msub><mml:mi>f</mml:mi><mml:mrow><mml:mi>&#x03B8;</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:mi>S</mml:mi><mml:mo>)</mml:mo></mml:mrow><mml:mo>&#x2208;</mml:mo><mml:mrow><mml:mo>[</mml:mo><mml:msub><mml:mi>&#x03C9;</mml:mi><mml:mrow><mml:mo movablelimits="true" form="prefix">min</mml:mo></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>&#x03C9;</mml:mi><mml:mrow><mml:mo movablelimits="true" form="prefix">max</mml:mo></mml:mrow></mml:msub><mml:mo>]</mml:mo></mml:mrow></mml:math></disp-formula>where <inline-formula id="ieqn-93"><mml:math id="mml-ieqn-93"><mml:msub><mml:mi>f</mml:mi><mml:mrow><mml:mi>&#x03B8;</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> is a dynamic <italic>&#x03C9;</italic> generating function, which can be statistical calculation or lightweight network (such as MLP). If <inline-formula id="ieqn-94"><mml:math id="mml-ieqn-94"><mml:msub><mml:mi>&#x03C9;</mml:mi><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula> is the method based on statistics and <inline-formula id="ieqn-95"><mml:math id="mml-ieqn-95"><mml:msub><mml:mi>&#x03C9;</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula> is the method based on MLP, then there are
<disp-formula id="eqn-21"><label>(21)</label><mml:math id="mml-eqn-21" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd /><mml:mtd><mml:msub><mml:mi>&#x03C9;</mml:mi><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mi>&#x03C3;</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mi>&#x03B1;</mml:mi><mml:mo>&#x22C5;</mml:mo><mml:mi>&#x03BC;</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mi>S</mml:mi><mml:mo>)</mml:mo></mml:mrow><mml:mo>+</mml:mo><mml:mi>&#x03B2;</mml:mi><mml:mo>&#x22C5;</mml:mo><mml:mrow><mml:mtext>std</mml:mtext></mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mi>S</mml:mi><mml:mo>)</mml:mo></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mo>&#x22C5;</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>&#x03C9;</mml:mi><mml:mrow><mml:mo movablelimits="true" form="prefix">max</mml:mo></mml:mrow></mml:msub><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mi>&#x03C9;</mml:mi><mml:mrow><mml:mo movablelimits="true" form="prefix">min</mml:mo></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow><mml:mo>+</mml:mo><mml:msub><mml:mi>&#x03C9;</mml:mi><mml:mrow><mml:mo movablelimits="true" form="prefix">min</mml:mo></mml:mrow></mml:msub></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<disp-formula id="eqn-22"><label>(22)</label><mml:math id="mml-eqn-22" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd /><mml:mtd><mml:msub><mml:mi>&#x03C9;</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mrow><mml:mtext>MLP</mml:mtext></mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mtext>GAP</mml:mtext></mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mi>S</mml:mi><mml:mo>)</mml:mo></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>where, <italic>&#x03B1;</italic> and <italic>&#x03B2;</italic> are learnable parameters, <italic>&#x03BC;</italic>(<italic>S</italic>) is the mean score, and std(<italic>S</italic>) is the standard deviation of the score.</p>
<p>In <xref ref-type="table" rid="table-1">Table 1</xref>, the performance of two dynamic &#x03C9; generation methods in the pavement crack segmentation task is verified. It can be seen from the table that Stat-<italic>&#x03C9;</italic> is slightly better than MLP-&#x03C9; (&#x002B;0.59% mIoU), because the statistics are more stable, and MLP overfitting noise can be avoided. And the FPS of Stat-<italic>&#x03C9;</italic> is close to fixed <italic>&#x03C9;</italic>, because it only needs simple tensor operation without additional trainable parameters, while MLP-<italic>&#x03C9;</italic> requires additional forward calculation and parameter update, increasing delay and memory occupation, and FPS decreases by 23%; The omega range of MLP-<italic>&#x03C9;</italic> is larger, but some extreme values (such as <italic>&#x03C9;</italic> &#x003C; 0.3) will lead to missed detection.</p>
<table-wrap id="table-1">
<label>Table 1</label>
<caption>
<title>Evaluation of dynamic &#x03C9; generation mode</title>
</caption>
<table>
<colgroup>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center" width="33mm"/>
<col align="center"/>
</colgroup>
<thead>
<tr>
<th>Method</th>
<th>mIoU (%)</th>
<th>F1-score</th>
<th>FPS</th>
<th>Video memory occupation (MB)</th>
<th>&#x03C9; fluctuation range</th>
</tr>
</thead>
<tbody>
<tr>
<td>Fixed &#x03C9; &#x003D; 0.5</td>
<td>42.12</td>
<td>0.535</td>
<td>38</td>
<td>1240</td>
<td>[0.5, 0.5]</td>
</tr>
<tr>
<td>Fixed &#x03C9; &#x003D; 0.3</td>
<td>38.51</td>
<td>0.535</td>
<td>45</td>
<td>1180</td>
<td>[0.3, 0.3]</td>
</tr>
<tr>
<td>Fixed &#x03C9; &#x003D; 0.7</td>
<td>43.82</td>
<td>0.528</td>
<td>32</td>
<td>1320</td>
<td>[0.7, 0.7]</td>
</tr>
<tr>
<td>Stat-&#x03C9;</td>
<td>44.12</td>
<td>0.558</td>
<td>36</td>
<td>1260</td>
<td>[0.32, 0.68]</td>
</tr>
<tr>
<td>MLP-&#x03C9;</td>
<td>43.53</td>
<td>0.549</td>
<td>28</td>
<td>1350</td>
<td>[0.25, 0.75]</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>In this study, the dynamic &#x03C9; generating function based on statistics is used to achieve optimal performance. When deploying edge devices, in order to balance the calculation efficiency and accuracy, a fixed <italic>&#x03C9;</italic> value can be selected according to the characteristics of the scene: <italic>&#x03C9;</italic> &#x003D; 0.5 is recommended for conventional scenarios; <italic>&#x03C9;</italic> &#x003D; 0.3 can be selected for sparse crack scenes (such as expressways) to improve calculation efficiency; it is recommended to adopt <italic>&#x03C9;</italic> &#x003D; 0.7 to enhance robustness in areas with dense cracks (such as old pavement). The server retains the dynamic <italic>&#x03C9;</italic> mechanism to ensure the highest detection accuracy. This hierarchical strategy achieves an adaptive balance between computing resources and detection accuracy.</p>
<p>As shown in <xref ref-type="table" rid="table-2">Table 2</xref>, the statistical approach demonstrates greater stability and effectively prevents the MLP from overfitting to noise. Specifically, different levels of Gaussian noise (with noise variances &#x03C3;<sup>2</sup> &#x003D; 0.01, 0.02, and 0.03) were added to the test set images, and the performance of three methods Fixed-&#x03C9;(0.5), Stat-&#x03C9;, and MLP-&#x03C9; was evaluated accordingly.</p>
<table-wrap id="table-2">
<label>Table 2</label>
<caption>
<title>Noise sensitivity test of dynamic &#x03C9; generation methods</title>
</caption>
<table>
<colgroup>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
</colgroup>
<thead>
<tr>
<th>Gaussian noise (&#x03C3;<sup>2</sup>)</th>
<th>Method</th>
<th>mIoU (%)</th>
<th>&#x0394;mIoU</th>
</tr>
</thead>
<tbody>
<tr>
<td rowspan="3">Noise-free</td>
<td>Fixed &#x03C9; &#x003D; 0.5</td>
<td>42.12</td>
<td>&#x2013;</td>
</tr>
<tr>
<td>Stat-&#x03C9;</td>
<td>44.12</td>
<td>&#x2013;</td>
</tr>
<tr>
<td>MLP-&#x03C9;</td>
<td>43.53</td>
<td>&#x2013;</td>
</tr>
<tr>
<td rowspan="3">0.01</td>
<td>Fixed &#x03C9; &#x003D; 0.5</td>
<td>40.15</td>
<td>&#x2212;1.97</td>
</tr>
<tr>
<td>Stat-&#x03C9;</td>
<td>42.78</td>
<td>&#x2212;1.34</td>
</tr>
<tr>
<td>MLP-&#x03C9;</td>
<td>40.89</td>
<td>&#x2212;2.64</td>
</tr>
<tr>
<td rowspan="3">0.02</td>
<td>Fixed &#x03C9; &#x003D; 0.5</td>
<td>38.21</td>
<td>&#x2212;3.91</td>
</tr>
<tr>
<td>Stat-&#x03C9;</td>
<td>40.95</td>
<td>&#x2212;3.17</td>
</tr>
<tr>
<td>MLP-&#x03C9;</td>
<td>37.12</td>
<td>&#x2212;6.41</td>
</tr>
<tr>
<td rowspan="3">0.03</td>
<td>Fixed &#x03C9; &#x003D; 0.5</td>
<td>35.04</td>
<td>&#x2212;7.08</td>
</tr>
<tr>
<td>Stat-&#x03C9;</td>
<td>38.26</td>
<td>&#x2212;5.86</td>
</tr>
<tr>
<td>MLP-&#x03C9;</td>
<td>33.58</td>
<td>&#x2212;9.95</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>The results in <xref ref-type="table" rid="table-2">Table 2</xref>, under different noise levels, show that the decrease in mIoU (&#x0394;mIoU) of Stat-&#x03C9; is consistently smaller than that of MLP-&#x03C9;. For example, at &#x03C3;<sup>2</sup> &#x003D; 0.03, the performance of Stat-&#x03C9; declines by 5.86%, while MLP-&#x03C9; exhibits a decrease of nearly 10%. This quantitatively demonstrates that MLP-&#x03C9;, due to its learnable parameters, is more prone to learning and amplifying noise present in the training data (i.e., overfitting), leading to a sharp performance degradation on noisy inputs. In contrast, Stat-&#x03C9; relies on simple statistical measures and lacks learning capacity, making it insensitive to noise and thus exhibiting stronger generalization robustness.</p>

<p>The results in <xref ref-type="table" rid="table-3">Table 3</xref> show that &#x03BC;(S) and std(S) have significant synergistic effects in the Stat-&#x03C9; method. When &#x03BC;(S) is used only, mIoU is 43.1%, which indicates that single dependence on global features will lead to insufficient local saliency perception; When std(S) is used only, the mIoU is 42.8%, which reflects that simply focusing on local changes will reduce the sensitivity to the overall structure. When &#x03BC;(S) &#x002B; std(S) are used together, mIoU increases to 44.12%, which proves that the combination of the two can effectively balance the perception of global features and local details.</p>
<table-wrap id="table-3">
<label>Table 3</label>
<caption>
<title>Evaluation of the Stat-<italic>&#x03C9;</italic> method</title>
</caption>
<table>
<colgroup>
<col align="center"/>
<col align="center"/>
</colgroup>
<thead>
<tr>
<th>Statistic combination</th>
<th>mIoU (%)</th>
</tr>
</thead>
<tbody>
<tr>
<td>Only &#x03BC;(S)</td>
<td>43.1</td>
</tr>
<tr>
<td>Only std(S)</td>
<td>42.8</td>
</tr>
<tr>
<td>&#x03BC;(S) &#x002B; std(S)</td>
<td>44.12</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>To systematically evaluate the contribution of key modules in the proposed VitSeg-Det framework to model performance, a series of ablation studies were conducted on the UAPD and UVA-PDD2023 datasets. The experiments were designed to incrementally incorporate each submodule for analysis, with results summarized in <xref ref-type="table" rid="table-4">Table 4</xref>. All experiments consistently employed EfficientNet-b5 as the backbone network and maintained identical training parameter settings to ensure fairness and reliability in the comparisons. In the experimental design, the fusion of fine-grained and macro scale features was implemented using the joint &#x03BC;(S) &#x002B; std(S) method from the statistical approach.</p>
<table-wrap id="table-4">
<label>Table 4</label>
<caption>
<title>Contribution analysis of submodules in VitSeg-det for segmentation and detection performance</title>
</caption>
<table>
<colgroup>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
</colgroup>
<thead>
<tr>
<th><inline-formula id="ieqn-96"><mml:math id="mml-ieqn-96"><mml:msub><mml:mi>F</mml:mi><mml:mrow><mml:mi>m</mml:mi><mml:mi>i</mml:mi><mml:mi>c</mml:mi><mml:mi>r</mml:mi><mml:mi>o</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula></th>
<th><inline-formula id="ieqn-97"><mml:math id="mml-ieqn-97"><mml:msub><mml:mi>F</mml:mi><mml:mrow><mml:mi>m</mml:mi><mml:mi>i</mml:mi><mml:mi>c</mml:mi><mml:mi>r</mml:mi><mml:mi>o</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula></th>
<th>Dataset</th>
<th>mIoU</th>
<th>mAP</th>
<th>AP50</th>
<th>APs</th>
</tr>
</thead>
<tbody>
<tr>
<td rowspan="2"></td>
<td rowspan="2"></td>
<td>UAPD</td>
<td>44.6</td>
<td>43.2</td>
<td>66.4</td>
<td>22.4</td>
</tr>
<tr>
<td>UVA-PDD2023</td>
<td>&#x2013;</td>
<td>45.3</td>
<td>64.7</td>
<td>31.6</td>
</tr>
<tr>
<td rowspan="2">&#x00D6;</td>
<td rowspan="2"></td>
<td>UAPD</td>
<td>46.1</td>
<td>46.1</td>
<td>67.3</td>
<td>25.1</td>
</tr>
<tr>	
<td>UVA-PDD2023</td>
<td>&#x2013;</td>
<td>47.3</td>
<td>66.4</td>
<td>34.1</td>
</tr>
<tr>
<td rowspan="2"></td>
<td rowspan="2">&#x00D6;</td>
<td>UAPD</td>
<td>46.6</td>
<td>46.4</td>
<td>68.0</td>
<td>24.0</td>
</tr>
<tr>
<td>UVA-PDD2023</td>
<td>&#x2013;</td>
<td>46.9</td>
<td>65.2</td>
<td>34.0</td>
</tr>
<tr>
<td rowspan="2">&#x00D6;</td>
<td rowspan="2">&#x00D6;</td>
<td>UAPD`</td>
<td>47.0</td>
<td>48.0</td>
<td>69.3</td>
<td>25.8</td>
</tr>
<tr>
<td>UVA-PDD2023</td>
<td>&#x2013;</td>
<td>47.9</td>
<td>67.7</td>
<td>35.0</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<fn id="table-4fn1" fn-type="other">
<p>Note: (`) This sign indicates that the selected module.</p>
</fn>
</table-wrap-foot>
</table-wrap>
<p>The experimental results demonstrate that both the fine grained feature module and the macro scale feature module in the VitSeg-Det model contribute significantly to performance improvement. When used in combination, the model achieves its best overall performance, with the segmentation mIoU increasing to 47.0 and the detection mAP reaching 48.0 on the UAPD dataset. The fine grained feature module notably enhances the detection capability for small targets (reflected by a significant improvement in APs), while the macro scale feature module more substantially improves segmentation accuracy and detection performance for regular sized objects. The two modules exhibit complementary characteristics, and their synergistic effect leads to optimal results across all evaluation metrics. This consistent trend is observed on both the UAPD and UVA-PDD2023 datasets.</p>
</sec>
<sec id="s5_2">
<label>5.2</label>
<title>CrackDSF-LMe Ablation Experiment</title>
<p>In this section, we studied several components of the model, such as spatial feature double joint data association, adaptive appearance aggregation, light gating, and long-term memory. Our main contribution is that we can accurately identify the cracks in the video stream and avoid the problem of repeated counting.</p>
<p>In this paper, we propose a data association method DSFM, and control the fusion ratio between IoU measurement and feature similarity calculation based on the attention mechanism by introducing the adjustable parameter <inline-formula id="ieqn-98"><mml:math id="mml-ieqn-98"><mml:msub><mml:mrow><mml:mi mathvariant="normal">&#x03BB;</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula> &#x003D; 1. In order to verify the effectiveness of this method, we designed four groups of comparative experiments, and the experimental results are shown in <xref ref-type="table" rid="table-5">Table 5</xref>. Among them, type 1 (<inline-formula id="ieqn-99"><mml:math id="mml-ieqn-99"><mml:msub><mml:mrow><mml:mi mathvariant="normal">&#x03BB;</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula> &#x003D; 1), as the baseline, only uses IoU matching strategy. Although the calculation speed is the fastest (average processing time is 28 ms/frame), which is suitable for high frame rate scenes, its AssA is only 58.7% due to the lack of feature information assistance, and frequent ID switching problems occur. Type 2 (<inline-formula id="ieqn-100"><mml:math id="mml-ieqn-100"><mml:msub><mml:mrow><mml:mi mathvariant="normal">&#x03BB;</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula> &#x003D; 0) completely depends on feature similarity matching, and performs well in anti occlusion. AssA is significantly improved to 70.2%, but the DetA decreases by 8.0% due to the false correlation of some fuzzy fracture areas. Type 3 adopts a hybrid strategy (<inline-formula id="ieqn-101"><mml:math id="mml-ieqn-101"><mml:msub><mml:mrow><mml:mi mathvariant="normal">&#x03BB;</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula> &#x003D; 0.6). While maintaining real-time (processing time 45 ms/frame), it achieves an absolute increase of 15.9% in HOTA indicators and 18.3% in IDF1, significantly improving the object&#x2019;s identity retention ability. Although type 4 (<inline-formula id="ieqn-102"><mml:math id="mml-ieqn-102"><mml:msub><mml:mrow><mml:mi mathvariant="normal">&#x03BB;</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula> &#x003D; 0.8) has slightly increased DetA by 3.1%, its comprehensive performance is inferior to type3, which indicates that a moderate feature fusion ratio (about 0.6) can achieve the best balance between detection accuracy and association accuracy. Experimental results show that our DSFM effectively overcomes the limitations of a single association strategy by dynamically adjusting the fusion ratio. Spatial-Feature Dual-Modal Data Association</p>
<table-wrap id="table-5">
<label>Table 5</label>
<caption>
<title>Influence based on different <inline-formula id="ieqn-103"><mml:math id="mml-ieqn-103"><mml:msub><mml:mrow><mml:mi mathvariant="normal">&#x03BB;</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula> values in DSFM</title>
</caption>
<table>
<colgroup>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
</colgroup>
<thead>
<tr>
<th>Type</th>
<th>HOTA&#x2191;</th>
<th>AssA&#x2191;</th>
<th>IDF1&#x2191;</th>
<th>DetA&#x2191;</th>
<th>FPS</th>
</tr>
</thead>
<tbody>
<tr>
<td>Baseline <inline-formula id="ieqn-104"><mml:math id="mml-ieqn-104"><mml:msub><mml:mrow><mml:mi mathvariant="normal">&#x03BB;</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:math></inline-formula></td>
<td>62.3</td>
<td>58.7</td>
<td>65.1</td>
<td>67.2</td>
<td>35</td>
</tr>
<tr>
<td>Only features <inline-formula id="ieqn-105"><mml:math id="mml-ieqn-105"><mml:msub><mml:mrow><mml:mi mathvariant="normal">&#x03BB;</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mn>0</mml:mn></mml:math></inline-formula></td>
<td>68.4</td>
<td>70.2</td>
<td>71.5</td>
<td>61.8</td>
<td>28</td>
</tr>
<tr>
<td>Mixture <inline-formula id="ieqn-106"><mml:math id="mml-ieqn-106"><mml:msub><mml:mrow><mml:mi mathvariant="normal">&#x03BB;</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mn>0.6</mml:mn></mml:math></inline-formula></td>
<td>72.2</td>
<td>73.9</td>
<td>77.0</td>
<td>69.8</td>
<td>22</td>
</tr>
<tr>
<td>Mixture <inline-formula id="ieqn-107"><mml:math id="mml-ieqn-107"><mml:msub><mml:mrow><mml:mi mathvariant="normal">&#x03BB;</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mn>0.8</mml:mn></mml:math></inline-formula></td>
<td>71.8</td>
<td>71.5</td>
<td>76.6</td>
<td>70.3</td>
<td>20</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>In <xref ref-type="sec" rid="s3_2_2">Section 3.2.2</xref>, we designed a ULMeM module, which can dynamically fuse the object features from adjacent frames, and integrate the interframe lighting changes to avoid the impact of sharp changes in light. We decompose this structure in <xref ref-type="table" rid="table-4">Table 4</xref>. The first result only uses single frame features as the comparison benchmark, the second group uses fixed weights instead of adaptive weights, the third group removes illumination compensation, and the last group is the complete model. As shown in <xref ref-type="table" rid="table-6">Table 6</xref>, compared with W<sub><italic>g2</italic></sub> &#x003D; 0.5, the adaptive weight increases about 12.3% (65.7 &#x2192; 73.8) in HOTA, which indicates that it is necessary to dynamically adjust the memory retention ratio for occluded/blurred scenes. IDF1 increases significantly (71.8 &#x2192; 78.7), which verifies that adjusting the memory weight through dynamic response appearance difference significantly reduces ID switching errors; Compared with the fourth group, the removal led to a 2.9% decrease in DetA (71.4 &#x2192; 68.5), indicating that the false detection rate increased when the illumination suddenly changed, and the histogram moment matching effectively improved the tracking stability; In the fourth group, compared with baseline, the FPS is reduced from 36 to 28, because the introduction of multi frame computing increases the time consumption, but significantly improves the accuracy (HOTA &#x002B; 11.5), which is suitable for high-precision demand scenarios. It can be seen that all indicators of the complete model are optimal, indicating that the joint modeling of illumination, adaptive appearance aggregation and memory update can comprehensively solve the problems of fuzziness, occlusion and brightness change.</p>
<table-wrap id="table-6">
<label>Table 6</label>
<caption>
<title>Contributions of the individual components in the ULMeM</title>
</caption>
<table>
<colgroup>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
</colgroup>
<thead>
<tr>
<th>Type</th>
<th>HOTA&#x2191;</th>
<th>AssA&#x2191;</th>
<th>IDF1&#x2191;</th>
<th>DetA&#x2191;</th>
<th>FPS</th>
</tr>
</thead>
<tbody>
<tr>
<td>Baseline</td>
<td>62.3</td>
<td>63.1</td>
<td>68.5</td>
<td>61.2</td>
<td>36</td>
</tr>
<tr>
<td>Fixed weights <inline-formula id="ieqn-108"><mml:math id="mml-ieqn-108"><mml:msub><mml:mi>W</mml:mi><mml:mrow><mml:mi>g</mml:mi><mml:mn>2</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula> &#x003D; 0.5</td>
<td>65.7</td>
<td>66.6</td>
<td>71.8</td>
<td>64.9</td>
<td>33</td>
</tr>
<tr>
<td>No illumination compensation</td>
<td>69.9</td>
<td>70.4</td>
<td>74.5</td>
<td>68.5</td>
<td>31</td>
</tr>
<tr>
<td>Full adaptive aggregation</td>
<td>73.8</td>
<td>74.6</td>
<td>78.7</td>
<td>71.4</td>
<td>28</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>The long-term memory we proposed in <xref ref-type="sec" rid="s3_2_2">Section 3.2.2</xref> is to use the longer time information and further inject it into the subsequent track embedding to enhance the object characteristics. We have studied the performance of long memory. As shown in <xref ref-type="table" rid="table-7">Table 7</xref>, when <inline-formula id="ieqn-109"><mml:math id="mml-ieqn-109"><mml:msub><mml:mi>&#x03BB;</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula> gradually increases from 0.01 to 0.4, our model has the highest HOTA score at <inline-formula id="ieqn-110"><mml:math id="mml-ieqn-110"><mml:msub><mml:mi>&#x03BB;</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula> &#x003D; 0.3, while DetA score decreases slightly. This indicates that moderate memory updating can achieve a balance between feature stability and adaptability. Based on the experimental results, we suggest that <inline-formula id="ieqn-111"><mml:math id="mml-ieqn-111"><mml:msub><mml:mi>&#x03BB;</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula> should be used as an adjustable super parameter, which should be optimized according to the needs of specific scenarios.</p>
<table-wrap id="table-7">
<label>Table 7</label>
<caption>
<title>Influence of different <inline-formula id="ieqn-112"><mml:math id="mml-ieqn-112"><mml:msub><mml:mi>&#x03BB;</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula> values in the ULMeM</title>
</caption>
<table>
<colgroup>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
</colgroup>
<thead>
<tr>
<th>Type</th>
<th>HOTA&#x2191;</th>
<th>AssA&#x2191;</th>
<th>IDF1&#x2191;</th>
<th>DetA&#x2191;</th>
<th>FPS</th>
</tr>
</thead>
<tbody>
<tr>
<td><inline-formula id="ieqn-113"><mml:math id="mml-ieqn-113"><mml:msub><mml:mi>&#x03BB;</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula> &#x003D; 0.1</td>
<td>60.3</td>
<td>61.8</td>
<td>65.7</td>
<td>59.4</td>
<td><inline-formula id="ieqn-114"><mml:math id="mml-ieqn-114"><mml:msub><mml:mi>&#x03BB;</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula> &#x003D; 0.1</td>
</tr>
<tr>
<td><inline-formula id="ieqn-115"><mml:math id="mml-ieqn-115"><mml:msub><mml:mi>&#x03BB;</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula> &#x003D; 0.15</td>
<td>66.7</td>
<td>67.6</td>
<td>72.8</td>
<td>65.9</td>
<td><inline-formula id="ieqn-116"><mml:math id="mml-ieqn-116"><mml:msub><mml:mi>&#x03BB;</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula> &#x003D; 0.15</td>
</tr>
<tr>
<td><inline-formula id="ieqn-117"><mml:math id="mml-ieqn-117"><mml:msub><mml:mi>&#x03BB;</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula> &#x003D; 0.2</td>
<td>70.8</td>
<td>71.2</td>
<td>76.6</td>
<td>69.4</td>
<td><inline-formula id="ieqn-118"><mml:math id="mml-ieqn-118"><mml:msub><mml:mi>&#x03BB;</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula> &#x003D; 0.2</td>
</tr>
<tr>
<td><inline-formula id="ieqn-119"><mml:math id="mml-ieqn-119"><mml:msub><mml:mi>&#x03BB;</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula> &#x003D; 0.3</td>
<td>73.8</td>
<td>74.6</td>
<td>78.7</td>
<td>71.4</td>
<td><inline-formula id="ieqn-120"><mml:math id="mml-ieqn-120"><mml:msub><mml:mi>&#x03BB;</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula> &#x003D; 0.3</td>
</tr>
<tr>
<td><inline-formula id="ieqn-121"><mml:math id="mml-ieqn-121"><mml:msub><mml:mi>&#x03BB;</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula> &#x003D; 0.4</td>
<td>71.1</td>
<td>70.3</td>
<td>76.7</td>
<td>71.9</td>
<td><inline-formula id="ieqn-122"><mml:math id="mml-ieqn-122"><mml:msub><mml:mi>&#x03BB;</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula> &#x003D; 0.4</td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
<sec id="s5_3">
<label>5.3</label>
<title>Detection and Segmentation Qualitative and Quantitative Analysis</title>
<p>Since our segmentation detection model is a multi task model, and Sampled-ViT is used as a shared encoder to extract global features, we have carried out unified processing on the dataset to adapt to the multi task model training. First, the training set and verification set of the DeepCrack dataset are integrated to generate the detection frame annotation, and each image is expanded to 30 video sequences through translation transformation to synchronously generate frame by frame segmentation tags and detection frame annotation. At the same time, we manually segment and annotate 100 crack images in the UVA-PDD2023 dataset, and generate 30 frames of video data and their corresponding multitask annotations using translation scaling transformation. In the test phase, the UVA-PDD2023 test set is selected to comprehensively evaluate the generalization performance of the model.</p>
<p>For the performance analysis of Sample-ViT, we conducted systematic comparative experiments on all test images, as illustrated in <xref ref-type="fig" rid="fig-8">Fig. 8</xref>. In the comparative experiments of this study, we selected representative benchmark models based on the following principles: (1) Task relevance. The selected models (e.g., Mask2Former, DPT) are advanced representatives dedicated to semantic segmentation or general segmentation tasks, and their design objectives are highly relevant to our crack segmentation task. (2) Architectural comparability. We specifically included Transformer-based models (e.g., SegGPT, DPT) to enable fair comparisons of efficiency (parameter count, FLOPs) and accuracy on the basis of similar architectures. (3) Industry influence and versatility. As a pioneering work in context-aware segmentation, SegGPT provides an important performance reference. The results show that Sample-ViT with the EfficientNetB5 backbone network exhibits the most excellent fine-grained segmentation capability in the road crack detection task. It can generate finer and more continuous crack segments, significantly improving the detection rate of microcracks.</p>
<fig id="fig-8">
<label>Figure 8</label>
<caption>
<title>The results of segment: From left to right, they are original image, SegGPT, Mask2former, DPT and ours in sequence</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_70563-fig-8.tif"/>
</fig>
<p>As shown in <xref ref-type="table" rid="table-8">Table 8</xref>, the methods proposed in this study show excellent performance balance in video segmentation tasks. Compared with the mainstream model, our method achieves 45.12% mIoU with only 91M parameters (based on ViT base), which is superior to SegGPT (41.2%) in accuracy. At the same time, the calculation amount (157 G FLOPS) is significantly lower than DPT and SegGPT of similar ViT base architectures. Although Mask2Former (Swin Transformer) performs best with 52.6% mIoU, its 219 M parameters and 520 G FLOPS have significantly higher computing costs. Experimental results show that the proposed method has lightweight advantages while maintaining competitive segmentation accuracy, and is particularly suitable for real-time video analysis scenes with limited computing resources.</p>
<table-wrap id="table-8">
<label>Table 8</label>
<caption>
<title>The comparison of Transformer-based segmentation network</title>
</caption>
<table>
<colgroup>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
</colgroup>
<thead>
<tr>
<th>Method</th>
<th>mIoU</th>
<th>Params</th>
<th>FLOPS(G)</th>
</tr>
</thead>
<tbody>
<tr>
<td>SegGPT</td>
<td>41.2</td>
<td>103</td>
<td>203</td>
</tr>
<tr>
<td>Mask2former</td>
<td>52.6</td>
<td>219</td>
<td>520</td>
</tr>
<tr>
<td>DPT</td>
<td>48.2</td>
<td>102</td>
<td>206</td>
</tr>
<tr>
<td>CrackFormer</td>
<td>43.8</td>
<td>4.96</td>
<td>72</td>
</tr>
<tr>
<td>UNet</td>
<td>42.4</td>
<td>45</td>
<td>131</td>
</tr>
<tr>
<td>PidiNet</td>
<td>38.1</td>
<td>0.59</td>
<td>13</td>
</tr>
<tr>
<td>PIDNet</td>
<td>49.4</td>
<td>110</td>
<td>178</td>
</tr>
<tr>
<td>RIND</td>
<td>41.9</td>
<td>59</td>
<td>101</td>
</tr>
<tr>
<td>DPT</td>
<td>50.9</td>
<td>308</td>
<td>530</td>
</tr>
<tr>
<td>Ours</td>
<td>45.2</td>
<td>91</td>
<td>157</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>When testing model validation, this study conducted a comprehensive performance evaluation of the trained model on the standard test set, and used indicators such as accuracy, precision, recall and F1 score to compare with the CNN based method. The experimental results show that the detection network optimized by Transformer has achieved significant improvement: as shown in <xref ref-type="table" rid="table-9">Table 9</xref>, our method&#x2019;s F1 score reached 74.13, with an accuracy rate of 70.20%, 23.4% and 9.6% higher than the optimal CNN model (Yolo v8) respectively. From the performance comparison of each algorithm listed in <xref ref-type="table" rid="table-10">Table 10</xref>, we can see that under the same input size (31,333,800), our method performs well in detection accuracy while maintaining a real-time processing speed of more than 16 fps. Although the number of model parameters (41.55 M) is slightly higher than some of the comparison methods, the computational efficiency (81.63 G FLOPs) is better than that of DETR (86.93 G FLOPs) and is significantly higher than the computational complexity of the traditional CNN method (Faster RCNN 0.178 T FLOPs). As shown in <xref ref-type="fig" rid="fig-9">Fig. 9</xref>, our method and DETR show stable detection capability for various defect types, while SSD and Yolo series have faster detection speed, but their performance in this dataset is poor, and Faster RCNN has obvious problems of false detection and missing detection. These results confirm the advantages of the proposed method in accuracy and efficiency. In the future, we will further improve the real-time performance through code optimization to meet the application requirements of high-precision real-time detection.</p>
<table-wrap id="table-9">
<label>Table 9</label>
<caption>
<title>The comparison of CNN-based and Transformer-based detection network (a)</title>
</caption>
<table>
<colgroup>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
</colgroup>
<thead>
<tr>
<th>Method</th>
<th>Backbone</th>
<th>Epochs</th>
<th>mAP</th>
<th>AP50</th>
<th>Precision</th>
<th>Recall</th>
<th>F1 score</th>
</tr>
</thead>
<tbody>
<tr>
<td>Ours</td>
<td>R50</td>
<td>150</td>
<td>45.2</td>
<td>78.5</td>
<td>70.20</td>
<td>82.14</td>
<td>74.13</td>
</tr>
<tr>
<td>DETR</td>
<td>R50</td>
<td>150</td>
<td>42.0</td>
<td>76.3</td>
<td>63.80</td>
<td>76.93</td>
<td>69.75</td>
</tr>
<tr>
<td>Faster RCNN</td>
<td>R50</td>
<td>12</td>
<td>34.1</td>
<td>58.7</td>
<td>54.81</td>
<td>58.80</td>
<td>55.73</td>
</tr>
<tr>
<td>SSD</td>
<td>SSDVGG</td>
<td>12</td>
<td>10.8</td>
<td>28.1</td>
<td>66.40</td>
<td>22.57</td>
<td>33.69</td>
</tr>
<tr>
<td>YOLOv8</td>
<td>R50</td>
<td>300</td>
<td>18.4</td>
<td>43.5</td>
<td>63.41</td>
<td>49.80</td>
<td>56.78</td>
</tr>
</tbody>
</table>
</table-wrap><table-wrap id="table-10">
<label>Table 10</label>
<caption>
<title>The comparison of CNN-based and Transformer-based detection network (b)</title>
</caption>
<table>
<colgroup>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
</colgroup>
<thead>
<tr>
<th>Method</th>
<th>Backbone</th>
<th>Epochs</th>
<th>FLOPS (G)</th>
<th>Params(M)</th>
<th>Input</th>
</tr>
</thead>
<tbody>
<tr>
<td>Ours</td>
<td>R50</td>
<td>150</td>
<td>81.63</td>
<td>41.55</td>
<td>(31, 333, 800)</td>
</tr>
<tr>
<td>DETR</td>
<td>R50</td>
<td>150</td>
<td>86.93</td>
<td>43.70</td>
<td>(31, 333, 800)</td>
</tr>
<tr>
<td>Faster RCNN</td>
<td>R50</td>
<td>12</td>
<td>0.178 T</td>
<td>41.37</td>
<td>(31, 333, 800)</td>
</tr>
<tr>
<td>SSD</td>
<td>SSDVGG</td>
<td>12</td>
<td>0.345 T</td>
<td>25.12</td>
<td>(31, 333, 800)</td>
</tr>
<tr>
<td>YOLOv8</td>
<td>R50</td>
<td>12</td>
<td>0.224 T</td>
<td>54.15</td>
<td>(31, 333, 800)</td>
</tr>
</tbody>
</table>
</table-wrap><fig id="fig-9">
<label>Figure 9</label>
<caption>
<title>Visual comparison of test results</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_70563-fig-9.tif"/>
</fig>
</sec>
<sec id="s5_4">
<label>5.4</label>
<title>Tracking Segmentation Qualitative and Quantitative Analysis</title>
<p>To verify the performance of the tracking model, we use the UVA-PDD dataset to generate analog video sequences, and expand each image to 30 frames through translation transformation to simulate continuous frame input in the real scene. In the test phase, the RoadDefct MT dataset is used for generalization evaluation. Experimental results show that the proposed tracking network achieves an average accuracy of 97.1% and an F1 score of 0.84 on the test set. As shown in <xref ref-type="fig" rid="fig-10">Fig. 10</xref>, the Precision Recall (PR) curve of pavement defect detection is close to the upper right corner of the coordinate system, indicating that the network can track crack targets stably and accurately.</p>
<fig id="fig-10">
<label>Figure 10</label>
<caption>
<title>Precision recall (PR) curve and F1 score performance analysis</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_70563-fig-10.tif"/>
</fig>
<p>To verify the advancement of the proposed TransTra-Count method, we conducted a comprehensive comparison between it and current mainstream multi-object tracking algorithms on the RoadDefect-MT dataset. As a classic paradigm, DeepSORT integrates motion models and appearance features, serving as the foundation for many subsequent studies and thus adopted here as a performance baseline. ByteTrack focuses on motion models as its core and significantly enhances association robustness by effectively utilizing low-score detection boxes, representing the advanced level of the current technical route in this field. BoT-SORT and OC-SORT, respectively, introduce camera motion compensation and trajectory smoothing optimization based on ByteTrack, together forming a collection of various advanced online tracking strategies. All the aforementioned trackers have open-source and stable implementations, facilitating reproducible and fair comparisons. The experimental results are presented in <xref ref-type="table" rid="table-11">Table 11</xref>.</p>
<table-wrap id="table-11">
<label>Table 11</label>
<caption>
<title>The comparison of CNN-based and Transformer-based detection network (a)</title>
</caption>
<table>
<colgroup>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
</colgroup>
<thead>
<tr>
<th>Detection head</th>
<th>Tracker</th>
<th>HOTA</th>
<th>MOTA</th>
<th>IDF1</th>
<th>FPS</th>
</tr>
</thead>
<tbody>
<tr>
<td>VitSeg-Det</td>
<td>DeepSORT [<xref ref-type="bibr" rid="ref-39">39</xref>]</td>
<td>60.5</td>
<td>75.5</td>
<td>73.5</td>
<td>35</td>
</tr>
<tr>
<td>VitSeg-Det</td>
<td>Bytetrack [<xref ref-type="bibr" rid="ref-40">40</xref>]</td>
<td>64.5</td>
<td>79.5</td>
<td>78.5</td>
<td>39</td>
</tr>
<tr>
<td>VitSeg-Det</td>
<td>BoT-SORT [<xref ref-type="bibr" rid="ref-41">41</xref>]</td>
<td>64.0</td>
<td>79.0</td>
<td>78.0</td>
<td>37</td>
</tr>
<tr>
<td>VitSeg-Det</td>
<td>OC-SORT [<xref ref-type="bibr" rid="ref-42">42</xref>]</td>
<td>63.0</td>
<td>77.5</td>
<td>76.5</td>
<td>36</td>
</tr>
<tr>
<td>VitSeg-Det</td>
<td>TransTra-Count (Ours)</td>
<td>67.0</td>
<td>81.1</td>
<td>80.2</td>
<td>34</td>
</tr>
<tr>
<td>YOLOv8</td>
<td>DeepSORT</td>
<td>59.0</td>
<td>74.0</td>
<td>72.0</td>
<td>38</td>
</tr>
<tr>
<td>YOLOv8</td>
<td>Bytetrack</td>
<td>63.0</td>
<td>78.0</td>
<td>77.0</td>
<td>42</td>
</tr>
<tr>
<td>YOLOv8</td>
<td>BoT-SORT</td>
<td>62.5</td>
<td>77.5</td>
<td>76.5</td>
<td>40</td>
</tr>
<tr>
<td>YOLOv8</td>
<td>OC-SORT</td>
<td>61.5</td>
<td>76.0</td>
<td>75.0</td>
<td>39</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>A controlled variable approach was adopted, where two detectors-the proposed VitSeg-Det and the widely used YOLOv8-were fixed and paired with each of the five tracking methods to evaluate the performance of different combinations in multi-object tracking tasks.</p>
<p>First, in terms of detector performance comparison, the proposed VitSeg-Det in this paper demonstrates significant advantages. When paired with the same tracker, VitSeg-Det consistently and stably outperforms YOLOv8 in all core tracking metrics (HOTA, MOTA, IDF1). Taking the combination with Bytetracker as an example, VitSeg-Det achieves a 1.5% performance improvement in HOTA (64.5 vs. 63.0), MOTA (79.5 vs. 78.0), and IDF1 (78.5 vs. 77.0), fully demonstrating its higher accuracy and robustness as a detection module.</p>
<p>Second, in the horizontal comparison of trackers, it was observed that for the same detector, Bytetrack and BoT-SORT delivered the best and closely matched performance, followed by OC-SORT, while the classic DeepSORT algorithm performed relatively weakly. This trend reflects the continuous progress in multi-object tracking technology regarding the robustness of data association. Notably, the proposed TransTra-Count tracker achieved the best performance among all combinations. When paired with VitSeg-Det, this combination reached the highest tracking performance (HOTA: 67.0, MOTA: 81.1, IDF1: 80.2), significantly outperforming other compared algorithms and strongly validating the effectiveness of TransTra-Count&#x2019;s innovative design.</p>
<p>Finally, in terms of computational efficiency, combinations using YOLOv8 achieved the highest frame rates (ranging from 38 to 42 FPS), benefiting from the computational efficiency of its CNN architecture. Although the Transformer-based VitSeg-Det combinations resulted in slightly lower frame rates (ranging from 34 to 39 FPS), they offered a favorable trade-off between accuracy and efficiency due to their superior precision. It is particularly noteworthy that the proposed TransTra-Count method maintained a real-time processing speed of 34 FPS while achieving leading accuracy metrics, successfully balancing precision and efficiency.</p>
<p>In conclusion, the experimental results demonstrate that both VitSeg-Det as a detection module and TransTra-Count as a tracking module deliver outstanding performance. Their combination forms a powerful and efficient solution for road defect detection and tracking.</p>
<p>To comprehensively evaluate the overall performance of the proposed method, we conducted a comparative analysis from a more macroscopic method paradigm perspective. The experimental results are shown in <xref ref-type="table" rid="table-12">Table 12</xref>.</p>
<table-wrap id="table-12">
<label>Table 12</label>
<caption>
<title>Comprehensive performance comparison of different paradigm frameworks on the RoadDefect-MT</title>
</caption>
<table>
<colgroup>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
</colgroup>
<thead>
<tr>
<th>Method</th>
<th>Type</th>
<th>mAP</th>
<th>mIoU</th>
<th>HOTA</th>
<th>MOTA</th>
<th>IDF1</th>
<th>FPS</th>
</tr>
</thead>
<tbody>
<tr>
<td>YOLOv8 &#x002B; Bytetrack</td>
<td>Decoupled</td>
<td>46.1</td>
<td></td>
<td>63.0</td>
<td>78.0</td>
<td>77.0</td>
<td>42</td>
</tr>
<tr>
<td>YOLOv11&#x002B; Bytetrack</td>
<td>Decoupled</td>
<td>46.7</td>
<td></td>
<td>64.5</td>
<td>79.0</td>
<td>78.0</td>
<td>40</td>
</tr>
<tr>
<td>TrackFormer</td>
<td>End-to-End</td>
<td>45.5</td>
<td></td>
<td>65.1</td>
<td>79.2</td>
<td>78.5</td>
<td>18</td>
</tr>
<tr>
<td>VitSeg-Det TransTra-Count (Ours)</td>
<td>Jointly Optimized</td>
<td>47.9</td>
<td>45.2</td>
<td>67.0</td>
<td>81.1</td>
<td>80.2</td>
<td>34</td>
</tr>
<tr>
<td>Mask2Former</td>
<td>Segmentation</td>
<td></td>
<td>52.6</td>
<td></td>
<td></td>
<td></td>
<td>22</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>The experimental results show that the proposed joint optimization framework in this paper achieves the best performance across all key indicators and leads comprehensively. Specifically, this method achieves an mAP of 47.9% in the detection task, an mIoU of 50.1% in the segmentation task, and significantly outperforms other frameworks in multi-object tracking, with HOTA of 67.0, MOTA of 81.1, and IDF1 of 80.2. This result indicates that the joint optimization mechanism can effectively promote the collaboration and information complementation between detection and tracking tasks, thereby achieving global performance optimization.</p>
<p>By comparing different paradigms, it can be observed that the end-to-end paradigm (TrackFormer) outperforms the decoupled method in tracking metrics, demonstrating certain potential of end-to-end learning. However, its detection accuracy (mAP 45.5) is relatively low, and its inference speed is the slowest (18 FPS), reflecting the limitations of this paradigm in terms of efficiency and flexibility. The decoupled paradigm, leveraging its architectural advantages, performs best in terms of speed (40&#x2013;42 FPS) and has become a widely adopted efficient solution in practical applications. However, its accuracy ceiling still falls short of the joint optimization method.</p>
<p>The reference pure segmentation model Mask2Former achieved an mIoU of 52.6% in semantic segmentation tasks, but it lacks tracking capabilities and cannot output multi-object tracking metrics. Moreover, its inference speed (22 FPS) is lower than that of most detection and tracking models.</p>
</sec>
<sec id="s5_5">
<label>5.5</label>
<title>TransTra-Count Analysis</title>
<p>To verify the effectiveness of the Transformer-based pavement defect detection and counting model, we conducted comprehensive tests on multiple datasets, including: (1) a publicly available benchmark dataset enhanced with translation augmentation; (2) our self-built 840 &#x00D7; 2160@30 fps. During the experiment, we strictly maintained the original resolution of each dataset to ensure the authenticity of the test conditions. The experimental results are shown in <xref ref-type="fig" rid="fig-11">Figs. 11</xref> and <xref ref-type="fig" rid="fig-12">12</xref>.</p>
<fig id="fig-11">
<label>Figure 11</label>
<caption>
<title>Defect tracking count (sequence 1)</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_70563-fig-11.tif"/>
</fig><fig id="fig-12">
<label>Figure 12</label>
<caption>
<title>Defect tracking count (sequence 2)</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_70563-fig-12.tif"/>
</fig>
<p>As shown in <xref ref-type="fig" rid="fig-10">Fig. 10</xref>, our system achieved accurate defect detection and stable tracking on the aforementioned datasets. The experimental results demonstrate that our method performs excellently on test data with different resolutions and frame rates. It not only accurately detects various types of pavement defects but also achieves stable target tracking and precise quantity counting. Notably, in the 4K high-definition video tests, despite a significant increase in the volume of input data, our model still maintained stable detection accuracy and counting precision. This preliminarily proves the algorithm&#x2019;s adaptability to high-resolution scenarios.</p>
<p>At the same time, we fully recognize a limitation of the current research: the self-collected dataset relied on for primary validation is insufficient in terms of scale (only 33 videos) and scene diversity (sourced from a single collection location). This may affect the model&#x2019;s generalization ability in broader and more complex real-world environments. Although supplementary experiments on public datasets such as DeepCrack and UVA-PDD further support the effectiveness of the method, we acknowledge that these datasets constructed from static images cannot fully replicate the challenges posed by real dynamic video scenarios. Therefore, the conclusions of this paper should be regarded as a strong validation of the effectiveness of the proposed TransTra-Count model within the scope of our existing datasets. These results provide valuable references and a foundation for practical engineering applications, but the universality of the model still needs further verification in the future through larger-scale and more diverse real-world video datasets.</p>
</sec>
<sec id="s5_6">
<label>5.6</label>
<title>Embedded Device Verification</title>
<p>To comprehensively evaluate the deployment feasibility and resource adaptability of the proposed method in actual industrial scenarios, this section focuses on the edge inference platform and systematically analyzes the performance of this method under lightweight deployment conditions.</p>
<p>The experiment was conducted on the embedded edge device Jetson AGX Orin 32 GB. Before deployment, the FP16 precision optimization processing was carried out using the Tensor RT toolchain to closely align with the acceleration scheme in actual embedded scenarios; the input image size was uniformly set to 1024 &#x00D7; 1024, and the batch size was set to 1 to simulate the continuous single-frame image processing flow, ensuring the consistency and reusability of the evaluation.</p>
<p>The performance tests cover four key indicators: model size (MB), maximum frame rate (FPS), peak memory usage (MB), and single-frame latency (ms). As shown in <xref ref-type="table" rid="table-13">Table 13</xref>, the experimental results show that the proposed method, while integrating detection and segmentation, still maintains low resource usage and high inference efficiency. The maximum frame rate reaches 20 FPS, and the single-frame processing latency is 49.8 ms. This performance demonstrates the excellent engineering applicability of this method in high-frequency industrial defect detection and quality monitoring tasks, providing a feasible lightweight solution for actual industrial applications.</p>
<table-wrap id="table-13">
<label>Table 13</label>
<caption>
<title>Verification of embedded devices comparison</title>
</caption>
<table>
<colgroup>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
</colgroup>
<thead>
<tr>
<th>Method</th>
<th>Para (MB)</th>
<th>Max FPS</th>
<th>Memory peak (MB)</th>
<th>Time/Frame (ms)</th>
<th>Task count</th>
</tr>
</thead>
<tbody>
<tr>
<td>VitSeg-Det</td>
<td>91</td>
<td>20</td>
<td>1535</td>
<td>49.8</td>
<td>3</td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
</sec>
<sec id="s6">
<label>6</label>
<title>Conclusion and Limitations Discussion</title>
<p>Focusing on the requirements of automatic identification of road crack defects and quantification of structural parameters, this research has constructed a multi task integrated crack measurement system based on the Transformer architecture, which focuses on breaking through the problems of repeated counting errors, structural information loss and measurement output inexplicability faced by existing methods in complex dynamic scenes. The system as a whole has realized end-to-end closed-loop modeling from detection, segmentation, tracking to counting, and promoted the transformation of the crack recognition task from &#x201C;image level classification&#x201D; to &#x201C;structured measurement&#x201D;. Its innovative contributions in visual perception and quantitative analysis are mainly reflected in the following four aspects:</p>
<p>First of all, to solve the problem that microcracks are difficult to segment in the low contrast background, a VitSeg-Det integrated detection and segmentation network is designed. Combining the channel space attention fusion scoring mechanism and the macro scale dilated convolution feature modeling, the edge structure and global topology information of cracks are accurately extracted, which effectively supports the measurement and calculation of subsequent structural parameters.</p>
<p>Secondly, to solve the problem of recognition drift and repeated counting of the same crack object in dynamic video due to changes in illumination and perspective, a TransTra-Count module is proposed, which is based on the space feature bimodal association and long-term memory update mechanism, and realizes the identity maintenance and life cycle management of the crack object. In many actual videos, the system achieves accurate and non repetitive object counting output, and maintains measurement stability under interference conditions such as occlusion and blurring.</p>
<p>Thirdly, in terms of crack structure measurement, this paper proposes an unsupervised width estimation method based on mask skeleton, and designs a smooth update mechanism and a width change rate constraint loss function to ensure the continuity and physical rationality of the estimation results in the time dimension.</p>
<p>Fourthly, to provide more comprehensive experimental support for system verification and measurement evaluation, this paper constructs a self built high resolution dynamic crack dataset, RoadDefect-MT, which covers a variety of typical pavement disease patterns, different illumination and occlusion conditions, and is supplemented by average width marking and track numbering, making up for the gaps in time sequence consistency and measurement marking in existing public datasets. The experimental results show that the method in this paper is significantly better than the existing mainstream methods on the dataset, showing stronger measurement stability, statistical accuracy and engineering adaptability.</p>
<p>Overall, this paper has made several key breakthroughs in visual measurement modeling, dynamic target statistics, and structural parameter extraction, verifying the effectiveness and scalability of the Transformer architecture in multi-task visual measurement systems. However, there are still several limitations that need to be further explored:</p>
<p>Firstly, the current validation mainly focuses on road surface scenarios. The applicability of this method in vertical or elevated surfaces such as bridges and tunnels remains an open question. This is mainly due to the differences in crack morphology under different perspectives and the influence of gravity, as well as the challenges of shadow and perspective distortion in imaging of vertical surfaces. Secondly, existing research also has limitations such as a single target type and limited measurement granularity. The current system focuses on the measurement of linear cracks, and the generalization ability for other common disease types such as potholes and network cracks has not been verified. At the same time, the parameter measurement is still concentrated on macroscopic indicators such as average width, and the quantitative ability for three-dimensional attributes such as depth and volume needs to be expanded. Additionally, the system faces challenges in extreme environments such as heavy rain, severe stains, and intense vibrations during actual deployment. Although this method performs robustly under common interference conditions, its generalization ability in unseen extreme scenarios still needs to be verified through more extensive datasets.</p>
<p>In response to these limitations, future work will focus on the following aspects: Firstly, extend the detection range from road surfaces to bridge facades, tunnel walls, and other vertical surfaces to verify and enhance the model&#x2019;s generalization ability in different scenarios and perspectives; Secondly, develop a universal measurement framework that can be applied to potholes, network cracks, and various structural diseases; Thirdly, construct a new generation of three-dimensional bridge and road disease data sets that integrate multi-view imaging, depth information, and calibration parameters to support fine-scale structural measurement at spatial scales; At the same time, explore segment-based structural modeling methods based on Transformer to achieve differentiated measurement and evaluation of cracks and other diseases, providing more precise data support for road and bridge maintenance decisions. Finally, further research will focus on robustness enhancement methods that integrate physical priors and adaptive learning mechanisms to improve the performance stability of the system in extreme conditions, promoting the effective transformation of research results into practical engineering applications.</p>
</sec>
</body>
<back>
<ack>
<p>We would like to express our gratitude to all those who contributed to the completion of this research. Their insights, discussions, and support greatly enhanced the quality and depth of this work.</p>
</ack>
<sec>
<title>Funding Statement</title>
<p>This work was supported in part by the Natural Science Foundation of Shaanxi Province of China under Grant 2024JC-YBQN-0695.</p>
</sec>
<sec>
<title>Author Contributions</title>
<p>The authors confirm contribution to the paper as follows: methodology, Langyue Zhao; software, Langyue Zhao and Yubin Yuan; validation, Langyue Zhao, Yubin Yuan and Yiquan Wu; formal analysis, Langyue Zhao and Yubin Yuan; investigation, Langyue Zhao; resources, Langyue Zhao; data curation, Langyue Zhao; writing&#x2014;original draft preparation, Langyue Zhao and Yubin Yuan; writing&#x2014;review and editing, Langyue Zhao and Yubin Yuan; visualization, Langyue Zhao and Yubin Yuan; project administration, Langyue Zhao and Yubin Yuan. All authors reviewed the results and approved the final version of the manuscript.</p>
</sec>
<sec sec-type="data-availability">
<title>Availability of Data and Materials</title>
<p>Data sharing not applicable.</p>
</sec>
<sec>
<title>Ethics Approval</title>
<p>Not applicable.</p>
</sec>
<sec sec-type="COI-statement">
<title>Conflicts of Interest</title>
<p>The authors declare no conflicts of interest to report regarding the present study.</p>
</sec>
<ref-list content-type="authoryear">
<title>References</title>
<ref id="ref-1"><label>[1]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Ren</surname> <given-names>S</given-names></string-name>, <string-name><surname>He</surname> <given-names>K</given-names></string-name>, <string-name><surname>Girshick</surname> <given-names>R</given-names></string-name>, <string-name><surname>Sun</surname> <given-names>J</given-names></string-name></person-group>. <article-title>Faster R-CNN: towards real-time object detection with region proposal networks</article-title>. <source>IEEE Trans Pattern Anal Mach Intell</source>. <year>2017</year>;<volume>39</volume>(<issue>6</issue>):<fpage>1137</fpage>&#x2013;<lpage>49</lpage>. doi:<pub-id pub-id-type="doi">10.1109/TPAMI.2016.2577031</pub-id>; <pub-id pub-id-type="pmid">27295650</pub-id></mixed-citation></ref>
<ref id="ref-2"><label>[2]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><surname>Redmon</surname> <given-names>J</given-names></string-name>, <string-name><surname>Farhadi</surname> <given-names>A</given-names></string-name></person-group>. <article-title>YOLOv3: an incremental improvement</article-title>. <comment>arXiv:1804.02767</comment>. <year>2018</year>.</mixed-citation></ref>
<ref id="ref-3"><label>[3]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Carion</surname> <given-names>N</given-names></string-name>, <string-name><surname>Massa</surname> <given-names>F</given-names></string-name>, <string-name><surname>Synnaeve</surname> <given-names>G</given-names></string-name>, <string-name><surname>Usunier</surname> <given-names>N</given-names></string-name>, <string-name><surname>Kirillov</surname> <given-names>A</given-names></string-name>, <string-name><surname>Zagoruyko</surname> <given-names>S</given-names></string-name></person-group>. <article-title>End-to-end object detection with transformers</article-title>. In: <conf-name>Proceedings of the European Conference on Computer Vision; 2020 Aug 23&#x2013;28</conf-name>; <publisher-loc>Glasgow, UK</publisher-loc>. p. <fpage>213</fpage>&#x2013;<lpage>29</lpage>. doi:<pub-id pub-id-type="doi">10.1007/978-3-030-58452-8_13</pub-id>.</mixed-citation></ref>
<ref id="ref-4"><label>[4]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Fan</surname> <given-names>L</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>D</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>J</given-names></string-name>, <string-name><surname>Li</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Cao</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Liu</surname> <given-names>Y</given-names></string-name>, <etal>et al</etal></person-group>. <article-title>Pavement defect detection with deep learning: a comprehensive survey</article-title>. <source>IEEE Trans Intell Veh</source>. <year>2024</year>;<volume>9</volume>(<issue>3</issue>):<fpage>4292</fpage>&#x2013;<lpage>311</lpage>. doi:<pub-id pub-id-type="doi">10.1109/tiv.2023.3326136</pub-id>.</mixed-citation></ref>
<ref id="ref-5"><label>[5]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Ma</surname> <given-names>D</given-names></string-name>, <string-name><surname>Fang</surname> <given-names>H</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>N</given-names></string-name>, <string-name><surname>Lu</surname> <given-names>H</given-names></string-name>, <string-name><surname>Matthews</surname> <given-names>J</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>C</given-names></string-name></person-group>. <article-title>Transformer-optimized generation, detection, and tracking network for images with drainage pipeline defects</article-title>. <source>Comput Aided Civil Eng</source>. <year>2023</year>;<volume>38</volume>(<issue>15</issue>):<fpage>2109</fpage>&#x2013;<lpage>27</lpage>. doi:<pub-id pub-id-type="doi">10.1111/mice.12970</pub-id>.</mixed-citation></ref>
<ref id="ref-6"><label>[6]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Vasan</surname> <given-names>V</given-names></string-name>, <string-name><surname>Sridharan</surname> <given-names>NV</given-names></string-name>, <string-name><surname>Balasundaram</surname> <given-names>RJ</given-names></string-name>, <string-name><surname>Vaithiyanathan</surname> <given-names>S</given-names></string-name></person-group>. <article-title>Ensemble-based deep learning model for welding defect detection and classification</article-title>. <source>Eng Appl Artif Intell</source>. <year>2024</year>;<volume>136</volume>:<fpage>108961</fpage>. doi:<pub-id pub-id-type="doi">10.1016/j.engappai.2024.108961</pub-id>.</mixed-citation></ref>
<ref id="ref-7"><label>[7]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Zhao</surname> <given-names>K</given-names></string-name>, <string-name><surname>Xu</surname> <given-names>S</given-names></string-name>, <string-name><surname>Loney</surname> <given-names>J</given-names></string-name>, <string-name><surname>Visentin</surname> <given-names>A</given-names></string-name>, <string-name><surname>Li</surname> <given-names>Z</given-names></string-name></person-group>. <article-title>Road pavement health monitoring system using smartphone sensing with a two-stage machine learning model</article-title>. <source>Autom Constr</source>. <year>2024</year>;<volume>167</volume>:<fpage>105664</fpage>. doi:<pub-id pub-id-type="doi">10.1016/j.autcon.2024.105664</pub-id>.</mixed-citation></ref>
<ref id="ref-8"><label>[8]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Peng</surname> <given-names>X</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>P</given-names></string-name>, <string-name><surname>Zhou</surname> <given-names>K</given-names></string-name>, <string-name><surname>Yan</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Zhong</surname> <given-names>X</given-names></string-name>, <string-name><surname>Zhao</surname> <given-names>C</given-names></string-name></person-group>. <article-title>Bridge defect detection using small sample data with deep learning and Hyperspectral imaging</article-title>. <source>Autom Constr</source>. <year>2025</year>;<volume>170</volume>:<fpage>105900</fpage>. doi:<pub-id pub-id-type="doi">10.1016/j.autcon.2024.105900</pub-id>.</mixed-citation></ref>
<ref id="ref-9"><label>[9]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Dung</surname> <given-names>CV</given-names></string-name>, <string-name><surname>Anh</surname> <given-names>LD</given-names></string-name></person-group>. <article-title>Autonomous concrete crack detection using deep fully convolutional neural network</article-title>. <source>Autom Constr</source>. <year>2019</year>;<volume>99</volume>(<issue>4</issue>):<fpage>52</fpage>&#x2013;<lpage>8</lpage>. doi:<pub-id pub-id-type="doi">10.1016/j.autcon.2018.11.028</pub-id>.</mixed-citation></ref>
<ref id="ref-10"><label>[10]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Li</surname> <given-names>S</given-names></string-name>, <string-name><surname>Zhao</surname> <given-names>X</given-names></string-name></person-group>. <article-title>Automatic crack detection and measurement of concrete structure using convolutional encoder-decoder network</article-title>. <source>IEEE Access</source>. <year>2020</year>;<volume>8</volume>:<fpage>134602</fpage>&#x2013;<lpage>18</lpage>. doi:<pub-id pub-id-type="doi">10.1109/ACCESS.2020.3011106</pub-id>.</mixed-citation></ref>
<ref id="ref-11"><label>[11]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Long</surname> <given-names>J</given-names></string-name>, <string-name><surname>Shelhamer</surname> <given-names>E</given-names></string-name>, <string-name><surname>Darrell</surname> <given-names>T</given-names></string-name></person-group>. <article-title>Fully convolutional networks for semantic segmentation</article-title>. In: <conf-name>Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR); 2015 Jun 7&#x2013;12</conf-name>; <publisher-loc>Boston, MA, USA</publisher-loc>. p. <fpage>3431</fpage>&#x2013;<lpage>40</lpage>. doi:<pub-id pub-id-type="doi">10.1109/CVPR.2015.7298965</pub-id>.</mixed-citation></ref>
<ref id="ref-12"><label>[12]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><surname>Chen</surname> <given-names>LC</given-names></string-name>, <string-name><surname>Papandreou</surname> <given-names>G</given-names></string-name>, <string-name><surname>Kokkinos</surname> <given-names>I</given-names></string-name>, <string-name><surname>Murphy</surname> <given-names>K</given-names></string-name>, <string-name><surname>Yuille</surname> <given-names>AL</given-names></string-name></person-group>. <article-title>Semantic image segmentation with deep convolutional nets and fully connected CRFs</article-title>. <comment>arXiv:1412.7062. 2014</comment>.</mixed-citation></ref>
<ref id="ref-13"><label>[13]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Ronneberger</surname> <given-names>O</given-names></string-name>, <string-name><surname>Fischer</surname> <given-names>P</given-names></string-name>, <string-name><surname>Brox</surname> <given-names>T</given-names></string-name></person-group>. <article-title>U-Net: convolutional networks for biomedical image segmentation</article-title>. In: <conf-name>Proceedings of the Medical Image Computing and Computer-Assisted Intervention&#x2014;MICCAI 2015; 2015 Oct 5&#x2013;9</conf-name>; <publisher-loc>Munich, Germany</publisher-loc>. p. <fpage>234</fpage>&#x2013;<lpage>41</lpage>. doi:<pub-id pub-id-type="doi">10.1007/978-3-319-24574-4_28</pub-id>.</mixed-citation></ref>
<ref id="ref-14"><label>[14]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Sun</surname> <given-names>X</given-names></string-name>, <string-name><surname>Xie</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Jiang</surname> <given-names>L</given-names></string-name>, <string-name><surname>Cao</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Liu</surname> <given-names>B</given-names></string-name></person-group>. <article-title>DMA-net: deepLab with multi-scale attention for pavement crack segmentation</article-title>. <source>IEEE Trans Intell Transp Syst</source>. <year>2022</year>;<volume>23</volume>(<issue>10</issue>):<fpage>18392</fpage>&#x2013;<lpage>403</lpage>. doi:<pub-id pub-id-type="doi">10.1109/TITS.2022.3158670</pub-id>.</mixed-citation></ref>
<ref id="ref-15"><label>[15]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Kang</surname> <given-names>D</given-names></string-name>, <string-name><surname>Benipal</surname> <given-names>SS</given-names></string-name>, <string-name><surname>Gopal</surname> <given-names>DL</given-names></string-name>, <string-name><surname>Cha</surname> <given-names>YJ</given-names></string-name></person-group>. <article-title>Hybrid pixel-level concrete crack segmentation and quantification across complex backgrounds using deep learning</article-title>. <source>Autom Constr</source>. <year>2020</year>;<volume>118</volume>(<issue>4</issue>):<fpage>103291</fpage>. doi:<pub-id pub-id-type="doi">10.1016/j.autcon.2020.103291</pub-id>.</mixed-citation></ref>
<ref id="ref-16"><label>[16]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Ali</surname> <given-names>R</given-names></string-name>, <string-name><surname>Cha</surname> <given-names>YJ</given-names></string-name></person-group>. <article-title>Attention-based generative adversarial network with internal damage segmentation using thermography</article-title>. <source>Autom Constr</source>. <year>2022</year>;<volume>141</volume>:<fpage>104412</fpage>. doi:<pub-id pub-id-type="doi">10.1016/j.autcon.2022.104412</pub-id>.</mixed-citation></ref>
<ref id="ref-17"><label>[17]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Kang</surname> <given-names>DH</given-names></string-name>, <string-name><surname>Cha</surname> <given-names>YJ</given-names></string-name></person-group>. <article-title>Efficient attention-based deep encoder and decoder for automatic crack segmentation</article-title>. <source>Struct Health Monit</source>. <year>2022</year>;<volume>21</volume>(<issue>5</issue>):<fpage>2190</fpage>&#x2013;<lpage>205</lpage>. doi:<pub-id pub-id-type="doi">10.1177/14759217211053776</pub-id>; <pub-id pub-id-type="pmid">36039173</pub-id></mixed-citation></ref>
<ref id="ref-18"><label>[18]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>He</surname> <given-names>K</given-names></string-name>, <string-name><surname>Gkioxari</surname> <given-names>G</given-names></string-name>, <string-name><surname>Doll&#x00E1;r</surname> <given-names>P</given-names></string-name>, <string-name><surname>Girshick</surname> <given-names>R</given-names></string-name></person-group>. <article-title>Mask R-CNN</article-title>. In: <conf-name>Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV); 2017 Oct 22&#x2013;29</conf-name>; <publisher-loc>Venice, Italy</publisher-loc>. p. <fpage>2980</fpage>&#x2013;<lpage>8</lpage>. doi:<pub-id pub-id-type="doi">10.1109/ICCV.2017.322</pub-id>.</mixed-citation></ref>
<ref id="ref-19"><label>[19]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Chen</surname> <given-names>DR</given-names></string-name>, <string-name><surname>Chiu</surname> <given-names>WM</given-names></string-name></person-group>. <article-title>Road crack detection using Gaussian mixture model for diverse illumination images</article-title>. In: <conf-name>Proceedings of the 2020 30th International Telecommunication Networks and Applications Conference (ITNAC); 2020 Nov 25&#x2013;27</conf-name>; <publisher-loc>Melbourne, VIC, Australia</publisher-loc>. p. <fpage>1</fpage>&#x2013;<lpage>6</lpage>. doi:<pub-id pub-id-type="doi">10.1109/ITNAC50341.2020.9315113</pub-id>.</mixed-citation></ref>
<ref id="ref-20"><label>[20]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Yan</surname> <given-names>BF</given-names></string-name>, <string-name><surname>Xu</surname> <given-names>GY</given-names></string-name>, <string-name><surname>Luan</surname> <given-names>J</given-names></string-name>, <string-name><surname>Lin</surname> <given-names>D</given-names></string-name>, <string-name><surname>Deng</surname> <given-names>L</given-names></string-name></person-group>. <article-title>Pavement distress detection based on faster R-CNN and morphological operations</article-title>. <source>China J Highw Transp</source>. <year>2021</year>;<volume>34</volume>(<issue>9</issue>):<fpage>181</fpage>&#x2013;<lpage>93</lpage>. doi:<pub-id pub-id-type="doi">10.19721/j.cnki.1001-7372.2021.09.015</pub-id>.</mixed-citation></ref>
<ref id="ref-21"><label>[21]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Liu</surname> <given-names>W</given-names></string-name>, <string-name><surname>Anguelov</surname> <given-names>D</given-names></string-name>, <string-name><surname>Erhan</surname> <given-names>D</given-names></string-name>, <string-name><surname>Szegedy</surname> <given-names>C</given-names></string-name>, <string-name><surname>Reed</surname> <given-names>S</given-names></string-name>, <string-name><surname>Fu</surname> <given-names>CY</given-names></string-name>, <etal>et al.</etal></person-group> <article-title>SSD: single shot multiBox detector</article-title>. In: <conf-name>Proceedings of the Computer Vision&#x2014;ECCV 2016; 2016 Oct 11&#x2013;14</conf-name>; <publisher-loc>Amsterdam, The Netherlands</publisher-loc>. p. <fpage>21</fpage>&#x2013;<lpage>37</lpage>. doi:<pub-id pub-id-type="doi">10.1007/978-3-319-46448-0_2</pub-id>.</mixed-citation></ref>
<ref id="ref-22"><label>[22]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Yan</surname> <given-names>K</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>Z</given-names></string-name></person-group>. <article-title>Automated asphalt highway pavement crack detection based on deformable single shot multi-box detector under a complex environment</article-title>. <source>IEEE Access</source>. <year>2021</year>;<volume>9</volume>:<fpage>150925</fpage>&#x2013;<lpage>38</lpage>. doi:<pub-id pub-id-type="doi">10.1109/ACCESS.2021.3125703</pub-id>.</mixed-citation></ref>
<ref id="ref-23"><label>[23]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Hou</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Shi</surname> <given-names>H</given-names></string-name>, <string-name><surname>Chen</surname> <given-names>N</given-names></string-name>, <string-name><surname>Liu</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Wei</surname> <given-names>H</given-names></string-name>, <string-name><surname>Han</surname> <given-names>Q</given-names></string-name></person-group>. <article-title>Vision image monitoring on transportation infrastructures: a lightweight transfer learning approach</article-title>. <source>IEEE Trans Intell Transport Syst</source>. <year>2023</year>;<volume>24</volume>(<issue>11</issue>):<fpage>12888</fpage>&#x2013;<lpage>99</lpage>. doi:<pub-id pub-id-type="doi">10.1109/tits.2022.3150536</pub-id>.</mixed-citation></ref>
<ref id="ref-24"><label>[24]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Zhang</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Zuo</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Xu</surname> <given-names>X</given-names></string-name>, <string-name><surname>Wu</surname> <given-names>J</given-names></string-name>, <string-name><surname>Zhu</surname> <given-names>J</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>H</given-names></string-name>, <etal>et al.</etal></person-group> <article-title>Road damage detection using UAV images based on multi-level attention mechanism</article-title>. <source>Autom Constr</source>. <year>2022</year>;<volume>144</volume>:<fpage>104613</fpage>. doi:<pub-id pub-id-type="doi">10.1016/j.autcon.2022.104613</pub-id>.</mixed-citation></ref>
<ref id="ref-25"><label>[25]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Ma</surname> <given-names>D</given-names></string-name>, <string-name><surname>Fang</surname> <given-names>H</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>N</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>C</given-names></string-name>, <string-name><surname>Dong</surname> <given-names>J</given-names></string-name>, <string-name><surname>Hu</surname> <given-names>H</given-names></string-name></person-group>. <article-title>Automatic detection and counting system for pavement cracks based on PCGAN and YOLO-MF</article-title>. <source>IEEE Trans Intell Transport Syst</source>. <year>2022</year>;<volume>23</volume>(<issue>11</issue>):<fpage>22166</fpage>&#x2013;<lpage>78</lpage>. doi:<pub-id pub-id-type="doi">10.1109/tits.2022.3161960</pub-id>.</mixed-citation></ref>
<ref id="ref-26"><label>[26]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Zheng</surname> <given-names>S</given-names></string-name>, <string-name><surname>Lu</surname> <given-names>J</given-names></string-name>, <string-name><surname>Zhao</surname> <given-names>H</given-names></string-name>, <string-name><surname>Zhu</surname> <given-names>X</given-names></string-name>, <string-name><surname>Luo</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>Y</given-names></string-name>, <etal>et al.</etal></person-group> <article-title>Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers</article-title>. In: <conf-name>Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); 2021 Jun 20&#x2013;25</conf-name>; <publisher-loc>Nashville, TN, USA</publisher-loc>. p. <fpage>6877</fpage>&#x2013;<lpage>86</lpage>. doi:<pub-id pub-id-type="doi">10.1109/CVPR46437.2021.00681</pub-id>.</mixed-citation></ref>
<ref id="ref-27"><label>[27]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Xie</surname> <given-names>E</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>W</given-names></string-name>, <string-name><surname>Yu</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Anandkumar</surname> <given-names>A</given-names></string-name>, <string-name><surname>Alvarez</surname> <given-names>JM</given-names></string-name>, <string-name><surname>Luo</surname> <given-names>P</given-names></string-name></person-group>. <article-title>SegFormer: simple and efficient design for semantic segmentation with transformers</article-title>. <source>Adv Neural Inf Process Syst</source>. <year>2021</year>;<volume>34</volume>:<fpage>12077</fpage>&#x2013;<lpage>90</lpage>.</mixed-citation></ref>
<ref id="ref-28"><label>[28]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Cao</surname> <given-names>H</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Chen</surname> <given-names>J</given-names></string-name>, <string-name><surname>Jiang</surname> <given-names>D</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>X</given-names></string-name>, <string-name><surname>Tian</surname> <given-names>Q</given-names></string-name>, <etal>et al.</etal></person-group> <article-title>Swin-Unet: Unet-like pure transformer for medical image segmentation</article-title>. In: <conf-name>Proceedings of the Computer Vision&#x2014;ECCV 2022; 2022 Oct 23&#x2013;27</conf-name>; <publisher-loc>Tel Aviv, Israel</publisher-loc>. p. <fpage>205</fpage>&#x2013;<lpage>18</lpage>. doi:<pub-id pub-id-type="doi">10.1007/978-3-031-25066-8_9</pub-id>.</mixed-citation></ref>
<ref id="ref-29"><label>[29]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Wang</surname> <given-names>W</given-names></string-name>, <string-name><surname>Su</surname> <given-names>C</given-names></string-name></person-group>. <article-title>Automatic concrete crack segmentation model based on transformer</article-title>. <source>Autom Constr</source>. <year>2022</year>;<volume>139</volume>:<fpage>104275</fpage>. doi:<pub-id pub-id-type="doi">10.1016/j.autcon.2022.104275</pub-id>.</mixed-citation></ref>
<ref id="ref-30"><label>[30]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Zhou</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>J</given-names></string-name>, <string-name><surname>Gong</surname> <given-names>C</given-names></string-name></person-group>. <article-title>Hybrid semantic segmentation for tunnel lining cracks based on Swin Transformer and convolutional neural network</article-title>. <source>Comput Aided Civ Infrastruct Eng</source>. <year>2023</year>;<volume>38</volume>(<issue>17</issue>):<fpage>2491</fpage>&#x2013;<lpage>510</lpage>. doi:<pub-id pub-id-type="doi">10.1111/mice.13003</pub-id>.</mixed-citation></ref>
<ref id="ref-31"><label>[31]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Guo</surname> <given-names>F</given-names></string-name>, <string-name><surname>Qian</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Liu</surname> <given-names>J</given-names></string-name>, <string-name><surname>Yu</surname> <given-names>H</given-names></string-name></person-group>. <article-title>Pavement crack detection based on transformer network</article-title>. <source>Autom Constr</source>. <year>2023</year>;<volume>145</volume>:<fpage>104646</fpage>. doi:<pub-id pub-id-type="doi">10.1016/j.autcon.2022.104646</pub-id>.</mixed-citation></ref>
<ref id="ref-32"><label>[32]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><surname>Wang</surname> <given-names>S</given-names></string-name>, <string-name><surname>Li</surname> <given-names>BZ</given-names></string-name>, <string-name><surname>Khabsa</surname> <given-names>M</given-names></string-name>, <string-name><surname>Fang</surname> <given-names>H</given-names></string-name>, <string-name><surname>Ma</surname> <given-names>H</given-names></string-name></person-group>. <article-title>Linformer: self-attention with linear complexity</article-title>. <comment>arXiv:2006.04768. 2020</comment>.</mixed-citation></ref>
<ref id="ref-33"><label>[33]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Wan</surname> <given-names>H</given-names></string-name>, <string-name><surname>Gao</surname> <given-names>L</given-names></string-name>, <string-name><surname>Su</surname> <given-names>M</given-names></string-name>, <string-name><surname>Sun</surname> <given-names>Q</given-names></string-name>, <string-name><surname>Huang</surname> <given-names>L</given-names></string-name></person-group>. <article-title>Attention-based convolutional neural network for pavement crack detection</article-title>. <source>Adv Mater Sci Eng</source>. <year>2021</year>;<volume>2021</volume>:<fpage>5520515</fpage>. doi:<pub-id pub-id-type="doi">10.1155/2021/5520515</pub-id>.</mixed-citation></ref>
<ref id="ref-34"><label>[34]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Liu</surname> <given-names>G</given-names></string-name>, <string-name><surname>Liu</surname> <given-names>G</given-names></string-name>, <string-name><surname>Chai</surname> <given-names>X</given-names></string-name>, <string-name><surname>Li</surname> <given-names>L</given-names></string-name>, <string-name><surname>Dai</surname> <given-names>F</given-names></string-name>, <string-name><surname>Huang</surname> <given-names>B</given-names></string-name></person-group>. <article-title>Crack-DETR: complex pavement crack detection by multifrequency feature extraction and fusion</article-title>. <source>IEEE Sens J</source>. <year>2025</year>;<volume>25</volume>(<issue>9</issue>):<fpage>16349</fpage>&#x2013;<lpage>60</lpage>. doi:<pub-id pub-id-type="doi">10.1109/jsen.2025.3549121</pub-id>.</mixed-citation></ref>
<ref id="ref-35"><label>[35]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Dong</surname> <given-names>R</given-names></string-name>, <string-name><surname>Xia</surname> <given-names>J</given-names></string-name>, <string-name><surname>Zhao</surname> <given-names>J</given-names></string-name>, <string-name><surname>Hong</surname> <given-names>L</given-names></string-name></person-group>. <article-title>CL-PSDD: contrastive learning for adaptive generalized pavement surface distress detection</article-title>. <source>IEEE Trans Intell Transp Syst</source>. <year>2025</year>;<volume>26</volume>(<issue>4</issue>):<fpage>5211</fpage>&#x2013;<lpage>24</lpage>. doi:<pub-id pub-id-type="doi">10.1109/TITS.2024.3525193</pub-id>.</mixed-citation></ref>
<ref id="ref-36"><label>[36]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Zou</surname> <given-names>Q</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Li</surname> <given-names>Q</given-names></string-name>, <string-name><surname>Qi</surname> <given-names>X</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>Q</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>S</given-names></string-name></person-group>. <article-title>DeepCrack: learning hierarchical convolutional features for crack detection</article-title>. <source>IEEE Trans Image Process</source>. <year>2019</year>;<volume>28</volume>(<issue>3</issue>):<fpage>1498</fpage>&#x2013;<lpage>512</lpage>. doi:<pub-id pub-id-type="doi">10.1109/TIP.2018.2878966</pub-id>; <pub-id pub-id-type="pmid">30387731</pub-id></mixed-citation></ref>
<ref id="ref-37"><label>[37]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Yan</surname> <given-names>H</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>J</given-names></string-name></person-group>. <article-title>UAV-PDD2023: a benchmark dataset for pavement distress detection based on UAV images</article-title>. <source>Data Brief</source>. <year>2023</year>;<volume>51</volume>:<fpage>109692</fpage>. doi:<pub-id pub-id-type="doi">10.1016/j.dib.2023.109692</pub-id>; <pub-id pub-id-type="pmid">38020429</pub-id></mixed-citation></ref>
<ref id="ref-38"><label>[38]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Zhu</surname> <given-names>J</given-names></string-name>, <string-name><surname>Zhong</surname> <given-names>J</given-names></string-name>, <string-name><surname>Ma</surname> <given-names>T</given-names></string-name>, <string-name><surname>Huang</surname> <given-names>X</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>W</given-names></string-name>, <string-name><surname>Zhou</surname> <given-names>Y</given-names></string-name></person-group>. <article-title>Pavement distress detection using convolutional neural networks with images captured via UAV</article-title>. <source>Autom Constr</source>. <year>2022</year>;<volume>133</volume>:<fpage>103991</fpage>. doi:<pub-id pub-id-type="doi">10.1016/j.autcon.2021.103991</pub-id>.</mixed-citation></ref>
<ref id="ref-39"><label>[39]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Pujara</surname> <given-names>A</given-names></string-name>, <string-name><surname>Bhamare</surname> <given-names>M</given-names></string-name></person-group>. <article-title>DeepSORT: real time &#x0026; multi-object detection and tracking with YOLO and TensorFlow</article-title>. In: <conf-name>Proceedings of the 2022 International Conference on Augmented Intelligence and Sustainable Systems (ICAISS); 2022 Nov 24&#x2013;26</conf-name>; <publisher-loc>Trichy, India</publisher-loc>. p. <fpage>456</fpage>&#x2013;<lpage>60</lpage>. doi:<pub-id pub-id-type="doi">10.1109/ICAISS55157.2022.10011018</pub-id>.</mixed-citation></ref>
<ref id="ref-40"><label>[40]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Zhang</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Sun</surname> <given-names>P</given-names></string-name>, <string-name><surname>Jiang</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Yu</surname> <given-names>D</given-names></string-name>, <string-name><surname>Weng</surname> <given-names>F</given-names></string-name>, <string-name><surname>Yuan</surname> <given-names>Z</given-names></string-name>, <etal>et al.</etal></person-group> <article-title>ByteTrack: multi-object tracking by associating every detection box</article-title>. In: <conf-name>Proceedings of the Computer Vision&#x2014;ECCV 2022; 2022 Oct 23&#x2013;27</conf-name>; <publisher-loc>Tel Aviv, Israel</publisher-loc>. p. <fpage>1</fpage>&#x2013;<lpage>21</lpage>. doi:<pub-id pub-id-type="doi">10.1007/978-3-031-20047-2_1</pub-id>.</mixed-citation></ref>
<ref id="ref-41"><label>[41]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><surname>Aharon</surname> <given-names>N</given-names></string-name>, <string-name><surname>Orfaig</surname> <given-names>R</given-names></string-name>, <string-name><surname>Bobrovsky</surname> <given-names>BZ</given-names></string-name></person-group>. <article-title>BoT-SORT: robust associations multi-pedestrian tracking</article-title>. <comment>arXiv:2206.14651</comment>. <year>2022</year>.</mixed-citation></ref>
<ref id="ref-42"><label>[42]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><surname>Cao</surname> <given-names>J</given-names></string-name>, <string-name><surname>Pang</surname> <given-names>J</given-names></string-name>, <string-name><surname>Weng</surname> <given-names>X</given-names></string-name>, <string-name><surname>Khirodkar</surname> <given-names>R</given-names></string-name>, <string-name><surname>Kitani</surname> <given-names>K</given-names></string-name></person-group>. <article-title>Observation-centric SORT: rethinking SORT for robust multi-object tracking</article-title>. <comment>arXiv:2203.14360</comment>. <year>2022</year>.</mixed-citation></ref>
</ref-list>
</back></article>