<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.1 20151215//EN" "http://jats.nlm.nih.gov/publishing/1.1/JATS-journalpublishing1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:mml="http://www.w3.org/1998/Math/MathML" xml:lang="en" article-type="review-article" dtd-version="1.1">
<front>
<journal-meta>
<journal-id journal-id-type="pmc">CMC</journal-id>
<journal-id journal-id-type="nlm-ta">CMC</journal-id>
<journal-id journal-id-type="publisher-id">CMC</journal-id>
<journal-title-group>
<journal-title>Computers, Materials &#x0026; Continua</journal-title>
</journal-title-group>
<issn pub-type="epub">1546-2226</issn>
<issn pub-type="ppub">1546-2218</issn>
<publisher>
<publisher-name>Tech Science Press</publisher-name>
<publisher-loc>USA</publisher-loc>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="publisher-id">76652</article-id>
<article-id pub-id-type="doi">10.32604/cmc.2026.076652</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Review</subject>
</subj-group>
</article-categories>
<title-group>
<article-title>3D Single Object Tracking in Point Clouds: A Review</article-title>
<alt-title alt-title-type="left-running-head">3D Single Object Tracking in Point Clouds: A Review</alt-title>
<alt-title alt-title-type="right-running-head">3D Single Object Tracking in Point Clouds: A Review</alt-title>
</title-group>
<contrib-group>
<contrib id="author-1" contrib-type="author">
<name name-style="western"><surname>Kuang</surname><given-names>Yihao</given-names></name><xref ref-type="aff" rid="aff-1">1</xref><xref ref-type="aff" rid="aff-2">2</xref></contrib>
<contrib id="author-2" contrib-type="author">
<name name-style="western"><surname>Zhang</surname><given-names>Hong</given-names></name><xref ref-type="aff" rid="aff-1">1</xref><xref ref-type="aff" rid="aff-2">2</xref></contrib>
<contrib id="author-3" contrib-type="author">
<name name-style="western"><surname>Wang</surname><given-names>Jiaqi</given-names></name><xref ref-type="aff" rid="aff-1">1</xref><xref ref-type="aff" rid="aff-2">2</xref></contrib>
<contrib id="author-4" contrib-type="author">
<name name-style="western"><surname>Jin</surname><given-names>Lingyu</given-names></name><xref ref-type="aff" rid="aff-1">1</xref><xref ref-type="aff" rid="aff-2">2</xref></contrib>
<contrib id="author-5" contrib-type="author" corresp="yes">
<name name-style="western"><surname>Huang</surname><given-names>Bo</given-names></name><xref ref-type="aff" rid="aff-1">1</xref><xref ref-type="aff" rid="aff-2">2</xref><xref rid="cor1" ref-type="corresp">&#x002A;</xref><email>huangbo0326@cqu.edu.cn</email></contrib>
<aff id="aff-1"><label>1</label><institution>College of Optoelectronic Engineering, Chongqing University</institution>, <addr-line>Chongqing</addr-line>, <country>China</country></aff>
<aff id="aff-2"><label>2</label><institution>Key Laboratory of Optoelectronic Technology and Systems of the Education Ministry of China, Chongqing University</institution>, <addr-line>Chongqing</addr-line>, <country>China</country></aff>
</contrib-group>
<author-notes>
<corresp id="cor1"><label>&#x002A;</label>Corresponding Author: Bo Huang. Email: <email>huangbo0326@cqu.edu.cn</email></corresp>
</author-notes>
<pub-date date-type="collection" publication-format="electronic">
<year>2026</year>
</pub-date>
<pub-date date-type="pub" publication-format="electronic">
<day>9</day><month>4</month><year>2026</year>
</pub-date>
<volume>87</volume>
<issue>3</issue>
<elocation-id>4</elocation-id>
<history>
<date date-type="received">
<day>24</day>
<month>11</month>
<year>2025</year>
</date>
<date date-type="accepted">
<day>30</day>
<month>01</month>
<year>2026</year>
</date>
</history>
<permissions>
<copyright-statement>&#x00A9; 2026 The Authors. Published by Tech Science Press.</copyright-statement>
<copyright-year>2026</copyright-year>
<copyright-holder>The Authors</copyright-holder>
<license xlink:href="https://creativecommons.org/licenses/by/4.0/">
<license-p>This work is licensed under a <ext-link ext-link-type="uri" xlink:type="simple" xlink:href="https://creativecommons.org/licenses/by/4.0/">Creative Commons Attribution 4.0 International License</ext-link>, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
</license>
</permissions>
<self-uri content-type="pdf" xlink:href="TSP_CMC_76652.pdf"></self-uri>
<abstract>
<p>3D single object tracking (SOT) based on point clouds is a fundamental task for environmental perception in autonomous driving and dynamic scene understanding in robotics. Recent technological advancements in this field have significantly bolstered the environmental interaction capabilities of intelligent systems. This field faces persistent challenges, including feature degradation induced by point cloud sparsity, representation drift caused by non-rigid deformation, and occlusion in complex scenarios. Traditional appearance matching methods, particularly those relying on Siamese networks, are severely constrained by point cloud characteristics, often failing under rapid motions or structural ambiguities among similar objects. In response, the research paradigm has progressively evolved toward motion-centric modeling approaches. These emerging frameworks utilize spatio-temporal joint modeling and geometric shape completion to attain notable performance gains. Furthermore, the incorporation of attention mechanisms and State Space Model (SSM) has enabled more effective multi-scale spatio-temporal feature association, which is particularly beneficial for long-term tracking scenarios. To the best of our knowledge, this is the first comprehensive survey dedicated to 3D single object tracking in point clouds. We provide a detailed analysis of current tracking methods, scrutinizing their limitations regarding multi-object interference and analyzing the trade-off between accuracy and computational efficiency. Finally, we discuss potential future directions, including the development of lightweight models for edge deployment and the integration of cross-modal fusion strategies.</p>
</abstract>
<kwd-group kwd-group-type="author">
<kwd>Point cloud</kwd>
<kwd>3D single object tracking</kwd>
<kwd>autonomous driving</kwd>
<kwd>sparsity</kwd>
<kwd>spatio-temporal feature</kwd>
</kwd-group>
<funding-group>
<award-group id="awg1">
<funding-source>National Natural Science Foundation of China</funding-source>
<award-id>62306049</award-id>
<award-id>92471207</award-id>
<award-id>W2421089</award-id>
</award-group>
<award-group id="awg2">
<funding-source>General Program of Chongqing Natural Science Foundation</funding-source>
<award-id>CSTB2023NSCQ-MSX0665</award-id>
</award-group>
<award-group id="awg3">
<funding-source>Fundamental Research Funds for the Central Universities</funding-source>
<award-id>2024CDJXY008</award-id>
</award-group>
</funding-group>
</article-meta>
</front>
<body>
<sec id="s1">
<label>1</label>
<title>Introduction</title>
<sec id="s1_1">
<label>1.1</label>
<title>Motivation and Background</title>
<p>3D SOT, a core task in computer vision and autonomous driving, aims to continuously localize target objects across consecutive point cloud frames using their initial states. The proliferation of 3D sensing technologies (e.g., LiDAR) has established point clouds as a critical data source for object tracking, primarily due to their illumination invariance and inherent geometric information. However, inherent challenges including sparsity, spatial disorder, and non-uniform density distribution fundamentally complicate point cloud processing. Furthermore, object deformation during motion, dynamic occlusion, and interference from inter-class similarity present ongoing tracking challenges.</p>
<p>In practical scenarios, sparsity manifests acutely; for instance, long-distance targets (e.g., pedestrians 100 m away) may contain only 20&#x2013;50 points, making it difficult to extract stable features; in rainy or foggy weather, LiDAR point cloud density can drop by 60%, blurring target contours. Spatial disorder, referring to the fact that point clouds have no fixed order, renders traditional 2D feature extraction methods&#x2014;dependent on grid structures&#x2014;inapplicable, necessitating specialized architectures compatible with disorder. Non-uniform density distribution further causes near-dense and far-sparse biases in feature extraction, affecting global localization accuracy. Furthermore, object deformation during motion, dynamic occlusion, and interference from inter-class similarity present ongoing tracking challenges&#x2014;deformation leads to mismatches between templates and current frames; occlusion may cause target point clouds to disappear entirely; and inter-class similarity easily triggers mistracking in complex traffic flows.</p>
<p>Early approaches predominantly adapted 2D visual similarity matching paradigms, determining object positions through template-search region correspondences [<xref ref-type="bibr" rid="ref-1">1</xref>,<xref ref-type="bibr" rid="ref-2">2</xref>]. However, 2D methods rely heavily on texture information, which is scarce in point clouds, leading to inherent mismatches with 3D point cloud characteristics. For example, 2D cross-correlation operations fail to handle point cloud disorder, resulting in unstable matching in sparse scenarios. Although conceptually straightforward, these methods were fundamentally limited by point cloud sparsity, exhibiting significant difficulty in discriminating geometrically similar distractors. Moreover, the reliance on handcrafted or computationally intensive bounding box estimation techniques&#x2014;such as Kalman filtering [<xref ref-type="bibr" rid="ref-3">3</xref>]&#x2014;introduces significant computational overhead. Traditional Kalman filtering-based methods often require <inline-formula id="ieqn-1"><mml:math id="mml-ieqn-1"><mml:mrow><mml:mi>&#x1D4AA;</mml:mi></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:msup><mml:mi>n</mml:mi><mml:mn>3</mml:mn></mml:msup><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> matrix operations per frame, which poses a major bottleneck for real-time deployment in resource-constrained environments.</p>
<p>With the rapid growth of point cloud data and advancements in computational resources, end-to-end learning frameworks have made significant strides by introducing collaborative voting schemes for object center candidate generation [<xref ref-type="bibr" rid="ref-4">4</xref>]. For instance, VoteNet [<xref ref-type="bibr" rid="ref-5">5</xref>] aggregates sparse points into object centers through point-wise voting, effectively addressing the sparse-to-dense localization bottleneck. This approach notably improves both localization accuracy and computational efficiency, thereby reinforcing the prevalence of matching paradigms. Subsequent refinements have further enhanced feature matching via employing voxel-based representations [<xref ref-type="bibr" rid="ref-6">6</xref>], and augmenting local geometric features [<xref ref-type="bibr" rid="ref-7">7</xref>]. Despite these advances, fundamental challenges remain&#x2014;particularly in mitigating tracking drift caused by inherent point cloud sparsity. Even state-of-the-art methods exhibit a 30% higher drift rate in extremely sparse scenarios (<inline-formula id="ieqn-2"><mml:math id="mml-ieqn-2"><mml:mo>&#x2264;</mml:mo></mml:math></inline-formula>30 points) compared to dense scenarios.</p>
<p>To overcome the constraints of matching paradigms, motion-centric modeling has emerged as a promising alternative. These techniques explicitly model inter-frame object motion dynamics, shifting focus from appearance similarity to motion patterns&#x2014;often through techniques such as pose transformation estimation via foreground point segmentation [<xref ref-type="bibr" rid="ref-8">8</xref>]. This methodology exhibits superior robustness under occlusion and point cloud sparsity conditions. However, dependence on local motion cues from neighboring frames fundamentally limits long-term motion pattern capture. Moreover, segmentation inaccuracies propagate pose estimation errors, leading to cumulative drift. Recent breakthroughs have begun to mitigate these issues by integrating multimodal data fusion [<xref ref-type="bibr" rid="ref-9">9</xref>], temporal context modeling [<xref ref-type="bibr" rid="ref-10">10</xref>], and Transformer architectures [<xref ref-type="bibr" rid="ref-11">11</xref>], which significantly enhance feature extraction and correspondence accuracy. Simultaneously, progress in point cloud representation learning and lightweight network design enables novel accuracy-efficiency tradeoffs [<xref ref-type="bibr" rid="ref-12">12</xref>].</p>
<p>To establish a rigorous research boundary, we adopt the formal problem definition proposed by Hoda&#x0148; et al. [<xref ref-type="bibr" rid="ref-13">13</xref>]. Mathematically, the 3D SOT task aims to estimate the optimal bounding box <inline-formula id="ieqn-3"><mml:math id="mml-ieqn-3"><mml:msub><mml:mi>B</mml:mi><mml:mi>t</mml:mi></mml:msub></mml:math></inline-formula> of a target in the current frame <inline-formula id="ieqn-4"><mml:math id="mml-ieqn-4"><mml:msub><mml:mi>P</mml:mi><mml:mi>t</mml:mi></mml:msub></mml:math></inline-formula>, given the initial state <inline-formula id="ieqn-5"><mml:math id="mml-ieqn-5"><mml:msub><mml:mi>B</mml:mi><mml:mn>0</mml:mn></mml:msub></mml:math></inline-formula> and a sequence of historical observations <inline-formula id="ieqn-6"><mml:math id="mml-ieqn-6"><mml:msub><mml:mrow><mml:mi>&#x1D4AB;</mml:mi></mml:mrow><mml:mrow><mml:mn>0</mml:mn><mml:mo>:</mml:mo><mml:mi>t</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula>, by maximizing the posterior probability <inline-formula id="ieqn-7"><mml:math id="mml-ieqn-7"><mml:mi>P</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>B</mml:mi><mml:mi>t</mml:mi></mml:msub><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:msub><mml:mi>B</mml:mi><mml:mrow><mml:mi>t</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>P</mml:mi><mml:mi>t</mml:mi></mml:msub><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>. Within this theoretical framework, this review explicitly restricts its scope to &#x2018;complex dynamic scenarios&#x2019; in autonomous driving, which are characterized by high sparsity (objects with fewer than 50 points), severe occlusion, and non-rigid deformation. Consequently, we focus exclusively on tracking algorithms designed for unstructured and sparse LiDAR point clouds, extending the discussion from foundational Siamese-based frameworks to emerging Motion-Centric and SSM approaches.</p>
<p>Despite the rapid growth of this field, a systematic survey dedicated to 3D SOT is notably absent. Existing reviews primarily focus on 3D object detection or 2D tracking, often overlooking the specific challenges of temporal coherence in sparse point clouds. Previous surveys on multi-object tracking also differ significantly in problem formulation. Therefore, a comprehensive analysis of 3D SOT algorithms is urgently needed.</p>
</sec>
<sec id="s1_2">
<label>1.2</label>
<title>Literature Search and Selection Criteria</title>
<p>To ensure a rigorous and objective consolidation, we implemented a systematic literature search and filtering methodology. The retrieval was conducted across major academic databases, including IEEE Xplore, Web of Science, Google Scholar, and the Computer Vision Foundation Open Access, covering the period from January 2019 to January 2026. The search strategy utilized Boolean logic with keyword combinations such as &#x201C;Point Cloud,&#x201D; &#x201C;LiDAR,&#x201D; &#x201C;3D,&#x201D; &#x201C;Single Object Tracking,&#x201D; and &#x201C;Deep Learning.&#x201D; This initial phase yielded over 300 candidate entries.</p>
<p>To isolate the most significant contributions, we applied a strict multi-stage screening process. First, the scope was restricted exclusively to 3D SOT, excluding studies focused solely on 2D tracking or Multi-Object Tracking unless they offered specific contributions to the SOT domain. Second, priority was given to peer-reviewed articles published in high-impact journals (e.g., IEEE TPAMI, TITS) and top-tier conferences (e.g., CVPR, ICCV, ECCV, AAAI), with a requirement that the methods be validated on standard benchmarks. Finally, to capture the rapid evolution of emerging architectures like Transformers and SSMs, we emphasized timeliness, with approximately 85% of the selected studies published between 2021 and 2026. Based on these criteria, 77 key publications were ultimately selected for in-depth analysis in this review.</p>
</sec>
<sec id="s1_3">
<label>1.3</label>
<title>Organization of the Paper</title>
<p>This survey provides a comprehensive and systematic analysis of the technological evolution in point cloud object tracking. Over the past few years, the field has transitioned from early appearance-based models to more advanced architectures that incorporate motion reasoning and spatio-temporal context modeling. To ground the discussion, a representative pipeline commonly adopted in modern 3D tracking systems is illustrated in <xref ref-type="fig" rid="fig-1">Fig. 1</xref>. It highlights the classical Siamese network structure, which forms the basis for many current methods, along with its typical extensions in feature extraction, similarity matching, and template updating.</p>
<fig id="fig-1">
<label>Figure 1</label>
<caption>
<title>General siamese network framework for 3D point cloud object tracking.</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_76652-fig-1.tif"/>
</fig>
<p>Corresponding to the general Siamese-based tracking framework, the structure of this paper is organized as follows: (1) Backbone Networks: This section reviews the development of feature extraction architectures, from early hand-engineered frameworks and voxel-based methods to modern Transformer and Mamba-based models, analyzing their capabilities in capturing geometric structures and spatio-temporal relationships. (2) Decision Mechanisms: We categorize and compare three core paradigms&#x2014;similarity matching, motion modeling, and attention mechanisms&#x2014;elucidating their respective strengths in handling appearance variations, motion dynamics, and long-range dependencies. (3) Model Update Strategies: This part focuses on dynamic template optimization, including shape compensation, temporal memory management, and adaptive parameter adjustment, exploring how these strategies mitigate tracking drift under occlusion and sparsity. (4) Experiments: A detailed analysis of mainstream metrics and datasets (KITTI [<xref ref-type="bibr" rid="ref-14">14</xref>], NuScenes [<xref ref-type="bibr" rid="ref-15">15</xref>]) is provided to establish a standardized performance comparison framework. (5) Discussion: This section synthesizes the technological evolution of tracking paradigms and analyzes persistent bottlenecks. (6) Conclusion: This section summarizes the systematic review of 3D single object tracking and outlines future directions.</p>
<p>To the best of our knowledge, this is the first comprehensive survey dedicated to point cloud object tracking. Its key contributions are fourfold: (1) Systematic Taxonomy: We implement a unified classification for tracking paradigms, clarifying the evolutionary logic from 2D adaptation to 3D-specific innovations. As this is a review paper, we do not propose any novel framework. (2) In-Depth Technical Analysis: By dissecting core modules (feature extraction, decision-making, updates), we reveal the theoretical limitations of each method and their performance bottlenecks under real-world constraints. (3) Benchmark Synthesis: A cross-dataset comparison of the state-of-the-art methods is conducted to quantify the effectiveness of different strategies under varying scenarios. (4) Research Outlook: By synthesizing theoretical advances and empirical results, this survey aims to serve as a foundational reference for researchers and engineers in the field, facilitating the development of next-generation point cloud tracking technologies. It is important to clarify that this paper serves as a comprehensive taxonomy and performance analysis of existing methods. While we systematically synthesize the evolution of the field, this work does not propose a novel tracking algorithm but rather consolidates recent years of developments to guide future research expectations.</p>
</sec>
</sec>
<sec id="s2">
<label>2</label>
<title>Backbone</title>
<p>As a fundamental representation of 3D geometric data, point cloud processing demands efficient backbone network development, as the ability to extract discriminative features directly determines the performance of downstream tasks such as object tracking, segmentation, and reconstruction. Unlike structured 2D images, point clouds are inherently unordered, sparse, and irregularly distributed, posing unique challenges for feature learning&#x2014;including maintaining permutation invariance, capturing local geometric patterns, and balancing computational efficiency with representation capacity [<xref ref-type="bibr" rid="ref-16">16</xref>]. The evolution of 3D point cloud backbones has thus been driven by addressing these challenges, with distinct stages reflecting advancements in modeling paradigms, as visualized in <xref ref-type="fig" rid="fig-2">Fig. 2</xref>.</p>
<fig id="fig-2">
<label>Figure 2</label>
<caption>
<title>The evolution process of the backbone.</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_76652-fig-2.tif"/>
</fig>
<sec id="s2_1">
<label>2.1</label>
<title>Regular Grid</title>
<p>Early approaches addressed the unstructured nature of point clouds by leveraging voxelization, a process that maps 3D points to regular grid spaces [<xref ref-type="bibr" rid="ref-17">17</xref>]. This enabled the adaptation of 2D CNN architectures for spatio-temporal motion modeling, offering a pragmatic solution to data disorder [<xref ref-type="bibr" rid="ref-18">18</xref>,<xref ref-type="bibr" rid="ref-19">19</xref>]. Nevertheless, voxelization inherently introduced quantization artifacts, degrading geometric fidelity and compromising the representation of fine-grained motion trajectories&#x2014;limitations that motivated subsequent innovations.</p>
</sec>
<sec id="s2_2">
<label>2.2</label>
<title>Direct Point Processing</title>
<p>The transition to direct point processing represented a paradigm shift in computational efficiency. PointNet [<xref ref-type="bibr" rid="ref-20">20</xref>] pioneered this by using shared multi-layer perceptron (MLP) for permutation invariance, though it failed to capture local structures. PointNet&#x002B;&#x002B; [<xref ref-type="bibr" rid="ref-21">21</xref>] addressed this by introducing a hierarchical framework with sampling and grouping, establishing a standard for local feature extraction. However, its recursive sampling creates computational bottlenecks.</p>
<p>Subsequent works focused on refining local feature aggregation. PointCNN [<xref ref-type="bibr" rid="ref-22">22</xref>] introduced X-transformation to enable convolution on unordered points. To enhance geometric reasoning, Relation-Shape CNN [<xref ref-type="bibr" rid="ref-23">23</xref>] and PointWeb [<xref ref-type="bibr" rid="ref-24">24</xref>] modeled inter-point relations and contextual interactions. FoldingNet [<xref ref-type="bibr" rid="ref-25">25</xref>] extended this line of work by introducing folding operations to generate point clouds from latent spaces, achieving state-of-the-art (SOTA) performance in shape completion tasks. Meanwhile, PointConv [<xref ref-type="bibr" rid="ref-26">26</xref>] and PAConv [<xref ref-type="bibr" rid="ref-27">27</xref>] improved computational efficiency by learning dynamic weight functions for irregular point densities.</p>
<p>In summary, point-based backbones process raw coordinates directly, preserving fine-grained geometric details essential for precise localization. However, their reliance on iterative sampling and grouping often creates a computational bottleneck.</p>
</sec>
<sec id="s2_3">
<label>2.3</label>
<title>Graph Structure</title>
<p>Graph neural networks naturally align with point cloud topology for modeling point cloud topology, given their capacity to capture pairwise relationships. DGCNN [<xref ref-type="bibr" rid="ref-28">28</xref>] pioneered this direction with dynamic graph convolution, constructing k-NN graphs in feature space and learning edge representations via EdgeConv [<xref ref-type="bibr" rid="ref-29">29</xref>]. While this significantly enhanced local geometric modeling, dynamic graph updates resulted in quadratic computational complexity as point counts increased. Subsequent works mitigated this issue: SpiderCNN [<xref ref-type="bibr" rid="ref-30">30</xref>] designed adaptive neighborhood expansion strategies, and Point-GNN [<xref ref-type="bibr" rid="ref-31">31</xref>] refined graph-based message passing to better capture structural dependencies.</p>
</sec>
<sec id="s2_4">
<label>2.4</label>
<title>Attention Mechanism</title>
<p>Attention-based architectures revolutionized point cloud processing by enabling adaptive feature weighting. Point-BERT [<xref ref-type="bibr" rid="ref-32">32</xref>] introduced a novel pretraining paradigm through Masked Point Modeling.</p>
<p>Transformers [<xref ref-type="bibr" rid="ref-33">33</xref>], with their capacity to model global context, were adapted to point clouds by PCT [<xref ref-type="bibr" rid="ref-34">34</xref>]&#x2014;the first work to integrate standard Transformer architectures. To handle spatial unorderedness, PCT [<xref ref-type="bibr" rid="ref-34">34</xref>] introduced coordinate embedding (injecting 3D positional information via encoding) and offset attention (dynamically adjusting geometric offsets in attention calculations), thereby strengthening the perception of spatial relationships. Swin3D [<xref ref-type="bibr" rid="ref-35">35</xref>] optimized Transformer efficiency for 3D scenes by adopting window-based attention (inspired by 2D vision): it partitions point clouds into spatial windows, computes self-attention within windows, and uses window shifting for cross-window interaction, balancing local geometric accuracy and computational cost. Point Transformer [<xref ref-type="bibr" rid="ref-36">36</xref>] further reduced complexity by restricting attention to overlapping windows, thereby striking a finer balance between performance and efficiency.</p>
</sec>
<sec id="s2_5">
<label>2.5</label>
<title>Sequence Modeling</title>
<p>Sequence-based methods transformed unordered point clouds into ordered structures, thereby facilitating spatio-temporal modeling. Point2Sequence [<xref ref-type="bibr" rid="ref-37">37</xref>] pioneered this paradigm by converting points into sequences to enhance the modeling of dynamic scene continuity.</p>
<p>Recent advances have leveraged state-space models: PointMamba [<xref ref-type="bibr" rid="ref-38">38</xref>] introduced Mamba architectures to point cloud processing by using Z-order curve partitioning to map unordered points to spatially localized sequences. Combined with bidirectional selective scanning, it maintained spatio-temporal modeling capabilities while reducing GPU memory consumption by over 60%. Its MAE-style pretraining further enhanced the representation of geometric structures. SMamba [<xref ref-type="bibr" rid="ref-39">39</xref>] optimized this framework for sparse point clouds by employing adaptive sparsity-aware mechanisms to boost the efficiency of long-range feature modeling. Additionally, submanifold sparse convolutions mitigated the &#x201C;near-dense, far-sparse&#x201D; bias, which is critical for long-range tracking tasks.</p>
</sec>
</sec>
<sec id="s3">
<label>3</label>
<title>Decision Mechanisms</title>
<p>Driven by breakthroughs in deep learning, the decision paradigms of 3D SOT have gradually evolved from traditional similarity matching to temporal attention, thus giving rise to diverse technical pathways. <xref ref-type="fig" rid="fig-3">Fig. 3</xref> illustrates the classification of decision mechanisms. This section focuses on the decision mechanisms of tracking algorithms, systematically analyzing mainstream methods from the perspectives of similarity matching, motion modeling, and Transformer architectures. Through experimental data, it reveals the advantages and limitations of each paradigm.</p>
<fig id="fig-3">
<label>Figure 3</label>
<caption>
<title>Classification of decision mechanisms.</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_76652-fig-3.tif"/>
</fig>
<sec id="s3_1">
<label>3.1</label>
<title>Similarity Matching</title>
<p>In the field of 3D object tracking, decision mechanisms based on Similarity Matching (often referred to in the literature as Siamese-based tracking or Appearance-based tracking) have long occupied a central position. Fundamentally, these methods achieve localization by modeling the feature similarity between the template object and the search region. Such approaches typically adopt a Siamese network architecture: weight-sharing feature extraction branches process the template point cloud and search region point cloud, respectively, with efficient similarity calculation modules designed to achieve object matching [<xref ref-type="bibr" rid="ref-40">40</xref>].</p>
<p>As a foundational work in this direction, SC3D [<xref ref-type="bibr" rid="ref-41">41</xref>] was the first to introduce Siamese networks into the field of point cloud tracking. The method generates candidate object proposals via Kalman filtering [<xref ref-type="bibr" rid="ref-3">3</xref>], uses an encoder to map the template and candidate point clouds to a compact latent space, and then performs object matching via cosine similarity. However, its mechanism relying on iterative optimization for proposal generation not only rendered end-to-end training challenging but also introduced significant bottlenecks in computational efficiency.</p>
<sec id="s3_1_1">
<label>3.1.1</label>
<title>End-to-End Proposal Generation</title>
<p>P2B [<xref ref-type="bibr" rid="ref-42">42</xref>] was the first to propose an end-to-end tracking framework, abandoning traditional proposal generation strategies. Its core innovation lies in a two-stage mechanism of object center localization <inline-formula id="ieqn-8"><mml:math id="mml-ieqn-8"><mml:mo stretchy="false">&#x2192;</mml:mo></mml:math></inline-formula> proposal generation: first, it localizes potential object centers in the search region via a Hough voting mechanism [<xref ref-type="bibr" rid="ref-5">5</xref>]; second, it generates object proposals via neighborhood point feature aggregation. To enhance the discriminative capability between objects and backgrounds, it designed an object-specific feature enhancement module. By performing point-wise cosine similarity calculations between template seed points and search region points, this module embeds template semantic information into the feature space of the search region, effectively enhancing object recognition capability in complex scenarios. Subsequent research, such as BAT [<xref ref-type="bibr" rid="ref-43">43</xref>], further advanced geometric prior modeling by proposing the BoxCloud representation to explicitly encode geometric relationships between points and object bounding boxes (e.g., point-to-corner distances). It also guided feature interaction between templates and search regions via a box-aware feature fusion module, significantly improving the perceptual accuracy of local object structures.</p>
</sec>
<sec id="s3_1_2">
<label>3.1.2</label>
<title>Spatio-Temporal and Multimodal Fusion</title>
<p>To tackle point cloud sparsity and object appearance variations, CAT [<xref ref-type="bibr" rid="ref-10">10</xref>] proposed a spatio-temporal context learning framework. On the spatial dimension, it dynamically expands the template range based on point cloud density via an adaptive template generation (ATG) strategy, which incorporates environmental information around the target into feature modeling. On the temporal dimension, a cross-frame aggregation (CFA) module is designed to retrieve relevant template features from historical frames via feature similarity, which fuses temporal information to enhance the robustness of object representations. MMF-Track [<xref ref-type="bibr" rid="ref-44">44</xref>] constructed a cross-modal interaction framework for geometric and visual texture features by converting RGB image texture information into pseudo-point features aligned with the point cloud space via a spatial alignment module. This method first back-projects image pixels to 3D space using intrinsic and extrinsic parameters of depth cameras, generating pseudo-point clouds with color information. It then achieves multi-scale fusion of texture and geometric features via a feature interaction module, and finally jointly optimizes bimodal similarity metrics in a similarity fusion module. GLT-T [<xref ref-type="bibr" rid="ref-45">45</xref>] achieves region-focused similarity calculation by designing a centricity loss function to weight and select seed points. This method first identifies key seed points of the target via density peak clustering, and then devises a centricity metric based on spatio-temporal motion consistency of point clouds, assigning higher weights to points in the target&#x2019;s core region with stable motion patterns. This dynamic weighting strategy enables similarity calculation concentrate on regions with structural stability and motion coherence, thereby significantly enhancing feature matching accuracy in occluded scenarios.</p>
</sec>
<sec id="s3_1_3">
<label>3.1.3</label>
<title>Efficiency-Oriented Architectures</title>
<p>3D-SiamRPN [<xref ref-type="bibr" rid="ref-46">46</xref>], inspired by the region proposal network (RPN) in 2D tracking [<xref ref-type="bibr" rid="ref-47">47</xref>], pioneered real-time point cloud tracking. This model devised two cross-correlation modules (point cloud-level and point-level) to efficiently integrate template and search region features and introduced a bin-based regression loss to enhance localization accuracy&#x2014;thereby maintaining high precision while satisfying real-time requirements. V2B [<xref ref-type="bibr" rid="ref-48">48</xref>] introduced a voxel-to-bird&#x2019;s-eye-view (BEV) conversion framework tailored for sparse scenarios. It generates dense BEV feature maps via voxelization and <italic>z</italic>-axis max pooling, regresses object centers in 2D space to circumvent proposal quality issues induced by sparse point clouds, and further bolsters feature discriminability by generating dense target point clouds through a shape-aware learning module. GPT [<xref ref-type="bibr" rid="ref-49">49</xref>] innovatively devised a bipartite graph model, leveraging edge convolution to propagate geometric shape relationships between template points and search points. By integrating Hough voting to generate object center cues, it accomplishes 38 FPS real-time performance while sustaining high tracking accuracy.</p>
<p>From early exploratory efforts reliant on iterative optimization to breakthroughs in end-to-end architectures, and from single geometric feature modeling to the deep integration of spatio-temporal context and multimodal information, similarity matching methodologies have continuously evolved in tackling key challenges such as point cloud sparsity, object deformation, and real-time performance constraints. These investigations have not only propelled advancements in tracking accuracy but have also, through technological innovations such as geometric prior embedding and feature space alignment, furnished diverse solutions for 3D object tracking in complex dynamic scenarios.</p>
</sec>
</sec>
<sec id="s3_2">
<label>3.2</label>
<title>Motion Modeling</title>
<p>In contrast to appearance-based matching strategies, motion modeling effectively mitigates the adverse effects of point cloud sparsity, occlusion, and sensor noise on tracking robustness by explicitly characterizing the motion dynamics of objects in spatio-temporal sequences, thus exhibiting unique advantages in complex dynamic scenarios.</p>
<sec id="s3_2_1">
<label>3.2.1</label>
<title>Rigid Motion Regression</title>
<p><inline-formula id="ieqn-9"><mml:math id="mml-ieqn-9"><mml:msup><mml:mi>M</mml:mi><mml:mn>2</mml:mn></mml:msup></mml:math></inline-formula>-Track [<xref ref-type="bibr" rid="ref-8">8</xref>] first constructed a tracking framework centered on relative motion regression. Free from the limitations of traditional appearance matching, this method reframes the tracking task as a temporal prediction problem of object motion parameters: First, it separates the target point set from point cloud sequences via a spatio-temporal joint segmentation algorithm and predicts the coarse-grained positional transformation of the target under the rigid motion assumption; then, it introduces a motion-aided shape completion module to generate a more complete 3D shape representation by fusing target point cloud information from adjacent frames, thereby refining bounding box estimation.</p>
<p>MTM-Tracker [<xref ref-type="bibr" rid="ref-50">50</xref>] proposed a motion-to-matching hybrid paradigm, aimed at balancing the stability of motion modeling and the flexibility of appearance matching. The method first constructs a motion dynamics model using historical frame bounding box sequences, predicts the coarse-grained position range of the target via Kalman filtering [<xref ref-type="bibr" rid="ref-3">3</xref>] to form prior constraints on the search region; then, it performs feature matching within the constrained region, achieving fine-grained object localization through joint similarity metrics of point cloud normal vectors and spatial coordinates. BEVTrack [<xref ref-type="bibr" rid="ref-51">51</xref>] projected 3D tracking tasks into the BEV space, thereby simplifying the tracking process through 2D motion regression. It first performs voxelization on the input point cloud, generates high-density BEV feature maps via max pooling along the <italic>z</italic>-axis, and then designs a motion regression head in the 2D space to directly predict the displacement vector and size variation of the target center.</p>
<p>DMT [<xref ref-type="bibr" rid="ref-52">52</xref>] presented a two-stage processing framework: &#x201C;initial localization via motion prediction and refinement via MLP&#x201D;. Specifically, in the motion prediction stage, it leverages historical bounding box sequences to model spatio-temporal motion patterns of objects under rigid body assumptions, generating coarse-grained spatial priors to constrain the search region&#x2014;thereby reducing the computational overhead of subsequent feature matching by approximately 40% compared to global search strategies. In the MLP refinement stage, the algorithm integrates local geometric descriptors of point clouds with motion cues from the prediction stage, iteratively refining localization errors through a lightweight feedforward network. This hierarchical optimization strategy not only maintains real-time performance (achieving 107 FPS on edge devices) but also enhances localization precision by 6.2% in sparse point cloud scenarios (with <inline-formula id="ieqn-10"><mml:math id="mml-ieqn-10"><mml:mo>&#x2264;</mml:mo></mml:math></inline-formula>50 points per target) compared to single-stage methods, thereby achieving a balance between tracking efficiency and localization accuracy in dynamic traffic environments.</p>
</sec>
<sec id="s3_2_2">
<label>3.2.2</label>
<title>Part-Level and Hybrid Modeling</title>
<p>P2P [<xref ref-type="bibr" rid="ref-53">53</xref>] proposed a part-level motion modeling framework that explicitly models the spatial transformation relationships of object components between adjacent frames through dual-branch feature representation in the point domain and voxel domain. Specifically, it achieves cross-frame alignment and difference fusion of component spatial structures via feature concatenation and dynamic weight assignment, enabling the model to capture subtle changes in object local motion. Experimental data shows that while maintaining a real-time inference speed of 107 FPS, it achieves a 6.7% improvement in localization accuracy compared to <inline-formula id="ieqn-11"><mml:math id="mml-ieqn-11"><mml:msup><mml:mi>M</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msup></mml:math></inline-formula>-Track [<xref ref-type="bibr" rid="ref-8">8</xref>] on the KITTI [<xref ref-type="bibr" rid="ref-14">14</xref>] pedestrian tracking task, validating the critical role of local motion cues in tracking complex objects.</p>
<p>SiamMo [<xref ref-type="bibr" rid="ref-54">54</xref>] further explored the deep integration of Siamese network architectures with motion modeling. The model employs a dual-branch feature extractor to process point clouds from adjacent frames, respectively, captures object motion trajectories across multi-resolution scales via a spatio-temporal feature aggregation (STFA) module, and innovatively introduces a bounding box-aware feature encoding (BFE) mechanism. This mechanism embeds prior information such as object size and orientation into the motion feature space, enhancing the modeling capacity for the dynamic characteristics of rigid body motion.</p>
<p>From early coarse-grained position prediction based on rigid motion assumptions, to fine-grained modeling of part-level motion features, and then to the deep integration of Siamese architectures with motion dynamics, the evolution of motion modeling methods has consistently centered on the core issue of &#x201C;how to more accurately characterize the spatio-temporal motion characteristics of an object&#x201D;. Such methods have not only overcome the constraints of point cloud data quality on tracking performance but also provided new technical pathways for addressing challenges like object occlusion, deformation, and fast motion in complex scenarios through the deep mining of temporal information.</p>
</sec>
</sec>
<sec id="s3_3">
<label>3.3</label>
<title>Attention</title>
<p>The introduction of Transformer architectures [<xref ref-type="bibr" rid="ref-33">33</xref>] marked a pivotal shift in feature modeling paradigms. By virtue of the self-attention mechanism&#x2019;s capability to efficiently model long-range dependencies, Transformer architectures broke through the limitations of traditional Siamese networks that rely on fixed matching modules, thereby exhibiting superior feature fusion and object localization capabilities in complex dynamic scenarios.</p>
<p>Note that the inter-frame attention discussed here differs from the intra-frame attention in <xref ref-type="sec" rid="s2_4">Section 2.4</xref>. While <xref ref-type="sec" rid="s2_4">Section 2.4</xref> focused on intra-frame attention for extracting local geometric features within the backbone, this section addresses inter-frame attention as a decision mechanism to model the global correlation between the template and the search region.</p>
<p>PTTR [<xref ref-type="bibr" rid="ref-11">11</xref>], as a pioneering work in this field, first constructed a point-relation Transformer framework. It strengthened local feature interaction between the template and search region via self-attention mechanisms, and achieved global-scale feature matching and alignment through cross-attention mechanisms. This end-to-end decision paradigm abandoned the traditional approach of manually designed similarity metrics, enabling the model to adaptively learn geometric correlation features of inter-frame point clouds. PTTR&#x002B;&#x002B; [<xref ref-type="bibr" rid="ref-55">55</xref>] further optimized the architecture design by proposing an innovative Point-BEV dual-branch Transformer fusion strategy. This method extracts fine geometric features via MLPs in the point cloud domain, while generating dense feature maps through voxelization and max pooling in the BEV domain. It achieves complementary enhancement of point cloud details and BEV global structures via cross-branch attention mechanisms, thereby realizing multi-scale feature integration. CXTrack [<xref ref-type="bibr" rid="ref-56">56</xref>] introduced an object-centric Transformer, which propagates contextual information and object cues from historical frames to the current frame through attention, effectively mitigating feature degradation caused by object occlusion. SyncTrack [<xref ref-type="bibr" rid="ref-57">57</xref>] completely abandoned the Siamese structure and proposed a single-branch synchronous architecture. By concatenating template and search region point clouds and feeding them into the Transformer, it naturally integrates feature extraction and matching processes using a dynamic attention matrix, significantly improving accuracy while maintaining real-time performance (45 FPS).</p>
<p>To address the feature sampling challenge caused by point cloud sparsity, PTTR [<xref ref-type="bibr" rid="ref-11">11</xref>] proposed a relationship-aware sampling strategy. By dynamically selecting highly correlated points based on attention response values between the template and search region, it significantly reduced geometric information loss induced by random sampling. SyncTrack [<xref ref-type="bibr" rid="ref-57">57</xref>] further built upon this optimization by designing an attention-point sampling Transformer, which guides key point selection via attention weights. This approach preserves more discriminative features while maintaining computational efficiency.</p>
<p>Through continuous innovation in Transformer architectures, these methods have driven a paradigm shift in 3D object tracking from manual feature matching to adaptive spatiotemporal modeling. Their technological evolution not only reflects exploratory efforts in multi-scale feature fusion but also demonstrates the ongoing optimization of the balance between real-time performance and tracking accuracy.</p>
</sec>
</sec>
<sec id="s4">
<label>4</label>
<title>Model Update Strategy</title>
<p>Model update strategies serve as a core technical component for addressing object shape variations, scene occlusion, and point cloud sparsity. As shown in <xref ref-type="fig" rid="fig-4">Fig. 4</xref>, these mechanisms dynamically adjust tracking model parameters or feature representations to ensure the algorithm maintains stable tracking performance despite temporal variations in object appearance. The evolution of these mechanisms centers on achieving robust adaptive model updates robust adaptive model updates in unstructured point cloud data scenarios.</p>
<fig id="fig-4">
<label>Figure 4</label>
<caption>
<title>Function of model update strategy.</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_76652-fig-4.tif"/>
</fig>
<sec id="s4_1">
<label>4.1</label>
<title>Shape Compensation and Template Enhancement</title>
<p>To address key challenges such as point cloud sparsity and object occlusion, dynamic updates of tracking models are achieved through geometric structure repair and feature representation optimization. The technical core of such methods lies in addressing challenges related to object representation under sparse point cloud scenarios through approaches including point cloud geometric completion, dynamic template optimization, and virtual feature generation. Their core objective is to maintain the template&#x2019;s capability to precisely characterize the target&#x2019;s geometric properties and appearance features throughout the tracking process.</p>
<sec id="s4_1_1">
<label>4.1.1</label>
<title>Geometric Completion and Generation</title>
<p>SC3D [<xref ref-type="bibr" rid="ref-41">41</xref>] pioneered an autoencoder-based shape completion network, which generates complete object geometry by learning the latent representation of sparse point clouds. This network constrains the similarity between generated point clouds and real shapes via Chamfer distance [<xref ref-type="bibr" rid="ref-58">58</xref>], enabling the template to maintain precise characterization of the target&#x2019;s geometric properties during tracking. V2B [<xref ref-type="bibr" rid="ref-48">48</xref>] further developed this approach by proposing a shape-aware feature learning module: it enhances foreground object features through a gating mechanism, generates dense point clouds by combining global-local branches, and compresses spatial information using voxelization and BEV view transformation, thereby significantly improving object localization accuracy in sparse scenarios. Such methods embed shape completion as a regularization term into the tracking framework, ensuring that the template update process balances appearance matching and geometric consistency.</p>
</sec>
<sec id="s4_1_2">
<label>4.1.2</label>
<title>Cross-Modal and Feature Enhancement</title>
<p>Facing severe point cloud loss caused by occlusion or long-distance observation, MVCTrack [<xref ref-type="bibr" rid="ref-59">59</xref>] proposed a virtual cue projection mechanism, which uses a lightweight 2D segmentation model to generate instance masks and back-projects image pixels to 3D space to create virtual point clouds. The fused input of these virtual points and original LiDAR point clouds enables the target template to retain highly discriminative features in extreme scenarios. Experiments show that this method increases the target point cloud density by 45%, particularly improving the tracking stability of small targets (e.g., pedestrians, bicycles). Boosting 3D SOT [<xref ref-type="bibr" rid="ref-60">60</xref>] leverages a target-aware projection module to map 3D point clouds to the 2D image space, leveraging pre-trained 2D trackers to extract mature matching knowledge. It subsequently achieves cross-modal knowledge transfer through IoU-weighted distillation loss, allowing the 3D tracker to learn robust matching capabilities even with limited point cloud data. This cross-domain knowledge transfer strategy effectively mitigates the problem of insufficient training data in 3D point cloud tracking scenarios.</p>
<p>SCVTrack [<xref ref-type="bibr" rid="ref-61">61</xref>] proposed a synthetic object representation method based on generative adversarial networks. This approach introduces a point cloud quality assessment module to perform structural integrity detection on input sparse point clouds, and then synthesizes missing geometric surfaces using a conditional GAN [<xref ref-type="bibr" rid="ref-62">62</xref>]. Its key innovation lies in designing a multi-scale Chamfer distance and normal vector consistency joint loss function, ensuring that the synthesized point clouds not only approximate real objects in spatial position but also retain accurate surface normal information, thereby constructing high-precision target templates. PillarTrack [<xref ref-type="bibr" rid="ref-63">63</xref>] focused on robust encoding of point cloud features, proposing pyramid-encoded pillar features. By leveraging a hierarchical coordinate encoding strategy to balance numerical differences across the dimensions of point cloud coordinates, it significantly enhances the invariance of features to 3D spatial transformations. Combined with a modality-aware Transformer backbone network, this method strengthens geometric information encoding in the early stages of feature extraction and improves template discriminability in dynamic scenarios through rational allocation of computational resources. This optimization approach at the fundamental level of feature representation provides effective support for stable template updates in complex environments.</p>
<p>The above methods generally adopt incremental update strategies, fusing reliable point cloud information from high-confidence detection results frame by frame. Through this selective integration approach, they avoid template drift caused by single-frame error accumulation. Experimental results demonstrate that shape compensation and template enhancement mechanisms have shown significant effectiveness in challenging scenarios such as long-range tracking and complex occlusion on mainstream datasets.</p>
</sec>
</sec>
<sec id="s4_2">
<label>4.2</label>
<title>Temporal Memory and Model Update</title>
<p>Temporal memory and model update strategies have provided key technical pathways for addressing challenges such as object occlusion, deformation, and template drift in long-term tracking by exploiting spatiotemporal correlation information from historical frames. These methods rely on constructing efficient historical information management mechanisms, which enable adaptive updates of tracking models to accommodate object motion patterns through dynamic selection and fusion of historical frame features or templates. Their technological evolution has been prominently reflected in the paradigm shift from independent single-frame processing to temporal dependency modeling.</p>
<sec id="s4_2_1">
<label>4.2.1</label>
<title>Explicit Template Management</title>
<p>TAT [<xref ref-type="bibr" rid="ref-64">64</xref>] proposed a template selection mechanism based on 3D IoU scoring. This method designed a confidence evaluation network to select high-quality templates exhibiting strong geometric matching with the current target from historical frames, thereby constructing a dynamic template set. Building on this, it enhanced cross-template feature interaction using a linear attention mechanism and performed weighted fusion of template information in chronological order via a recurrent neural network, effectively suppressing interference from low-quality early templates in current decisions. This temporal selective memory strategy enables the model to maintain tracking cues through historical high-quality templates even during temporary object occlusion, significantly improving trajectory continuity in complex interactive scenarios. MBPTrack [<xref ref-type="bibr" rid="ref-65">65</xref>] introduced an external memory bank architecture, which separately stores point cloud features and object mask information from historical frames. It processed geometric feature propagation and object semantic cue propagation separately via a decoupled Transformer module. This design avoided mutual interference between different modalities of information during fusion, achieving efficient reuse of historical object states. When the target reappears from occlusion, the complete geometric features stored in the memory bank can quickly restore the object&#x2019;s representation, effectively addressing tracking interruption issues in traditional methods caused by single-frame information loss.</p>
</sec>
<sec id="s4_2_2">
<label>4.2.2</label>
<title>Sequence and Streaming Modeling</title>
<p>SeqTrack3D [<xref ref-type="bibr" rid="ref-66">66</xref>] integrated the sequence-to-sequence paradigm into 3D object tracking, constructing an end-to-end prediction framework based on historical point cloud and bounding box sequences. The method&#x2019;s encoder extracted multi-scale motion cues via a spatiotemporal feature pyramid, while the decoder generated target position queries conditioned on historical bounding box sequences. By leveraging self-attention mechanisms, it explicitly modeled the temporal continuity of trajectories. Its core innovation lies in transforming object motion constraints into physical law constraints of bounding box sequences, thereby compelling the network to learn position transition patterns consistent with rigid motion characteristics. This enables precise prediction of object motion trends in long-term sequence tracking. StreamTrack [<xref ref-type="bibr" rid="ref-67">67</xref>], targeting real-time requirements, designed a lightweight streaming processing architecture. It inputs only the current frame&#x2019;s point cloud into the backbone network and interacts with key features from historical frames through memory bank indexing. Adopting a hybrid attention mechanism, this method captures temporal changes in the target&#x2019;s geometric structure at the local level and models dynamic characteristics of cross-frame motion at the global level, effectively balancing computational efficiency and long-range dependency modeling capability.</p>
<p>Facing the challenge of tracking drift in occluded scenarios, STMD-Tracker [<xref ref-type="bibr" rid="ref-68">68</xref>] constructed a bidirectional cross-frame memory module, creatively incorporating future frame information to contribute to feature compensation for the current frame. In the forward process, the model enhanced the feature representation of the current frame using the complete object information from future frames; in the backward process, it corrected the current state through historical memory, forming a closed-loop update of spatiotemporal information. Combined with a Gaussian spatial masking mechanism, this method weights features based on the spatial distribution of target centers, effectively suppressing the influence of background interfering point clouds. It maintains precise localization of the core region even when the target is partially occluded. M3SOT [<xref ref-type="bibr" rid="ref-69">69</xref>] enhanced model robustness at the feature representation level by constructing a multi-solution space in the intermediate layers of the Transformer, simultaneously performing masked point prediction and target center localization tasks. This progressive training mechanism compelled the network to learn multi-dimensional feature representations of the target. Even with sparse point cloud inputs, it can infer the complete spatial position of the target through temporal context reasoning from historical frames, significantly improving adaptability to extremely sparse scenarios.</p>
<p>Essentially, these methods have continuously optimized the memory-reasoning-update closed loop in object tracking. From early rule-based template selection, to data-driven sequential modeling, and further to bidirectional compensation mechanisms integrating multimodal information, such methods have gradually overcome the limitations of single-frame processing by explicitly modeling the temporal dependency of object motion.</p>
</sec>
</sec>
<sec id="s4_3">
<label>4.3</label>
<title>Tracking Failure Detection and Relocalization</title>
<p>Addressing the issue of trajectory interruption caused by object occlusion and point cloud sparsity in complex dynamic scenarios, recent studies have significantly improved the recovery capability of tracking systems under extreme conditions by constructing precise failure detection metrics and efficient relocalization strategies.</p>
<p>DeepPCT [<xref ref-type="bibr" rid="ref-70">70</xref>] systematically constructed a detection-relocalization closed-loop mechanism. The core innovation of this method lies in proposing the center intersection-over-union (CIoU) metric, which extends the traditional IoU by introducing a normalized center distance. This effectively addresses the challenge of quantifying spatial relationships for non-overlapping bounding boxes. Compared to traditional metrics that only focus on region overlap, CIoU can more accurately characterize the relative positional deviation between predicted and ground truth boxes, making it particularly suitable for object localization evaluation in sparse point cloud scenarios. Upon detecting that the CIoU value falls below a predefined confidence threshold, the system automatically activates the relocalization module. This module invokes a pre-trained 3D object detector to perform high-density point cloud reconstruction and object proposal generation in the global search region, thereby achieving geometric feature matching and position correction for drifted targets. Experimental data shows that this mechanism increases the success rate by 18 percentage points in long-term tracking tasks on the KITTI [<xref ref-type="bibr" rid="ref-14">14</xref>] dataset. Notably, in extremely sparse scenarios where the number of single-object points is less than 150, the relocalization response time is improved by 40% compared to baseline methods, significantly reducing the risk of tracking failure caused by feature loss.</p>
<p>In the field of implicit anti-degradation research, LTTR [<xref ref-type="bibr" rid="ref-71">71</xref>] constructed a Transformer-based encoder-decoder framework, enhancing feature robustness in sparse scenarios by explicitly modeling point cloud region relationships. This method breaks through the local neighborhood limitation of traditional point cloud processing, capturing long-range dependencies between any two points using a multi-head self-attention mechanism while retaining the prior knowledge of spatial distribution of point clouds through a positional encoding module. STNet [<xref ref-type="bibr" rid="ref-72">72</xref>] and CorpNet [<xref ref-type="bibr" rid="ref-73">73</xref>] represent technical pathways for enhancing robustness via network structure optimization. STNet [<xref ref-type="bibr" rid="ref-72">72</xref>] designed an iterative cross-self-feature enhancement module: during the feature interaction phase, it embeds the global geometric information of the template object into the current frame&#x2019;s feature space via cross-attention, then aggregates locally semantically similar points using a self-attention mechanism, forming a multi-round progressive feature optimization process. This mechanism effectively suppresses error accumulation in short-term object occlusion scenarios, maintaining tracking trajectory continuity through iterative calibration of historical frame features. CorpNet [<xref ref-type="bibr" rid="ref-73">73</xref>] focused on multi-scale feature encoding, constructing a correlation pyramid structure with 512/256/128-point hierarchies. It simultaneously preserves high-level semantic information and low-level geometric details during feature extraction. Through a cross-layer feature fusion strategy, this method significantly enhances structural awareness capability in sparse point cloud regions, reducing the probability of tracking failure caused by local feature loss by 23%.</p>
</sec>
</sec>
<sec id="s5">
<label>5</label>
<title>Experiments</title>
<sec id="s5_1">
<label>5.1</label>
<title>Comparison of All Methods</title>
<p>To ensure a comprehensive and rigorous evaluation of 3D single object tracking methods, we conducted systematic comparisons on two benchmark datasets: KITTI [<xref ref-type="bibr" rid="ref-14">14</xref>] and NuScenes [<xref ref-type="bibr" rid="ref-15">15</xref>], which are widely recognized for their representativeness in autonomous driving scenarios. Specifically, the KITTI dataset comprises 21 video sequences for training and 29 for testing, with each sequence containing synchronized LiDAR point clouds and camera images captured in urban, rural, and highway environments. Due to the inaccessibility of official test set labels, we further split the training set into three non-overlapping subsets (train/val/test) following standard protocols, ensuring unbiased performance validation. For NuScenes, a larger-scale dataset with richer scene diversity, it includes 1000 scenes covering various weather conditions (e.g., sunny, rainy, foggy) and time periods (day and night), which are partitioned into 700 training scenes, 150 validation scenes, and 150 test scenes to support comprehensive model generalization analysis.</p>
<p>In terms of evaluation categories, we focused on five typical object classes critical to autonomous driving: Car, Pedestrian, Truck, Trailer, and Bus. These categories were selected based on their high frequency in traffic scenarios and the distinct challenges they pose (e.g., small size for Pedestrians, large volume for Trucks/Trailers), ensuring the evaluation covers diverse tracking scenarios.</p>
<p>The quantitative evaluation results of state-of-the-art algorithms on the KITTI and NuScenes datasets are presented in <xref ref-type="table" rid="table-1">Tables 1</xref> and <xref ref-type="table" rid="table-2">2</xref>, respectively (&#x201C;-&#x201D; indicates that there are no relevant experiments for this method; the best and second-best results are highlighted in red and blue, respectively). To ensure a fair comparison, we adopt Success and Precision defined in one-pass evaluation (OPE) [<xref ref-type="bibr" rid="ref-75">75</xref>,<xref ref-type="bibr" rid="ref-76">76</xref>] as the evaluation metrics. Success denotes the Area Under the Curve (AUC) for the plot of the ratio of frames in which the IoU between the predicted box and the ground truth box exceeds a threshold (ranging from 0 to 1). Meanwhile, Precision denotes the AUC for the plot of the ratio of frames in which the distance between their centers (i.e., the predicted box and the ground truth box) is within a threshold (ranging from 0 to 2 m).</p>
<table-wrap id="table-1">
<label>Table 1</label>
<caption>
<title>Comparison on KITTI dataset.</title>
</caption>
<table>
<colgroup>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
</colgroup>
<thead>
<tr>
<th align="center">Method</th>
<th align="center">Source</th>
<th>Car [6424]</th>
<th>Pedestrian [6088]</th>
<th align="center">Van [1248]</th>
<th>Cyclist [308]</th>
<th>Mean [14,068]</th>
</tr>
</thead>
<tbody>
<tr>
<td>SC3D [<xref ref-type="bibr" rid="ref-41">41</xref>]</td>
<td>CVPR19</td>
<td>41.3/57.9</td>
<td>18.2/37.8</td>
<td>40.4/47.0</td>
<td>41.5/70.4</td>
<td>31.2/48.5</td>
</tr>
<tr>
<td>P2B [<xref ref-type="bibr" rid="ref-42">42</xref>]</td>
<td>CVPR20</td>
<td>56.2/72.8</td>
<td>28.7/49.6</td>
<td>40.8/48.4</td>
<td>32.1/44.7</td>
<td>42.4/60.0</td>
</tr>
<tr>
<td>3D-SiamRPN [<xref ref-type="bibr" rid="ref-46">46</xref>]</td>
<td>JSEN20</td>
<td>58.2/76.2</td>
<td>35.2/56.2</td>
<td>45.7/52.9</td>
<td>36.2/49.0</td>
<td>46.7/64.9</td>
</tr>
<tr>
<td>PTT [<xref ref-type="bibr" rid="ref-74">74</xref>]</td>
<td>IROS21</td>
<td>67.8/81.8</td>
<td>44.9/72.0</td>
<td>43.6/52.5</td>
<td>37.2/47.3</td>
<td>55.1/74.2</td>
</tr>
<tr>
<td>LTTR [<xref ref-type="bibr" rid="ref-71">71</xref>]</td>
<td>BMVC21</td>
<td>65.0/77.1</td>
<td>33.2/56.8</td>
<td>35.8/45.6</td>
<td>66.2/89.9</td>
<td>48.7/65.8</td>
</tr>
<tr>
<td>BAT [<xref ref-type="bibr" rid="ref-43">43</xref>]</td>
<td>ICCV21</td>
<td>60.5/77.7</td>
<td>42.1/70.1</td>
<td>52.4/67.0</td>
<td>33.7/45.4</td>
<td>51.2/72.8</td>
</tr>
<tr>
<td>V2B [<xref ref-type="bibr" rid="ref-48">48</xref>]</td>
<td>NIPS21</td>
<td>70.5/81.3</td>
<td>48.3/73.5</td>
<td>50.1/58.0</td>
<td>40.8/49.7</td>
<td>58.4/75.2</td>
</tr>
<tr>
<td>GPT [<xref ref-type="bibr" rid="ref-49">49</xref>]</td>
<td>AAAI22</td>
<td>59.1/75.6</td>
<td>35.2/63.6</td>
<td>49.6/60.6</td>
<td>34.3/46.3</td>
<td>47.4/68.4</td>
</tr>
<tr>
<td>STNet [<xref ref-type="bibr" rid="ref-72">72</xref>]</td>
<td>ECCV22</td>
<td>72.1/84.0</td>
<td>49.9/77.2</td>
<td>58.0/70.6</td>
<td>73.5/93.7</td>
<td>61.3/80.1</td>
</tr>
<tr>
<td><inline-formula id="ieqn-12"><mml:math id="mml-ieqn-12"><mml:msup><mml:mi>M</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msup></mml:math></inline-formula>-Track [<xref ref-type="bibr" rid="ref-8">8</xref>]</td>
<td>CVPR22</td>
<td>65.5/80.8</td>
<td>61.5/88.2</td>
<td>53.8/70.7</td>
<td>73.2/93.5</td>
<td>62.9/83.4</td>
</tr>
<tr>
<td>PTTR [<xref ref-type="bibr" rid="ref-11">11</xref>]</td>
<td>CVPR22</td>
<td>65.2/77.4</td>
<td>50.9/81.6</td>
<td>52.5/61.8</td>
<td>65.1/90.5</td>
<td>57.9/78.2</td>
</tr>
<tr>
<td>TAT [<xref ref-type="bibr" rid="ref-64">64</xref>]</td>
<td>ACCV22</td>
<td>72.2/83.3</td>
<td>57.4/84.4</td>
<td>58.9/69.2</td>
<td>74.2/93.9</td>
<td>64.7/82.8</td>
</tr>
<tr>
<td>GLT-T [<xref ref-type="bibr" rid="ref-45">45</xref>]</td>
<td>AAAI23</td>
<td>68.2/82.1</td>
<td>52.4/78.8</td>
<td>52.6/62.9</td>
<td>68.9/92.1</td>
<td>60.1/79.3</td>
</tr>
<tr>
<td>CAT [<xref ref-type="bibr" rid="ref-10">10</xref>]</td>
<td>TNNLS23</td>
<td>66.6/81.8</td>
<td>51.6/77.7</td>
<td>53.1/69.8</td>
<td>67.0/90.1</td>
<td>58.9/79.1</td>
</tr>
<tr>
<td>BEVTrack [<xref ref-type="bibr" rid="ref-51">51</xref>]</td>
<td>arXiv23</td>
<td>74.9/86.5</td>
<td>69.5/94.3</td>
<td>66.0/77.2</td>
<td>77.0/94.7</td>
<td><styled-content style-type="color" style="color: blue;">71.8</styled-content>/89.2</td>
</tr>
<tr>
<td>DMT [<xref ref-type="bibr" rid="ref-52">52</xref>]</td>
<td>TITS23</td>
<td>66.4/79.4</td>
<td>48.1/77.9</td>
<td>53.3/65.6</td>
<td>70.4/92.6</td>
<td>55.1/75.8</td>
</tr>
<tr>
<td>CorpNet [<xref ref-type="bibr" rid="ref-73">73</xref>]</td>
<td>CVPR23</td>
<td>73.6/84.1</td>
<td>55.6/82.4</td>
<td>58.7/66.5</td>
<td>74.3/94.2</td>
<td>64.5/82.0</td>
</tr>
<tr>
<td>CXTrack [<xref ref-type="bibr" rid="ref-56">56</xref>]</td>
<td>CVPR23</td>
<td>69.1/81.6</td>
<td>67.0/91.5</td>
<td>60.0/71.8</td>
<td>74.2/94.3</td>
<td>67.5/85.3</td>
</tr>
<tr>
<td>SyncTrack [<xref ref-type="bibr" rid="ref-57">57</xref>]</td>
<td>ICCV23</td>
<td>73.3/85.0</td>
<td>54.7/80.5</td>
<td>60.3/70.0</td>
<td>73.1/93.8</td>
<td>64.1/81.9</td>
</tr>
<tr>
<td>MBPTrack [<xref ref-type="bibr" rid="ref-65">65</xref>]</td>
<td>ICCV23</td>
<td>73.4/84.8</td>
<td>68.6/93.9</td>
<td>61.3/72.7</td>
<td>76.7/94.3</td>
<td>70.3/87.9</td>
</tr>
<tr>
<td>MMF-Track [<xref ref-type="bibr" rid="ref-44">44</xref>]</td>
<td>TIV23</td>
<td>73.9/84.1</td>
<td>61.7/89.3</td>
<td>59.3/72.5</td>
<td>75.9/95.0</td>
<td>67.4/85.6</td>
</tr>
<tr>
<td>MTM-Tracker [<xref ref-type="bibr" rid="ref-50">50</xref>]</td>
<td>RAL23</td>
<td>73.1/84.5</td>
<td><styled-content style-type="color" style="color: blue;">70.4</styled-content>/<styled-content style-type="color" style="color: red;">95.1</styled-content></td>
<td>60.8/74.2</td>
<td>76.7/94.6</td>
<td>70.9/88.4</td>
</tr>
<tr>
<td>SCVTrack [<xref ref-type="bibr" rid="ref-61">61</xref>]</td>
<td>AAAI24</td>
<td>68.7/81.9</td>
<td>62.0/89.1</td>
<td>58.6/72.8</td>
<td>77.4/94.4</td>
<td>&#x2013;</td>
</tr>
<tr>
<td>StreamTrack [<xref ref-type="bibr" rid="ref-67">67</xref>]</td>
<td>AAAI24</td>
<td>72.6/83.7</td>
<td><styled-content style-type="color" style="color: red;">70.5</styled-content>/<styled-content style-type="color" style="color: blue;">94.7</styled-content></td>
<td>61.0/76.9</td>
<td><styled-content style-type="color" style="color: blue;">78.1</styled-content>/94.6</td>
<td>70.8/88.1</td>
</tr>
<tr>
<td>SeqTrack3D [<xref ref-type="bibr" rid="ref-66">66</xref>]</td>
<td>ICRA24</td>
<td>&#x2013;</td>
<td>&#x2013;</td>
<td>&#x2013;</td>
<td>&#x2013;</td>
<td>&#x2013;</td>
</tr>
<tr>
<td>PTTR&#x002B;&#x002B; [<xref ref-type="bibr" rid="ref-55">55</xref>]</td>
<td>TPAMI24</td>
<td>73.4/84.5</td>
<td>55.2/84.7</td>
<td>55.1/62.2</td>
<td>71.6/92.8</td>
<td>63.9/82.8</td>
</tr>
<tr>
<td>VoxelTrack [<xref ref-type="bibr" rid="ref-6">6</xref>]</td>
<td>arXiv24</td>
<td>72.5/84.7</td>
<td>67.8/92.6</td>
<td><styled-content style-type="color" style="color: blue;">69.8</styled-content>/<styled-content style-type="color" style="color: blue;">683.6</styled-content></td>
<td>75.1/94.7</td>
<td>70.4/88.3</td>
</tr>
<tr>
<td>MVCTrack [<xref ref-type="bibr" rid="ref-59">59</xref>]</td>
<td>arXiv24</td>
<td>&#x2013;</td>
<td>&#x2013;</td>
<td>&#x2013;</td>
<td>&#x2013;</td>
<td>&#x2013;</td>
</tr>
<tr>
<td>SiamMo [<xref ref-type="bibr" rid="ref-54">54</xref>]</td>
<td>arXiv24</td>
<td><styled-content style-type="color" style="color: red;">76.3</styled-content>/<styled-content style-type="color" style="color: red;">88.1</styled-content></td>
<td>68.6/93.9</td>
<td>67.9/80.5</td>
<td><styled-content style-type="color" style="color: red;">78.5</styled-content>/94.8</td>
<td><styled-content style-type="color" style="color: red;">72.3</styled-content>/<styled-content style-type="color" style="color: red;">90.1</styled-content></td>
</tr>
<tr>
<td>M3SOT [<xref ref-type="bibr" rid="ref-69">69</xref>]</td>
<td>AAAI24</td>
<td><styled-content style-type="color" style="color: blue;">75.9</styled-content>/<styled-content style-type="color" style="color: blue;">87.4</styled-content></td>
<td>66.6/92.5</td>
<td>59.4/74.7</td>
<td>70.3/93.4</td>
<td>70.3/88.6</td>
</tr>
<tr>
<td>MemDisst [<xref ref-type="bibr" rid="ref-60">60</xref>]</td>
<td>ECCV24</td>
<td>74.1/85.6</td>
<td>69.1/94.1</td>
<td>66.6/79.3</td>
<td>77.2/94.7</td>
<td>71.3/88.9</td>
</tr>
<tr>
<td>PillarTrack [<xref ref-type="bibr" rid="ref-63">63</xref>]</td>
<td>arXiv24</td>
<td>74.2/85.1</td>
<td>59.7/84.7</td>
<td>61.0/69.2</td>
<td>78.0/<styled-content style-type="color" style="color: red;">95.0</styled-content></td>
<td>66.8/83.7</td>
</tr>
<tr>
<td>STMD-Tracker [<xref ref-type="bibr" rid="ref-68">68</xref>]</td>
<td>arXiv24</td>
<td>73.7/85.2</td>
<td>68.9/94.2</td>
<td>61.4/72.7</td>
<td>76.9/<styled-content style-type="color" style="color: blue;">94.9</styled-content></td>
<td>70.6/88.2</td>
</tr>
<tr>
<td>P2P [<xref ref-type="bibr" rid="ref-53">53</xref>]</td>
<td>IJCV25</td>
<td>73.6/85.7</td>
<td>69.6/94.0</td>
<td><styled-content style-type="color" style="color: red;">70.3</styled-content>/<styled-content style-type="color" style="color: red;">83.9</styled-content></td>
<td>75.5/94.6</td>
<td>71.7/<styled-content style-type="color" style="color: blue;">89.4</styled-content></td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<fn id="table-1fn1" fn-type="other">
<p>Note: The best and second-best results are highlighted in <styled-content style-type="color" style="color: red;"><bold>red</bold></styled-content> and <styled-content style-type="color" style="color: blue;"><bold>blue</bold></styled-content>, respectively. &#x201C;-&#x201D; denotes that the result is not available.</p>
</fn>
</table-wrap-foot>
</table-wrap><table-wrap id="table-2">
<label>Table 2</label>
<caption>
<title>Comparison on NuScenes dataset.</title>
</caption>
<table>
<colgroup>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
</colgroup>
<thead>
<tr>
<th>Method</th>
<th>Source</th>
<th>Car [64,159]</th>
<th>Pedestrian [33,227]</th>
<th>Truck [13,587]</th>
<th>Trailer [3352]</th>
<th>Bus [2953]</th>
<th>Mean [117,278]</th>
</tr>
</thead>
<tbody>
<tr>
<td>SC3D [<xref ref-type="bibr" rid="ref-41">41</xref>]</td>
<td>CVPR19</td>
<td>22.31/21.93</td>
<td>11.29/12.65</td>
<td>30.67/27.73</td>
<td>35.28/28.12</td>
<td>29.35/24.08</td>
<td>20.70/20.20</td>
</tr>
<tr>
<td>P2B [<xref ref-type="bibr" rid="ref-42">42</xref>]</td>
<td>CVPR20</td>
<td>38.81/43.18</td>
<td>28.39/52.24</td>
<td>42.95/41.59</td>
<td>48.96/40.05</td>
<td>32.95/27.41</td>
<td>36.48/45.08</td>
</tr>
<tr>
<td>3D-SiamRPN [<xref ref-type="bibr" rid="ref-46">46</xref>]</td>
<td>JSEN20</td>
<td>&#x2013;</td>
<td>&#x2013;</td>
<td>&#x2013;</td>
<td>&#x2013;</td>
<td>&#x2013;</td>
<td>&#x2013;</td>
</tr>
<tr>
<td>PTT [<xref ref-type="bibr" rid="ref-74">74</xref>]</td>
<td>IROS21</td>
<td>41.22/45.26</td>
<td>19.33/32.03</td>
<td>50.23/48.56</td>
<td>51.70/46.50</td>
<td>39.40/36.70</td>
<td>36.33/41.72</td>
</tr>
<tr>
<td>LTTR [<xref ref-type="bibr" rid="ref-71">71</xref>]</td>
<td>BMVC21</td>
<td>&#x2013;</td>
<td>&#x2013;</td>
<td>&#x2013;</td>
<td>&#x2013;</td>
<td>&#x2013;</td>
<td>&#x2013;</td>
</tr>
<tr>
<td>BAT [<xref ref-type="bibr" rid="ref-43">43</xref>]</td>
<td>ICCV21</td>
<td>40.73/43.29</td>
<td>28.83/53.32</td>
<td>45.34/42.58</td>
<td>52.59/44.89</td>
<td>35.44/28.01</td>
<td>38.10/45.71</td>
</tr>
<tr>
<td>V2B [<xref ref-type="bibr" rid="ref-48">48</xref>]</td>
<td>NIPS21</td>
<td>54.40/59.70</td>
<td>30.10/55.40</td>
<td>53.70/54.50</td>
<td>54.90/51.44</td>
<td>&#x2013;</td>
<td>&#x2013;</td>
</tr>
<tr>
<td>GPT [<xref ref-type="bibr" rid="ref-49">49</xref>]</td>
<td>AAAI22</td>
<td>&#x2013;</td>
<td>&#x2013;</td>
<td>&#x2013;</td>
<td>&#x2013;</td>
<td>&#x2013;</td>
<td>&#x2013;</td>
</tr>
<tr>
<td>STNet [<xref ref-type="bibr" rid="ref-72">72</xref>]</td>
<td>ECCV22</td>
<td>&#x2013;</td>
<td>&#x2013;</td>
<td>&#x2013;</td>
<td>&#x2013;</td>
<td>&#x2013;</td>
<td>&#x2013;</td>
</tr>
<tr>
<td><inline-formula id="ieqn-13"><mml:math id="mml-ieqn-13"><mml:msup><mml:mi>M</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msup></mml:math></inline-formula>-Track [<xref ref-type="bibr" rid="ref-8">8</xref>]</td>
<td>CVPR22</td>
<td>55.85/65.09</td>
<td>32.10/60.92</td>
<td>57.36/59.54</td>
<td>57.61/58.26</td>
<td>51.39/51.44</td>
<td>49.23/62.73</td>
</tr>
<tr>
<td>PTTR [<xref ref-type="bibr" rid="ref-11">11</xref>]</td>
<td>CVPR22</td>
<td>51.89/58.61</td>
<td>29.90/45.09</td>
<td>45.30/44.74</td>
<td>45.87/38.36</td>
<td>43.14/37.74</td>
<td>44.50/52.07</td>
</tr>
<tr>
<td>TAT [<xref ref-type="bibr" rid="ref-64">64</xref>]</td>
<td>ACCV22</td>
<td>&#x2013;</td>
<td>&#x2013;</td>
<td>&#x2013;</td>
<td>&#x2013;</td>
<td>&#x2013;</td>
<td>&#x2013;</td>
</tr>
<tr>
<td>GLT-T [<xref ref-type="bibr" rid="ref-45">45</xref>]</td>
<td>AAAI23</td>
<td>48.52/54.29</td>
<td>31.74/56.49</td>
<td>52.74/51.43</td>
<td>57.60/52.01</td>
<td>44.55/40.69</td>
<td>44.42/54.33</td>
</tr>
<tr>
<td>CAT [<xref ref-type="bibr" rid="ref-10">10</xref>]</td>
<td>TNNLS23</td>
<td>43.34/49.41</td>
<td>30.68/56.67</td>
<td>47.64/48.10</td>
<td>57.90/55.31</td>
<td>43.30/41.42</td>
<td>40.67/51.28</td>
</tr>
<tr>
<td>BEVTrack [<xref ref-type="bibr" rid="ref-51">51</xref>]</td>
<td>arXiv23</td>
<td>64.31/71.14</td>
<td>46.28/<styled-content style-type="color" style="color: red;">76.77</styled-content></td>
<td><styled-content style-type="color" style="color: blue;">66.83</styled-content>/<styled-content style-type="color" style="color: blue;">67.04</styled-content></td>
<td><styled-content style-type="color" style="color: red;">75.54</styled-content>/<styled-content style-type="color" style="color: red;">71.62</styled-content></td>
<td><styled-content style-type="color" style="color: blue;">61.09</styled-content>/56.68</td>
<td>59.71/71.19</td>
</tr>
<tr>
<td>DMT [<xref ref-type="bibr" rid="ref-52">52</xref>]</td>
<td>TITS23</td>
<td>&#x2013;</td>
<td>&#x2013;</td>
<td>&#x2013;</td>
<td>&#x2013;</td>
<td>&#x2013;</td>
<td>&#x2013;</td>
</tr>
<tr>
<td>CorpNet [<xref ref-type="bibr" rid="ref-73">73</xref>]</td>
<td>CVPR23</td>
<td>&#x2013;</td>
<td>&#x2013;</td>
<td>&#x2013;</td>
<td>&#x2013;</td>
<td>&#x2013;</td>
<td>&#x2013;</td>
</tr>
<tr>
<td>CXTrack [<xref ref-type="bibr" rid="ref-56">56</xref>]</td>
<td>CVPR23</td>
<td>48.92/55.61</td>
<td>31.67/56.64</td>
<td>51.40/50.93</td>
<td>60.64/54.44</td>
<td>40.11/35.83</td>
<td>44.43/54.83</td>
</tr>
<tr>
<td>SyncTrack [<xref ref-type="bibr" rid="ref-57">57</xref>]</td>
<td>ICCV23</td>
<td>&#x2013;</td>
<td>&#x2013;</td>
<td>&#x2013;</td>
<td>&#x2013;</td>
<td>&#x2013;</td>
<td>&#x2013;</td>
</tr>
<tr>
<td>MBPTrack [<xref ref-type="bibr" rid="ref-65">65</xref>]</td>
<td>ICCV23</td>
<td>62.47/70.41</td>
<td>45.32/74.03</td>
<td>62.18/63.31</td>
<td>65.14/61.33</td>
<td>55.41/51.76</td>
<td>57.48/69.88</td>
</tr>
<tr>
<td>MMF-Track [<xref ref-type="bibr" rid="ref-44">44</xref>]</td>
<td>TIV23</td>
<td>50.73/58.85</td>
<td>32.80/66.25</td>
<td>&#x2013;</td>
<td>&#x2013;</td>
<td>52.73/52.98</td>
<td>&#x2013;</td>
</tr>
<tr>
<td>MTM-Tracker [<xref ref-type="bibr" rid="ref-50">50</xref>]</td>
<td>RAL23</td>
<td>57.05/65.94</td>
<td>37.08/72.80</td>
<td>59.37/63.69</td>
<td>59.73/60.44</td>
<td>55.46/54.31</td>
<td>51.70/67.17</td>
</tr>
<tr>
<td>SCVTrack [<xref ref-type="bibr" rid="ref-61">61</xref>]</td>
<td>AAAI24</td>
<td>58.9/67.7</td>
<td>34.5/61.5</td>
<td>60.6/61.4</td>
<td>59.5/60.1</td>
<td>54.3/53.6</td>
<td>52.1/64.7</td>
</tr>
<tr>
<td>StreamTrack [<xref ref-type="bibr" rid="ref-67">67</xref>]</td>
<td>AAAI24</td>
<td>62.65/70.81</td>
<td>38.43/68.58</td>
<td>64.67/66.60</td>
<td>66.67/64.27</td>
<td>60.66/<styled-content style-type="color" style="color: blue;">59.74</styled-content></td>
<td>55.75/69.22</td>
</tr>
<tr>
<td>SeqTrack3D [<xref ref-type="bibr" rid="ref-66">66</xref>]</td>
<td>ICRA24</td>
<td>62.55/71.46</td>
<td>39.94/68.57</td>
<td>60.97/63.04</td>
<td>68.37/61.76</td>
<td>54.33/53.52</td>
<td>55.92/68.94</td>
</tr>
<tr>
<td>PTTR&#x002B;&#x002B; [<xref ref-type="bibr" rid="ref-55">55</xref>]</td>
<td>TPAMI24</td>
<td>59.96/66.73</td>
<td>32.49/50.50</td>
<td>59.85/61.20</td>
<td>54.51/50.28</td>
<td>53.98/51.22</td>
<td>51.86/60.63</td>
</tr>
<tr>
<td>VoxelTrack [<xref ref-type="bibr" rid="ref-6">6</xref>]</td>
<td>arXiv24</td>
<td>63.9/71.6</td>
<td>46.8/75.9</td>
<td>64.8/65.9</td>
<td>69.5/64.3</td>
<td>60.1/57.7</td>
<td>59.0/71.4</td>
</tr>
<tr>
<td>MVCTrack [<xref ref-type="bibr" rid="ref-59">59</xref>]</td>
<td>arXiv24</td>
<td><styled-content style-type="color" style="color: red;">66.76</styled-content>/<styled-content style-type="color" style="color: red;">73.76</styled-content></td>
<td><styled-content style-type="color" style="color: red;">47.34</styled-content>/<styled-content style-type="color" style="color: blue;">76.64</styled-content></td>
<td>66.21/66.33</td>
<td>72.73/69.80</td>
<td>60.20/58.88</td>
<td><styled-content style-type="color" style="color: red;">61.20</styled-content>/<styled-content style-type="color" style="color: red;">73.22</styled-content></td>
</tr>
<tr>
<td>SiamMo [<xref ref-type="bibr" rid="ref-54">54</xref>]</td>
<td>arXiv24</td>
<td>64.95/72.24</td>
<td>46.23/76.25</td>
<td><styled-content style-type="color" style="color: red;">68.22</styled-content>/<styled-content style-type="color" style="color: red;">68.81</styled-content></td>
<td><styled-content style-type="color" style="color: blue;">74.21</styled-content>/<styled-content style-type="color" style="color: blue;">70.63</styled-content></td>
<td><styled-content style-type="color" style="color: red;">65.63</styled-content>/<styled-content style-type="color" style="color: red;">62.07</styled-content></td>
<td><styled-content style-type="color" style="color: blue;">60.31</styled-content>/<styled-content style-type="color" style="color: blue;">72.68</styled-content></td>
</tr>
<tr>
<td>M3SOT [<xref ref-type="bibr" rid="ref-69">69</xref>]</td>
<td>AAAI24</td>
<td>&#x2013;</td>
<td>&#x2013;</td>
<td>&#x2013;</td>
<td>&#x2013;</td>
<td>&#x2013;</td>
<td>&#x2013;</td>
</tr>
<tr>
<td>MemDisst [<xref ref-type="bibr" rid="ref-60">60</xref>]</td>
<td>ECCV24</td>
<td>&#x2013;</td>
<td>&#x2013;</td>
<td>&#x2013;</td>
<td>&#x2013;</td>
<td>&#x2013;</td>
<td>&#x2013;</td>
</tr>
<tr>
<td>PillarTrack [<xref ref-type="bibr" rid="ref-63">63</xref>]</td>
<td>arXiv24</td>
<td>47.12/57.72</td>
<td>34.18/64.93</td>
<td>54.82/54.41</td>
<td>57.70/54.63</td>
<td>44.68/40.73</td>
<td>44.59/58.86</td>
</tr>
<tr>
<td>STMD-Tracker [<xref ref-type="bibr" rid="ref-68">68</xref>]</td>
<td>arXiv24</td>
<td>63.05/71.24</td>
<td><styled-content style-type="color" style="color: blue;">46.86</styled-content>/75.27</td>
<td>62.87/63.96</td>
<td>65.24/62.03</td>
<td>56.02/52.88</td>
<td>58.33/70.81</td>
</tr>
<tr>
<td>P2P [<xref ref-type="bibr" rid="ref-53">53</xref>]</td>
<td>IJCV25</td>
<td><styled-content style-type="color" style="color: blue;">65.15</styled-content>/<styled-content style-type="color" style="color: blue;">72.90</styled-content></td>
<td>46.43/75.08</td>
<td>64.96/65.96</td>
<td>70.46/66.86</td>
<td>59.02/56.56</td>
<td>59.84/72.13</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<fn id="table-2fn1" fn-type="other">
<p>Note: The best and second-best results are highlighted in <styled-content style-type="color" style="color: red;"><bold>red</bold></styled-content> and <styled-content style-type="color" style="color: blue;"><bold>blue</bold></styled-content>, respectively. &#x201C;-&#x201D; denotes that the result is not available.</p>
</fn>
</table-wrap-foot>
</table-wrap>
<p><xref ref-type="table" rid="table-1">Table 1</xref> summarizes SOTA 3D single-object tracking performance on the KITTI dataset, with evaluations based on Success and Precision. SiamMo [<xref ref-type="bibr" rid="ref-54">54</xref>] establishes a new SOTA with a mean Success of 72.3 and a mean Precision of 90.1, outperforming MTM-Tracker [<xref ref-type="bibr" rid="ref-50">50</xref>] (70.9/88.4) by 1.4 points. This performance gain stems from its spatio-temporal feature aggregation (STFA) module, which explicitly models cross-frame motion trajectories, which is a critical capability for mitigating partial occlusion and fast motion in urban and highway scenarios.</p>

<p><xref ref-type="table" rid="table-2">Table 2</xref> presents performance evaluations on NuScenes&#x2014;a larger-scale dataset featuring diverse weather conditions (e.g., rain, fog) and sparse long-range point clouds (&#x003E;50 m), which amplifies generalization challenges. MVCTrack [<xref ref-type="bibr" rid="ref-59">59</xref>] establishes itself as the SOTA (61.20 in mean Success, 73.22 in mean Precision), outperforming SiamMo [<xref ref-type="bibr" rid="ref-54">54</xref>] (60.31/72.68) and P2P [<xref ref-type="bibr" rid="ref-53">53</xref>] (59.84/72.13). Its multi-view cross-modal fusion mechanism integrates RGB texture and LiDAR geometric features, compensating for sparse or noisy points&#x2014;a critical capability for addressing NuScenes&#x2019; variability.</p>

<p>While <xref ref-type="table" rid="table-1">Tables 1</xref> and <xref ref-type="table" rid="table-2">2</xref> provide exhaustive details, it is beneficial to summarize the performance ceilings of different technical routes. <xref ref-type="table" rid="table-3">Table 3</xref> presents a concise comparison of the representative SOTA methods from each category. As observed, Sequence &#x0026; Update methods (e.g., M3SOT [<xref ref-type="bibr" rid="ref-69">69</xref>], MVCTrack [<xref ref-type="bibr" rid="ref-59">59</xref>]) currently achieve the highest performance on both datasets, validating the importance of temporal context in handling occlusion. Motion Modeling methods (e.g., P2P [<xref ref-type="bibr" rid="ref-53">53</xref>]) show exceptional robustness on the NuScenes dataset, narrowing the gap with sequence-based methods, whereas traditional Similarity Matching approaches generally lag behind in complex sparse scenarios.</p>
<table-wrap id="table-3">
<label>Table 3</label>
<caption>
<title>Concise summary of SOTA performance by category.</title>
</caption>
<table>
<colgroup>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
</colgroup>
<thead>
<tr>
<th>Category</th>
<th>Representative SOTA</th>
<th>KITTI (Mean)</th>
<th>NuScenes (Mean)</th>
<th>Key Characteristics</th>
</tr>
</thead>
<tbody>
<tr>
<td><bold>Similarity Matching</bold></td>
<td>MMF-Track [<xref ref-type="bibr" rid="ref-44">44</xref>]/ VoxelTrack [<xref ref-type="bibr" rid="ref-6">6</xref>]</td>
<td>67.4</td>
<td>59.0</td>
<td>Robust texture fusion; Fast inference</td>
</tr>
<tr>
<td><bold>Motion Modeling</bold></td>
<td>P2P [<xref ref-type="bibr" rid="ref-53">53</xref>]</td>
<td>71.7</td>
<td>59.8</td>
<td>Explicit kinematic constraints; Rigid-body focus</td>
</tr>
<tr>
<td><bold>Transformer</bold></td>
<td>CorpNet [<xref ref-type="bibr" rid="ref-73">73</xref>]/ PTTR&#x002B;&#x002B; [<xref ref-type="bibr" rid="ref-55">55</xref>]</td>
<td>64.5</td>
<td>51.9</td>
<td>Long-range dependency; High computation</td>
</tr>
<tr>
<td><bold>Sequence &#x0026; Update</bold></td>
<td>M3SOT [<xref ref-type="bibr" rid="ref-69">69</xref>]/ MVCTrack [<xref ref-type="bibr" rid="ref-59">59</xref>]</td>
<td><bold>72.3</bold></td>
<td><bold>61.2</bold></td>
<td>Temporal consistency; Best occlusion handling</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>Regarding backbone impact, Transformer-based backbones consistently outperform (e.g., PTTR [<xref ref-type="bibr" rid="ref-11">11</xref>]) consistently outperform PointNet&#x002B;&#x002B; [<xref ref-type="bibr" rid="ref-20">20</xref>] based ones by approximately 5% in Success, validating the importance of global context capture in sparse point clouds. In terms of decision mechanism effectiveness, Motion-centric methods (e.g., <inline-formula id="ieqn-14"><mml:math id="mml-ieqn-14"><mml:msup><mml:mi>M</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msup></mml:math></inline-formula>-Track [<xref ref-type="bibr" rid="ref-8">8</xref>]) show superior robustness on rigid objects like cars compared to Similarity Matching, as they explicitly utilize kinematic constraints. However, for non-rigid objects such as pedestrians, attention-based methods show better adaptability. Finally, concerning update strategies, methods utilizing Temporal Memory (e.g., MBPTrack [<xref ref-type="bibr" rid="ref-65">65</xref>]) demonstrate a significant performance boost in long-term tracking scenarios compared to single-frame baselines.</p>
</sec>
<sec id="s5_2">
<label>5.2</label>
<title>Visualization</title>
<p>We select BAT [<xref ref-type="bibr" rid="ref-43">43</xref>] as the baseline for visualization because it represents the foundational &#x201C;Box-Aware&#x201D; paradigm and serves as a standard open-source benchmark for validating recent advancements.</p>
<p>To mitigate the paucity of visual evaluations in existing 3D point cloud tracking literature, we integrate visualization data from <inline-formula id="ieqn-15"><mml:math id="mml-ieqn-15"><mml:msup><mml:mi>M</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msup></mml:math></inline-formula>-Track [<xref ref-type="bibr" rid="ref-8">8</xref>] and compare it with BAT [<xref ref-type="bibr" rid="ref-43">43</xref>], as illustrated in <xref ref-type="fig" rid="fig-5">Figs. 5</xref> and <xref ref-type="fig" rid="fig-6">6</xref>. These visual comparisons offer mechanistic insights into the representative tracking sequence, thereby facilitating a nuanced understanding of the efficacy of motion modeling and trajectory estimation in 3D object tracking.</p>
<fig id="fig-5">
<label>Figure 5</label>
<caption>
<title>Visualization of tracking results.</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_76652-fig-5.tif"/>
</fig><fig id="fig-6">
<label>Figure 6</label>
<caption>
<title>Trajectory analysis: long-term consistency and per-frame position error.</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_76652-fig-6.tif"/>
</fig>
<p>In Scene 0020 (Frame 55), the notable position error of <inline-formula id="ieqn-16"><mml:math id="mml-ieqn-16"><mml:msup><mml:mi>M</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msup></mml:math></inline-formula>-Track (0.077 m vs. BAT&#x2019;s 0.870 m in single-frame localization) exposes limitations in its motion modeling under fine-grained spatial constraints. The discrepancy between <inline-formula id="ieqn-17"><mml:math id="mml-ieqn-17"><mml:msup><mml:mi>M</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msup></mml:math></inline-formula>-Track&#x2019;s bounding box and the ground truth across multi-view projections (e.g., XY, XZ views) indicates that its motion-driven localization strategy grapples with precise spatial alignment in dense point cloud clusters. Conversely, BAT exhibits superior single-frame localization accuracy (0.870 m error) and trajectory consistency, attributed to its effective fusion of geometric features and motion priors&#x2014;underscoring the pivotal role of multi-modal information integration in precise tracking. Notably, BAT&#x2019;s consistent alignment with the ground truth in both single-frame and trajectory analyses corroborates that fine-grained feature matching, coupled with robust motion regularization, constitutes an effective approach to enhancing tracking precision and temporal stability. This design aligns with methodologies emphasizing spatiotemporal coherence and multi-view geometric consistency, collectively indicating an emerging trend of integrating precise spatial alignment with temporal continuity in 3D tracking.</p>
<p>In the trajectory analysis of Scene 0020, <inline-formula id="ieqn-18"><mml:math id="mml-ieqn-18"><mml:msup><mml:mi>M</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msup></mml:math></inline-formula>-Track&#x2019;s substantial mean position error (8.086 m) and trajectory divergence highlight the inadequacy of its motion-centric modeling in sustaining long-term tracking consistency under complex spatial dynamics. This shortcoming stands in contrast to BAT&#x2019;s stable trajectory (mean error 0.460 m), which benefits from adaptive motion correction and multi-frame feature aggregation&#x2014;mechanisms that bolster robustness by explicitly modeling spatiotemporal dependencies. Furthermore, BAT&#x2019;s precision in both single-frame localization and trajectory estimation aligns with a convergent technical trajectory of methodologies emphasizing the tight integration of spatial geometry and temporal motion, collectively propelling tracking models toward spatiotemporal co-optimization.</p>
<p>These visual findings not only validate the efficacy of quantitative metrics but also unveil two core trends in contemporary 3D tracking: a paradigm shift from the pursuit of single-frame precision to the assurance of long-term trajectory consistency, and a methodological evolution from motion-only or feature-only modeling to integrated spatiotemporal-feature fusion. As paradigmatic examples, BAT&#x2019;s multi-view alignment mechanism and <inline-formula id="ieqn-19"><mml:math id="mml-ieqn-19"><mml:msup><mml:mi>M</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msup></mml:math></inline-formula>-Track&#x2019;s motion modeling strategy demonstrate that precise spatial alignment and robust motion regularization are pivotal strategies for addressing long-term trajectory divergence and fine-grained localization errors in 3D tracking.</p>
</sec>
</sec>
<sec id="s6">
<label>6</label>
<title>Discussion</title>
<sec id="s6_1">
<label>6.1</label>
<title>Technological Evolution</title>
<p>The field has witnessed a paradigm shift from 2D-inspired Similarity Matching to 3D Motion Modeling, and recently to Sequence-based learning. This evolution reflects a growing capability to handle the intrinsic disorder and sparsity of point clouds, moving from simple texture-less matching to sophisticated dynamic state estimation.</p>
</sec>
<sec id="s6_2">
<label>6.2</label>
<title>Limitations and Open Challenges</title>
<p>3D SOT has witnessed substantial advancements, yet critical challenges persist, limiting its robustness in real-world scenarios. Extreme point cloud sparsity remains a primary bottleneck: long-range targets (e.g., pedestrians 100 m away with 20&#x2013;50 points) or sensor noise in adverse weather (60% density reduction in rain/fog) degrade feature stability, leading to a 30% higher drift rate in state-of-the-art methods compared to dense scenarios. This sparsity exacerbates mismatches in similarity matching frameworks and propagates errors in motion modeling pipelines.</p>
<p>Long-term occlusion further disrupts tracking continuity. Motion-centric methods relying on local inter-frame cues fail to capture non-rigid dynamics or complex trajectories, while similarity-based approaches lose discriminative features, resulting in irreversible drift. Non-rigid deformation of targets (e.g., pedestrian limb movement) challenges rigid-body assumptions, with part-level models still lacking dense correspondence to model fine-grained changes.</p>
<p>The accuracy-efficiency trade-off remains unresolved. Transformer-based architectures achieve superior performance but incur quadratic computational complexity with respect to point count, hindering real-time deployment on edge devices. Lightweight alternatives optimize speed but sacrifice precision in sparse scenes. Additionally, limited generalization across diverse scenarios (urban vs. off-road, varying sensor quality) persists, as models overfit to dataset-specific patterns rather than generalizable geometric or motion priors.</p>
</sec>
</sec>
<sec id="s7">
<label>7</label>
<title>Conclusion</title>
<p>In this survey, we have presented a systematic and comprehensive review of 3D single object tracking in point clouds. We established a unified taxonomy covering feature extraction backbones, decision mechanisms, and model update strategies, and quantitatively evaluated state-of-the-art methods across the KITTI and NuScenes benchmarks. Our analysis reveals that while significant progress has been made in handling unstructured data, the field still faces bottlenecks in extreme sparsity and long-term robustness. Drawing from these insights, we conclude by outlining key directions for the next generation of tracking systems.</p>
<p>Future research must prioritize: multimodal fusion to mitigate sparsity; lightweight architectures for edge deployment; and physics-informed motion modeling. Furthermore, the development of trustworthy and explainable tracking systems is imperative. Current models often act as black boxes, limiting their adoption in safety-critical domains. Future works should integrate counterfactual reasoning to diagnose tracking failures and drifts by analyzing how minimal perturbations in point clouds affect tracking outcomes. As demonstrated in recent studies on static perception [<xref ref-type="bibr" rid="ref-77">77</xref>], explainable perturbation-based analysis can reveal model sensitivities, and extending this to dynamic tracking is a promising avenue for expert-guided reliability. These advancements will accelerate 3D SOT deployment in autonomous driving, robotics, and intelligent surveillance, enabling more reliable environmental perception in dynamic real-world scenarios.</p>
</sec>
</body>
<back>
<ack>
<p>None.</p>
</ack>
<sec>
<title>Funding Statement</title>
<p>This work was supported by the National Natural Science Foundation of China (Nos. 62306049, 92471207 and W2421089), the General Program of Chongqing Natural Science Foundation (No. CSTB2023NSCQ-MSX0665), and the Fundamental Research Funds for the Central Universities (No. 2024CDJXY008).</p>
</sec>
<sec>
<title>Author Contributions</title>
<p>The authors confirm contribution to the paper as follows: Conceptualization, Bo Huang and Yihao Kuang; methodology, Yihao Kuang and Hong Zhang; writing&#x2014;original draft preparation, Yihao Kuang, Jiaqi Wang and Lingyu Jin; writing&#x2014;review and editing, Yihao Kuang and Bo Huang. All authors reviewed and approved the final version of the manuscript.</p>
</sec>
<sec sec-type="data-availability">
<title>Availability of Data and Materials</title>
<p>Not applicable.</p>
</sec>
<sec>
<title>Ethics Approval</title>
<p>Not applicable.</p>
</sec>
<sec sec-type="COI-statement">
<title>Conflicts of Interest</title>
<p>The authors declare no conflicts of interest.</p>
</sec>
<glossary content-type="abbreviations" id="glossary-1">
<title>Table of Abbreviations</title>
<def-list>
<def-item>
<term>Abbreviation</term>
<def>
<p>Full Term</p>
</def>
</def-item>
<def-item>
<term>SOT</term>
<def>
<p>Single Object Tracking</p>
</def>
</def-item>
<def-item>
<term>BEV</term>
<def>
<p>Bird&#x2019;s Eye View</p>
</def>
</def-item>
<def-item>
<term>SSM</term>
<def>
<p>State Space Model</p>
</def>
</def-item>
<def-item>
<term>ATG</term>
<def>
<p>Adaptive Template Generation</p>
</def>
</def-item>
<def-item>
<term>CFA</term>
<def>
<p>Cross-Frame Aggregation</p>
</def>
</def-item>
<def-item>
<term>STFA</term>
<def>
<p>Spatio-Temporal Feature Aggregation</p>
</def>
</def-item>
<def-item>
<term>BFE</term>
<def>
<p>Bounding Box-Aware Feature Encoding</p>
</def>
</def-item>
<def-item>
<term>IoU/CIoU</term>
<def>
<p>Intersection over Union/Center IoU</p>
</def>
</def-item>
<def-item>
<term>FPS</term>
<def>
<p>Frames Per Second</p>
</def>
</def-item>
<def-item>
<term>RPN</term>
<def>
<p>Region Proposal Network</p>
</def>
</def-item>
<def-item>
<term>MLP</term>
<def>
<p>Multi-Layer Perceptron</p>
</def>
</def-item>
<def-item>
<term>AUC</term>
<def>
<p>Area Under the Curve</p>
</def>
</def-item>
<def-item>
<term>SOTA</term>
<def>
<p>State-of-the-Art</p>
</def>
</def-item>
</def-list>
</glossary>
<ref-list content-type="authoryear">
<title>References</title>
<ref id="ref-1"><label>[1]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Hinterstoisser</surname> <given-names>S</given-names></string-name>, <string-name><surname>Lepetit</surname> <given-names>V</given-names></string-name>, <string-name><surname>Ilic</surname> <given-names>S</given-names></string-name>, <string-name><surname>Holzer</surname> <given-names>S</given-names></string-name>, <string-name><surname>Bradski</surname> <given-names>G</given-names></string-name>, <string-name><surname>Konolige</surname> <given-names>K</given-names></string-name>, <etal>et al</etal></person-group>. <article-title>Model based training, detection and pose estimation of texture-less 3D objects in heavily cluttered scenes</article-title>. In: <conf-name>Asian Conference on Computer Vision</conf-name>. <publisher-loc>Cham, Switzerland</publisher-loc>: <publisher-name>Springer</publisher-name>; <year>2012</year>. p. <fpage>548</fpage>&#x2013;<lpage>62</lpage>. doi:<pub-id pub-id-type="doi">10.1007/978-3-642-37331-2_42</pub-id>.</mixed-citation></ref>
<ref id="ref-2"><label>[2]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Henriques</surname> <given-names>JF</given-names></string-name>, <string-name><surname>Caseiro</surname> <given-names>R</given-names></string-name>, <string-name><surname>Martins</surname> <given-names>P</given-names></string-name>, <string-name><surname>Batista</surname> <given-names>J</given-names></string-name></person-group>. <article-title>High-speed tracking with kernelized correlation filters</article-title>. <source>IEEE Trans Pattern Anal Mach Intell</source>. <year>2014</year>;<volume>37</volume>(<issue>3</issue>):<fpage>583</fpage>&#x2013;<lpage>96</lpage>. doi:<pub-id pub-id-type="doi">10.1109/TPAMI.2014.2345390</pub-id>; <pub-id pub-id-type="pmid">26353263</pub-id></mixed-citation></ref>
<ref id="ref-3"><label>[3]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Kalman</surname> <given-names>RE</given-names></string-name></person-group>. <article-title>A new approach to linear filtering and prediction problems</article-title>. <source>J Basic Eng</source>. <year>1960</year>;<volume>82</volume>(<issue>1</issue>):<fpage>35</fpage>&#x2013;<lpage>45</lpage>. doi:<pub-id pub-id-type="doi">10.1115/1.3662552</pub-id>.</mixed-citation></ref>
<ref id="ref-4"><label>[4]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Yin</surname> <given-names>T</given-names></string-name>, <string-name><surname>Zhou</surname> <given-names>X</given-names></string-name>, <string-name><surname>Krahenbuhl</surname> <given-names>P</given-names></string-name></person-group>. <article-title>Center-based 3D object detection and tracking</article-title>. In: <conf-name>Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</conf-name>. <publisher-loc>Piscataway, NJ, USA</publisher-loc>: <publisher-name>IEEE</publisher-name>; <year>2021</year>. p. <fpage>11784</fpage>&#x2013;<lpage>93</lpage>. doi:<pub-id pub-id-type="doi">10.1109/CVPR46437.2021.01161</pub-id>.</mixed-citation></ref>
<ref id="ref-5"><label>[5]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Qi</surname> <given-names>CR</given-names></string-name>, <string-name><surname>Litany</surname> <given-names>O</given-names></string-name>, <string-name><surname>He</surname> <given-names>K</given-names></string-name>, <string-name><surname>Guibas</surname> <given-names>LJ</given-names></string-name></person-group>. <article-title>Deep hough voting for 3D object detection in point clouds</article-title>. In: <conf-name>Proceedings of the IEEE/CVF International Conference on Computer Vision</conf-name>. <publisher-loc>Piscataway, NJ, USA</publisher-loc>: <publisher-name>IEEE</publisher-name>; <year>2019</year>. p. <fpage>9277</fpage>&#x2013;<lpage>86</lpage>. doi:<pub-id pub-id-type="doi">10.1109/ICCV.2019.00937</pub-id>.</mixed-citation></ref>
<ref id="ref-6"><label>[6]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><surname>Lu</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Nie</surname> <given-names>J</given-names></string-name>, <string-name><surname>He</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Gu</surname> <given-names>H</given-names></string-name>, <string-name><surname>Lv</surname> <given-names>X</given-names></string-name></person-group>. <article-title>VoxelTrack: exploring voxel representation for 3D point cloud object tracking</article-title>. <comment>arXiv:2408.02263. 2024</comment>. doi:<pub-id pub-id-type="doi">10.48550/arxiv.2408.02263</pub-id>.</mixed-citation></ref>
<ref id="ref-7"><label>[7]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Huang</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Wen</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Ren</surname> <given-names>J</given-names></string-name>, <string-name><surname>Jia</surname> <given-names>K</given-names></string-name></person-group>. <article-title>Surface reconstruction from point clouds: a survey and a benchmark</article-title>. <source>IEEE Trans Pattern Anal Mach Intell</source>. <year>2024</year>;<volume>46</volume>(<issue>10</issue>):<fpage>6762</fpage>&#x2013;<lpage>83</lpage>. doi:<pub-id pub-id-type="doi">10.1109/TPAMI.2024.3429209</pub-id>; <pub-id pub-id-type="pmid">39012756</pub-id></mixed-citation></ref>
<ref id="ref-8"><label>[8]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Zheng</surname> <given-names>C</given-names></string-name>, <string-name><surname>Yan</surname> <given-names>X</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>H</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>B</given-names></string-name>, <string-name><surname>Cheng</surname> <given-names>S</given-names></string-name>, <string-name><surname>Cui</surname> <given-names>S</given-names></string-name>, <etal>et al</etal></person-group>. <article-title>Beyond 3D siamese tracking: a motion-centric paradigm for 3D single object tracking in point clouds</article-title>. In: <conf-name>Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</conf-name>. <publisher-loc>Piscataway, NJ, USA</publisher-loc>: <publisher-name>IEEE</publisher-name>; <year>2022</year>. p. <fpage>8111</fpage>&#x2013;<lpage>20</lpage>. doi:<pub-id pub-id-type="doi">10.48550/arXiv.2203.01730</pub-id>.</mixed-citation></ref>
<ref id="ref-9"><label>[9]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Ku</surname> <given-names>J</given-names></string-name>, <string-name><surname>Mozifian</surname> <given-names>M</given-names></string-name>, <string-name><surname>Lee</surname> <given-names>J</given-names></string-name>, <string-name><surname>Harakeh</surname> <given-names>A</given-names></string-name>, <string-name><surname>Waslander</surname> <given-names>SL</given-names></string-name></person-group>. <article-title>Joint 3D proposal generation and object detection from view aggregation</article-title>. In: <conf-name>2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)</conf-name>. <publisher-loc>Piscataway, NJ, USA</publisher-loc>: <publisher-name>IEEE</publisher-name>; <year>2018</year>. p. <fpage>1</fpage>&#x2013;<lpage>8</lpage>. doi:<pub-id pub-id-type="doi">10.1109/IROS.2018.8594049</pub-id>.</mixed-citation></ref>
<ref id="ref-10"><label>[10]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Gao</surname> <given-names>J</given-names></string-name>, <string-name><surname>Yan</surname> <given-names>X</given-names></string-name>, <string-name><surname>Zhao</surname> <given-names>W</given-names></string-name>, <string-name><surname>Lyu</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Liao</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Zheng</surname> <given-names>C</given-names></string-name></person-group>. <article-title>Spatio-temporal contextual learning for single object tracking on point clouds</article-title>. <source>IEEE Trans Neural Netw Learn Syst</source>. <year>2024</year>;<volume>35</volume>(<issue>4</issue>):<fpage>4754</fpage>&#x2013;<lpage>66</lpage>. doi:<pub-id pub-id-type="doi">10.1109/TNNLS.2022.3233562</pub-id>; <pub-id pub-id-type="pmid">37018572</pub-id></mixed-citation></ref>
<ref id="ref-11"><label>[11]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Zhou</surname> <given-names>C</given-names></string-name>, <string-name><surname>Luo</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Luo</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Liu</surname> <given-names>T</given-names></string-name>, <string-name><surname>Pan</surname> <given-names>L</given-names></string-name>, <string-name><surname>Cai</surname> <given-names>Z</given-names></string-name>, <etal>et al</etal></person-group>. <article-title>PTTR: relational 3D point cloud object tracking with transformer</article-title>. In: <conf-name>Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</conf-name>. <publisher-loc>Piscataway, NJ, USA</publisher-loc>: <publisher-name>IEEE</publisher-name>; <year>2022</year>. p. <fpage>8531</fpage>&#x2013;<lpage>40</lpage>. doi:<pub-id pub-id-type="doi">10.48550/arXiv.2112.02857</pub-id>.</mixed-citation></ref>
<ref id="ref-12"><label>[12]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><surname>Gu</surname> <given-names>L</given-names></string-name>, <string-name><surname>Wei</surname> <given-names>M</given-names></string-name>, <string-name><surname>Yan</surname> <given-names>X</given-names></string-name>, <string-name><surname>Zhu</surname> <given-names>D</given-names></string-name>, <string-name><surname>Zhao</surname> <given-names>W</given-names></string-name>, <string-name><surname>Xie</surname> <given-names>H</given-names></string-name>, <etal>et al.</etal></person-group> <article-title>YOLOO: you only learn from others once</article-title>. <comment>arXiv:2409.00618. 2024</comment>. doi:<pub-id pub-id-type="doi">10.48550/arxiv.2409.00618</pub-id>.</mixed-citation></ref>
<ref id="ref-13"><label>[13]</label><mixed-citation publication-type="book"><person-group person-group-type="author"><string-name><surname>Hoda&#x0148;</surname> <given-names>T</given-names></string-name>, <string-name><surname>Matas</surname> <given-names>J</given-names></string-name>, <string-name><surname>Obdr&#x017E;&#x00E1;lek</surname> <given-names>&#x0160;</given-names></string-name></person-group>. <chapter-title>On evaluation of 6D object pose estimation</chapter-title>. In: <person-group person-group-type="editor"><string-name><surname>Hua</surname> <given-names>G</given-names></string-name>, <string-name><surname>J&#x00E9;gou</surname> <given-names>H</given-names></string-name></person-group>, editors. <source>Computer Vision &#x2013; ECCV 2016 Workshops</source>. <publisher-loc>Cham, Switzerland</publisher-loc>: <publisher-name>Springer International Publishing</publisher-name>; <year>2016</year>. p. <fpage>606</fpage>&#x2013;<lpage>19</lpage>. doi:<pub-id pub-id-type="doi">10.1007/978-3-319-49409-8_52</pub-id>.</mixed-citation></ref>
<ref id="ref-14"><label>[14]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Geiger</surname> <given-names>A</given-names></string-name>, <string-name><surname>Lenz</surname> <given-names>P</given-names></string-name>, <string-name><surname>Urtasun</surname> <given-names>R</given-names></string-name></person-group>. <article-title>Are we ready for autonomous driving? The kitti vision benchmark suite</article-title>. In: <conf-name>2012 IEEE Conference on Computer Vision and Pattern Recognition</conf-name>. <publisher-loc>Piscataway, NJ, USA</publisher-loc>: <publisher-name>IEEE</publisher-name>; <year>2012</year>. p. <fpage>3354</fpage>&#x2013;<lpage>61</lpage>. doi:<pub-id pub-id-type="doi">10.1109/CVPR.2012.6248074</pub-id>.</mixed-citation></ref>
<ref id="ref-15"><label>[15]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Luo</surname> <given-names>C</given-names></string-name>, <string-name><surname>Yang</surname> <given-names>X</given-names></string-name>, <string-name><surname>Yuille</surname> <given-names>A</given-names></string-name></person-group>. <article-title>Exploring simple 3D multi-object tracking for autonomous driving</article-title>. In: <conf-name>Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)</conf-name>. <publisher-loc>Piscataway, NJ, USA</publisher-loc>: <publisher-name>IEEE</publisher-name>; <year>2021</year>. p. <fpage>10488</fpage>&#x2013;<lpage>97</lpage>. doi:<pub-id pub-id-type="doi">10.1109/ICCV48922.2021.01034</pub-id>.</mixed-citation></ref>
<ref id="ref-16"><label>[16]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Asvadi</surname> <given-names>A</given-names></string-name>, <string-name><surname>Girao</surname> <given-names>P</given-names></string-name>, <string-name><surname>Peixoto</surname> <given-names>P</given-names></string-name>, <string-name><surname>Nunes</surname> <given-names>U</given-names></string-name></person-group>. <article-title>3D object tracking using RGB and LiDAR data</article-title>. In: <conf-name>2016 IEEE 19th International Conference on Intelligent Transportation Systems (ITSC)</conf-name>. <publisher-loc>Piscataway, NJ, USA</publisher-loc>: <publisher-name>IEEE</publisher-name>; <year>2016</year>. p. <fpage>1255</fpage>&#x2013;<lpage>60</lpage>. doi:<pub-id pub-id-type="doi">10.1109/ITSC.2016.7795718</pub-id>.</mixed-citation></ref>
<ref id="ref-17"><label>[17]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Wu</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Song</surname> <given-names>S</given-names></string-name>, <string-name><surname>Khosla</surname> <given-names>A</given-names></string-name>, <string-name><surname>Yu</surname> <given-names>F</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>L</given-names></string-name>, <string-name><surname>Tang</surname> <given-names>X</given-names></string-name>, <etal>et al</etal></person-group>. <article-title>3D ShapeNets: a deep representation for volumetric shapes</article-title>. In: <conf-name>Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition</conf-name>. <publisher-loc>Piscataway, NJ, USA</publisher-loc>: <publisher-name>IEEE</publisher-name>; <year>2015</year>. p. <fpage>1912</fpage>&#x2013;<lpage>20</lpage>. doi:<pub-id pub-id-type="doi">10.1109/CVPR.2015.7298801</pub-id>.</mixed-citation></ref>
<ref id="ref-18"><label>[18]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Guo</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>H</given-names></string-name>, <string-name><surname>Hu</surname> <given-names>Q</given-names></string-name>, <string-name><surname>Liu</surname> <given-names>H</given-names></string-name>, <string-name><surname>Liu</surname> <given-names>L</given-names></string-name>, <string-name><surname>Bennamoun</surname> <given-names>M</given-names></string-name></person-group>. <article-title>Deep learning for 3D point clouds: a survey</article-title>. <source>IEEE Trans Pattern Anal Mach Intell</source>. <year>2020</year>;<volume>43</volume>(<issue>12</issue>):<fpage>4338</fpage>&#x2013;<lpage>64</lpage>. doi:<pub-id pub-id-type="doi">10.1109/tpami.2020.3005434</pub-id>; <pub-id pub-id-type="pmid">32750799</pub-id></mixed-citation></ref>
<ref id="ref-19"><label>[19]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Maturana</surname> <given-names>D</given-names></string-name>, <string-name><surname>Scherer</surname> <given-names>S</given-names></string-name></person-group>. <article-title>VoxNet: a 3D convolutional neural network for real-time object recognition</article-title>. In: <conf-name>2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)</conf-name>. <publisher-loc>Piscataway, NJ, USA</publisher-loc>: <publisher-name>IEEE</publisher-name>; <year>2015</year>. p. <fpage>922</fpage>&#x2013;<lpage>8</lpage>. doi:<pub-id pub-id-type="doi">10.1109/IROS.2015.7353481</pub-id>.</mixed-citation></ref>
<ref id="ref-20"><label>[20]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Qi</surname> <given-names>CR</given-names></string-name>, <string-name><surname>Su</surname> <given-names>H</given-names></string-name>, <string-name><surname>Mo</surname> <given-names>K</given-names></string-name>, <string-name><surname>Guibas</surname> <given-names>LJ</given-names></string-name></person-group>. <article-title>Pointnet: deep learning on point sets for 3D classification and segmentation</article-title>. In: <conf-name>Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition</conf-name>. <publisher-loc>Piscataway, NJ, USA</publisher-loc>: <publisher-name>IEEE</publisher-name>; <year>2017</year>. p. <fpage>652</fpage>&#x2013;<lpage>60</lpage>. doi:<pub-id pub-id-type="doi">10.1109/CVPR.2017.16</pub-id>.</mixed-citation></ref>
<ref id="ref-21"><label>[21]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><surname>Qi</surname> <given-names>CR</given-names></string-name>, <string-name><surname>Yi</surname> <given-names>L</given-names></string-name>, <string-name><surname>Su</surname> <given-names>H</given-names></string-name>, <string-name><surname>Guibas</surname> <given-names>LJ</given-names></string-name></person-group>. <article-title>Pointnet&#x002B;&#x002B;: deep hierarchical feature learning on point sets in a metric space</article-title>. <comment>arXiv:1706.02413. 2017</comment>.</mixed-citation></ref>
<ref id="ref-22"><label>[22]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><surname>Li</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Bu</surname> <given-names>R</given-names></string-name>, <string-name><surname>Sun</surname> <given-names>M</given-names></string-name>, <string-name><surname>Wu</surname> <given-names>W</given-names></string-name>, <string-name><surname>Di</surname> <given-names>X</given-names></string-name>, <string-name><surname>Chen</surname> <given-names>B</given-names></string-name></person-group>. <article-title>PointCNN: convolution on x-transformed points</article-title>. <comment>arXiv:1801.07791. 2018</comment>.</mixed-citation></ref>
<ref id="ref-23"><label>[23]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Liu</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Fan</surname> <given-names>B</given-names></string-name>, <string-name><surname>Xiang</surname> <given-names>S</given-names></string-name>, <string-name><surname>Pan</surname> <given-names>C</given-names></string-name></person-group>. <article-title>Relation-shape convolutional neural network for point cloud analysis</article-title>. In: <conf-name>Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</conf-name>. <publisher-loc>Piscataway, NJ, USA</publisher-loc>: <publisher-name>IEEE</publisher-name>; <year>2019</year>. p. <fpage>8895</fpage>&#x2013;<lpage>904</lpage>. doi:<pub-id pub-id-type="doi">10.1109/CVPR.2019.00910</pub-id>.</mixed-citation></ref>
<ref id="ref-24"><label>[24]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Zhao</surname> <given-names>H</given-names></string-name>, <string-name><surname>Jiang</surname> <given-names>L</given-names></string-name>, <string-name><surname>Fu</surname> <given-names>CW</given-names></string-name>, <string-name><surname>Jia</surname> <given-names>J</given-names></string-name></person-group>. <article-title>Pointweb: enhancing local neighborhood features for point cloud processing</article-title>. In: <conf-name>Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</conf-name>. <publisher-loc>Piscataway, NJ, USA</publisher-loc>: <publisher-name>IEEE</publisher-name>; <year>2019</year>. p. <fpage>5565</fpage>&#x2013;<lpage>73</lpage>. doi:<pub-id pub-id-type="doi">10.1109/CVPR.2019.00571</pub-id>.</mixed-citation></ref>
<ref id="ref-25"><label>[25]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Yang</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Feng</surname> <given-names>C</given-names></string-name>, <string-name><surname>Shen</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Tian</surname> <given-names>D</given-names></string-name></person-group>. <article-title>Foldingnet: point cloud auto-encoder via deep grid deformation</article-title>. In: <conf-name>Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition</conf-name>. <publisher-loc>Piscataway, NJ, USA</publisher-loc>: <publisher-name>IEEE</publisher-name>; <year>2018</year>. p. <fpage>206</fpage>&#x2013;<lpage>15</lpage>. doi:<pub-id pub-id-type="doi">10.1109/CVPR.2018.00029</pub-id>.</mixed-citation></ref>
<ref id="ref-26"><label>[26]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Wu</surname> <given-names>W</given-names></string-name>, <string-name><surname>Qi</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Li</surname> <given-names>F</given-names></string-name></person-group>. <article-title>Deep convolutional networks on 3D point clouds</article-title>. In: <conf-name>IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)</conf-name>. <publisher-loc>Piscataway, NJ, USA</publisher-loc>: <publisher-name>IEEE</publisher-name>; <year>2019</year>. p. <fpage>9613</fpage>&#x2013;<lpage>22</lpage>. doi:<pub-id pub-id-type="doi">10.1109/CVPR.2019.00985</pub-id>.</mixed-citation></ref>
<ref id="ref-27"><label>[27]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Xu</surname> <given-names>M</given-names></string-name>, <string-name><surname>Ding</surname> <given-names>R</given-names></string-name>, <string-name><surname>Zhao</surname> <given-names>H</given-names></string-name>, <string-name><surname>Qi</surname> <given-names>X</given-names></string-name></person-group>. <article-title>Paconv: position adaptive convolution with dynamic kernel assembling on point clouds</article-title>. In: <conf-name>Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</conf-name>. <publisher-loc>Piscataway, NJ, USA</publisher-loc>: <publisher-name>IEEE</publisher-name>; <year>2021</year>. p. <fpage>3173</fpage>&#x2013;<lpage>82</lpage>. doi:<pub-id pub-id-type="doi">10.48550/arXiv.2103.14635</pub-id>.</mixed-citation></ref>
<ref id="ref-28"><label>[28]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Wang</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Sun</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Liu</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Sarma</surname> <given-names>SE</given-names></string-name>, <string-name><surname>Bronstein</surname> <given-names>MM</given-names></string-name>, <string-name><surname>Solomon</surname> <given-names>JM</given-names></string-name></person-group>. <article-title>Dynamic graph CNN for learning on point clouds</article-title>. <source>ACM Trans Graph</source>. <year>2019</year>;<volume>38</volume>(<issue>5</issue>):<fpage>1</fpage>&#x2013;<lpage>12</lpage>. doi:<pub-id pub-id-type="doi">10.1145/3326362</pub-id>.</mixed-citation></ref>
<ref id="ref-29"><label>[29]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Jiang</surname> <given-names>B</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Lin</surname> <given-names>D</given-names></string-name>, <string-name><surname>Tang</surname> <given-names>J</given-names></string-name>, <string-name><surname>Luo</surname> <given-names>B</given-names></string-name></person-group>. <article-title>Semi-supervised learning with graph learning-convolutional networks</article-title>. In: <conf-name>Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</conf-name>. <publisher-loc>Piscataway, NJ, USA</publisher-loc>: <publisher-name>IEEE</publisher-name>; <year>2019</year>. p. <fpage>11313</fpage>&#x2013;<lpage>20</lpage>. doi:<pub-id pub-id-type="doi">10.1109/CVPR.2019.01157</pub-id>.</mixed-citation></ref>
<ref id="ref-30"><label>[30]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Xu</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Fan</surname> <given-names>T</given-names></string-name>, <string-name><surname>Xu</surname> <given-names>M</given-names></string-name>, <string-name><surname>Zeng</surname> <given-names>L</given-names></string-name>, <string-name><surname>Qiao</surname> <given-names>Y</given-names></string-name></person-group>. <article-title>Spidercnn: deep learning on point sets with parameterized convolutional filters</article-title>. In: <conf-name>Proceedings of the European Conference on Computer Vision (ECCV)</conf-name>. <publisher-loc>Cham, Switzerland</publisher-loc>: <publisher-name>Springer</publisher-name>; <year>2018</year>. p. <fpage>87</fpage>&#x2013;<lpage>102</lpage>. doi:<pub-id pub-id-type="doi">10.1007/978-3-030-01237-3_6</pub-id>.</mixed-citation></ref>
<ref id="ref-31"><label>[31]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Shi</surname> <given-names>W</given-names></string-name>, <string-name><surname>Rajkumar</surname> <given-names>R</given-names></string-name></person-group>. <article-title>Point-GNN: graph neural network for 3D object detection in a point cloud</article-title>. In: <conf-name>Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</conf-name>. <publisher-loc>Piscataway, NJ, USA</publisher-loc>: <publisher-name>IEEE</publisher-name>; <year>2020</year>. p. <fpage>1711</fpage>&#x2013;<lpage>9</lpage>. doi:<pub-id pub-id-type="doi">10.1109/CVPR42600.2020.00178</pub-id>.</mixed-citation></ref>
<ref id="ref-32"><label>[32]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Yu</surname> <given-names>X</given-names></string-name>, <string-name><surname>Tang</surname> <given-names>L</given-names></string-name>, <string-name><surname>Rao</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Huang</surname> <given-names>T</given-names></string-name>, <string-name><surname>Zhou</surname> <given-names>J</given-names></string-name>, <string-name><surname>Lu</surname> <given-names>J</given-names></string-name></person-group>. <article-title>Point-BERT: pre-training 3D point cloud transformers with masked point modeling</article-title>. In: <conf-name>Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</conf-name>. <publisher-loc>Piscataway, NJ, USA</publisher-loc>: <publisher-name>IEEE</publisher-name>; <year>2022</year>. p. <fpage>19313</fpage>&#x2013;<lpage>22</lpage>. doi:<pub-id pub-id-type="doi">10.48550/arXiv.2111.14819</pub-id>.</mixed-citation></ref>
<ref id="ref-33"><label>[33]</label><mixed-citation publication-type="book"><person-group person-group-type="author"><string-name><surname>Vaswani</surname> <given-names>A</given-names></string-name>, <string-name><surname>Shazeer</surname> <given-names>N</given-names></string-name>, <string-name><surname>Parmar</surname> <given-names>N</given-names></string-name>, <string-name><surname>Uszkoreit</surname> <given-names>J</given-names></string-name>, <string-name><surname>Jones</surname> <given-names>L</given-names></string-name>, <string-name><surname>Gomez</surname> <given-names>AN</given-names></string-name>, <etal>et al</etal></person-group>. <chapter-title>Attention is all you need</chapter-title>. In: <source>Advances in neural information processing systems</source>. Vol. 30. <publisher-loc>Red Hook, NY, USA</publisher-loc>: <publisher-name>Curran Associates, Inc.</publisher-name>; <year>2017</year> p. <fpage>5998</fpage>&#x2013;<lpage>6008</lpage>. doi:<pub-id pub-id-type="doi">10.65215/ctdc8e75</pub-id>.</mixed-citation></ref>
<ref id="ref-34"><label>[34]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Guo</surname> <given-names>MH</given-names></string-name>, <string-name><surname>Cai</surname> <given-names>JX</given-names></string-name>, <string-name><surname>Liu</surname> <given-names>ZN</given-names></string-name>, <string-name><surname>Mu</surname> <given-names>TJ</given-names></string-name>, <string-name><surname>Martin</surname> <given-names>RR</given-names></string-name>, <string-name><surname>Hu</surname> <given-names>SM</given-names></string-name></person-group>. <article-title>PCT: point cloud transformer</article-title>. <source>Comput Vis Media</source>. <year>2021</year>;<volume>7</volume>(<issue>2</issue>):<fpage>187</fpage>&#x2013;<lpage>99</lpage>. doi:<pub-id pub-id-type="doi">10.1007/s41095-021-0229-5</pub-id>.</mixed-citation></ref>
<ref id="ref-35"><label>[35]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Yang</surname> <given-names>YQ</given-names></string-name>, <string-name><surname>Guo</surname> <given-names>YX</given-names></string-name>, <string-name><surname>Xiong</surname> <given-names>JY</given-names></string-name>, <string-name><surname>Liu</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Pan</surname> <given-names>H</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>PS</given-names></string-name>, <etal>et al</etal></person-group>. <article-title>Swin3D: a pretrained transformer backbone for 3D indoor scene understanding</article-title>. <source>Comput Vis Media</source>. <year>2025</year>;<volume>11</volume>(<issue>1</issue>):<fpage>83</fpage>&#x2013;<lpage>101</lpage>. doi:<pub-id pub-id-type="doi">10.26599/CVM.2025.9450383</pub-id>.</mixed-citation></ref>
<ref id="ref-36"><label>[36]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Zhao</surname> <given-names>H</given-names></string-name>, <string-name><surname>Jiang</surname> <given-names>L</given-names></string-name>, <string-name><surname>Jia</surname> <given-names>J</given-names></string-name>, <string-name><surname>Torr</surname> <given-names>PH</given-names></string-name>, <string-name><surname>Koltun</surname> <given-names>V</given-names></string-name></person-group>. <article-title>Point transformer</article-title>. In: <conf-name>Proceedings of the IEEE/CVF International Conference on Computer Vision</conf-name>. <publisher-loc>Piscataway, NJ, USA</publisher-loc>: <publisher-name>IEEE</publisher-name>; <year>2021</year>. p. <fpage>16259</fpage>&#x2013;<lpage>68</lpage>. doi:<pub-id pub-id-type="doi">10.1109/ICCV48922.2021.01595</pub-id>.</mixed-citation></ref>
<ref id="ref-37"><label>[37]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Liu</surname> <given-names>X</given-names></string-name>, <string-name><surname>Han</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Liu</surname> <given-names>YS</given-names></string-name>, <string-name><surname>Zwicker</surname> <given-names>M</given-names></string-name></person-group>. <article-title>Point2sequence: learning the shape representation of 3D point clouds with an attention-based sequence to sequence network</article-title>. In: <conf-name>Proceedings of the AAAI Conference on Artificial Intelligence</conf-name>. <publisher-loc>Palo Alto, CA, USA</publisher-loc>: <publisher-name>AAAI Press</publisher-name>; <year>2019</year>. p. <fpage>8778</fpage>&#x2013;<lpage>85</lpage>. doi:<pub-id pub-id-type="doi">10.1609/aaai.v33i01.33018778</pub-id>.</mixed-citation></ref>
<ref id="ref-38"><label>[38]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><surname>Liang</surname> <given-names>D</given-names></string-name>, <string-name><surname>Zhou</surname> <given-names>X</given-names></string-name>, <string-name><surname>Xu</surname> <given-names>W</given-names></string-name>, <string-name><surname>Zhu</surname> <given-names>X</given-names></string-name>, <string-name><surname>Zou</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Ye</surname> <given-names>X</given-names></string-name>, <etal>et al.</etal></person-group> <article-title>Pointmamba: a simple state space model for point cloud analysis</article-title>. <comment>arXiv:2402.10739. 2024</comment>. doi:<pub-id pub-id-type="doi">10.48550/arxiv.2402.10739</pub-id>.</mixed-citation></ref>
<ref id="ref-39"><label>[39]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Yang</surname> <given-names>N</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Liu</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Li</surname> <given-names>M</given-names></string-name>, <string-name><surname>An</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Zhao</surname> <given-names>X</given-names></string-name></person-group>. <article-title>SMamba: sparse mamba for event-based object detection</article-title>. In: <conf-name>Proceedings of the AAAI Conference on Artificial Intelligence</conf-name>. <publisher-loc>Palo Alto, CA, USA</publisher-loc>: <publisher-name>AAAI Press</publisher-name>; <year>2025</year>. p. <fpage>9229</fpage>&#x2013;<lpage>37</lpage>.</mixed-citation></ref>
<ref id="ref-40"><label>[40]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Bertinetto</surname> <given-names>L</given-names></string-name>, <string-name><surname>Valmadre</surname> <given-names>J</given-names></string-name>, <string-name><surname>Henriques</surname> <given-names>JF</given-names></string-name>, <string-name><surname>Vedaldi</surname> <given-names>A</given-names></string-name>, <string-name><surname>Torr</surname> <given-names>PH</given-names></string-name></person-group>. <article-title>Fully-convolutional siamese networks for object tracking</article-title>. In: <conf-name>European Conference on Computer Vision</conf-name>. <publisher-loc>Cham, Switzerland</publisher-loc>: <publisher-name>Springer</publisher-name>; <year>2016</year>. p. <fpage>850</fpage>&#x2013;<lpage>65</lpage>. doi:<pub-id pub-id-type="doi">10.1007/978-3-319-48881-3_56</pub-id>.</mixed-citation></ref>
<ref id="ref-41"><label>[41]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Giancola</surname> <given-names>S</given-names></string-name>, <string-name><surname>Zarzar</surname> <given-names>J</given-names></string-name>, <string-name><surname>Ghanem</surname> <given-names>B</given-names></string-name></person-group>. <article-title>Leveraging shape completion for 3D siamese tracking</article-title>. In: <conf-name>Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</conf-name>. <publisher-loc>Piscataway, NJ, USA</publisher-loc>: <publisher-name>IEEE</publisher-name>; <year>2019</year>. p. <fpage>1359</fpage>&#x2013;<lpage>68</lpage>. doi:<pub-id pub-id-type="doi">10.1109/CVPR.2019.00145</pub-id>.</mixed-citation></ref>
<ref id="ref-42"><label>[42]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Qi</surname> <given-names>H</given-names></string-name>, <string-name><surname>Feng</surname> <given-names>C</given-names></string-name>, <string-name><surname>Cao</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Zhao</surname> <given-names>F</given-names></string-name>, <string-name><surname>Xiao</surname> <given-names>Y</given-names></string-name></person-group>. <article-title>P2B: point-to-box network for 3D object tracking in point clouds</article-title>. In: <conf-name>Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</conf-name>. <publisher-loc>Piscataway, NJ, USA</publisher-loc>: <publisher-name>IEEE</publisher-name>; <year>2020</year>. p. <fpage>6329</fpage>&#x2013;<lpage>38</lpage>. doi:<pub-id pub-id-type="doi">10.1109/CVPR42600.2020.00636</pub-id>.</mixed-citation></ref>
<ref id="ref-43"><label>[43]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Zheng</surname> <given-names>C</given-names></string-name>, <string-name><surname>Yan</surname> <given-names>X</given-names></string-name>, <string-name><surname>Gao</surname> <given-names>J</given-names></string-name>, <string-name><surname>Zhao</surname> <given-names>W</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>W</given-names></string-name>, <string-name><surname>Li</surname> <given-names>Z</given-names></string-name>, <etal>et al</etal></person-group>. <article-title>Box-aware feature enhancement for single object tracking on point clouds</article-title>. In: <conf-name>Proceedings of the IEEE/CVF International Conference on Computer Vision</conf-name>. <publisher-loc>Piscataway, NJ, USA</publisher-loc>: <publisher-name>IEEE</publisher-name>; <year>2021</year>. p. <fpage>13199</fpage>&#x2013;<lpage>208</lpage>. doi:<pub-id pub-id-type="doi">10.48550/arXiv.2108.04728</pub-id>.</mixed-citation></ref>
<ref id="ref-44"><label>[44]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Li</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Cui</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Lin</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Fang</surname> <given-names>Z</given-names></string-name></person-group>. <article-title>Mmf-track: multi-modal multi-level fusion for 3D single object tracking</article-title>. <source>IEEE Trans Intell Veh</source>. <year>2023</year>;<volume>9</volume>(<issue>1</issue>):<fpage>1817</fpage>&#x2013;<lpage>29</lpage>. doi:<pub-id pub-id-type="doi">10.1109/tiv.2023.3326790</pub-id>.</mixed-citation></ref>
<ref id="ref-45"><label>[45]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Nie</surname> <given-names>J</given-names></string-name>, <string-name><surname>He</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Yang</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Gao</surname> <given-names>M</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>J</given-names></string-name></person-group>. <article-title>GLT-T: global-local transformer voting for 3D single object tracking in point clouds</article-title>. In: <conf-name>Proceedings of the AAAI Conference on Artificial Intelligence</conf-name>. <publisher-loc>Palo Alto, CA, USA</publisher-loc>: <publisher-name>AAAI Press</publisher-name>; <year>2023</year>. p. <fpage>1957</fpage>&#x2013;<lpage>65</lpage>. doi:<pub-id pub-id-type="doi">10.1609/aaai.v37i2.25287</pub-id>.</mixed-citation></ref>
<ref id="ref-46"><label>[46]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Fang</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Zhou</surname> <given-names>S</given-names></string-name>, <string-name><surname>Cui</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Scherer</surname> <given-names>S</given-names></string-name></person-group>. <article-title>3D-siamrpn: an end-to-end learning method for real-time 3D single object tracking using raw point cloud</article-title>. <source>IEEE Sens J</source>. <year>2020</year>;<volume>21</volume>(<issue>4</issue>):<fpage>4995</fpage>&#x2013;<lpage>5011</lpage>. doi:<pub-id pub-id-type="doi">10.1109/JSEN.2020.3033034</pub-id>.</mixed-citation></ref>
<ref id="ref-47"><label>[47]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Li</surname> <given-names>B</given-names></string-name>, <string-name><surname>Yan</surname> <given-names>J</given-names></string-name>, <string-name><surname>Wu</surname> <given-names>W</given-names></string-name>, <string-name><surname>Zhu</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Hu</surname> <given-names>X</given-names></string-name></person-group>. <article-title>High performance visual tracking with siamese region proposal network</article-title>. In: <conf-name>Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition</conf-name>. <publisher-loc>Piscataway, NJ, USA</publisher-loc>: <publisher-name>IEEE</publisher-name>; <year>2018</year>. p. <fpage>8971</fpage>&#x2013;<lpage>80</lpage>. doi:<pub-id pub-id-type="doi">10.1109/CVPR.2018.00935</pub-id>.</mixed-citation></ref>
<ref id="ref-48"><label>[48]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Hui</surname> <given-names>L</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>L</given-names></string-name>, <string-name><surname>Cheng</surname> <given-names>M</given-names></string-name>, <string-name><surname>Xie</surname> <given-names>J</given-names></string-name>, <string-name><surname>Yang</surname> <given-names>J</given-names></string-name></person-group>. <article-title>3D siamese voxel-to-bev tracker for sparse point clouds</article-title>. <source>Adv Neural Inf Process Syst</source>. <year>2021</year>;<volume>34</volume>:<fpage>28714</fpage>&#x2013;<lpage>27</lpage>. doi:<pub-id pub-id-type="doi">10.48550/arXiv.2111.04426</pub-id>.</mixed-citation></ref>
<ref id="ref-49"><label>[49]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Park</surname> <given-names>M</given-names></string-name>, <string-name><surname>Seong</surname> <given-names>H</given-names></string-name>, <string-name><surname>Jang</surname> <given-names>W</given-names></string-name>, <string-name><surname>Kim</surname> <given-names>E</given-names></string-name></person-group>. <article-title>Graph-based point tracker for 3D object tracking in point clouds</article-title>. In: <conf-name>Proceedings of the AAAI Conference on Artificial Intelligence</conf-name>. <publisher-loc>Palo Alto, CA, USA</publisher-loc>: <publisher-name>AAAI Press</publisher-name>; <year>2022</year>. p. <fpage>2053</fpage>&#x2013;<lpage>61</lpage>. doi:<pub-id pub-id-type="doi">10.1609/aaai.v36i2.20101</pub-id>.</mixed-citation></ref>
<ref id="ref-50"><label>[50]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Li</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Lin</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Cui</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Li</surname> <given-names>S</given-names></string-name>, <string-name><surname>Fang</surname> <given-names>Z</given-names></string-name></person-group>. <article-title>Motion-to-matching: a mixed paradigm for 3D single object tracking</article-title>. <source>IEEE Robot Autom Lett</source>. <year>2023</year>;<volume>9</volume>(<issue>2</issue>):<fpage>1468</fpage>&#x2013;<lpage>75</lpage>. doi:<pub-id pub-id-type="doi">10.1109/LRA.2023.3347143</pub-id>.</mixed-citation></ref>
<ref id="ref-51"><label>[51]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Yang</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Deng</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Fan</surname> <given-names>J</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>J</given-names></string-name>, <string-name><surname>Zha</surname> <given-names>ZJ</given-names></string-name></person-group>. <article-title>Bevtrack: a simple and strong baseline for 3D single object tracking in bird&#x2019;s-eye view</article-title>. In: <conf-name>Proceedings of the 31st ACM International Conference on Multimedia</conf-name>. <publisher-loc>New York, NY, USA</publisher-loc>: <publisher-name>ACM</publisher-name>; <year>2023</year>. p. <fpage>1819</fpage>&#x2013;<lpage>28</lpage>.</mixed-citation></ref>
<ref id="ref-52"><label>[52]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Xia</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Wu</surname> <given-names>Q</given-names></string-name>, <string-name><surname>Li</surname> <given-names>W</given-names></string-name>, <string-name><surname>Chan</surname> <given-names>AB</given-names></string-name>, <string-name><surname>Stilla</surname> <given-names>U</given-names></string-name></person-group>. <article-title>A lightweight and detector-free 3D single object tracker on point clouds</article-title>. <source>IEEE Trans Intell Transp Syst</source>. <year>2023</year>;<volume>24</volume>(<issue>5</issue>):<fpage>5543</fpage>&#x2013;<lpage>54</lpage>. doi:<pub-id pub-id-type="doi">10.48550/arXiv.2203.04232</pub-id>.</mixed-citation></ref>
<ref id="ref-53"><label>[53]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Nie</surname> <given-names>J</given-names></string-name>, <string-name><surname>Xie</surname> <given-names>F</given-names></string-name>, <string-name><surname>Zhou</surname> <given-names>S</given-names></string-name>, <string-name><surname>Zhou</surname> <given-names>X</given-names></string-name>, <string-name><surname>Chae</surname> <given-names>DK</given-names></string-name>, <string-name><surname>He</surname> <given-names>Z</given-names></string-name></person-group>. <article-title>P2P: part-to-part motion cues guide a strong tracking framework for LiDAR point clouds</article-title>. <source>Int J Comput Vis</source>. <year>2025</year>;<volume>133</volume>(<issue>8</issue>):<fpage>5326</fpage>&#x2013;<lpage>42</lpage>. doi:<pub-id pub-id-type="doi">10.1007/s11263-025-02430-6</pub-id>.</mixed-citation></ref>
<ref id="ref-54"><label>[54]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><surname>Yang</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Deng</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>J</given-names></string-name>, <string-name><surname>Gu</surname> <given-names>H</given-names></string-name>, <string-name><surname>Dong</surname> <given-names>Z</given-names></string-name></person-group>. <article-title>SiamMo: siamese motion-centric 3D object tracking</article-title>. <comment>arXiv:2408.01688. 2024</comment>. doi:<pub-id pub-id-type="doi">10.48550/arxiv.2408.01688</pub-id>.</mixed-citation></ref>
<ref id="ref-55"><label>[55]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Luo</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Zhou</surname> <given-names>C</given-names></string-name>, <string-name><surname>Pan</surname> <given-names>L</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>G</given-names></string-name>, <string-name><surname>Liu</surname> <given-names>T</given-names></string-name>, <string-name><surname>Luo</surname> <given-names>Y</given-names></string-name>, <etal>et al</etal></person-group>. <article-title>Exploring point-bev fusion for 3D point cloud object tracking with transformer</article-title>. <source>IEEE Trans Pattern Anal Mach Intell</source>. <year>2024</year>;<volume>46</volume>(<issue>9</issue>):<fpage>5921</fpage>&#x2013;<lpage>35</lpage>. doi:<pub-id pub-id-type="doi">10.1109/TPAMI.2024.3373693</pub-id>; <pub-id pub-id-type="pmid">38442046</pub-id></mixed-citation></ref>
<ref id="ref-56"><label>[56]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Xu</surname> <given-names>TX</given-names></string-name>, <string-name><surname>Guo</surname> <given-names>YC</given-names></string-name>, <string-name><surname>Lai</surname> <given-names>YK</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>SH</given-names></string-name></person-group>. <article-title>CXTrack: improving 3D point cloud tracking with contextual information</article-title>. In: <conf-name>Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</conf-name>. <publisher-loc>Piscataway, NJ, USA</publisher-loc>: <publisher-name>IEEE</publisher-name>; <year>2023</year>. p. <fpage>1084</fpage>&#x2013;<lpage>93</lpage>. doi:<pub-id pub-id-type="doi">10.1109/CVPR52729.2023.00111</pub-id>.</mixed-citation></ref>
<ref id="ref-57"><label>[57]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Ma</surname> <given-names>T</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>M</given-names></string-name>, <string-name><surname>Xiao</surname> <given-names>J</given-names></string-name>, <string-name><surname>Wu</surname> <given-names>H</given-names></string-name>, <string-name><surname>Liu</surname> <given-names>Y</given-names></string-name></person-group>. <article-title>Synchronize feature extracting and matching: a single branch framework for 3D object tracking</article-title>. In: <conf-name>Proceedings of the IEEE/CVF International Conference on Computer Vision</conf-name>. <publisher-loc>Piscataway, NJ, USA</publisher-loc>: <publisher-name>IEEE</publisher-name>; <year>2023</year>. p. <fpage>9953</fpage>&#x2013;<lpage>63</lpage>. doi:<pub-id pub-id-type="doi">10.1109/iccv51070.2023.00913</pub-id>.</mixed-citation></ref>
<ref id="ref-58"><label>[58]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Fan</surname> <given-names>H</given-names></string-name>, <string-name><surname>Su</surname> <given-names>H</given-names></string-name>, <string-name><surname>Guibas</surname> <given-names>LJ</given-names></string-name></person-group>. <article-title>A point set generation network for 3D object reconstruction from a single image</article-title>. In: <conf-name>Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition</conf-name>. <publisher-loc>Piscataway, NJ, USA</publisher-loc>: <publisher-name>IEEE</publisher-name>; <year>2017</year>. p. <fpage>605</fpage>&#x2013;<lpage>13</lpage>. doi:<pub-id pub-id-type="doi">10.1109/CVPR.2017.264</pub-id>.</mixed-citation></ref>
<ref id="ref-59"><label>[59]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><surname>Hu</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Zhou</surname> <given-names>S</given-names></string-name>, <string-name><surname>Zhao</surname> <given-names>S</given-names></string-name>, <string-name><surname>Yuan</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Liang</surname> <given-names>CJ</given-names></string-name></person-group>. <article-title>MVCTrack: boosting 3D point cloud tracking via multimodal-guided virtual cues</article-title>. <comment>arXiv:2412.02734. 2024</comment>. doi:<pub-id pub-id-type="doi">10.48550/arxiv.2412.02734</pub-id>.</mixed-citation></ref>
<ref id="ref-60"><label>[60]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Wu</surname> <given-names>Q</given-names></string-name>, <string-name><surname>Xia</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Wan</surname> <given-names>J</given-names></string-name>, <string-name><surname>Chan</surname> <given-names>AB</given-names></string-name></person-group>. <article-title>Boosting 3D single object tracking with 2D matching distillation and 3D pre-training</article-title>. In: <conf-name>European Conference on Computer Vision</conf-name>. <publisher-loc>Cham, Switzerland</publisher-loc>: <publisher-name>Springer</publisher-name>; <year>2024</year>. p. <fpage>270</fpage>&#x2013;<lpage>88</lpage>. doi:<pub-id pub-id-type="doi">10.1007/978-3-031-73254-6_16</pub-id>.</mixed-citation></ref>
<ref id="ref-61"><label>[61]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Zhang</surname> <given-names>J</given-names></string-name>, <string-name><surname>Zhou</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Lu</surname> <given-names>G</given-names></string-name>, <string-name><surname>Tian</surname> <given-names>J</given-names></string-name>, <string-name><surname>Pei</surname> <given-names>W</given-names></string-name></person-group>. <article-title>Robust 3D tracking with quality-aware shape completion</article-title>. In: <conf-name>Proceedings of the AAAI Conference on Artificial Intelligence</conf-name>. <publisher-loc>Palo Alto, CA, USA</publisher-loc>: <publisher-name>AAAI Press</publisher-name>; <year>2024</year>. p. <fpage>7160</fpage>&#x2013;<lpage>8</lpage>. doi:<pub-id pub-id-type="doi">10.1609/aaai.v38i7.28544</pub-id>.</mixed-citation></ref>
<ref id="ref-62"><label>[62]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><surname>Mirza</surname> <given-names>M</given-names></string-name>, <string-name><surname>Osindero</surname> <given-names>S</given-names></string-name></person-group>. <article-title>Conditional generative adversarial nets</article-title>. <comment>arXiv:1411.1784. 2014</comment>. doi:<pub-id pub-id-type="doi">10.48550/arxiv.1411.1784</pub-id>.</mixed-citation></ref>
<ref id="ref-63"><label>[63]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><surname>Xu</surname> <given-names>W</given-names></string-name>, <string-name><surname>Zhou</surname> <given-names>S</given-names></string-name>, <string-name><surname>Yuan</surname> <given-names>Z</given-names></string-name></person-group>. <article-title>PillarTrack: redesigning pillar-based transformer network for single object tracking on point clouds</article-title>. <comment>arXiv:2404.07495. 2024</comment>. doi:<pub-id pub-id-type="doi">10.48550/arxiv.2404.07495</pub-id>.</mixed-citation></ref>
<ref id="ref-64"><label>[64]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Lan</surname> <given-names>K</given-names></string-name>, <string-name><surname>Jiang</surname> <given-names>H</given-names></string-name>, <string-name><surname>Xie</surname> <given-names>J</given-names></string-name></person-group>. <article-title>Temporal-aware siamese tracker: integrate temporal context for 3D object tracking</article-title>. In: <conf-name>Proceedings of the Asian Conference on Computer Vision</conf-name>. <publisher-loc>Cham, Switzerland</publisher-loc>: <publisher-name>Springer</publisher-name>; <year>2022</year>. p. <fpage>399</fpage>&#x2013;<lpage>414</lpage>.</mixed-citation></ref>
<ref id="ref-65"><label>[65]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Xu</surname> <given-names>TX</given-names></string-name>, <string-name><surname>Guo</surname> <given-names>YC</given-names></string-name>, <string-name><surname>Lai</surname> <given-names>YK</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>SH</given-names></string-name></person-group>. <article-title>Mbptrack: improving 3D point cloud tracking with memory networks and box priors</article-title>. In: <conf-name>Proceedings of the IEEE/CVF International Conference on Computer Vision</conf-name>. <publisher-loc>Piscataway, NJ, USA</publisher-loc>: <publisher-name>IEEE</publisher-name>; <year>2023</year>. p. <fpage>9911</fpage>&#x2013;<lpage>20</lpage>. doi:<pub-id pub-id-type="doi">10.1109/ICCV51070.2023.00909</pub-id>.</mixed-citation></ref>
<ref id="ref-66"><label>[66]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Lin</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Li</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Cui</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Fang</surname> <given-names>Z</given-names></string-name></person-group>. <article-title>SeqTrack3D: exploring sequence information for robust 3D point cloud tracking</article-title>. In: <conf-name>2024 IEEE International Conference on Robotics and Automation (ICRA)</conf-name>. <publisher-loc>Piscataway, NJ, USA</publisher-loc>: <publisher-name>IEEE</publisher-name>; <year>2024</year>. p. <fpage>6959</fpage>&#x2013;<lpage>65</lpage>. doi:<pub-id pub-id-type="doi">10.1109/ICRA57147.2024.10611238</pub-id>.</mixed-citation></ref>
<ref id="ref-67"><label>[67]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Luo</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>G</given-names></string-name>, <string-name><surname>Zhou</surname> <given-names>C</given-names></string-name>, <string-name><surname>Wu</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Tao</surname> <given-names>Q</given-names></string-name>, <string-name><surname>Lu</surname> <given-names>L</given-names></string-name>, <etal>et al.</etal></person-group> <article-title>Modeling continuous motion for 3D point cloud object tracking</article-title>. In: <conf-name>Proceedings of the AAAI Conference on Artificial Intelligence</conf-name>. <publisher-loc>Palo Alto, CA, USA</publisher-loc>: <publisher-name>AAAI Press</publisher-name>; <year>2024</year>. p. <fpage>4026</fpage>&#x2013;<lpage>34</lpage>. doi:<pub-id pub-id-type="doi">10.48550/arXiv.2303.07605</pub-id>.</mixed-citation></ref>
<ref id="ref-68"><label>[68]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><surname>Sun</surname> <given-names>S</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>C</given-names></string-name>, <string-name><surname>Liu</surname> <given-names>X</given-names></string-name>, <string-name><surname>Shi</surname> <given-names>C</given-names></string-name>, <string-name><surname>Ding</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Xi</surname> <given-names>G</given-names></string-name></person-group>. <article-title>Spatio-temporal bi-directional cross-frame memory for distractor filtering point cloud single object tracking</article-title>. <comment>arXiv:2403.15831. 2024</comment>. doi:<pub-id pub-id-type="doi">10.48550/arxiv.2403.15831</pub-id>.</mixed-citation></ref>
<ref id="ref-69"><label>[69]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Liu</surname> <given-names>J</given-names></string-name>, <string-name><surname>Wu</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Gong</surname> <given-names>M</given-names></string-name>, <string-name><surname>Miao</surname> <given-names>Q</given-names></string-name>, <string-name><surname>Ma</surname> <given-names>W</given-names></string-name>, <string-name><surname>Xu</surname> <given-names>C</given-names></string-name>, <etal>et al.</etal></person-group> <article-title>M3SOT: multi-frame, multi-field, multi-space 3D single object tracking</article-title>. In: <conf-name>Proceedings of the AAAI Conference on Artificial Intelligence</conf-name>. <publisher-loc>Palo Alto, CA, USA</publisher-loc>: <publisher-name>AAAI Press</publisher-name>; <year>2024</year>. p. <fpage>3630</fpage>&#x2013;<lpage>8</lpage>. doi:<pub-id pub-id-type="doi">10.1609/aaai.v38i4.28152</pub-id>.</mixed-citation></ref>
<ref id="ref-70"><label>[70]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Liu</surname> <given-names>H</given-names></string-name>, <string-name><surname>Hu</surname> <given-names>Q</given-names></string-name>, <string-name><surname>Ma</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Xie</surname> <given-names>K</given-names></string-name>, <string-name><surname>Guo</surname> <given-names>Y</given-names></string-name></person-group>. <article-title>DeepPCT: single object tracking in dynamic point cloud sequences</article-title>. <source>IEEE Trans Instrum Meas</source>. <year>2022</year>;<volume>72</volume>:<fpage>1</fpage>&#x2013;<lpage>12</lpage>. doi:<pub-id pub-id-type="doi">10.1109/TIM.2022.3232092</pub-id>.</mixed-citation></ref>
<ref id="ref-71"><label>[71]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><surname>Cui</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Fang</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Shan</surname> <given-names>J</given-names></string-name>, <string-name><surname>Gu</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Zhou</surname> <given-names>S</given-names></string-name></person-group>. <article-title>3D object tracking with transformer</article-title>. <comment>arXiv:2110.14921. 2021</comment>. doi:<pub-id pub-id-type="doi">10.48550/arXiv.2110.14921</pub-id>.</mixed-citation></ref>
<ref id="ref-72"><label>[72]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Hui</surname> <given-names>L</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>L</given-names></string-name>, <string-name><surname>Tang</surname> <given-names>L</given-names></string-name>, <string-name><surname>Lan</surname> <given-names>K</given-names></string-name>, <string-name><surname>Xie</surname> <given-names>J</given-names></string-name>, <string-name><surname>Yang</surname> <given-names>J</given-names></string-name></person-group>. <article-title>3D siamese transformer network for single object tracking on point clouds</article-title>. In: <conf-name>European Conference on Computer Vision</conf-name>. <publisher-loc>Cham, Switzerland</publisher-loc>: <publisher-name>Springer</publisher-name>; <year>2022</year>. p. <fpage>293</fpage>&#x2013;<lpage>310</lpage>. doi:<pub-id pub-id-type="doi">10.48550/arXiv.2207.11995</pub-id>.</mixed-citation></ref>
<ref id="ref-73"><label>[73]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Wang</surname> <given-names>M</given-names></string-name>, <string-name><surname>Ma</surname> <given-names>T</given-names></string-name>, <string-name><surname>Zuo</surname> <given-names>X</given-names></string-name>, <string-name><surname>Lv</surname> <given-names>J</given-names></string-name>, <string-name><surname>Liu</surname> <given-names>Y</given-names></string-name></person-group>. <article-title>Correlation pyramid network for 3D single object tracking</article-title>. In: <conf-name>Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</conf-name>. <publisher-loc>Piscataway, NJ, USA</publisher-loc>: <publisher-name>IEEE</publisher-name>; <year>2023</year>. p. <fpage>3216</fpage>&#x2013;<lpage>25</lpage>. doi:<pub-id pub-id-type="doi">10.48550/arXiv.2305.09195</pub-id>.</mixed-citation></ref>
<ref id="ref-74"><label>[74]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Shan</surname> <given-names>J</given-names></string-name>, <string-name><surname>Zhou</surname> <given-names>S</given-names></string-name>, <string-name><surname>Fang</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Cui</surname> <given-names>Y</given-names></string-name></person-group>. <article-title>PTT: point-track-transformer module for 3D single object tracking in point clouds</article-title>. In: <conf-name>2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)</conf-name>. <publisher-loc>Piscataway, NJ, USA</publisher-loc>: <publisher-name>IEEE</publisher-name>; <year>2021</year>. p. <fpage>1310</fpage>&#x2013;<lpage>6</lpage>. doi:<pub-id pub-id-type="doi">10.48550/arXiv.2108.06455</pub-id>.</mixed-citation></ref>
<ref id="ref-75"><label>[75]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Wu</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Lim</surname> <given-names>J</given-names></string-name>, <string-name><surname>Yang</surname> <given-names>MH</given-names></string-name></person-group>. <article-title>Online object tracking: a benchmark</article-title>. In: <conf-name>Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition</conf-name>. <publisher-loc>Piscataway, NJ, USA</publisher-loc>: <publisher-name>IEEE</publisher-name>; <year>2013</year>. p. <fpage>2411</fpage>&#x2013;<lpage>8</lpage>. doi:<pub-id pub-id-type="doi">10.1109/CVPR.2013.312</pub-id>.</mixed-citation></ref>
<ref id="ref-76"><label>[76]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Kristan</surname> <given-names>M</given-names></string-name>, <string-name><surname>Matas</surname> <given-names>J</given-names></string-name>, <string-name><surname>Leonardis</surname> <given-names>A</given-names></string-name>, <string-name><surname>Voj&#x00ED;&#x0159;</surname> <given-names>T</given-names></string-name>, <string-name><surname>Pflugfelder</surname> <given-names>R</given-names></string-name>, <string-name><surname>Fernandez</surname> <given-names>G</given-names></string-name>, <etal>et al</etal></person-group>. <article-title>A novel performance evaluation methodology for single-target trackers</article-title>. <source>IEEE Trans Pattern Anal Mach Intell</source>. <year>2016</year>;<volume>38</volume>(<issue>11</issue>):<fpage>2137</fpage>&#x2013;<lpage>55</lpage>. doi:<pub-id pub-id-type="doi">10.1109/TPAMI.2016.2516982</pub-id>; <pub-id pub-id-type="pmid">26766217</pub-id></mixed-citation></ref>
<ref id="ref-77"><label>[77]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Holzinger</surname> <given-names>A</given-names></string-name>, <string-name><surname>Luka&#x010D;</surname> <given-names>N</given-names></string-name>, <string-name><surname>Rozajac</surname> <given-names>D</given-names></string-name>, <string-name><surname>Johnston</surname> <given-names>E</given-names></string-name>, <string-name><surname>Kocic</surname> <given-names>V</given-names></string-name>, <string-name><surname>Hoerl</surname> <given-names>B</given-names></string-name>, <etal>et al</etal></person-group>. <article-title>Enhancing trust in automated 3D point cloud data interpretation through explainable counterfactuals</article-title>. <source>Inf Fusion</source>. <year>2025</year>;<volume>119</volume>(<issue>4</issue>):<fpage>103032</fpage>. doi:<pub-id pub-id-type="doi">10.1016/j.inffus.2025.103032</pub-id>.</mixed-citation></ref>
</ref-list>
</back></article>