<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.1 20151215//EN" "http://jats.nlm.nih.gov/publishing/1.1/JATS-journalpublishing1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:mml="http://www.w3.org/1998/Math/MathML" xml:lang="en" article-type="research-article" dtd-version="1.1">
<front>
<journal-meta>
<journal-id journal-id-type="pmc">CMES</journal-id>
<journal-id journal-id-type="nlm-ta">CMES</journal-id>
<journal-id journal-id-type="publisher-id">CMES</journal-id>
<journal-title-group>
<journal-title>Computer Modeling in Engineering &#x0026; Sciences</journal-title>
</journal-title-group>
<issn pub-type="epub">1526-1506</issn>
<issn pub-type="ppub">1526-1492</issn>
<publisher>
<publisher-name>Tech Science Press</publisher-name>
<publisher-loc>USA</publisher-loc>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="publisher-id">80595</article-id>
<article-id pub-id-type="doi">10.32604/cmes.2026.080595</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Article</subject>
</subj-group>
</article-categories>
<title-group>
<article-title>DA-T3D: Distribution-Aware Cross-Modal Distillation Framework for Temporal 3D Object Detection</article-title>
<alt-title alt-title-type="left-running-head">DA-T3D: Distribution-Aware Cross-Modal Distillation Framework for Temporal 3D Object Detection</alt-title>
<alt-title alt-title-type="right-running-head">DA-T3D: Distribution-Aware Cross-Modal Distillation Framework for Temporal 3D Object Detection</alt-title>
</title-group>
<contrib-group>
<contrib id="author-1" contrib-type="author">
<name name-style="western"><surname>Jiao</surname><given-names>Tianzhe</given-names></name></contrib>
<contrib id="author-2" contrib-type="author">
<name name-style="western"><surname>Chen</surname><given-names>Yuming</given-names></name></contrib>
<contrib id="author-3" contrib-type="author">
<name name-style="western"><surname>Feng</surname><given-names>Xiaoyue</given-names></name></contrib>
<contrib id="author-4" contrib-type="author">
<name name-style="western"><surname>Guo</surname><given-names>Chaopeng</given-names></name></contrib>
<contrib id="author-5" contrib-type="author" corresp="yes">
<name name-style="western"><surname>Song</surname><given-names>Jie</given-names></name><email>songjie@mail.neu.edu.cn</email></contrib>
<aff id="aff-1"><institution>Software College, Northeastern University</institution>, <addr-line>Shenyang</addr-line>, <country>China</country></aff>
</contrib-group>
<author-notes>
<corresp id="cor1"><label>&#x002A;</label>Corresponding Author: Jie Song. Email: <email>songjie@mail.neu.edu.cn</email></corresp>
</author-notes>
<pub-date date-type="collection" publication-format="electronic">
<year>2026</year>
</pub-date>
<pub-date date-type="pub" publication-format="electronic">
<day>27</day><month>4</month><year>2026</year>
</pub-date>
<volume>147</volume>
<issue>1</issue>
<elocation-id>1</elocation-id>
<history>
<date date-type="received">
<day>12</day>
<month>02</month>
<year>2026</year>
</date>
<date date-type="accepted">
<day>23</day>
<month>03</month>
<year>2026</year>
</date>
</history>
<permissions>
<copyright-statement>&#x00A9; 2026 The Authors. Published by Tech Science Press.</copyright-statement>
<copyright-year>2026</copyright-year>
<copyright-holder>The Authors</copyright-holder>
<license xlink:href="https://creativecommons.org/licenses/by/4.0/">
<license-p>This work is licensed under a <ext-link ext-link-type="uri" xlink:type="simple" xlink:href="https://creativecommons.org/licenses/by/4.0/">Creative Commons Attribution 4.0 International License</ext-link>, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
</license>
</permissions>
<self-uri content-type="pdf" xlink:href="TSP_CMES_80595.pdf"></self-uri>
<abstract>
<p>Knowledge distillation bridges the performance gap between camera-based and LiDAR-based 3D detectors by leveraging the precise geometric information from LiDAR. However, cross-modal knowledge transfer remains challenging due to the inherent modality heterogeneity between LiDAR and camera data, which often leads to instability during training. In this work, we find that these instabilities are closely related to distribution mismatch in the cross-modal feature space and noisy teacher signals. To address this issue, we propose a novel distribution-aware cross-modal distillation framework, named DA-T3D. Specifically, we first explicitly model the LiDAR teacher&#x2019;s Bird&#x2019;s-Eye-View (BEV) feature distribution and use the learned distribution as a statistical prior to guide the student features toward high-density and geometrically stable regions in the teacher&#x2019;s BEV feature space. This ensures feature alignment in BEV space by constraining the student model&#x2019;s feature distribution to match that of the LiDAR teacher model within foreground regions. Next, we further introduce response-level distillation to directly transfer the teacher&#x2019;s prediction behavior to the student detection head, providing direct output-space supervision that complements feature distillation and effectively reduces modality-induced ambiguity, leading to more accurate and stable classification confidence and bounding-box regression. Furthermore, we perform temporal modeling on the distilled cross-modal features to produce fused BEV representations that capture more comprehensive scene context. Finally, we utilize the fused BEV features to generate 3D detection results. Through experiments, we validate the effectiveness and superiority of DA-T3D on the nuScenes dataset, achieving 46.7% mAP and 58.1% NDS.</p>
</abstract>
<kwd-group kwd-group-type="author">
<kwd>3D object detection</kwd>
<kwd>Bird&#x2019;s-Eye-View perception</kwd>
<kwd>cross-modal knowledge distillation</kwd>
<kwd>Dirichlet process Gaussian mixture model</kwd>
<kwd>temporal modeling</kwd>
</kwd-group>
<funding-group>
<award-group id="awg1">
<funding-source>National Natural Science Foundation of China</funding-source>
<award-id>62302086</award-id>
</award-group>
</funding-group>
</article-meta>
</front>
<body>
<sec id="s1">
<label>1</label>
<title>Introduction</title>
<p>3D object detection based on multi-view cameras is a fundamental yet challenging task in autonomous driving [<xref ref-type="bibr" rid="ref-1">1</xref>]. In real-world applications such as autonomous driving, accurate 3D object detection directly affects a vehicle&#x2019;s ability to perceive the surrounding environment and make safe driving decisions. However, compared with LiDAR-based methods, camera-only methods often suffer from ambiguous depth estimation and are more sensitive to illumination variations and occlusions, which typically result in degraded 3D localization accuracy and limited robustness. To narrow the performance gap with LiDAR-based methods, researchers have increasingly explored cross-modal knowledge distillation in recent years. Specifically, cross-modal distillation transfers geometric priors from complementary modalities such as LiDAR to a camera-based student, providing reliable 3D structural cues to improve the 3D detection performance [<xref ref-type="bibr" rid="ref-2">2</xref>]. However, the inherent data heterogeneity between LiDAR point clouds and camera images poses challenges for effective cross-modal distillation.</p>
<p>To alleviate the distillation challenges caused by the modality gap between LiDAR and cameras, existing methods typically map data from both modalities into a unified feature space to facilitate feature imitation [<xref ref-type="bibr" rid="ref-3">3</xref>]. Some studies project LiDAR points onto the image plane and perform distillation in 2D space [<xref ref-type="bibr" rid="ref-4">4</xref>]. However, such cross-modal transformations often lead to the loss of intrinsic features of the original data, which limits the student model&#x2019;s ability to learn effective information from the teacher. Consequently, another mainstream method maps both modalities into a unified BEV space [<xref ref-type="bibr" rid="ref-5">5</xref>], enabling the student model to align features with the teacher more directly, as shown in <xref ref-type="fig" rid="fig-1">Fig. 1</xref>. These works commonly adopt point-wise aligned distillation, which allows fine-grained matching between BEV features from the two modalities. Nevertheless, background regions in BEV space often contain substantial task-irrelevant noise, which can divert the distillation process toward redundant background features and reduce the efficiency of learning key foreground features. To address this issue, Chen et al. [<xref ref-type="bibr" rid="ref-6">6</xref>] proposed a foreground-aware distillation method that has been widely adopted. By focusing knowledge transfer on foreground target regions in the scene, it enhances the model&#x2019;s ability to extract and transfer important features.</p>
<fig id="fig-1">
<label>Figure 1</label>
<caption>
<title>Cross-modal knowledge distillation frameworks.</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_80595-fig-1.tif"/>
</fig>
<p>Despite the promising progress of existing cross-modal distillation methods, the domain gap across modalities remains persists due to differences in imaging mechanisms and spatial resolution. In this context, adopting a point-wise aligned distillation scheme that enforces exact consistency between the BEV features of the two modalities may lead to noise amplification and overly restrictive constraints, thereby affecting the model&#x2019;s detection performance. Moreover, distillation typically depends on high-quality supervisory signals from the teacher model. However, the teacher&#x2019;s features may themselves contain noise and bias, for example, due to false positives, missed detections, or feature jitter. Such noise can be directly transferred to the student during distillation, leading to unstable supervision and reduced distillation effectiveness. Therefore, cross-modal knowledge distillation faces two core challenges: (1) due to inherent modality heterogeneity, using a simple point-to-point distillation method is suboptimal, and (2) the LiDAR teacher&#x2019;s features can be noisy, so naive imitation may introduce erroneous supervision.</p>
<p>In this work, we propose a novel distribution-aware cross-modal distillation framework, which is a carefully designed distribution-level cross-modal distillation strategy that effectively addresses the aforementioned challenges. Specifically, our method first models class-conditional feature distributions of the LiDAR teacher&#x2019;s BEV features. Then, using a distribution-consistency constraint, we encourage the student features to fall into the teacher&#x2019;s high-density and geometrically stable regions, as shown in <xref ref-type="fig" rid="fig-1">Fig. 1c</xref>. By aligning features at the distribution level, this method effectively narrows the BEV representation gap between the two modalities. Meanwhile, the modeling process naturally suppresses a small number of outlier and noisy teacher features. Distribution-level distillation pulls the student toward aggregated mode centers rather than individual noisy instances, thereby mitigating the adverse effects of teacher noise. In addition, to reduce interference from factors such as target occlusion and motion blur, we further apply lightweight temporal modeling to the distilled BEV features, improving training stability. The main contributions of this paper are as follows:<list list-type="simple">
<list-item>
<label>1.</label>
<p>We propose a novel distribution-aware cross-modal distillation framework (DA-T3D) for 3D object detection, which enables distribution-level knowledge transfer from a LiDAR teacher to a camera-based student. In addition, we introduce response-level distillation to convey task-specific decision knowledge, further improving detection performance.</p></list-item>
<list-item>
<label>2.</label>
<p>We propose a lightweight temporal fusion module that fuses features from two consecutive frames and introduces a gating mechanism to adaptively balance the contributions of the current and historical frames.</p></list-item>
<list-item>
<label>3.</label>
<p>Through extensive experiments and ablation studies on the nuScenes benchmark, our framework demonstrates outstanding performance in 3D object detection. Our best model achieves 46.7% mAP and 58.1% NDS on nuScenes.</p></list-item>
</list></p>
<p>The remainder of this paper is organized as follows: <xref ref-type="sec" rid="s2">Section 2</xref> briefly reviews the related work. <xref ref-type="sec" rid="s3">Section 3</xref> introduces our proposed solutions in detail. Experimental settings and results, along with comparisons to baseline methods, are presented in <xref ref-type="sec" rid="s4">Section 4</xref> to validate the effectiveness of our approach. Finally, <xref ref-type="sec" rid="s5">Section 5</xref> presents the conclusion of this paper, summarizing the key contributions and discussing potential future directions.</p>
</sec>
<sec id="s2">
<label>2</label>
<title>Related Work</title>
<sec id="s2_1">
<label>2.1</label>
<title>Multi-View 3D Object Detection</title>
<p>Multi-view 3D object detection aims to leverage surround-view camera images to align and fuse multi-view 2D features into a unified 3D space or bird&#x2019;s-eye-view (BEV) representation, thereby enabling 3D object localization and attribute regression. Existing methods mainly follow two paradigms: (1) explicitly constructing a dense BEV representation and then performing detection; and (2) adopting query-based or sparse 3D representations, where 3D queries directly aggregate information from multi-view features to regress 3D bounding boxes [<xref ref-type="bibr" rid="ref-7">7</xref>].</p>
<p>For explicit BEV construction, early studies achieved view transformation and feature fusion by predicting pixel-wise depth distributions (e.g., LSS [<xref ref-type="bibr" rid="ref-8">8</xref>]). Subsequent works have improved this pipeline along several directions, including depth estimation quality and temporal fusion. For example, BEVDepth introduces depth supervision [<xref ref-type="bibr" rid="ref-9">9</xref>], BEVFormer generates BEV features with spatiotemporal attention [<xref ref-type="bibr" rid="ref-10">10</xref>], and GeoBEV enhances geometric details via more efficient BEV sampling and structure-aware depth supervision [<xref ref-type="bibr" rid="ref-11">11</xref>]. In contrast, to avoid the computational overhead of dense BEV, query-based methods use 3D queries to interact with multi-view features. DETR3D samples features by projecting 3D reference points onto 2D views [<xref ref-type="bibr" rid="ref-12">12</xref>]. PETR and its variants strengthen spatial alignment with 3D positional embeddings [<xref ref-type="bibr" rid="ref-13">13</xref>], and Sparse4D aggregates multi-view and temporal information using 4D keypoints [<xref ref-type="bibr" rid="ref-14">14</xref>]. These methods have continually evolved to balance efficiency and accuracy, collectively advancing vision-only 3D detection. However, the performance of multi-view models heavily depends on the quality of depth estimation, lacks robustness to complex conditions such as illumination changes and adverse weather, and typically requires large amounts of accurately annotated data for supervised learning [<xref ref-type="bibr" rid="ref-9">9</xref>].</p>
</sec>
<sec id="s2_2">
<label>2.2</label>
<title>Multi-Modal 3D Object Detection</title>
<p>Multi-modal 3D object detection aims to fuse semantic and geometric information from sensors such as cameras, LiDAR, and radar to improve perception performance in complex scenarios. Existing methods can be categorized by fusion stage as early fusion, feature-level fusion, and late fusion. Mainstream directions include BEV-based unified representations, sparse query&#x2013;based fusion, and unified 3D representations, enabling better cross-modal complementarity [<xref ref-type="bibr" rid="ref-3">3</xref>,<xref ref-type="bibr" rid="ref-15">15</xref>].</p>
<p>Specifically, early fusion injects image semantics directly into point clouds or voxels, as in PointPainting [<xref ref-type="bibr" rid="ref-16">16</xref>] and MVX-Net [<xref ref-type="bibr" rid="ref-17">17</xref>]. However, it is sensitive to calibration errors and point cloud sparsity. Subsequent works such as PPF-Net improve robustness via region-level semantic aggregation [<xref ref-type="bibr" rid="ref-18">18</xref>]. Feature-level fusion maps multi-modal features into a shared BEV space for interaction, with BEVFusion providing a lightweight fusion framework [<xref ref-type="bibr" rid="ref-19">19</xref>,<xref ref-type="bibr" rid="ref-20">20</xref>]. Late fusion performs cross-modal fusion after generating candidate boxes, as in MV3D [<xref ref-type="bibr" rid="ref-21">21</xref>] and CLOCs [<xref ref-type="bibr" rid="ref-22">22</xref>], but the degree of cross-modal interaction is limited. In addition, to improve efficiency and long-range performance, MV2DFusion adopts a sparse query&#x2013;based fusion scheme, using object queries as carriers for cross-modal interaction [<xref ref-type="bibr" rid="ref-23">23</xref>]. To address sensor disparities, unified 3D representation methods such as FGU3R convert images into pseudo point clouds to enable fine-grained fusion [<xref ref-type="bibr" rid="ref-24">24</xref>]. Although multi-modal fusion methods can effectively mitigate inherent limitations of unimodal methods in depth estimation and robustness under adverse weather conditions [<xref ref-type="bibr" rid="ref-15">15</xref>,<xref ref-type="bibr" rid="ref-16">16</xref>], they face challenges in deployment cost and computational overhead introduced by multiple sensors.</p>
</sec>
<sec id="s2_3">
<label>2.3</label>
<title>Cross-Modal Knowledge Distillation for 3D Object Detection</title>
<p>Cross-modal knowledge distillation (CMKD) for 3D object detection aims to use a stronger, information-rich modality (e.g., LiDAR or multimodal fusion) during training to guide a weaker-modality detector (e.g., camera-only or radar-only). In this way, inference can rely solely on low-cost sensors, striking a balance between deployment efficiency and accuracy. Existing studies mainly focus on key issues such as modality representation gaps, spatial alignment, and noise in teacher-generated pseudo labels.</p>
<p>Early works such as MonoDistill [<xref ref-type="bibr" rid="ref-4">4</xref>] distill knowledge by projecting LiDAR features onto the image plane, improving spatial reasoning for monocular 3D detection. BEVDistill [<xref ref-type="bibr" rid="ref-6">6</xref>] and DistillBEV [<xref ref-type="bibr" rid="ref-25">25</xref>] further align image features with LiDAR teacher predictions in BEV space to enhance camera-based BEV detection. UniDistill [<xref ref-type="bibr" rid="ref-26">26</xref>] proposes a generic BEV-oriented CMKD framework that transfers knowledge at multiple levels, including features, predictions, and relations. To alleviate the high cost of 3D annotations, MonoLiG [<xref ref-type="bibr" rid="ref-27">27</xref>] and SCKD [<xref ref-type="bibr" rid="ref-28">28</xref>] combine CMKD with semi-supervised learning, using teacher-generated pseudo labels to train student models and suppressing noisy negative transfer via uncertainty weighting, feature distillation, and related techniques, thus moving CMKD from fully supervised to a semi-supervised training paradigm. In our method, the student is attracted toward dominant modes rather than individual noisy instances. This robustness mechanism is difficult to obtain from moment matching alone, which treats all samples implicitly through aggregated statistics, and it is also less explicit in adversarial alignment, where unstable optimization may itself introduce additional training noise [<xref ref-type="bibr" rid="ref-29">29</xref>]. The effectiveness of cross-modal distillation depends heavily on the teacher model&#x2019;s representational capacity and the accuracy of cross-modal spatial alignment. Calibration errors or large modality discrepancies can easily lead to feature misalignment and negative transfer. To this end, we propose a distribution-level cross-modal distillation method to effectively address the above challenges.</p>
</sec>
</sec>
<sec id="s3">
<label>3</label>
<title>Method</title>
<p>In this section, we propose an innovative distribution-aware cross-modal distillation framework that transfers geometric knowledge from a LiDAR-based teacher model to a multi-view camera student model, improving camera-only 3D object detection. Unlike mainstream point-to-point feature regression for BEV distillation, we model the teacher features with a probabilistic distribution and regularize the student features by enforcing distribution-level consistency. This method alleviates distillation instability caused by cross-modal feature distribution mismatches and noisy teacher signals.</p>
<sec id="s3_1">
<label>3.1</label>
<title>Overall Architecture</title>
<p>As illustrated in <xref ref-type="fig" rid="fig-2">Fig. 2</xref>, we first model the teacher&#x2019;s BEV features within each foreground object region as a probabilistic distribution, and encourage the student features to fall into its high-density regions. This strategy couples the supervision strength with the statistical uncertainty of the teacher features, automatically reweighting different feature dimensions. We impose stronger supervision on more stable feature directions, while appropriately relaxing the constraints on directions that are more variable. In this way, the student progressively aligns with the teacher&#x2019;s BEV feature distribution in an overall statistical sense, effectively narrowing the cross-modality feature gap in BEV space. Moreover, distribution-level distillation tends to pull the student toward the aggregated centers of dominant modes rather than individual noisy instances, thus mitigating the adverse impact of teacher noise without modifying the student architecture. Subsequently, we further introduce response distillation to refine output-level supervision and improve distillation quality. Notably, although distillation methods are effective at extracting and transferring knowledge, they cannot eliminate information loss at the physical level. To address this limitation, we incorporate temporal modeling to compensate for missing observations in the current frame by fusing information from historical frames.</p>
<fig id="fig-2">
<label>Figure 2</label>
<caption>
<title>A cross-modal knowledge distillation framework integrating LiDAR and camera modalities for enhanced BEV object detection.</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_80595-fig-2.tif"/>
</fig>
</sec>
<sec id="s3_2">
<label>3.2</label>
<title>Distribution-Aware Cross-Modal Distillation Framework</title>
<p>Previous BEV feature distillation methods [<xref ref-type="bibr" rid="ref-6">6</xref>,<xref ref-type="bibr" rid="ref-30">30</xref>] typically use a foreground mask to select target-relevant regions on the BEV plane and perform point-wise alignment between the student and teacher features at these locations. This concentrates the distillation on key spatial positions and reduces interference from background noise. The distillation loss <inline-formula id="ieqn-1"><mml:math id="mml-ieqn-1"><mml:msub><mml:mi>L</mml:mi><mml:mrow><mml:mtext>feat</mml:mtext></mml:mrow></mml:msub></mml:math></inline-formula> is defined as follows:
<disp-formula id="eqn-1"><label>(1)</label><mml:math id="mml-eqn-1" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd /><mml:mtd><mml:msub><mml:mrow><mml:mi>&#x02112;</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mtext>feat</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:msub><mml:mi>N</mml:mi><mml:mi>p</mml:mi></mml:msub></mml:mfrac><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mi>H</mml:mi></mml:munderover><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>j</mml:mi></mml:mrow><mml:mi>W</mml:mi></mml:munderover><mml:msub><mml:mi>M</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo fence="false" stretchy="false">&#x2016;</mml:mo><mml:msubsup><mml:mi>f</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msubsup><mml:mo>&#x2212;</mml:mo><mml:mi>&#x03BE;</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msubsup><mml:mi>f</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mi>s</mml:mi></mml:mrow></mml:msubsup><mml:mo stretchy="false">)</mml:mo><mml:msub><mml:mo fence="false" stretchy="false">&#x2016;</mml:mo><mml:mn>2</mml:mn></mml:msub><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>where <italic>H</italic> and <italic>W</italic> are the height and width of the BEV feature map, respectively. <inline-formula id="ieqn-2"><mml:math id="mml-ieqn-2"><mml:mo fence="false" stretchy="false">&#x2016;</mml:mo><mml:mo>&#x22C5;</mml:mo><mml:mo fence="false" stretchy="false">&#x2016;</mml:mo></mml:math></inline-formula> is the L2 norm. <inline-formula id="ieqn-3"><mml:math id="mml-ieqn-3"><mml:msubsup><mml:mi>f</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula> and <inline-formula id="ieqn-4"><mml:math id="mml-ieqn-4"><mml:msubsup><mml:mi>f</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mi>s</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula> denote the feature at location (<inline-formula id="ieqn-5"><mml:math id="mml-ieqn-5"><mml:mi>i</mml:mi></mml:math></inline-formula>, <inline-formula id="ieqn-6"><mml:math id="mml-ieqn-6"><mml:mi>j</mml:mi></mml:math></inline-formula>) from the teacher and student models, respectively. The foreground mask <italic>M</italic> is generated from the ground-truth heatmap in the BEV space, and <inline-formula id="ieqn-7"><mml:math id="mml-ieqn-7"><mml:msub><mml:mi>N</mml:mi><mml:mi>p</mml:mi></mml:msub></mml:math></inline-formula> denotes the sum of all non-zero elements in the mask <italic>M</italic>. The adaptation module <inline-formula id="ieqn-8"><mml:math id="mml-ieqn-8"><mml:mi>&#x03BE;</mml:mi></mml:math></inline-formula> uses the convolutional layer to match the dimensionality of the student&#x2019;s features to that of the teacher.</p>
<p>Although existing methods project both features maps onto the BEV plane to alleviate cross-view discrepancies, a domain gap still remains due to differences in imaging mechanisms and spatial resolution. Moreover, teacher features often contain noise and bias. Directly forcing the student to mimic the teacher&#x2019;s feature maps can weaken the distillation effectiveness. To address this, we employ a Dirichlet Process Gaussian Mixture Model (DPGMM) to model the distribution of the teacher&#x2019;s BEV features, approximating it as a mixture of Gaussian components. DPGMM can adaptively infer the effective number of active components for each class from the data, thereby avoiding per-class manual tuning and providing a more flexible prior for distribution-level distillation. Each component is parameterized by a mean and a covariance matrix, which describe the feature center and its variation across directions. This shifts teacher supervision from point-wise distillation to distribution-level distillation. We then introduce a distribution-consistency constraint to encourage the student features to match the teacher&#x2019;s mixture distribution in a probabilistic manner. Compared with purely point-wise regression, our method provides stronger and more structure-aware supervision. It avoids noise amplification and overly restrictive constraints caused by point-to-point alignment, leading to more robust BEV feature transfer.</p>
<p><bold>Teacher model.</bold> The teacher model adopts CenterPoint, a LiDAR-based 3D object detector that performs detection in the BEV space. Given an input LiDAR point cloud, it first quantizes the 3D space into regular bins (voxels or pillars) and encodes points within each bin into learned features. A standard LiDAR-based backbone network (e.g., VoxelNet [<xref ref-type="bibr" rid="ref-31">31</xref>] or PointPillars [<xref ref-type="bibr" rid="ref-32">32</xref>]) then produces a BEV feature map <inline-formula id="ieqn-9"><mml:math id="mml-ieqn-9"><mml:msubsup><mml:mrow><mml:mtext mathvariant="bold">F</mml:mtext></mml:mrow><mml:mrow><mml:mtext>bev</mml:mtext></mml:mrow><mml:mi>t</mml:mi></mml:msubsup><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:msub><mml:mi>C</mml:mi><mml:mi>t</mml:mi></mml:msub><mml:mo>&#x00D7;</mml:mo><mml:mi>H</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mi>W</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula>. The CenterPoint detection head predicts object centers and regresses 3D box attributes from this BEV feature map. Notably, the teacher model is used only during training to provide supervision for BEV feature distillation.</p>
<p><bold>Student model.</bold> The student model is based on BEVDepth [<xref ref-type="bibr" rid="ref-9">9</xref>], a camera-only BEV detector that explicitly lifts multi-view image features into the BEV space using depth-aware projection. It first extracts image features with a image backbone and predicts per-pixel depth distributions using a depth network. The features are then lifted to 3D space and projected onto a predefined BEV grid through a lift-splat-shoot operation, followed by a 2D BEV backbone for further encoding, producing the student BEV feature map <inline-formula id="ieqn-10"><mml:math id="mml-ieqn-10"><mml:msubsup><mml:mrow><mml:mtext mathvariant="bold">F</mml:mtext></mml:mrow><mml:mrow><mml:mtext>bev</mml:mtext></mml:mrow><mml:mi>s</mml:mi></mml:msubsup><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:msub><mml:mi>C</mml:mi><mml:mi>s</mml:mi></mml:msub><mml:mo>&#x00D7;</mml:mo><mml:mi>H</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mi>W</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula>. It has the same feature size as <inline-formula id="ieqn-11"><mml:math id="mml-ieqn-11"><mml:msubsup><mml:mrow><mml:mtext mathvariant="bold">F</mml:mtext></mml:mrow><mml:mrow><mml:mtext>bev</mml:mtext></mml:mrow><mml:mi>t</mml:mi></mml:msubsup></mml:math></inline-formula>. For distillation, we align the channel dimension by applying a 1 <inline-formula id="ieqn-12"><mml:math id="mml-ieqn-12"><mml:mo>&#x00D7;</mml:mo></mml:math></inline-formula> 1 convolution.</p>
<p><bold>Distribution-aware feature distillation (DAFD).</bold> For each ground-truth 3D bounding box <inline-formula id="ieqn-13"><mml:math id="mml-ieqn-13"><mml:msub><mml:mi>b</mml:mi><mml:mi>k</mml:mi></mml:msub></mml:math></inline-formula> with a class label <inline-formula id="ieqn-14"><mml:math id="mml-ieqn-14"><mml:mi>c</mml:mi></mml:math></inline-formula>, we project <inline-formula id="ieqn-15"><mml:math id="mml-ieqn-15"><mml:msub><mml:mi>b</mml:mi><mml:mi>k</mml:mi></mml:msub></mml:math></inline-formula> onto the BEV plane and extract a foreground region <inline-formula id="ieqn-16"><mml:math id="mml-ieqn-16"><mml:msub><mml:mi mathvariant="normal">&#x03A9;</mml:mi><mml:mi>k</mml:mi></mml:msub></mml:math></inline-formula> from the teacher model. Then, we apply average pooling to aggregate the features and obtain <italic>D</italic>-dimensional feature vectors:<disp-formula id="eqn-2"><label>(2)</label><mml:math id="mml-eqn-2" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd /><mml:mtd><mml:msub><mml:mrow><mml:mtext mathvariant="bold">F</mml:mtext></mml:mrow><mml:mi>t</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mo fence="false" stretchy="false">{</mml:mo><mml:msubsup><mml:mrow><mml:mi>f</mml:mi></mml:mrow><mml:mi>k</mml:mi><mml:mi>t</mml:mi></mml:msubsup><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mi>D</mml:mi></mml:msup><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mi>k</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mn>2</mml:mn><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:mi>K</mml:mi><mml:mo fence="false" stretchy="false">}</mml:mo><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>where <italic>K</italic> is the number of foreground objects. <inline-formula id="ieqn-17"><mml:math id="mml-ieqn-17"><mml:msubsup><mml:mrow><mml:mi>f</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula> is the feature vector of the <inline-formula id="ieqn-18"><mml:math id="mml-ieqn-18"><mml:mi>k</mml:mi></mml:math></inline-formula>-th object from the teacher model.</p>
<p>Because the feature distribution is highly class-dependent, mixing different semantic categories would lead to ambiguous high-density regions that provide misleading supervision for distillation. Therefore, we model each class <inline-formula id="ieqn-19"><mml:math id="mml-ieqn-19"><mml:mi>c</mml:mi></mml:math></inline-formula> separately. For a given class <inline-formula id="ieqn-20"><mml:math id="mml-ieqn-20"><mml:mi>c</mml:mi></mml:math></inline-formula>, we collect the corresponding teacher features:<disp-formula id="eqn-3"><label>(3)</label><mml:math id="mml-eqn-3" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd /><mml:mtd><mml:msubsup><mml:mrow><mml:mtext mathvariant="bold">D</mml:mtext></mml:mrow><mml:mi>c</mml:mi><mml:mi>t</mml:mi></mml:msubsup><mml:mo>=</mml:mo><mml:mo fence="false" stretchy="false">{</mml:mo><mml:msubsup><mml:mrow><mml:mi>f</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi><mml:mo>,</mml:mo><mml:mi>c</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msubsup><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mi>D</mml:mi></mml:msup><mml:msubsup><mml:mo fence="false" stretchy="false">}</mml:mo><mml:mrow><mml:mi>k</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:msub><mml:mi>N</mml:mi><mml:mi>c</mml:mi></mml:msub></mml:mrow></mml:msubsup><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>where <inline-formula id="ieqn-21"><mml:math id="mml-ieqn-21"><mml:msub><mml:mi>N</mml:mi><mml:mi>c</mml:mi></mml:msub></mml:math></inline-formula> denotes the number of teacher features belonging to class <inline-formula id="ieqn-22"><mml:math id="mml-ieqn-22"><mml:mi>c</mml:mi></mml:math></inline-formula>. Due to variations in viewpoint, distance, and occlusion, features from the same category follow a multimodal distribution. To explicitly capture these intra-class modes, we model the teacher features of each class <inline-formula id="ieqn-23"><mml:math id="mml-ieqn-23"><mml:mi>c</mml:mi></mml:math></inline-formula> with a DPGMM. In this way, different appearance and geometry patterns are separated into distinct sub-modes:<disp-formula id="eqn-4"><label>(4)</label><mml:math id="mml-eqn-4" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd /><mml:mtd><mml:mi>p</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msubsup><mml:mi>D</mml:mi><mml:mi>c</mml:mi><mml:mi>t</mml:mi></mml:msubsup><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:munderover><mml:mo>&#x220F;</mml:mo><mml:mrow><mml:mi>k</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:msub><mml:mi>N</mml:mi><mml:mi>c</mml:mi></mml:msub></mml:mrow></mml:munderover><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>m</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x221E;</mml:mi></mml:mrow></mml:munderover><mml:msub><mml:mi>&#x03C0;</mml:mi><mml:mrow><mml:mi>c</mml:mi><mml:mo>,</mml:mo><mml:mi>m</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mi>&#x1D4A9;</mml:mi></mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:msubsup><mml:mrow><mml:mi>f</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi><mml:mo>,</mml:mo><mml:mi>c</mml:mi></mml:mrow><mml:mi>t</mml:mi></mml:msubsup><mml:mo>&#x2223;</mml:mo><mml:msub><mml:mi>&#x03BC;</mml:mi><mml:mrow><mml:mi>c</mml:mi><mml:mo>,</mml:mo><mml:mi>m</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi mathvariant="normal">&#x03A3;</mml:mi><mml:mrow><mml:mi>c</mml:mi><mml:mo>,</mml:mo><mml:mi>m</mml:mi></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>where <inline-formula id="ieqn-24"><mml:math id="mml-ieqn-24"><mml:mo fence="false" stretchy="false">{</mml:mo><mml:msub><mml:mi>&#x03C0;</mml:mi><mml:mrow><mml:mi>c</mml:mi><mml:mo>,</mml:mo><mml:mi>m</mml:mi></mml:mrow></mml:msub><mml:msubsup><mml:mo fence="false" stretchy="false">}</mml:mo><mml:mrow><mml:mi>m</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x221E;</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula> denote the mixture weights of Gaussian components for class <inline-formula id="ieqn-25"><mml:math id="mml-ieqn-25"><mml:mi>c</mml:mi></mml:math></inline-formula>, generated from a Dirichlet Process, satisfying <inline-formula id="ieqn-26"><mml:math id="mml-ieqn-26"><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>m</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x221E;</mml:mi></mml:mrow></mml:munderover><mml:msub><mml:mi>&#x03C0;</mml:mi><mml:mrow><mml:mi>c</mml:mi><mml:mo>,</mml:mo><mml:mi>m</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:math></inline-formula>. Each component is parameterized by a mean <inline-formula id="ieqn-27"><mml:math id="mml-ieqn-27"><mml:msub><mml:mi>&#x03BC;</mml:mi><mml:mrow><mml:mi>c</mml:mi><mml:mo>,</mml:mo><mml:mi>m</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> and a covariance matrix <inline-formula id="ieqn-28"><mml:math id="mml-ieqn-28"><mml:msub><mml:mi mathvariant="normal">&#x03A3;</mml:mi><mml:mrow><mml:mi>c</mml:mi><mml:mo>,</mml:mo><mml:mi>m</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>. <inline-formula id="ieqn-29"><mml:math id="mml-ieqn-29"><mml:mrow><mml:mi>&#x1D4A9;</mml:mi></mml:mrow></mml:math></inline-formula> is the Gaussian distribution.</p>
<p>For each class <inline-formula id="ieqn-30"><mml:math id="mml-ieqn-30"><mml:mi>c</mml:mi></mml:math></inline-formula>, we independently perform Collapsed Variational Inference (CVI) to infer the posterior over latent assignments <inline-formula id="ieqn-31"><mml:math id="mml-ieqn-31"><mml:msub><mml:mi>z</mml:mi><mml:mrow><mml:mi>k</mml:mi><mml:mo>,</mml:mo><mml:mi>c</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>, where <inline-formula id="ieqn-32"><mml:math id="mml-ieqn-32"><mml:msub><mml:mi>z</mml:mi><mml:mrow><mml:mi>k</mml:mi><mml:mo>,</mml:mo><mml:mi>c</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mi>m</mml:mi></mml:math></inline-formula> indicates that feature <inline-formula id="ieqn-33"><mml:math id="mml-ieqn-33"><mml:msubsup><mml:mrow><mml:mi>f</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi><mml:mo>,</mml:mo><mml:mi>c</mml:mi></mml:mrow><mml:mi>t</mml:mi></mml:msubsup></mml:math></inline-formula> is generated from the <inline-formula id="ieqn-34"><mml:math id="mml-ieqn-34"><mml:mi>m</mml:mi></mml:math></inline-formula>-th Gaussian component. We approximate the posterior as:<disp-formula id="eqn-5"><label>(5)</label><mml:math id="mml-eqn-5" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd /><mml:mtd><mml:mi>q</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>Z</mml:mi><mml:mi>c</mml:mi></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:munderover><mml:mo>&#x220F;</mml:mo><mml:mrow><mml:mi>k</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:msub><mml:mi>N</mml:mi><mml:mi>c</mml:mi></mml:msub></mml:mrow></mml:munderover><mml:mi>q</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>z</mml:mi><mml:mrow><mml:mi>k</mml:mi><mml:mo>,</mml:mo><mml:mi>c</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>with categorical factors:<disp-formula id="eqn-6"><label>(6)</label><mml:math id="mml-eqn-6" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd /><mml:mtd><mml:mi>q</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>z</mml:mi><mml:mrow><mml:mi>k</mml:mi><mml:mo>,</mml:mo><mml:mi>c</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mi>m</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:msub><mml:mi>r</mml:mi><mml:mrow><mml:mi>k</mml:mi><mml:mo>,</mml:mo><mml:mi>c</mml:mi><mml:mo>,</mml:mo><mml:mi>m</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>where <inline-formula id="ieqn-35"><mml:math id="mml-ieqn-35"><mml:msub><mml:mi>r</mml:mi><mml:mrow><mml:mi>k</mml:mi><mml:mo>,</mml:mo><mml:mi>c</mml:mi><mml:mo>,</mml:mo><mml:mi>m</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> is the responsibility of the <inline-formula id="ieqn-36"><mml:math id="mml-ieqn-36"><mml:mi>m</mml:mi></mml:math></inline-formula>-th Gaussian component for feature <inline-formula id="ieqn-37"><mml:math id="mml-ieqn-37"><mml:msubsup><mml:mrow><mml:mi>f</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi><mml:mo>,</mml:mo><mml:mi>c</mml:mi></mml:mrow><mml:mi>t</mml:mi></mml:msubsup></mml:math></inline-formula>. The collapsed variational updates for assignment responsibilities are:<disp-formula id="eqn-7"><label>(7)</label><mml:math id="mml-eqn-7" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd /><mml:mtd><mml:msub><mml:mrow><mml:mover><mml:mi>r</mml:mi><mml:mo stretchy="false">&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>k</mml:mi><mml:mo>,</mml:mo><mml:mi>c</mml:mi><mml:mo>,</mml:mo><mml:mi>m</mml:mi></mml:mrow></mml:msub><mml:mo>&#x221D;</mml:mo><mml:msub><mml:mrow><mml:mi>n</mml:mi></mml:mrow><mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mi>k</mml:mi><mml:mo>,</mml:mo><mml:mi>c</mml:mi><mml:mo>,</mml:mo><mml:mi>m</mml:mi></mml:mrow></mml:msub><mml:mspace width="thinmathspace" /><mml:mi>p</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msubsup><mml:mi>f</mml:mi><mml:mrow><mml:mi>k</mml:mi><mml:mo>,</mml:mo><mml:mi>c</mml:mi></mml:mrow><mml:mi>t</mml:mi></mml:msubsup><mml:mo>&#x2223;</mml:mo><mml:msubsup><mml:mrow><mml:mi>&#x1D4AE;</mml:mi></mml:mrow><mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mi>k</mml:mi><mml:mo>,</mml:mo><mml:mi>c</mml:mi><mml:mo>,</mml:mo><mml:mi>m</mml:mi></mml:mrow><mml:mi>t</mml:mi></mml:msubsup><mml:mo stretchy="false">)</mml:mo><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<disp-formula id="eqn-8"><label>(8)</label><mml:math id="mml-eqn-8" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd /><mml:mtd><mml:msub><mml:mrow><mml:mover><mml:mi>r</mml:mi><mml:mo stretchy="false">&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>k</mml:mi><mml:mo>,</mml:mo><mml:mi>c</mml:mi><mml:mo>,</mml:mo><mml:mrow><mml:mtext>new</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:mo>&#x221D;</mml:mo><mml:mi>&#x03B3;</mml:mi><mml:mspace width="thinmathspace" /><mml:mi>p</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msubsup><mml:mi>f</mml:mi><mml:mrow><mml:mi>k</mml:mi><mml:mo>,</mml:mo><mml:mi>c</mml:mi></mml:mrow><mml:mi>t</mml:mi></mml:msubsup><mml:mo>&#x2223;</mml:mo><mml:msubsup><mml:mrow><mml:mi>&#x1D4AE;</mml:mi></mml:mrow><mml:mrow><mml:mn>0</mml:mn></mml:mrow><mml:mi>t</mml:mi></mml:msubsup><mml:mo stretchy="false">)</mml:mo><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>and the normalized responsibilities are
<disp-formula id="eqn-9"><label>(9)</label><mml:math id="mml-eqn-9" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd /><mml:mtd><mml:msub><mml:mi>r</mml:mi><mml:mrow><mml:mi>k</mml:mi><mml:mo>,</mml:mo><mml:mi>c</mml:mi><mml:mo>,</mml:mo><mml:mi>m</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mfrac><mml:msub><mml:mrow><mml:mover><mml:mi>r</mml:mi><mml:mo stretchy="false">&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>k</mml:mi><mml:mo>,</mml:mo><mml:mi>c</mml:mi><mml:mo>,</mml:mo><mml:mi>m</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:munder><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:msup><mml:mi>m</mml:mi><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msup><mml:mo>&#x2208;</mml:mo><mml:msub><mml:mrow><mml:mi>&#x02133;</mml:mi></mml:mrow><mml:mi>c</mml:mi></mml:msub></mml:mrow></mml:munder><mml:msub><mml:mrow><mml:mover><mml:mi>r</mml:mi><mml:mo stretchy="false">&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>k</mml:mi><mml:mo>,</mml:mo><mml:mi>c</mml:mi><mml:mo>,</mml:mo><mml:msup><mml:mi>m</mml:mi><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msup></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:msub><mml:mrow><mml:mover><mml:mi>r</mml:mi><mml:mo stretchy="false">&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>k</mml:mi><mml:mo>,</mml:mo><mml:mi>c</mml:mi><mml:mo>,</mml:mo><mml:mrow><mml:mtext>new</mml:mtext></mml:mrow></mml:mrow></mml:msub></mml:mrow></mml:mfrac><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<disp-formula id="eqn-10"><label>(10)</label><mml:math id="mml-eqn-10" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd /><mml:mtd><mml:msub><mml:mi>r</mml:mi><mml:mrow><mml:mi>k</mml:mi><mml:mo>,</mml:mo><mml:mi>c</mml:mi><mml:mo>,</mml:mo><mml:mrow><mml:mtext>new</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mfrac><mml:msub><mml:mrow><mml:mover><mml:mi>r</mml:mi><mml:mo stretchy="false">&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>k</mml:mi><mml:mo>,</mml:mo><mml:mi>c</mml:mi><mml:mo>,</mml:mo><mml:mrow><mml:mtext>new</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:mrow><mml:munder><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:msup><mml:mi>m</mml:mi><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msup><mml:mo>&#x2208;</mml:mo><mml:msub><mml:mrow><mml:mi>&#x02133;</mml:mi></mml:mrow><mml:mi>c</mml:mi></mml:msub></mml:mrow></mml:munder><mml:msub><mml:mrow><mml:mover><mml:mi>r</mml:mi><mml:mo stretchy="false">&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>k</mml:mi><mml:mo>,</mml:mo><mml:mi>c</mml:mi><mml:mo>,</mml:mo><mml:msup><mml:mi>m</mml:mi><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msup></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:msub><mml:mrow><mml:mover><mml:mi>r</mml:mi><mml:mo stretchy="false">&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>k</mml:mi><mml:mo>,</mml:mo><mml:mi>c</mml:mi><mml:mo>,</mml:mo><mml:mrow><mml:mtext>new</mml:mtext></mml:mrow></mml:mrow></mml:msub></mml:mrow></mml:mfrac><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>where <inline-formula id="ieqn-38"><mml:math id="mml-ieqn-38"><mml:msub><mml:mi>n</mml:mi><mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mi>k</mml:mi><mml:mo>,</mml:mo><mml:mi>c</mml:mi><mml:mo>,</mml:mo><mml:mi>m</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:munder><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>&#x2260;</mml:mo><mml:mi>k</mml:mi></mml:mrow></mml:munder><mml:msub><mml:mi>r</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>c</mml:mi><mml:mo>,</mml:mo><mml:mi>m</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> denotes the expected number of features assigned to the <inline-formula id="ieqn-39"><mml:math id="mml-ieqn-39"><mml:mi>m</mml:mi></mml:math></inline-formula>-th component excluding the current sample <inline-formula id="ieqn-40"><mml:math id="mml-ieqn-40"><mml:mi>k</mml:mi></mml:math></inline-formula>. <inline-formula id="ieqn-41"><mml:math id="mml-ieqn-41"><mml:mi>&#x03B3;</mml:mi></mml:math></inline-formula> is the Dirichlet process concentration parameter, controlling the trade-off between creating a new component and reusing existing ones. <inline-formula id="ieqn-42"><mml:math id="mml-ieqn-42"><mml:msub><mml:mi>r</mml:mi><mml:mrow><mml:mi>k</mml:mi><mml:mo>,</mml:mo><mml:mi>c</mml:mi><mml:mo>,</mml:mo><mml:mtext>new</mml:mtext></mml:mrow></mml:msub></mml:math></inline-formula> denotes the assignment responsibility that the feature <inline-formula id="ieqn-43"><mml:math id="mml-ieqn-43"><mml:msubsup><mml:mi>f</mml:mi><mml:mrow><mml:mi>k</mml:mi><mml:mo>,</mml:mo><mml:mi>c</mml:mi></mml:mrow><mml:mi>t</mml:mi></mml:msubsup></mml:math></inline-formula> in class <inline-formula id="ieqn-44"><mml:math id="mml-ieqn-44"><mml:mi>c</mml:mi></mml:math></inline-formula> is assigned to a potential new component. <inline-formula id="ieqn-45"><mml:math id="mml-ieqn-45"><mml:msub><mml:mrow><mml:mi>&#x02133;</mml:mi></mml:mrow><mml:mi>c</mml:mi></mml:msub></mml:math></inline-formula> is the set of currently instantiated components in class <inline-formula id="ieqn-46"><mml:math id="mml-ieqn-46"><mml:mi>c</mml:mi></mml:math></inline-formula>. <inline-formula id="ieqn-47"><mml:math id="mml-ieqn-47"><mml:mi>p</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msubsup><mml:mi>f</mml:mi><mml:mrow><mml:mi>k</mml:mi><mml:mo>,</mml:mo><mml:mi>c</mml:mi></mml:mrow><mml:mi>t</mml:mi></mml:msubsup><mml:mo>&#x2223;</mml:mo><mml:msubsup><mml:mrow><mml:mi>&#x1D4AE;</mml:mi></mml:mrow><mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mi>k</mml:mi><mml:mo>,</mml:mo><mml:mi>c</mml:mi><mml:mo>,</mml:mo><mml:mi>m</mml:mi></mml:mrow><mml:mi>t</mml:mi></mml:msubsup><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> is the posterior predictive of the feature under component <inline-formula id="ieqn-48"><mml:math id="mml-ieqn-48"><mml:mi>m</mml:mi></mml:math></inline-formula>, given the posterior hyperparameters <inline-formula id="ieqn-49"><mml:math id="mml-ieqn-49"><mml:msubsup><mml:mrow><mml:mi>&#x1D4AE;</mml:mi></mml:mrow><mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mi>k</mml:mi><mml:mo>,</mml:mo><mml:mi>c</mml:mi><mml:mo>,</mml:mo><mml:mi>m</mml:mi></mml:mrow><mml:mi>t</mml:mi></mml:msubsup></mml:math></inline-formula> computed from all other samples (again excluding <inline-formula id="ieqn-50"><mml:math id="mml-ieqn-50"><mml:mi>k</mml:mi></mml:math></inline-formula>). <inline-formula id="ieqn-51"><mml:math id="mml-ieqn-51"><mml:mi>p</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msubsup><mml:mi>f</mml:mi><mml:mrow><mml:mi>k</mml:mi><mml:mo>,</mml:mo><mml:mi>c</mml:mi></mml:mrow><mml:mi>t</mml:mi></mml:msubsup><mml:mo>&#x2223;</mml:mo><mml:msubsup><mml:mrow><mml:mi>&#x1D4AE;</mml:mi></mml:mrow><mml:mrow><mml:mn>0</mml:mn></mml:mrow><mml:mi>t</mml:mi></mml:msubsup><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> is the prior predictive under the prior hyperparameters <inline-formula id="ieqn-52"><mml:math id="mml-ieqn-52"><mml:msubsup><mml:mrow><mml:mi>&#x1D4AE;</mml:mi></mml:mrow><mml:mrow><mml:mn>0</mml:mn></mml:mrow><mml:mi>t</mml:mi></mml:msubsup></mml:math></inline-formula> for a new component.</p>
<p>Using the collapsed sufficient statistics aggregated over all samples, we obtain the posterior hyperparameters <inline-formula id="ieqn-53"><mml:math id="mml-ieqn-53"><mml:mo fence="false" stretchy="false">{</mml:mo><mml:msub><mml:mrow><mml:mover><mml:mi>&#x03C0;</mml:mi><mml:mo stretchy="false">&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>c</mml:mi><mml:mo>,</mml:mo><mml:mi>m</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mover><mml:mi>&#x03BC;</mml:mi><mml:mo stretchy="false">&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>c</mml:mi><mml:mo>,</mml:mo><mml:mi>m</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="normal">&#x03A3;</mml:mi><mml:mo stretchy="false">&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>c</mml:mi><mml:mo>,</mml:mo><mml:mi>m</mml:mi></mml:mrow></mml:msub><mml:mo fence="false" stretchy="false">}</mml:mo></mml:math></inline-formula>, which characterize the class-wise multi-modal teacher feature distribution. For each teacher feature <inline-formula id="ieqn-54"><mml:math id="mml-ieqn-54"><mml:msubsup><mml:mi>f</mml:mi><mml:mrow><mml:mi>k</mml:mi><mml:mo>,</mml:mo><mml:mi>c</mml:mi></mml:mrow><mml:mi>t</mml:mi></mml:msubsup></mml:math></inline-formula>, we define its dominant mode:<disp-formula id="eqn-11"><label>(11)</label><mml:math id="mml-eqn-11" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd /><mml:mtd><mml:msubsup><mml:mi>m</mml:mi><mml:mrow><mml:mi>k</mml:mi></mml:mrow><mml:mo>&#x2217;</mml:mo></mml:msubsup><mml:mo>=</mml:mo><mml:mi>arg</mml:mi><mml:mo>&#x2061;</mml:mo><mml:munder><mml:mo movablelimits="true" form="prefix">max</mml:mo><mml:mrow><mml:mi>m</mml:mi><mml:mo>&#x2208;</mml:mo><mml:msub><mml:mrow><mml:mi>&#x02133;</mml:mi></mml:mrow><mml:mi>c</mml:mi></mml:msub></mml:mrow></mml:munder><mml:msub><mml:mi>r</mml:mi><mml:mrow><mml:mi>k</mml:mi><mml:mo>,</mml:mo><mml:mi>c</mml:mi><mml:mo>,</mml:mo><mml:mi>m</mml:mi></mml:mrow></mml:msub><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>In rare cases, some mixture components are supported by only a few teacher features, which leads to unreliable density estimates. Enforcing distribution-aware distillation on such poorly-supported components may introduce noisy supervision. Therefore, we apply a tiny-component filter. For each class <inline-formula id="ieqn-55"><mml:math id="mml-ieqn-55"><mml:mi>c</mml:mi></mml:math></inline-formula> and component <inline-formula id="ieqn-56"><mml:math id="mml-ieqn-56"><mml:mi>m</mml:mi></mml:math></inline-formula>, we compute the expected number of assigned samples <inline-formula id="ieqn-57"><mml:math id="mml-ieqn-57"><mml:msub><mml:mi>N</mml:mi><mml:mrow><mml:mi>c</mml:mi><mml:mo>,</mml:mo><mml:mi>m</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>. If <inline-formula id="ieqn-58"><mml:math id="mml-ieqn-58"><mml:msub><mml:mi>N</mml:mi><mml:mrow><mml:mi>c</mml:mi><mml:mo>,</mml:mo><mml:mi>m</mml:mi></mml:mrow></mml:msub><mml:mo>&#x003C;</mml:mo><mml:msub><mml:mi>n</mml:mi><mml:mrow><mml:mtext>min</mml:mtext></mml:mrow></mml:msub></mml:math></inline-formula>, we mark component <inline-formula id="ieqn-59"><mml:math id="mml-ieqn-59"><mml:mi>m</mml:mi></mml:math></inline-formula> as tiny and exclude it from distribution-aware distillation. For samples whose dominant mode <inline-formula id="ieqn-60"><mml:math id="mml-ieqn-60"><mml:msubsup><mml:mi>m</mml:mi><mml:mi>k</mml:mi><mml:mo>&#x22C6;</mml:mo></mml:msubsup></mml:math></inline-formula> is a tiny component, we fall back to the basic feature regression loss in <xref ref-type="disp-formula" rid="eqn-14">Eq. (14)</xref>.</p>
<p>Next, we extract student BEV features <inline-formula id="ieqn-61"><mml:math id="mml-ieqn-61"><mml:msubsup><mml:mrow><mml:mi>f</mml:mi></mml:mrow><mml:mi>k</mml:mi><mml:mi>s</mml:mi></mml:msubsup></mml:math></inline-formula> from the corresponding foreground regions <inline-formula id="ieqn-62"><mml:math id="mml-ieqn-62"><mml:msubsup><mml:mi mathvariant="normal">&#x03A9;</mml:mi><mml:mi>k</mml:mi><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula> of the student model, forming the set <inline-formula id="ieqn-63"><mml:math id="mml-ieqn-63"><mml:msub><mml:mrow><mml:mtext mathvariant="bold">F</mml:mtext></mml:mrow><mml:mi>s</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mo fence="false" stretchy="false">{</mml:mo><mml:msubsup><mml:mrow><mml:mi>f</mml:mi></mml:mrow><mml:mi>k</mml:mi><mml:mi>s</mml:mi></mml:msubsup><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mi>D</mml:mi></mml:msup><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mi>k</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mn>2</mml:mn><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:mi>K</mml:mi><mml:mo fence="false" stretchy="false">}</mml:mo></mml:math></inline-formula>. For each teacher object <inline-formula id="ieqn-64"><mml:math id="mml-ieqn-64"><mml:msubsup><mml:mi>f</mml:mi><mml:mrow><mml:mi>k</mml:mi><mml:mo>,</mml:mo><mml:mi>c</mml:mi></mml:mrow><mml:mi>t</mml:mi></mml:msubsup></mml:math></inline-formula> with class <inline-formula id="ieqn-65"><mml:math id="mml-ieqn-65"><mml:mi>c</mml:mi></mml:math></inline-formula> and dominant mode <inline-formula id="ieqn-66"><mml:math id="mml-ieqn-66"><mml:msubsup><mml:mi>m</mml:mi><mml:mrow><mml:mi>k</mml:mi></mml:mrow><mml:mo>&#x22C6;</mml:mo></mml:msubsup></mml:math></inline-formula>, we regard the Gaussian distribution <inline-formula id="ieqn-67"><mml:math id="mml-ieqn-67"><mml:mrow><mml:mi>&#x1D4A9;</mml:mi></mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mo>&#x22C5;</mml:mo><mml:mo>&#x2223;</mml:mo><mml:msub><mml:mrow><mml:mover><mml:mi>&#x03BC;</mml:mi><mml:mo stretchy="false">&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>c</mml:mi><mml:mo>,</mml:mo><mml:msubsup><mml:mi>m</mml:mi><mml:mrow><mml:mi>k</mml:mi></mml:mrow><mml:mo>&#x22C6;</mml:mo></mml:msubsup></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="normal">&#x03A3;</mml:mi><mml:mo stretchy="false">&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>c</mml:mi><mml:mo>,</mml:mo><mml:msubsup><mml:mi>m</mml:mi><mml:mrow><mml:mi>k</mml:mi></mml:mrow><mml:mo>&#x22C6;</mml:mo></mml:msubsup></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula> as the target distribution for the feature <inline-formula id="ieqn-68"><mml:math id="mml-ieqn-68"><mml:msubsup><mml:mrow><mml:mi>f</mml:mi></mml:mrow><mml:mi>k</mml:mi><mml:mi>s</mml:mi></mml:msubsup></mml:math></inline-formula>. The mode-aware loss is defined as follows:<disp-formula id="eqn-12"><label>(12)</label><mml:math id="mml-eqn-12" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd /><mml:mtd><mml:msubsup><mml:mrow><mml:mi>&#x02112;</mml:mi></mml:mrow><mml:mi>k</mml:mi><mml:mrow><mml:mrow><mml:mtext>mode</mml:mtext></mml:mrow></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:mn>2</mml:mn></mml:mfrac><mml:msup><mml:mrow><mml:mo>(</mml:mo><mml:msubsup><mml:mrow><mml:mi>f</mml:mi></mml:mrow><mml:mi>k</mml:mi><mml:mi>s</mml:mi></mml:msubsup><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mrow><mml:mover><mml:mi>&#x03BC;</mml:mi><mml:mo stretchy="false">&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>c</mml:mi><mml:mo>,</mml:mo><mml:msubsup><mml:mi>m</mml:mi><mml:mrow><mml:mi>k</mml:mi></mml:mrow><mml:mo>&#x22C6;</mml:mo></mml:msubsup></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow><mml:mi mathvariant="normal">&#x22A4;</mml:mi></mml:msup><mml:msubsup><mml:mrow><mml:mover><mml:mi mathvariant="normal">&#x03A3;</mml:mi><mml:mo stretchy="false">&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>c</mml:mi><mml:mo>,</mml:mo><mml:msubsup><mml:mi>m</mml:mi><mml:mrow><mml:mi>k</mml:mi></mml:mrow><mml:mo>&#x22C6;</mml:mo></mml:msubsup></mml:mrow><mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msubsup><mml:mrow><mml:mo>(</mml:mo><mml:msubsup><mml:mrow><mml:mi>f</mml:mi></mml:mrow><mml:mi>k</mml:mi><mml:mi>s</mml:mi></mml:msubsup><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mrow><mml:mover><mml:mi>&#x03BC;</mml:mi><mml:mo stretchy="false">&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>c</mml:mi><mml:mo>,</mml:mo><mml:msubsup><mml:mi>m</mml:mi><mml:mrow><mml:mi>k</mml:mi></mml:mrow><mml:mo>&#x22C6;</mml:mo></mml:msubsup></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow><mml:mo>+</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:mn>2</mml:mn></mml:mfrac><mml:mi>log</mml:mi><mml:mo>&#x2061;</mml:mo><mml:mstyle scriptlevel="0"><mml:mrow><mml:mo maxsize="1.2em" minsize="1.2em">|</mml:mo></mml:mrow></mml:mstyle><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="normal">&#x03A3;</mml:mi><mml:mo stretchy="false">&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>c</mml:mi><mml:mo>,</mml:mo><mml:msubsup><mml:mi>m</mml:mi><mml:mrow><mml:mi>k</mml:mi></mml:mrow><mml:mo>&#x22C6;</mml:mo></mml:msubsup></mml:mrow></mml:msub><mml:mstyle scriptlevel="0"><mml:mrow><mml:mo maxsize="1.2em" minsize="1.2em">|</mml:mo></mml:mrow></mml:mstyle><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p><inline-formula id="ieqn-69"><mml:math id="mml-ieqn-69"><mml:msubsup><mml:mrow><mml:mi>&#x02112;</mml:mi></mml:mrow><mml:mi>k</mml:mi><mml:mrow><mml:mtext>mode</mml:mtext></mml:mrow></mml:msubsup></mml:math></inline-formula> provides the primary mode-level distillation signal, imposing stronger supervision along critical directions with small teacher variance while adaptively relaxing the constraints along high-variance, noise-dominated directions.</p>
<p>In addition, we introduce a mixture-level regularization term to further align class-wise feature distributions across different modes:<disp-formula id="eqn-13"><label>(13)</label><mml:math id="mml-eqn-13" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd /><mml:mtd><mml:msubsup><mml:mrow><mml:mi>&#x02112;</mml:mi></mml:mrow><mml:mi>k</mml:mi><mml:mrow><mml:mrow><mml:mtext>mix</mml:mtext></mml:mrow></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mi>log</mml:mi><mml:mo>&#x2061;</mml:mo><mml:msub><mml:mi>p</mml:mi><mml:mrow><mml:mi>c</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:msubsup><mml:mi>f</mml:mi><mml:mi>k</mml:mi><mml:mi>s</mml:mi></mml:msubsup><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mi>log</mml:mi><mml:mo>&#x2061;</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:munder><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>m</mml:mi><mml:mo>&#x2208;</mml:mo><mml:msub><mml:mrow><mml:mi>&#x02133;</mml:mi></mml:mrow><mml:mi>c</mml:mi></mml:msub></mml:mrow></mml:munder><mml:msub><mml:mrow><mml:mover><mml:mi>&#x03C0;</mml:mi><mml:mo stretchy="false">&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>c</mml:mi><mml:mo>,</mml:mo><mml:mi>m</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mi>&#x1D4A9;</mml:mi></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:msubsup><mml:mi>f</mml:mi><mml:mi>k</mml:mi><mml:mi>s</mml:mi></mml:msubsup><mml:mo>&#x2223;</mml:mo><mml:msub><mml:mrow><mml:mover><mml:mi>&#x03BC;</mml:mi><mml:mo stretchy="false">&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>c</mml:mi><mml:mo>,</mml:mo><mml:mi>m</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="normal">&#x03A3;</mml:mi><mml:mo stretchy="false">&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>c</mml:mi><mml:mo>,</mml:mo><mml:mi>m</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mo>)</mml:mo></mml:mrow><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>To stabilize early training and ensure robustness, we include a standard pair-wise feature loss:<disp-formula id="eqn-14"><label>(14)</label><mml:math id="mml-eqn-14" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd /><mml:mtd><mml:msubsup><mml:mrow><mml:mi>&#x02112;</mml:mi></mml:mrow><mml:mi>k</mml:mi><mml:mrow><mml:mrow><mml:mtext>pair</mml:mtext></mml:mrow></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:mo fence="false" stretchy="false">&#x2016;</mml:mo><mml:msubsup><mml:mi>f</mml:mi><mml:mi>k</mml:mi><mml:mi>s</mml:mi></mml:msubsup><mml:mo>&#x2212;</mml:mo><mml:msubsup><mml:mi>f</mml:mi><mml:mi>k</mml:mi><mml:mi>t</mml:mi></mml:msubsup><mml:msubsup><mml:mo fence="false" stretchy="false">&#x2016;</mml:mo><mml:mn>2</mml:mn><mml:mn>2</mml:mn></mml:msubsup><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p><inline-formula id="ieqn-70"><mml:math id="mml-ieqn-70"><mml:msubsup><mml:mrow><mml:mi>&#x02112;</mml:mi></mml:mrow><mml:mi>k</mml:mi><mml:mrow><mml:mtext>pair</mml:mtext></mml:mrow></mml:msubsup></mml:math></inline-formula> preserves instance-level details, ensuring that the student does not ignore the specific teacher representation for the current sample. The weight of this loss is gradually decayed to avoid interfering with the probabilistic distillation loss.</p>
<p>Thus, the final BEV feature distillation loss is defined as follows:<disp-formula id="eqn-15"><label>(15)</label><mml:math id="mml-eqn-15" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd /><mml:mtd><mml:msub><mml:mrow><mml:mi>&#x02112;</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mtext>feat</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:mi>K</mml:mi></mml:mfrac><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>k</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>K</mml:mi></mml:mrow></mml:munderover><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>&#x03BB;</mml:mi><mml:mrow><mml:mrow><mml:mtext>mode</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:msubsup><mml:mrow><mml:mi>&#x02112;</mml:mi></mml:mrow><mml:mi>k</mml:mi><mml:mrow><mml:mrow><mml:mtext>mode</mml:mtext></mml:mrow></mml:mrow></mml:msubsup><mml:mo>+</mml:mo><mml:msub><mml:mi>&#x03BB;</mml:mi><mml:mrow><mml:mrow><mml:mtext>mix</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:msubsup><mml:mrow><mml:mi>&#x02112;</mml:mi></mml:mrow><mml:mi>k</mml:mi><mml:mrow><mml:mrow><mml:mtext>mix</mml:mtext></mml:mrow></mml:mrow></mml:msubsup><mml:mo>+</mml:mo><mml:msub><mml:mi>&#x03BB;</mml:mi><mml:mrow><mml:mrow><mml:mtext>pair</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:msubsup><mml:mrow><mml:mi>&#x02112;</mml:mi></mml:mrow><mml:mi>k</mml:mi><mml:mrow><mml:mrow><mml:mtext>pair</mml:mtext></mml:mrow></mml:mrow></mml:msubsup><mml:mo stretchy="false">)</mml:mo><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>where <inline-formula id="ieqn-71"><mml:math id="mml-ieqn-71"><mml:msub><mml:mi>&#x03BB;</mml:mi><mml:mrow><mml:mtext>mode</mml:mtext></mml:mrow></mml:msub></mml:math></inline-formula>, <inline-formula id="ieqn-72"><mml:math id="mml-ieqn-72"><mml:msub><mml:mi>&#x03BB;</mml:mi><mml:mrow><mml:mtext>mix</mml:mtext></mml:mrow></mml:msub></mml:math></inline-formula> and <inline-formula id="ieqn-73"><mml:math id="mml-ieqn-73"><mml:msub><mml:mi>&#x03BB;</mml:mi><mml:mrow><mml:mtext>pair</mml:mtext></mml:mrow></mml:msub></mml:math></inline-formula> are scalar weights. With the above distribution-level constraints, the student progressively aligns with the LiDAR BEV feature distribution in a global statistical sense, effectively narrowing the feature gap between the two modalities. The pseudocode of the proposed distribution-aware feature distillation is presented in Algorithm 1.</p>
<fig id="fig-6">
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_80595-fig-6.tif"/>
</fig>
<p><bold>Response-Level Distillation.</bold> To transfer knowledge from the teacher&#x2019;s detection head to the student&#x2019;s detection head with the same architecture, we introduce response-level loss, which directly encourages the student head&#x2019;s outputs to match the teacher&#x2019;s responses. We also apply ground truth guided head distillation to prevent background dominated, uninformative locations from propagating noise.</p>
<p>For the classification branch, we distill the teacher&#x2019;s soft responses in foreground regions and define the classification distillation term using a Gaussian focal loss, following [<xref ref-type="bibr" rid="ref-30">30</xref>]:<disp-formula id="eqn-16"><label>(16)</label><mml:math id="mml-eqn-16" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd /><mml:mtd><mml:msub><mml:mrow><mml:mi>&#x02112;</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mtext>cls</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mi>L</mml:mi><mml:mrow><mml:mrow><mml:mtext>GFocal</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:msubsup><mml:mi>h</mml:mi><mml:mrow><mml:mrow><mml:mtext>cls</mml:mtext></mml:mrow></mml:mrow><mml:mrow><mml:mi>s</mml:mi></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:mspace width="thinmathspace" /><mml:msubsup><mml:mi>h</mml:mi><mml:mrow><mml:mrow><mml:mtext>cls</mml:mtext></mml:mrow></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msubsup><mml:mo>&#x2299;</mml:mo><mml:mi>M</mml:mi><mml:mo>)</mml:mo></mml:mrow><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>where <inline-formula id="ieqn-101"><mml:math id="mml-ieqn-101"><mml:msubsup><mml:mi>h</mml:mi><mml:mrow><mml:mtext>cls</mml:mtext></mml:mrow><mml:mrow><mml:mi>s</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula> and <inline-formula id="ieqn-102"><mml:math id="mml-ieqn-102"><mml:msubsup><mml:mi>h</mml:mi><mml:mrow><mml:mtext>cls</mml:mtext></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula> denote the classification heatmap outputs of the student and teacher models, respectively, and <inline-formula id="ieqn-103"><mml:math id="mml-ieqn-103"><mml:mo>&#x2299;</mml:mo></mml:math></inline-formula> denotes element-wise multiplication. The foreground mask <italic>M</italic> is generated from the ground-truth Gaussian heatmap.</p>
<p>For the regression branch, following the training scheme of CenterPoint, we compute the regression distillation term using a <inline-formula id="ieqn-104"><mml:math id="mml-ieqn-104"><mml:msub><mml:mi>L</mml:mi><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula> loss only at the center locations of positive samples:<disp-formula id="eqn-17"><label>(17)</label><mml:math id="mml-eqn-17" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd /><mml:mtd><mml:msub><mml:mrow><mml:mi>&#x02112;</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mtext>reg</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mi>L</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:msubsup><mml:mi>r</mml:mi><mml:mrow><mml:mrow><mml:mtext>reg</mml:mtext></mml:mrow></mml:mrow><mml:mrow><mml:mi>s</mml:mi></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:msubsup><mml:mi>r</mml:mi><mml:mrow><mml:mrow><mml:mtext>reg</mml:mtext></mml:mrow></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msubsup><mml:mo>)</mml:mo></mml:mrow><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>where <inline-formula id="ieqn-105"><mml:math id="mml-ieqn-105"><mml:msubsup><mml:mi>r</mml:mi><mml:mrow><mml:mtext>reg</mml:mtext></mml:mrow><mml:mrow><mml:mi>s</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula> and <inline-formula id="ieqn-106"><mml:math id="mml-ieqn-106"><mml:msubsup><mml:mi>r</mml:mi><mml:mrow><mml:mtext>reg</mml:mtext></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula> denote the regression vectors predicted by the student and teacher models at the corresponding center locations, and <inline-formula id="ieqn-107"><mml:math id="mml-ieqn-107"><mml:mrow><mml:mi>&#x1D4B2;</mml:mi></mml:mrow></mml:math></inline-formula> denotes the weight matrix. Finally, we obtain the response-level distillation loss:<disp-formula id="eqn-18"><label>(18)</label><mml:math id="mml-eqn-18" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd /><mml:mtd><mml:msub><mml:mrow><mml:mi>&#x02112;</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mtext>resp</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mi>&#x02112;</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mtext>cls</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:msub><mml:mrow><mml:mi>&#x02112;</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mtext>reg</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
</sec>
<sec id="s3_3">
<label>3.3</label>
<title>Overall Loss</title>
<p>To summarize, we improve the camera-based student detector by distilling knowledge from a LiDAR teacher at two complementary levels. First, we perform distribution-aware feature distillation, which aligns the student&#x2019;s BEV representations with the teacher via distribution-consistency constraints. Second, we apply response-level distillation on the detection head to further transfer the teacher&#x2019;s prediction behavior, providing direct output-level guidance. These distillation objectives are jointly optimized with the student&#x2019;s original training losses [<xref ref-type="bibr" rid="ref-9">9</xref>], including the standard 3D detection loss <inline-formula id="ieqn-108"><mml:math id="mml-ieqn-108"><mml:msub><mml:mrow><mml:mi>&#x02112;</mml:mi></mml:mrow><mml:mrow><mml:mtext>det</mml:mtext></mml:mrow></mml:msub></mml:math></inline-formula> and the depth supervision loss <inline-formula id="ieqn-109"><mml:math id="mml-ieqn-109"><mml:msub><mml:mrow><mml:mi>&#x02112;</mml:mi></mml:mrow><mml:mrow><mml:mtext>depth</mml:mtext></mml:mrow></mml:msub></mml:math></inline-formula>. The overall loss is defined as:<disp-formula id="eqn-19"><label>(19)</label><mml:math id="mml-eqn-19" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd /><mml:mtd><mml:msub><mml:mrow><mml:mi>&#x02112;</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mtext>total</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mi>&#x02112;</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mtext>det</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:msub><mml:mrow><mml:mi>&#x02112;</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mtext>depth</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:msub><mml:mi>&#x03BB;</mml:mi><mml:mrow><mml:mrow><mml:mtext>feat</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:mspace width="thinmathspace" /><mml:msub><mml:mrow><mml:mi>&#x02112;</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mtext>feat</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:msub><mml:mi>&#x03BB;</mml:mi><mml:mrow><mml:mrow><mml:mtext>resp</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:mspace width="thinmathspace" /><mml:msub><mml:mrow><mml:mi>&#x02112;</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mtext>resp</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>where <inline-formula id="ieqn-110"><mml:math id="mml-ieqn-110"><mml:msub><mml:mi>&#x03BB;</mml:mi><mml:mrow><mml:mtext>feat</mml:mtext></mml:mrow></mml:msub></mml:math></inline-formula> and <inline-formula id="ieqn-111"><mml:math id="mml-ieqn-111"><mml:msub><mml:mi>&#x03BB;</mml:mi><mml:mrow><mml:mtext>resp</mml:mtext></mml:mrow></mml:msub></mml:math></inline-formula> balance the contributions of feature-level and response-level distillation, respectively. During training, we assign a larger <inline-formula id="ieqn-112"><mml:math id="mml-ieqn-112"><mml:msub><mml:mi>&#x03BB;</mml:mi><mml:mrow><mml:mtext>feat</mml:mtext></mml:mrow></mml:msub></mml:math></inline-formula> in the early stage to facilitate feature-level knowledge transfer, and then gradually decrease <inline-formula id="ieqn-113"><mml:math id="mml-ieqn-113"><mml:msub><mml:mi>&#x03BB;</mml:mi><mml:mrow><mml:mtext>feat</mml:mtext></mml:mrow></mml:msub></mml:math></inline-formula> while increasing <inline-formula id="ieqn-114"><mml:math id="mml-ieqn-114"><mml:msub><mml:mi>&#x03BB;</mml:mi><mml:mrow><mml:mtext>resp</mml:mtext></mml:mrow></mml:msub></mml:math></inline-formula>, thereby shifting the optimization focus to <inline-formula id="ieqn-115"><mml:math id="mml-ieqn-115"><mml:msub><mml:mrow><mml:mi>&#x02112;</mml:mi></mml:mrow><mml:mrow><mml:mtext>resp</mml:mtext></mml:mrow></mml:msub></mml:math></inline-formula> to refine the final predictions.</p>
</sec>
<sec id="s3_4">
<label>3.4</label>
<title>Temporal Multi-View 3D Object Detection</title>
<p>While several existing methods achieve competitive 3D perception using a single image frame, relying solely on single-frame cues inevitably leads to performance bottlenecks. First, a single frame provides only static geometric and appearance information, which can result in unstable motion estimation. Second, objects that are occluded or only partially observed in one frame are more likely to be missed or localized inaccurately, hindering reliable detection. Incorporating temporal context improves the completeness and robustness of the representation. To this end, we introduce a lightweight, plug-and-play two-frame temporal fusion module that leverages distilled BEV features from the previous frame as historical compensation and injects cross-frame information into the current-frame representation through explicit alignment and adaptive fusion, thereby improving detection stability.</p>
<p>We take the current-frame BEV feature <inline-formula id="ieqn-116"><mml:math id="mml-ieqn-116"><mml:msub><mml:mi>B</mml:mi><mml:mi>t</mml:mi></mml:msub><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:mi>C</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mi>H</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mi>W</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula> as the primary representation and use the previous-frame feature <inline-formula id="ieqn-117"><mml:math id="mml-ieqn-117"><mml:msub><mml:mi>B</mml:mi><mml:mrow><mml:mi>t</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula> to provide cross-frame information compensation. As illustrated in <xref ref-type="fig" rid="fig-3">Fig. 3</xref>, we first map the historical feature into the current coordinate system to eliminate the effect of ego-motion. We then selectively incorporate historical information through motion-aware gating and suppress dynamic and inconsistent regions. Finally, we achieve stable fusion in a residual manner.</p>
<fig id="fig-3">
<label>Figure 3</label>
<caption>
<title>Architecture of the two-frame temporal fusion model.</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_80595-fig-3.tif"/>
</fig>
<p>Specifically, we compute the relative transformation matrix <inline-formula id="ieqn-118"><mml:math id="mml-ieqn-118"><mml:msub><mml:mi>T</mml:mi><mml:mrow><mml:mi>t</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">&#x2192;</mml:mo><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> from the ego poses of two consecutive frames. Based on this transformation, we construct a sampling function <inline-formula id="ieqn-119"><mml:math id="mml-ieqn-119"><mml:mi>G</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> that maps a BEV grid location <inline-formula id="ieqn-120"><mml:math id="mml-ieqn-120"><mml:mi>x</mml:mi></mml:math></inline-formula> in the current frame to the corresponding feature coordinates in the previous frame. We then align the previous-frame BEV feature to the current coordinate system via spatial sampling:<disp-formula id="eqn-20"><label>(20)</label><mml:math id="mml-eqn-20" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd /><mml:mtd><mml:msub><mml:mrow><mml:mover><mml:mi>B</mml:mi><mml:mo stretchy="false">&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>t</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mi>Warp</mml:mi><mml:mo>&#x2061;</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>B</mml:mi><mml:mrow><mml:mi>t</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mspace width="thinmathspace" /><mml:mi>G</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">)</mml:mo><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>where <inline-formula id="ieqn-121"><mml:math id="mml-ieqn-121"><mml:mi>Warp</mml:mi><mml:mo>&#x2061;</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mo>&#x22C5;</mml:mo><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> denotes bilinear interpolation on the BEV grid [<xref ref-type="bibr" rid="ref-33">33</xref>]. To compensate for local misalignment caused by discretized interpolation and dynamic objects, we introduce a learnable refinement on top of rigid alignment and predict a small offset increment <inline-formula id="ieqn-122"><mml:math id="mml-ieqn-122"><mml:mi mathvariant="normal">&#x0394;</mml:mi><mml:mi>u</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> for each BEV location:<disp-formula id="eqn-21"><label>(21)</label><mml:math id="mml-eqn-21" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd /><mml:mtd><mml:mi mathvariant="normal">&#x0394;</mml:mi><mml:mi>u</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:msub><mml:mi>f</mml:mi><mml:mrow><mml:mo movablelimits="true" form="prefix">max</mml:mo></mml:mrow></mml:msub><mml:mo>&#x22C5;</mml:mo><mml:mi>tanh</mml:mi><mml:mo>&#x2061;</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mi>Conv</mml:mi><mml:mo>&#x2061;</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:msub><mml:mi>B</mml:mi><mml:mi>t</mml:mi></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mover><mml:mi>B</mml:mi><mml:mo stretchy="false">&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>t</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo stretchy="false">]</mml:mo><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">)</mml:mo><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>where <inline-formula id="ieqn-123"><mml:math id="mml-ieqn-123"><mml:mo stretchy="false">[</mml:mo><mml:mo>&#x22C5;</mml:mo><mml:mo stretchy="false">]</mml:mo></mml:math></inline-formula> denotes concatenation along the channel dimension. <inline-formula id="ieqn-124"><mml:math id="mml-ieqn-124"><mml:msub><mml:mi>f</mml:mi><mml:mrow><mml:mo movablelimits="true" form="prefix">max</mml:mo></mml:mrow></mml:msub></mml:math></inline-formula> is used to constrain the maximum displacement magnitude, ensuring that the refinement only compensates for local errors without introducing unstable deformations.</p>
<p>Unlike a cascaded two-stage warp (first obtaining <inline-formula id="ieqn-125"><mml:math id="mml-ieqn-125"><mml:msub><mml:mrow><mml:mover><mml:mi>B</mml:mi><mml:mo stretchy="false">&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>t</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula> and then resampling on <inline-formula id="ieqn-126"><mml:math id="mml-ieqn-126"><mml:msub><mml:mrow><mml:mover><mml:mi>B</mml:mi><mml:mo stretchy="false">&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>t</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula>), we directly add the rigid grid and the residual increment to obtain a joint sampling grid, and perform only a single interpolated sampling on the original <inline-formula id="ieqn-127"><mml:math id="mml-ieqn-127"><mml:msub><mml:mi>B</mml:mi><mml:mrow><mml:mi>t</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula> to obtain the final aligned historical features:<disp-formula id="eqn-22"><label>(22)</label><mml:math id="mml-eqn-22" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd /><mml:mtd><mml:msub><mml:mrow><mml:mover><mml:mi>B</mml:mi><mml:mo stretchy="false">&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>t</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mi>Warp</mml:mi><mml:mo>&#x2061;</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>B</mml:mi><mml:mrow><mml:mi>t</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mspace width="thinmathspace" /><mml:mi>G</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>+</mml:mo><mml:mi mathvariant="normal">&#x0394;</mml:mi><mml:mi>u</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">)</mml:mo><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>This design ensures that the entire alignment process performs only one interpolation, numerically avoiding the extra smoothing and amplification of systematic bias introduced by a second resampling.</p>
<p>Next, we introduce a pixel-wise gating weight <inline-formula id="ieqn-128"><mml:math id="mml-ieqn-128"><mml:msub><mml:mrow><mml:mi>&#x1D4A2;</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mi>H</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mi>W</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula> to estimate the contribution of history based on the current features, the aligned historical features, and motion priors (relative pose increment and time interval):<disp-formula id="eqn-23"><label>(23)</label><mml:math id="mml-eqn-23" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd /><mml:mtd><mml:msub><mml:mrow><mml:mi>&#x1D4A2;</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mi>&#x03C3;</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>Conv</mml:mi><mml:mo>&#x2061;</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:msub><mml:mi>B</mml:mi><mml:mi>t</mml:mi></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mover><mml:mi>B</mml:mi><mml:mo stretchy="false">&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>t</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mi mathvariant="normal">&#x0394;</mml:mi><mml:msub><mml:mi>p</mml:mi><mml:mrow><mml:mi>t</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mi mathvariant="normal">&#x0394;</mml:mi><mml:msub><mml:mi>t</mml:mi><mml:mrow><mml:mi>t</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo stretchy="false">]</mml:mo><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">)</mml:mo><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>where <inline-formula id="ieqn-129"><mml:math id="mml-ieqn-129"><mml:mi mathvariant="normal">&#x0394;</mml:mi><mml:msub><mml:mi>p</mml:mi><mml:mrow><mml:mi>t</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula> is a low-dimensional encoding of the relative translation and yaw, and <inline-formula id="ieqn-130"><mml:math id="mml-ieqn-130"><mml:mi mathvariant="normal">&#x0394;</mml:mi><mml:msub><mml:mi>t</mml:mi><mml:mrow><mml:mi>t</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula> is a time-interval encoding. <inline-formula id="ieqn-131"><mml:math id="mml-ieqn-131"><mml:mi>&#x03C3;</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mo>&#x22C5;</mml:mo><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> is the sigmoid function. Then, we construct an inconsistency map from the aligned cross-frame differences and explicitly suppress the gating:<disp-formula id="eqn-24"><label>(24)</label><mml:math id="mml-eqn-24" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd /><mml:mtd><mml:msub><mml:mrow><mml:msup><mml:mi>&#x1D4A2;</mml:mi><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msup></mml:mrow><mml:mrow><mml:mi>t</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mi>&#x1D4A2;</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>&#x22C5;</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mn>1</mml:mn><mml:mo>&#x2212;</mml:mo><mml:mi>&#x03C3;</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>Conv</mml:mi><mml:mo>&#x2061;</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:msub><mml:mi>B</mml:mi><mml:mi>t</mml:mi></mml:msub><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mrow><mml:mover><mml:mi>B</mml:mi><mml:mo stretchy="false">&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>t</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">)</mml:mo><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>Finally, we aggregate information from the previous frame in a residual manner:<disp-formula id="ueqn-25"><mml:math id="mml-ueqn-25" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd /><mml:mtd><mml:msubsup><mml:mi>B</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mtext>fused</mml:mtext></mml:mrow></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:msub><mml:mi>B</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:msub><mml:mrow><mml:msup><mml:mi>&#x1D4A2;</mml:mi><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msup></mml:mrow><mml:mrow><mml:mi>t</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>&#x2299;</mml:mo><mml:mi>&#x03D5;</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mrow><mml:mover><mml:mi>B</mml:mi><mml:mo stretchy="false">&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>t</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>where <inline-formula id="ieqn-132"><mml:math id="mml-ieqn-132"><mml:mi>&#x03D5;</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mo>&#x22C5;</mml:mo><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> is a <inline-formula id="ieqn-133"><mml:math id="mml-ieqn-133"><mml:mn>1</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mn>1</mml:mn></mml:math></inline-formula> convolution for channel alignment, and <inline-formula id="ieqn-134"><mml:math id="mml-ieqn-134"><mml:mo>&#x2299;</mml:mo></mml:math></inline-formula> denotes element-wise multiplication.</p>
</sec>
</sec>
<sec id="s4">
<label>4</label>
<title>Experiments</title>
<p>In this section, we present the evaluation setup, including the datasets used, evaluation metrics, and implementation details. We conduct a series of ablation studies and related analyses to thoroughly investigate the role and contribution of each component in our method. Finally, we perform comprehensive comparisons between our method and current state-of-the-art methods on widely used benchmark datasets.</p>
<sec id="s4_1">
<label>4.1</label>
<title>Dataset and Metrics</title>
<p>We evaluate our method on the nuScenes datasets, covering diverse scenarios and sensor configurations.</p>
<p><bold>nuScenes Dataset</bold> contains 1000 scenes (700 train, 150 val, 150 test) captured with 6 cameras and a 32-beam LiDAR at 20 Hz/10 Hz. Annotations include 1.4M 3D bounding boxes for 10 classes: <italic>car, truck, bus, trailer, construction vehicle, pedestrian, motorcycle, bicycle, barrier, traffic cone</italic>. We use the official metrics: nuScenes Detection Score (NDS), mean Average Precision (mAP), and 5 True Positive (TP) metrics: Average Translation Error (ATE), Average Scale Error (ASE), Average Orientation Error (AOE), Average Velocity Error (AVE), and Average Attribute Error (AAE). The NDS is calculated as follows:<disp-formula id="eqn-25"><label>(25)</label><mml:math id="mml-eqn-25" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd /><mml:mtd><mml:mrow><mml:mtext>NDS</mml:mtext></mml:mrow><mml:mo>=</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:mn>10</mml:mn></mml:mfrac><mml:mrow><mml:mo>(</mml:mo><mml:mn>5</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mrow><mml:mtext>mAP</mml:mtext></mml:mrow><mml:mo>+</mml:mo><mml:munder><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>T</mml:mi><mml:mi>P</mml:mi><mml:mo>&#x2208;</mml:mo><mml:mi>T</mml:mi><mml:mi>P</mml:mi></mml:mrow></mml:munder><mml:mrow><mml:mo>(</mml:mo><mml:mn>1</mml:mn><mml:mo>&#x2212;</mml:mo><mml:mo movablelimits="true" form="prefix">min</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mi>T</mml:mi><mml:mi>P</mml:mi><mml:mo>)</mml:mo></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
</sec>
<sec id="s4_2">
<label>4.2</label>
<title>Implementation Details</title>
<p>Our framework is implemented using the MMDetection3D toolkit and trained on 4 NVIDIA GeForce RTX 4090 GPUs. We employ the AdamW optimizer with a cosine-scheduled learning rate of <inline-formula id="ieqn-135"><mml:math id="mml-ieqn-135"><mml:mn>2</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:msup><mml:mn>10</mml:mn><mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mn>4</mml:mn></mml:mrow></mml:msup></mml:math></inline-formula> and a batch size of 8. Models are trained for 20 epochs on nuScenes using the Class-Balanced Group Sampling (CBGS) strategy. Data augmentations follow [<xref ref-type="bibr" rid="ref-9">9</xref>,<xref ref-type="bibr" rid="ref-34">34</xref>], including random flipping, scaling, rotation, and noise injection. We use ResNet-50/101 pre-trained on ImageNet-1K as image backbones. For LiDAR data during training, we adopt a pre-trained CenterPoint model. CenterPoint was selected because it is a representative and widely used 3D detector with strong performance, providing a reliable baseline for evaluating DA-T3D and facilitating fair comparison with other models. For ablation studies, we train models for 24 epochs. When comparing with state-of-the-art methods, we extend training to 60 epochs for convergence. During inference, we process 2 frames and apply motion compensation using ego-vehicle pose information. For temporal data augmentation, we randomly skip 1 frame in the training sequence.</p>
<p>In our DPGMM-based feature modeling, we set the tiny-component filter threshold to <inline-formula id="ieqn-136"><mml:math id="mml-ieqn-136"><mml:msub><mml:mi>n</mml:mi><mml:mrow><mml:mtext>min</mml:mtext></mml:mrow></mml:msub></mml:math></inline-formula> &#x003D; 5 and the Dirichlet Process concentration parameter to <inline-formula id="ieqn-137"><mml:math id="mml-ieqn-137"><mml:mi>&#x03B3;</mml:mi></mml:math></inline-formula> &#x003D; 1.0. For the DPGMM base prior, (<inline-formula id="ieqn-138"><mml:math id="mml-ieqn-138"><mml:msub><mml:mi>&#x03BC;</mml:mi><mml:mrow><mml:mn>0</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula>, <inline-formula id="ieqn-139"><mml:math id="mml-ieqn-139"><mml:msub><mml:mi mathvariant="normal">&#x03A3;</mml:mi><mml:mrow><mml:mn>0</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula>) is specified as a simple data-adaptive weak prior. For each class, we compute the mean and diagonal covariance of the teacher ROI-aggregated features over the training set, use them as <inline-formula id="ieqn-140"><mml:math id="mml-ieqn-140"><mml:msub><mml:mi>&#x03BC;</mml:mi><mml:mrow><mml:mn>0</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula> and <inline-formula id="ieqn-141"><mml:math id="mml-ieqn-141"><mml:msub><mml:mi mathvariant="normal">&#x03A3;</mml:mi><mml:mrow><mml:mn>0</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula>, and add a small diagonal jitter for numerical stability. In our implementation, the DPGMM is fitted offline, and the resulting mixture parameters remain fixed throughout distillation training. Moreover, the DPGMM is used only during training. At inference time, neither the teacher model nor the DPGMM fitting process is involved, and thus our method introduces no additional computational cost compared with the student detector. We set the initial loss weights to <inline-formula id="ieqn-142"><mml:math id="mml-ieqn-142"><mml:msub><mml:mi>&#x03BB;</mml:mi><mml:mrow><mml:mtext>feat</mml:mtext></mml:mrow></mml:msub></mml:math></inline-formula> &#x003D; 1.0 and <inline-formula id="ieqn-143"><mml:math id="mml-ieqn-143"><mml:msub><mml:mi>&#x03BB;</mml:mi><mml:mrow><mml:mtext>resp</mml:mtext></mml:mrow></mml:msub></mml:math></inline-formula> &#x003D; 0.1, and linearly adjust them during training by gradually decreasing <inline-formula id="ieqn-144"><mml:math id="mml-ieqn-144"><mml:msub><mml:mi>&#x03BB;</mml:mi><mml:mrow><mml:mtext>feat</mml:mtext></mml:mrow></mml:msub></mml:math></inline-formula> while increasing <inline-formula id="ieqn-145"><mml:math id="mml-ieqn-145"><mml:msub><mml:mi>&#x03BB;</mml:mi><mml:mrow><mml:mtext>resp</mml:mtext></mml:mrow></mml:msub></mml:math></inline-formula>, so that training emphasizes feature distillation in the early stage and shifts the focus to response distillation in the later stage. We fix <inline-formula id="ieqn-146"><mml:math id="mml-ieqn-146"><mml:msub><mml:mi>&#x03BB;</mml:mi><mml:mrow><mml:mtext>mode</mml:mtext></mml:mrow></mml:msub></mml:math></inline-formula> &#x003D; 1.0 and <inline-formula id="ieqn-147"><mml:math id="mml-ieqn-147"><mml:msub><mml:mi>&#x03BB;</mml:mi><mml:mrow><mml:mtext>mix</mml:mtext></mml:mrow></mml:msub></mml:math></inline-formula> &#x003D; 0.1, while <inline-formula id="ieqn-148"><mml:math id="mml-ieqn-148"><mml:msub><mml:mi>&#x03BB;</mml:mi><mml:mrow><mml:mtext>pair</mml:mtext></mml:mrow></mml:msub></mml:math></inline-formula> is decayed with a cosine schedule from 1.0 to 0.</p>
</sec>
<sec id="s4_3">
<label>4.3</label>
<title>Comparison with Other Models</title>
<p>We first report the main comparison results on the nuScenes validation set under the standard evaluation protocol. For a fair comparison, we group methods by backbone and input setting (image resolution and the number of frames), and summarize the overall 3D detection performance using the official metrics mAP and NDS. <xref ref-type="table" rid="table-1">Table 1</xref> compares our method with representative camera-based 3D detectors, and <xref ref-type="table" rid="table-2">Table 2</xref> further benchmarks different cross-modal distillation strategies under comparable student/teacher settings.</p>
<table-wrap id="table-1">
<label>Table 1</label>
<caption>
<title>Comparison of methods on the nuScenes val set for 3D object detection.</title>
</caption>
 
<table>
<colgroup>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
</colgroup>
<thead>
<tr>
<th>Method</th>
<th>Backbone</th>
<th>Image Size</th>
<th>Frames</th>
<th>mAP<inline-formula id="ieqn-149"><mml:math id="mml-ieqn-149"><mml:mo stretchy="false">&#x2191;</mml:mo></mml:math></inline-formula></th>
<th>NDS<inline-formula id="ieqn-150"><mml:math id="mml-ieqn-150"><mml:mo stretchy="false">&#x2191;</mml:mo></mml:math></inline-formula></th>
<th>mATE<inline-formula id="ieqn-151"><mml:math id="mml-ieqn-151"><mml:mo stretchy="false">&#x2193;</mml:mo></mml:math></inline-formula></th>
<th>mASE<inline-formula id="ieqn-152"><mml:math id="mml-ieqn-152"><mml:mo stretchy="false">&#x2193;</mml:mo></mml:math></inline-formula></th>
<th>mAOE<inline-formula id="ieqn-153"><mml:math id="mml-ieqn-153"><mml:mo stretchy="false">&#x2193;</mml:mo></mml:math></inline-formula></th>
<th>mAVE<inline-formula id="ieqn-154"><mml:math id="mml-ieqn-154"><mml:mo stretchy="false">&#x2193;</mml:mo></mml:math></inline-formula></th>
<th>mAEE<inline-formula id="ieqn-155"><mml:math id="mml-ieqn-155"><mml:mo stretchy="false">&#x2193;</mml:mo></mml:math></inline-formula></th>
</tr>
</thead>
<tbody>
<tr>
<td>BEVDet [<xref ref-type="bibr" rid="ref-34">34</xref>]</td>
<td>ResNet50</td>
<td>256 <inline-formula id="ieqn-156"><mml:math id="mml-ieqn-156"><mml:mo>&#x00D7;</mml:mo></mml:math></inline-formula> 704</td>
<td>1</td>
<td>0.298</td>
<td>0.379</td>
<td>0.725</td>
<td>0.279</td>
<td>0.589</td>
<td>0.860</td>
<td>0.245</td>
</tr>
<tr>
<td>BEVDet4D [<xref ref-type="bibr" rid="ref-33">33</xref>]</td>
<td>ResNet50</td>
<td>256 <inline-formula id="ieqn-157"><mml:math id="mml-ieqn-157"><mml:mo>&#x00D7;</mml:mo></mml:math></inline-formula> 704</td>
<td>2</td>
<td>0.322</td>
<td>0.457</td>
<td>0.703</td>
<td>0.278</td>
<td>0.495</td>
<td>0.354</td>
<td>0.206</td>
</tr>
<tr>
<td>PETRv2 [<xref ref-type="bibr" rid="ref-13">13</xref>]</td>
<td>ResNet50</td>
<td>256 <inline-formula id="ieqn-158"><mml:math id="mml-ieqn-158"><mml:mo>&#x00D7;</mml:mo></mml:math></inline-formula> 704</td>
<td>2</td>
<td>0.349</td>
<td>0.456</td>
<td>0.700</td>
<td>0.275</td>
<td>0.580</td>
<td>0.437</td>
<td>0.187</td>
</tr>
<tr>
<td>BEVDepth [<xref ref-type="bibr" rid="ref-9">9</xref>]</td>
<td>ResNet50</td>
<td>256 <inline-formula id="ieqn-159"><mml:math id="mml-ieqn-159"><mml:mo>&#x00D7;</mml:mo></mml:math></inline-formula> 704</td>
<td>2</td>
<td>0.351</td>
<td>0.475</td>
<td>0.639</td>
<td>0.267</td>
<td>0.479</td>
<td>0.428</td>
<td>0.198</td>
</tr>
<tr>
<td>BEVStereo [<xref ref-type="bibr" rid="ref-35">35</xref>]</td>
<td>ResNet50</td>
<td>256 <inline-formula id="ieqn-160"><mml:math id="mml-ieqn-160"><mml:mo>&#x00D7;</mml:mo></mml:math></inline-formula> 704</td>
<td>2</td>
<td>0.372</td>
<td>0.500</td>
<td>0.598</td>
<td>0.270</td>
<td>0.438</td>
<td>0.367</td>
<td>0.190</td>
</tr>
<tr>
<td>BEVFormerv2 [<xref ref-type="bibr" rid="ref-36">36</xref>]</td>
<td>ResNet50</td>
<td>&#x2013;</td>
<td>&#x2013;</td>
<td>0.423</td>
<td>0.529</td>
<td>0.618</td>
<td>0.273</td>
<td>0.413</td>
<td>0.333</td>
<td>0.188</td>
</tr>
<tr>
<td>SOLOFusion [<xref ref-type="bibr" rid="ref-37">37</xref>]</td>
<td>ResNet50</td>
<td>256 <inline-formula id="ieqn-161"><mml:math id="mml-ieqn-161"><mml:mo>&#x00D7;</mml:mo></mml:math></inline-formula> 704</td>
<td>16 &#x002B; 1</td>
<td>0.427</td>
<td>0.534</td>
<td>0.567</td>
<td>0.274</td>
<td>0.511</td>
<td>0.252</td>
<td>0.181</td>
</tr>
<tr>
<td>BEVPoolv2 [<xref ref-type="bibr" rid="ref-38">38</xref>]</td>
<td>ResNet50</td>
<td>256 <inline-formula id="ieqn-162"><mml:math id="mml-ieqn-162"><mml:mo>&#x00D7;</mml:mo></mml:math></inline-formula> 704</td>
<td>8 &#x002B; 1</td>
<td>0.406</td>
<td>0.526</td>
<td>0.572</td>
<td>0.275</td>
<td>0.463</td>
<td>0.275</td>
<td>0.188</td>
</tr>
<tr>
<td>DA-T3D</td>
<td>ResNet50</td>
<td>256 <inline-formula id="ieqn-163"><mml:math id="mml-ieqn-163"><mml:mo>&#x00D7;</mml:mo></mml:math></inline-formula> 704</td>
<td>2</td>
<td>0.421</td>
<td>0.543</td>
<td>0.532</td>
<td>0.223</td>
<td>0.398</td>
<td>0.347</td>
<td>0.175</td>
</tr>
<tr>
<td>DETR3D [<xref ref-type="bibr" rid="ref-12">12</xref>]</td>
<td>ResNet101-DCN</td>
<td>900 <inline-formula id="ieqn-164"><mml:math id="mml-ieqn-164"><mml:mo>&#x00D7;</mml:mo></mml:math></inline-formula> 1600</td>
<td>1</td>
<td>0.349</td>
<td>0.434</td>
<td>0.716</td>
<td>0.268</td>
<td>0.379</td>
<td>0.842</td>
<td>0.200</td>
</tr>
<tr>
<td>Focal-PETR [<xref ref-type="bibr" rid="ref-39">39</xref>]</td>
<td>ResNet101-DCN</td>
<td>512 <inline-formula id="ieqn-165"><mml:math id="mml-ieqn-165"><mml:mo>&#x00D7;</mml:mo></mml:math></inline-formula> 1408</td>
<td>1</td>
<td>0.390</td>
<td>0.461</td>
<td>0.678</td>
<td>0.263</td>
<td>0.395</td>
<td>0.804</td>
<td>0.202</td>
</tr>
<tr>
<td>PETR [<xref ref-type="bibr" rid="ref-40">40</xref>]</td>
<td>ResNet101-DCN</td>
<td>512 <inline-formula id="ieqn-166"><mml:math id="mml-ieqn-166"><mml:mo>&#x00D7;</mml:mo></mml:math></inline-formula> 1408</td>
<td>1</td>
<td>0.366</td>
<td>0.441</td>
<td>0.717</td>
<td>0.267</td>
<td>0.412</td>
<td>0.834</td>
<td>0.190</td>
</tr>
<tr>
<td>BEVFormer [<xref ref-type="bibr" rid="ref-41">41</xref>]</td>
<td>ResNet101-DCN</td>
<td>900 <inline-formula id="ieqn-167"><mml:math id="mml-ieqn-167"><mml:mo>&#x00D7;</mml:mo></mml:math></inline-formula> 1600</td>
<td>4</td>
<td>0.416</td>
<td>0.517</td>
<td>0.673</td>
<td>0.274</td>
<td>0.372</td>
<td>0.394</td>
<td>0.198</td>
</tr>
<tr>
<td>PolarDETR [<xref ref-type="bibr" rid="ref-42">42</xref>]</td>
<td>ResNet101-DCN</td>
<td>900 <inline-formula id="ieqn-168"><mml:math id="mml-ieqn-168"><mml:mo>&#x00D7;</mml:mo></mml:math></inline-formula> 1600</td>
<td>2</td>
<td>0.383</td>
<td>0.488</td>
<td>0.707</td>
<td>0.269</td>
<td>0.344</td>
<td>0.518</td>
<td>0.196</td>
</tr>
<tr>
<td>Sparse4D [<xref ref-type="bibr" rid="ref-14">14</xref>]</td>
<td>ResNet101-DCN</td>
<td>900 <inline-formula id="ieqn-169"><mml:math id="mml-ieqn-169"><mml:mo>&#x00D7;</mml:mo></mml:math></inline-formula> 1600</td>
<td>4</td>
<td>0.436</td>
<td>0.541</td>
<td>0.633</td>
<td>0.279</td>
<td>0.363</td>
<td>0.317</td>
<td>0.177</td>
</tr>
<tr>
<td>CenterNet [<xref ref-type="bibr" rid="ref-43">43</xref>]</td>
<td>DLA</td>
<td>&#x2013;</td>
<td>&#x2013;</td>
<td>0.306</td>
<td>0.328</td>
<td>0.716</td>
<td>0.264</td>
<td>0.609</td>
<td>1.426</td>
<td>0.658</td>
</tr>
<tr>
<td>FCOS3D [<xref ref-type="bibr" rid="ref-44">44</xref>]</td>
<td>ResNet101</td>
<td>900 <inline-formula id="ieqn-170"><mml:math id="mml-ieqn-170"><mml:mo>&#x00D7;</mml:mo></mml:math></inline-formula> 1600</td>
<td>&#x2013;</td>
<td>0.343</td>
<td>0.415</td>
<td>0.725</td>
<td>0.263</td>
<td>0.422</td>
<td>1.292</td>
<td>0.153</td>
</tr>
<tr>
<td>PGD [<xref ref-type="bibr" rid="ref-45">45</xref>]</td>
<td>ResNet101</td>
<td>900 <inline-formula id="ieqn-171"><mml:math id="mml-ieqn-171"><mml:mo>&#x00D7;</mml:mo></mml:math></inline-formula> 1600</td>
<td>&#x2013;</td>
<td>0.369</td>
<td>0.428</td>
<td>0.683</td>
<td>0.260</td>
<td>0.439</td>
<td>1.268</td>
<td>0.185</td>
</tr>
<tr>
<td>BEVDepth</td>
<td>ResNet101</td>
<td>512 <inline-formula id="ieqn-172"><mml:math id="mml-ieqn-172"><mml:mo>&#x00D7;</mml:mo></mml:math></inline-formula> 1408</td>
<td>2</td>
<td>0.412</td>
<td>0.535</td>
<td>0.565</td>
<td>0.266</td>
<td>0.358</td>
<td>0.331</td>
<td>0.190</td>
</tr>
<tr>
<td>SOLOFusion</td>
<td>ResNet101</td>
<td>512 <inline-formula id="ieqn-173"><mml:math id="mml-ieqn-173"><mml:mo>&#x00D7;</mml:mo></mml:math></inline-formula> 1408</td>
<td>16 &#x002B; 1</td>
<td>0.483</td>
<td>0.582</td>
<td>0.503</td>
<td>0.264</td>
<td>0.381</td>
<td>0.246</td>
<td>0.207</td>
</tr>
<tr>
<td>BEVFormer-En [<xref ref-type="bibr" rid="ref-46">46</xref>]</td>
<td>&#x2013;</td>
<td>&#x2013;</td>
<td>&#x2013;</td>
<td>0.418</td>
<td>0.529</td>
<td>0.631</td>
<td>0.268</td>
<td>0.328</td>
<td>0.373</td>
<td>0.194</td>
</tr>
<tr>
<td>BEVDiffuser [<xref ref-type="bibr" rid="ref-47">47</xref>]</td>
<td>&#x2013;</td>
<td>&#x2013;</td>
<td>&#x2013;</td>
<td>0.430</td>
<td>0.537</td>
<td>0.638</td>
<td>0.274</td>
<td>0.333</td>
<td>0.355</td>
<td>0.179</td>
</tr>
<tr>
<td>DA-T3D</td>
<td>ResNet101</td>
<td>512 <inline-formula id="ieqn-174"><mml:math id="mml-ieqn-174"><mml:mo>&#x00D7;</mml:mo></mml:math></inline-formula> 1408</td>
<td>2</td>
<td>0.467</td>
<td>0.581</td>
<td>0.501</td>
<td>0.239</td>
<td>0.315</td>
<td>0.295</td>
<td>0.178</td>
</tr>
</tbody>
</table>
</table-wrap><table-wrap id="table-2">
<label>Table 2</label>
<caption>
<title>Comparison to other cross-modal knowledge distillation methods on the nuScenes val set.</title>
</caption>
<table>
<colgroup>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/> </colgroup>
<thead>
<tr>
<th>Method</th>
<th>Backbone</th>
<th>Image Size</th>
<th>Student Model</th>
<th>Teacher Modality</th>
<th>mAP<inline-formula id="ieqn-175"><mml:math id="mml-ieqn-175"><mml:mo stretchy="false">&#x2191;</mml:mo></mml:math></inline-formula></th>
<th>NDS<inline-formula id="ieqn-176"><mml:math id="mml-ieqn-176"><mml:mo stretchy="false">&#x2191;</mml:mo></mml:math></inline-formula></th>
</tr>
</thead>
<tbody>
<tr>
<td>MemDistill [<xref ref-type="bibr" rid="ref-48">48</xref>]</td>
<td>ResNet50</td>
<td>256 <inline-formula id="ieqn-177"><mml:math id="mml-ieqn-177"><mml:mo>&#x00D7;</mml:mo></mml:math></inline-formula> 704</td>
<td>BEVDepth</td>
<td>Lidar</td>
<td>0.425</td>
<td>0.531</td>
</tr>
<tr>
<td>LabelDistill [<xref ref-type="bibr" rid="ref-30">30</xref>]</td>
<td>ResNet50</td>
<td>256 <inline-formula id="ieqn-178"><mml:math id="mml-ieqn-178"><mml:mo>&#x00D7;</mml:mo></mml:math></inline-formula> 704</td>
<td>BEVDepth</td>
<td>Lidar</td>
<td>0.419</td>
<td>0.528</td>
</tr>
<tr>
<td>X3KD [<xref ref-type="bibr" rid="ref-49">49</xref>]</td>
<td>ResNet50</td>
<td>256 <inline-formula id="ieqn-179"><mml:math id="mml-ieqn-179"><mml:mo>&#x00D7;</mml:mo></mml:math></inline-formula> 704</td>
<td>BEVDepth</td>
<td>Lidar</td>
<td>0.390</td>
<td>0.505</td>
</tr>
<tr>
<td>Set2Set [<xref ref-type="bibr" rid="ref-50">50</xref>]</td>
<td>ResNet50</td>
<td>256 <inline-formula id="ieqn-180"><mml:math id="mml-ieqn-180"><mml:mo>&#x00D7;</mml:mo></mml:math></inline-formula> 704</td>
<td>&#x2013;</td>
<td>Lidar</td>
<td>0.375</td>
<td>0.479</td>
</tr>
<tr>
<td>MonoDistill [<xref ref-type="bibr" rid="ref-4">4</xref>]</td>
<td>ResNet50</td>
<td>256 <inline-formula id="ieqn-181"><mml:math id="mml-ieqn-181"><mml:mo>&#x00D7;</mml:mo></mml:math></inline-formula> 704</td>
<td>MonoDLE</td>
<td>Lidar</td>
<td>0.390</td>
<td>0.495</td>
</tr>
<tr>
<td>BEVDistill [<xref ref-type="bibr" rid="ref-51">51</xref>]</td>
<td>ResNet50</td>
<td>900 <inline-formula id="ieqn-182"><mml:math id="mml-ieqn-182"><mml:mo>&#x00D7;</mml:mo></mml:math></inline-formula> 1600</td>
<td>BEVFormer</td>
<td>Lidar</td>
<td>0.407</td>
<td>0.515</td>
</tr>
<tr>
<td>DA-T3D</td>
<td>ResNet50</td>
<td>256 <inline-formula id="ieqn-183"><mml:math id="mml-ieqn-183"><mml:mo>&#x00D7;</mml:mo></mml:math></inline-formula> 704</td>
<td>BEVDepth</td>
<td>Lidar</td>
<td>0.421</td>
<td>0.543</td>
</tr>
<tr>
<td>UVTR [<xref ref-type="bibr" rid="ref-5">5</xref>]</td>
<td>ResNet101</td>
<td>900 <inline-formula id="ieqn-184"><mml:math id="mml-ieqn-184"><mml:mo>&#x00D7;</mml:mo></mml:math></inline-formula> 1600</td>
<td>&#x2013;</td>
<td>Lidar</td>
<td>0.392</td>
<td>0.488</td>
</tr>
<tr>
<td>DistillBEV [<xref ref-type="bibr" rid="ref-25">25</xref>]</td>
<td>ResNet101</td>
<td>512 <inline-formula id="ieqn-185"><mml:math id="mml-ieqn-185"><mml:mo>&#x00D7;</mml:mo></mml:math></inline-formula> 1408</td>
<td>BEVDepth</td>
<td>Lidar</td>
<td>0.450</td>
<td>0.547</td>
</tr>
<tr>
<td>PromptDet [<xref ref-type="bibr" rid="ref-52">52</xref>]</td>
<td>ResNet101</td>
<td>256 <inline-formula id="ieqn-186"><mml:math id="mml-ieqn-186"><mml:mo>&#x00D7;</mml:mo></mml:math></inline-formula> 704</td>
<td>BEVDepth</td>
<td>Lidar</td>
<td>0.433</td>
<td>0.569</td>
</tr>
<tr>
<td>TiG-BEV [<xref ref-type="bibr" rid="ref-53">53</xref>]</td>
<td>ResNet101</td>
<td>512 <inline-formula id="ieqn-187"><mml:math id="mml-ieqn-187"><mml:mo>&#x00D7;</mml:mo></mml:math></inline-formula> 1408</td>
<td>BEVDepth</td>
<td>Lidar</td>
<td>0.440</td>
<td>0.544</td>
</tr>
<tr>
<td>SimDistill [<xref ref-type="bibr" rid="ref-54">54</xref>]</td>
<td>SwinT</td>
<td>256 <inline-formula id="ieqn-188"><mml:math id="mml-ieqn-188"><mml:mo>&#x00D7;</mml:mo></mml:math></inline-formula> 704</td>
<td>BEVFusion-C</td>
<td>Lidar&#x0026;Camera</td>
<td>0.404</td>
<td>0.453</td>
</tr>
<tr>
<td>DA-T3D</td>
<td>ResNet101</td>
<td>512 <inline-formula id="ieqn-189"><mml:math id="mml-ieqn-189"><mml:mo>&#x00D7;</mml:mo></mml:math></inline-formula> 1408</td>
<td>BEVDepth</td>
<td>Lidar</td>
<td>0.467</td>
<td>0.581</td>
</tr>
</tbody>
</table>
</table-wrap>
 
<p><xref ref-type="table" rid="table-1">Table 1</xref> compares our method with representative camera-based 3D object detection approaches on the nuScenes validation set, evaluated by mAP, NDS, and five TP error metrics. Overall, our approach achieves strong performance under two common settings: with ResNet50 at 256 <inline-formula id="ieqn-190"><mml:math id="mml-ieqn-190"><mml:mo>&#x00D7;</mml:mo></mml:math></inline-formula> 704 using 2 frames, we obtain 0.421 mAP and 0.543 NDS; with ResNet101 at 512 <inline-formula id="ieqn-191"><mml:math id="mml-ieqn-191"><mml:mo>&#x00D7;</mml:mo></mml:math></inline-formula> 1408 using 2 frames, we reach 0.467 mAP and 0.581 NDS. Beyond the headline scores, we consistently reduce key geometric errors (especially mATE and mAOE), indicating improvements in localization and orientation stability rather than a metric-specific trade-off.</p>

<p>Under the ResNet50 configuration, many competing methods operate with similar input resolution and typically 2 frames. Our method attains a higher NDS while simultaneously lowering geometry-related errors (mATE: 0.532, mASE: 0.223, mAOE: 0.398). These gains align with our design motivation: instead of enforcing strict point-wise matching, we perform distribution-aware cross-modal distillation that guides the student toward high-density and geometrically stable regions of the teacher feature space, which helps mitigate modality mismatch and suppress noisy teacher outliers. In addition, response-level distillation further transfers reliable decision behavior to the student, contributing to the overall quality improvements across TP metrics.</p>
<p>Some methods achieve strong performance by aggregating many historical frames (e.g., 16 &#x002B; 1). In contrast, our model uses only 2 frames yet achieves competitive or better NDS and notably improved localization, demonstrating that lightweight temporal fusion with alignment and selective information injection can effectively compensate for occlusions and missing observations without relying on long sequences.</p>
<p>The ResNet101 results further confirm the scalability of our framework. With 2 frames, we achieve 0.467 mAP and 0.581 NDS, together with lower errors compared to the 2-frame baseline. These consistent gains support our conclusion that distribution-aware distillation provides robust geometric supervision under modality gaps, while the lightweight temporal design improves robustness in dynamic and occluded scenarios at low temporal overhead.</p>
<p><xref ref-type="table" rid="table-2">Table 2</xref> compares our approach with representative cross-modal knowledge distillation and multi-modal baselines on the nuScenes validation set. Existing distillation methods typically reduce the camera&#x2013;LiDAR gap via foreground feature imitation, label and response distillation, or multi-stage alignment. However, their gains can be affected by modality-induced distribution mismatch and noisy teacher signals (e.g., missed or false detections and feature jitter), which may limit robustness. In contrast, our method performs distribution-aware cross-modal distillation and combines it with a lightweight temporal design, aiming to transfer more stable geometric knowledge while keeping the student model efficient.</p>

<p>Under the ResNet50, 256 <inline-formula id="ieqn-192"><mml:math id="mml-ieqn-192"><mml:mo>&#x00D7;</mml:mo></mml:math></inline-formula> 704 setting (with BEVDepth as the student), our method achieves 0.421 mAP and 0.543 NDS, yielding the best NDS among the listed ResNet50 baselines. Notably, while some methods may obtain slightly higher mAP in this group, the improved NDS suggests better overall detection quality when both accuracy and error-related components are jointly considered. This supports that distribution-aware alignment can mitigate strict point-wise matching issues under cross-modal heterogeneity and suppress outlier teacher supervision, resulting in more reliable knowledge transfer.</p>
<p>With a stronger backbone and higher resolution (ResNet101, 512 <inline-formula id="ieqn-193"><mml:math id="mml-ieqn-193"><mml:mo>&#x00D7;</mml:mo></mml:math></inline-formula> 1408), our method achieves 0.467 mAP and 0.581 NDS, which are the best results in the ResNet101 group. Compared with prior distillation approaches under similar student/teacher modality settings, this indicates that explicitly addressing distribution mismatch and teacher noise is crucial. Simply introducing additional modalities or branches does not necessarily guarantee consistent improvements. <xref ref-type="fig" rid="fig-4">Fig. 4</xref> shows the visualization results of our method and other methods. Overall, the results demonstrate that our framework scales favorably across backbones and resolutions, providing stable gains for camera-based 3D detection.</p>
<fig id="fig-4">
<label>Figure 4</label>
<caption>
<title>Visualization results of different cross modal knowledge distillation methods.</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_80595-fig-4.tif"/>
</fig>
</sec>
<sec id="s4_4">
<label>4.4</label>
<title>Ablation Study</title>
<p>To better understand where the improvements come from, we conduct controlled ablations on the nuScenes validation set by enabling one component at a time. We use BEVDepth as the camera-only baseline student, adopt CenterPoint as the teacher for distillation, and employ a lightweight 2-frame temporal modeling strategy. The results in <xref ref-type="table" rid="table-3">Table 3</xref> progressively quantify the contribution of feature distillation, response distillation, and temporal modeling.</p>
<table-wrap id="table-3">
<label>Table 3</label>
<caption>
<title>Ablation study of each component.</title>
</caption>
<table>
<colgroup>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/> </colgroup>
<thead>
<tr>
<th>Setting</th>
<th>Baseline</th>
<th>Feature Distill.</th>
<th>Response Distill.</th>
<th>Temporal (2-Frame)</th>
<th>mAP<inline-formula id="ieqn-194"><mml:math id="mml-ieqn-194"><mml:mo stretchy="false">&#x2191;</mml:mo></mml:math></inline-formula></th>
<th>NDS<inline-formula id="ieqn-195"><mml:math id="mml-ieqn-195"><mml:mo stretchy="false">&#x2191;</mml:mo></mml:math></inline-formula></th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td><inline-formula id="ieqn-196"><mml:math id="mml-ieqn-196"><mml:mi>&#x2713;</mml:mi></mml:math></inline-formula></td>
<td></td>
<td></td>
<td></td>
<td>0.412</td>
<td>0.535</td>
</tr>
<tr>
<td>2</td>
<td><inline-formula id="ieqn-197"><mml:math id="mml-ieqn-197"><mml:mi>&#x2713;</mml:mi></mml:math></inline-formula></td>
<td><inline-formula id="ieqn-198"><mml:math id="mml-ieqn-198"><mml:mi>&#x2713;</mml:mi></mml:math></inline-formula></td>
<td></td>
<td></td>
<td>0.443</td>
<td>0.565</td>
</tr>
<tr>
<td>3</td>
<td><inline-formula id="ieqn-199"><mml:math id="mml-ieqn-199"><mml:mi>&#x2713;</mml:mi></mml:math></inline-formula></td>
<td><inline-formula id="ieqn-200"><mml:math id="mml-ieqn-200"><mml:mi>&#x2713;</mml:mi></mml:math></inline-formula></td>
<td><inline-formula id="ieqn-201"><mml:math id="mml-ieqn-201"><mml:mi>&#x2713;</mml:mi></mml:math></inline-formula></td>
<td></td>
<td>0.451</td>
<td>0.572</td>
</tr>
<tr>
<td>4</td>
<td><inline-formula id="ieqn-202"><mml:math id="mml-ieqn-202"><mml:mi>&#x2713;</mml:mi></mml:math></inline-formula></td>
<td><inline-formula id="ieqn-203"><mml:math id="mml-ieqn-203"><mml:mi>&#x2713;</mml:mi></mml:math></inline-formula></td>
<td><inline-formula id="ieqn-204"><mml:math id="mml-ieqn-204"><mml:mi>&#x2713;</mml:mi></mml:math></inline-formula></td>
<td><inline-formula id="ieqn-205"><mml:math id="mml-ieqn-205"><mml:mi>&#x2713;</mml:mi></mml:math></inline-formula></td>
<td>0.467</td>
<td>0.581</td>
</tr>
</tbody>
</table>
</table-wrap>
<p><xref ref-type="table" rid="table-3">Table 3</xref> reports an ablation study on the nuScenes validation set to quantify the contribution of each component in our framework. In Setting 1, the baseline achieves 0.412 mAP and 0.535 NDS. In Setting 2, after introducing feature distillation, the performance improves to 0.443 mAP and 0.565 NDS, indicating that intermediate BEV representation guidance from the LiDAR teacher helps narrow the modality gap and provides more reliable geometric cues for the camera model, thereby improving spatial feature quality and overall 3D detection performance. Further enabling response distillation on top of feature distillation yields 0.451 mAP and 0.572 NDS, bringing consistent gains. This suggests that output-level supervision complements feature-level alignment by refining the student&#x2019;s prediction distribution, which improves the final detection heads. Finally, incorporating 2-frame temporal modeling achieves the best performance of 0.467 mAP and 0.581 NDS. Temporal fusion aggregates complementary observations across consecutive frames, mitigating single-frame noise and partial occlusions and producing more stable BEV features and more consistent localization, which is reflected in both mAP and NDS. In order to more intuitively demonstrate the difference between the baseline model and the optimal setting, we visualized their inference results, as shown in <xref ref-type="fig" rid="fig-5">Fig. 5</xref>.</p>
<fig id="fig-5">
<label>Figure 5</label>
<caption>
<title>Visualization results of baseline and our method.</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_80595-fig-5.tif"/>
</fig>
<p>To further investigate the contribution of the proposed distribution-aware feature distillation objective, we additionally perform a loss-level ablation study, as reported in <xref ref-type="table" rid="table-4">Table 4</xref>.</p>
<table-wrap id="table-4">
<label>Table 4</label>
<caption>
<title>Ablation study of each loss.</title>
</caption>
<table>
<colgroup>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/> </colgroup>
<thead>
<tr>
<th>Setting</th>
<th><inline-formula id="ieqn-206"><mml:math id="mml-ieqn-206"><mml:msub><mml:mrow><mml:mi>&#x02112;</mml:mi></mml:mrow><mml:mrow><mml:mtext>pair</mml:mtext></mml:mrow></mml:msub></mml:math></inline-formula></th>
<th><inline-formula id="ieqn-207"><mml:math id="mml-ieqn-207"><mml:msub><mml:mrow><mml:mi>&#x02112;</mml:mi></mml:mrow><mml:mrow><mml:mtext>mode</mml:mtext></mml:mrow></mml:msub></mml:math></inline-formula></th>
<th><inline-formula id="ieqn-208"><mml:math id="mml-ieqn-208"><mml:msub><mml:mrow><mml:mi>&#x02112;</mml:mi></mml:mrow><mml:mrow><mml:mtext>mix</mml:mtext></mml:mrow></mml:msub></mml:math></inline-formula></th>
<th>mAP<inline-formula id="ieqn-209"><mml:math id="mml-ieqn-209"><mml:mo stretchy="false">&#x2191;</mml:mo></mml:math></inline-formula></th>
<th>NDS<inline-formula id="ieqn-210"><mml:math id="mml-ieqn-210"><mml:mo stretchy="false">&#x2191;</mml:mo></mml:math></inline-formula></th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td><inline-formula id="ieqn-211"><mml:math id="mml-ieqn-211"><mml:mi>&#x2713;</mml:mi></mml:math></inline-formula></td>
<td></td>
<td></td>
<td>0.448</td>
<td>0.563</td>
</tr>
<tr>
<td>2</td>
<td><inline-formula id="ieqn-212"><mml:math id="mml-ieqn-212"><mml:mi>&#x2713;</mml:mi></mml:math></inline-formula></td>
<td><inline-formula id="ieqn-213"><mml:math id="mml-ieqn-213"><mml:mi>&#x2713;</mml:mi></mml:math></inline-formula></td>
<td></td>
<td>0.464</td>
<td>0.579</td>
</tr>
<tr>
<td>3</td>
<td><inline-formula id="ieqn-214"><mml:math id="mml-ieqn-214"><mml:mi>&#x2713;</mml:mi></mml:math></inline-formula></td>
<td><inline-formula id="ieqn-215"><mml:math id="mml-ieqn-215"><mml:mi>&#x2713;</mml:mi></mml:math></inline-formula></td>
<td><inline-formula id="ieqn-216"><mml:math id="mml-ieqn-216"><mml:mi>&#x2713;</mml:mi></mml:math></inline-formula></td>
<td>0.467</td>
<td>0.581</td>
</tr>
</tbody>
</table>
</table-wrap>
<p><xref ref-type="table" rid="table-4">Table 4</xref> further analyzes the effect of each loss term in the proposed distribution-aware feature distillation. Starting from the pair-wise loss <inline-formula id="ieqn-217"><mml:math id="mml-ieqn-217"><mml:msub><mml:mrow><mml:mi>&#x02112;</mml:mi></mml:mrow><mml:mrow><mml:mtext>pair</mml:mtext></mml:mrow></mml:msub></mml:math></inline-formula>, the model achieves 0.448 mAP and 0.563 NDS, indicating that instance-level feature regression provides a stable optimization basis for cross-modal alignment. After introducing the mode-aware loss <inline-formula id="ieqn-218"><mml:math id="mml-ieqn-218"><mml:msub><mml:mrow><mml:mi>&#x02112;</mml:mi></mml:mrow><mml:mrow><mml:mtext>mode</mml:mtext></mml:mrow></mml:msub></mml:math></inline-formula>, the performance increases to 0.464 mAP and 0.579 NDS. This notable improvement shows that constraining the student features toward the dominant high-density mode of the teacher distribution is more effective than relying only on point-wise matching, as it better captures the intrinsic structure of teacher features while reducing the influence of noisy samples. When the mixture-level regularization <inline-formula id="ieqn-219"><mml:math id="mml-ieqn-219"><mml:msub><mml:mrow><mml:mi>&#x02112;</mml:mi></mml:mrow><mml:mrow><mml:mtext>mix</mml:mtext></mml:mrow></mml:msub></mml:math></inline-formula> is further added, the performance is further improved to 0.467 mAP and 0.581 NDS. Although the gain is relatively smaller, it demonstrates that mixture-level distribution alignment provides complementary supervision by encouraging global consistency across different intra-class modes. Overall, the best performance is achieved by jointly using <inline-formula id="ieqn-220"><mml:math id="mml-ieqn-220"><mml:msub><mml:mrow><mml:mi>&#x02112;</mml:mi></mml:mrow><mml:mrow><mml:mtext>pair</mml:mtext></mml:mrow></mml:msub></mml:math></inline-formula>, <inline-formula id="ieqn-221"><mml:math id="mml-ieqn-221"><mml:msub><mml:mrow><mml:mi>&#x02112;</mml:mi></mml:mrow><mml:mrow><mml:mtext>mode</mml:mtext></mml:mrow></mml:msub></mml:math></inline-formula>, and <inline-formula id="ieqn-222"><mml:math id="mml-ieqn-222"><mml:msub><mml:mrow><mml:mi>&#x02112;</mml:mi></mml:mrow><mml:mrow><mml:mtext>mix</mml:mtext></mml:mrow></mml:msub></mml:math></inline-formula>, validating the effectiveness and complementarity of the three loss terms.</p>

<p>We further analyze the design of the temporal modeling module by ablating its key components, including the learnable refinement and the motion-aware gating mechanism, as shown in <xref ref-type="table" rid="table-5">Table 5</xref>.</p>
<table-wrap id="table-5">
<label>Table 5</label>
<caption>
<title>Ablation study of each component in temporal modeling.</title>
</caption>
<table>
<colgroup>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/> </colgroup>
<thead>
<tr>
<th>Setting</th>
<th>Baseline</th>
<th>Learnable Refinement</th>
<th>Motion-Aware Gating</th>
<th>mAP<inline-formula id="ieqn-223"><mml:math id="mml-ieqn-223"><mml:mo stretchy="false">&#x2191;</mml:mo></mml:math></inline-formula></th>
<th>NDS<inline-formula id="ieqn-224"><mml:math id="mml-ieqn-224"><mml:mo stretchy="false">&#x2191;</mml:mo></mml:math></inline-formula></th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td><inline-formula id="ieqn-225"><mml:math id="mml-ieqn-225"><mml:mi>&#x2713;</mml:mi></mml:math></inline-formula></td>
<td></td>
<td></td>
<td>0.460</td>
<td>0.576</td>
</tr>
<tr>
<td>2</td>
<td><inline-formula id="ieqn-226"><mml:math id="mml-ieqn-226"><mml:mi>&#x2713;</mml:mi></mml:math></inline-formula></td>
<td><inline-formula id="ieqn-227"><mml:math id="mml-ieqn-227"><mml:mi>&#x2713;</mml:mi></mml:math></inline-formula></td>
<td></td>
<td>0.464</td>
<td>0.579</td>
</tr>
<tr>
<td>3</td>
<td><inline-formula id="ieqn-228"><mml:math id="mml-ieqn-228"><mml:mi>&#x2713;</mml:mi></mml:math></inline-formula></td>
<td><inline-formula id="ieqn-229"><mml:math id="mml-ieqn-229"><mml:mi>&#x2713;</mml:mi></mml:math></inline-formula></td>
<td><inline-formula id="ieqn-230"><mml:math id="mml-ieqn-230"><mml:mi>&#x2713;</mml:mi></mml:math></inline-formula></td>
<td>0.467</td>
<td>0.581</td>
</tr>
</tbody>
</table>
</table-wrap>
<p><xref ref-type="table" rid="table-5">Table 5</xref> presents an ablation study of each component in the temporal modeling module. Here, the baseline denotes the basic two-frame fusion design with ego-motion compensation, while removing the learnable refinement offset and the motion-aware gating mechanism. Under this setting, the model achieves 0.460 mAP and 0.576 NDS, showing that simple temporal aggregation already provides useful historical context. After introducing the learnable refinement, the performance improves to 0.464 mAP and 0.579 NDS. This gain indicates that compensating for local misalignment beyond rigid ego-motion warping is beneficial, since discretization errors and dynamic scene variations cannot be fully handled by geometric transformation alone. When the motion-aware gating mechanism is further incorporated, the performance reaches 0.467 mAP and 0.581 NDS. This result shows that adaptively controlling the contribution of historical features is important for suppressing inconsistent or noisy temporal information, especially in regions affected by object motion or partial occlusion.</p>

</sec>
</sec>
<sec id="s5">
<label>5</label>
<title>Conclusion</title>
<p>This paper presents a novel distribution-aware cross-modal distillation framework that transfers geometric priors from a LiDAR-based teacher to a camera-only student for temporal 3D object detection in the BEV space. To address distillation instability caused by modality heterogeneity and noisy teacher features, we propose distribution-aware BEV feature distillation that explicitly models class-conditional BEV feature distributions of the teacher using a DPGMM and constrains student features to match the teacher&#x2019;s distribution in a probabilistic manner. Next, we introduce response-level distillation to transfer task-specific decision behavior at the detection head, improving output calibration and localization refinement. Furthermore, we design a lightweight two-frame temporal fusion module with ego-motion compensation, residual alignment refinement, and motion-aware gating to robustly aggregate complementary observations from consecutive frames. Although our study achieves promising results, the proposed framework may be less effective in adverse environments (e.g., low light, rain, or fog) where both camera and LiDAR signals degrade, making the teacher&#x2019;s predictions unreliable and the student&#x2019;s inputs severely corrupted. In such cases, distillation may propagate erroneous supervision and reduce overall performance. In future work, we will systematically investigate robustness under severe sensor degradation. In addition, we will also focus on uncertainty issues caused by long-tail categories and complex motion patterns, and explore more adaptive mixture distribution modeling and uncertainty characterization methods.</p>
</sec>
</body>
<back>
<ack>
<p>None.</p>
</ack>
<sec>
<title>Funding Statement</title>
<p>This paper is supported by the National Natural Science Foundation of China (Grant No. 62302086).</p>
</sec>
<sec>
<title>Author Contributions</title>
<p>The authors confirm contribution to the paper as follows: conceptualization, Tianzhe Jiao and Jie Song; methodology, Tianzhe Jiao and Yuming Chen; software, Xiaoyue Feng ; validation, Yuming Chen, Tianzhe Jiao and Chaopeng Guo; formal analysis, Tianzhe Jiao; investigation, Yuming Chen; resources, Tianzhe Jiao; data curation, Xiaoyue Feng; writing&#x2014;original draft preparation, Tianzhe Jiao; writing&#x2014;review and editing, Jie Song; visualization, Yuming Chen; supervision, Jie Song; project administration, Chaopeng Guo; funding acquisition, Chaopeng Guo. All authors reviewed and approved the final version of the manuscript.</p>
</sec>
<sec sec-type="data-availability">
<title>Availability of Data and Materials</title>
<p>The data that support the findings of this study are available from the Corresponding Author, upon reasonable request. The original data presented in the study are openly available in publicly accessible repositories: nuScenes at <ext-link ext-link-type="uri" xlink:href="https://www.nuscenes.org/">https://www.nuscenes.org/</ext-link> and KITTI at <ext-link ext-link-type="uri" xlink:href="http://www.cvlibs.net/datasets/kitti/eval_object.php">http://www.cvlibs.net/datasets/kitti/eval_object.php</ext-link>.</p>
</sec>
<sec>
<title>Ethics Approval</title>
<p>Not applicable.</p>
</sec>
<sec sec-type="COI-statement">
<title>Conflicts of Interest</title>
<p>The authors declare no conflicts of interest.</p>
</sec>
<ref-list content-type="authoryear">
<title>References</title>
<ref id="ref-1"><label>[1]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Wang</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Jiang</surname> <given-names>H</given-names></string-name>, <string-name><surname>Chen</surname> <given-names>G</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>T</given-names></string-name>, <string-name><surname>Zhou</surname> <given-names>J</given-names></string-name>, <string-name><surname>Qing</surname> <given-names>Z</given-names></string-name>, <etal>et al</etal></person-group>. <article-title>Efficient and robust multi-camera 3D object detection in Bird-Eye-View</article-title>. <source>Image Vis Comput</source>. <year>2025</year>;<volume>154</volume>:<fpage>105428</fpage>. doi:<pub-id pub-id-type="doi">10.1016/j.imavis.2025.105428</pub-id>.</mixed-citation></ref>
<ref id="ref-2"><label>[2]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Zhang</surname> <given-names>C</given-names></string-name>, <string-name><surname>Liang</surname> <given-names>L</given-names></string-name>, <string-name><surname>Zhou</surname> <given-names>J</given-names></string-name>, <string-name><surname>Xu</surname> <given-names>Y</given-names></string-name></person-group>. <article-title>Multi-view depth estimation based on multi-feature aggregation for 3D reconstruction</article-title>. <source>Comput Graph</source>. <year>2024</year>;<volume>122</volume>(<issue>4</issue>):<fpage>103954</fpage>. doi:<pub-id pub-id-type="doi">10.1016/j.cag.2024.103954</pub-id>.</mixed-citation></ref>
<ref id="ref-3"><label>[3]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Li</surname> <given-names>J</given-names></string-name>, <string-name><surname>Xu</surname> <given-names>J</given-names></string-name>, <string-name><surname>Zhi</surname> <given-names>M</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>J</given-names></string-name>, <string-name><surname>Zhuo</surname> <given-names>L</given-names></string-name></person-group>. <article-title>BEV-CMHF: a cross-modality hybrid fusion framework for BEV 3D object detection with feature interaction and temporal fusion, Early access</article-title>. <source>IEEE Trans Intell Transp Syst</source>. <year>2026</year>. doi:<pub-id pub-id-type="doi">10.1109/TITS.2026.3651793</pub-id>.</mixed-citation></ref>
<ref id="ref-4"><label>[4]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Chong</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Ma</surname> <given-names>X</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>H</given-names></string-name>, <string-name><surname>Yue</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Li</surname> <given-names>H</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>Z</given-names></string-name>, <etal>et al.</etal></person-group> <article-title>MonoDistill: learning spatial features for monocular 3D object detection</article-title>. In: <conf-name>The Tenth International Conference on Learning Representations, ICLR 2022; 2022 Apr 25&#x2013;29; Virtual Event</conf-name>.</mixed-citation></ref>
<ref id="ref-5"><label>[5]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Li</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Chen</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Qi</surname> <given-names>X</given-names></string-name>, <string-name><surname>Li</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Sun</surname> <given-names>J</given-names></string-name>, <string-name><surname>Jia</surname> <given-names>J</given-names></string-name></person-group>. <article-title>Unifying voxel-based representation with transformer for 3D object detection</article-title>. <source>Adv Neural Inf Process Syst</source>. <year>2022</year>;<volume>35</volume>:<fpage>18442</fpage>&#x2013;<lpage>55</lpage>.</mixed-citation></ref>
<ref id="ref-6"><label>[6]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Chen</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Li</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>S</given-names></string-name>, <string-name><surname>Fang</surname> <given-names>L</given-names></string-name>, <string-name><surname>Jiang</surname> <given-names>Q</given-names></string-name>, <string-name><surname>Zhao</surname> <given-names>F</given-names></string-name></person-group>. <article-title>BEVDistill: cross-modal BEV distillation for multi-view 3D object detection</article-title>. In: <conf-name>The Eleventh International Conference on Learning Representations</conf-name>; <year>2023 [cited 2026 Jan 1]</year>. Available from: <ext-link ext-link-type="uri" xlink:href="https://openreview.net/forum?id=-2zfgNS917">https://openreview.net/forum?id&#x003D;-2zfgNS917</ext-link>.</mixed-citation></ref>
<ref id="ref-7"><label>[7]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Liu</surname> <given-names>J</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>P</given-names></string-name>, <string-name><surname>Liu</surname> <given-names>W</given-names></string-name>, <string-name><surname>Du</surname> <given-names>S</given-names></string-name>, <string-name><surname>Li</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Hu</surname> <given-names>J</given-names></string-name>, <etal>et al</etal></person-group>. <article-title>DMPG-BEV: diffusion model-based point clouds features generation for efficient camera-based BEV perception</article-title>. <source>IEEE Sens J</source>. <year>2025</year>;<volume>25</volume>(<issue>15</issue>):<fpage>28905</fpage>&#x2013;<lpage>18</lpage>.</mixed-citation></ref>
<ref id="ref-8"><label>[8]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Philion</surname> <given-names>J</given-names></string-name>, <string-name><surname>Fidler</surname> <given-names>S</given-names></string-name></person-group>. <article-title>Encoding images from arbitrary camera rigs by implicitly unprojecting to 3D</article-title>. In: <conf-name>Proceedings of the Computer Vision&#x2014;ECCV 2020: 16th European Conference; 2020 Aug 23&#x2013;28</conf-name>; <publisher-loc>Glasgow, UK</publisher-loc>. p. <fpage>194</fpage>&#x2013;<lpage>210</lpage>.</mixed-citation></ref>
<ref id="ref-9"><label>[9]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Li</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Ge</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Yu</surname> <given-names>G</given-names></string-name>, <string-name><surname>Yang</surname> <given-names>J</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Shi</surname> <given-names>Y</given-names></string-name>, <etal>et al</etal></person-group>. <article-title>Bevdepth: acquisition of reliable depth for multi-view 3D object detection</article-title>. <source>Proc AAAI Conf Artif Intell</source>. <year>2023</year>;<volume>37</volume>(<issue>2</issue>):<fpage>1477</fpage>&#x2013;<lpage>85</lpage>.</mixed-citation></ref>
<ref id="ref-10"><label>[10]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Li</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>W</given-names></string-name>, <string-name><surname>Li</surname> <given-names>H</given-names></string-name>, <string-name><surname>Xie</surname> <given-names>E</given-names></string-name>, <string-name><surname>Sima</surname> <given-names>C</given-names></string-name>, <string-name><surname>Lu</surname> <given-names>T</given-names></string-name>, <etal>et al.</etal></person-group> <article-title>Bevformer: learning bird&#x2019;s-eye-view representation from multi-camera images via spatiotemporal transformers</article-title>. In: <conf-name>European conference on computer vision</conf-name>. <publisher-loc>Berlin/Heidelberg, Germany</publisher-loc>: <publisher-name>Springer</publisher-name>; <year>2022</year>. p. <fpage>1</fpage>&#x2013;<lpage>18</lpage>.</mixed-citation></ref>
<ref id="ref-11"><label>[11]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Zhang</surname> <given-names>J</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Qi</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Fu</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Liu</surname> <given-names>Q</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>Y</given-names></string-name></person-group>. <article-title>GeoBEV: learning geometric BEV representation for multi-view 3D object detection</article-title>. <source>Proc AAAI Conf Artif Intell</source>. <year>2025</year>;<volume>39</volume>(<issue>9</issue>):<fpage>9960</fpage>&#x2013;<lpage>8</lpage>.</mixed-citation></ref>
<ref id="ref-12"><label>[12]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Wang</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Guizilini</surname> <given-names>VC</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>T</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Zhao</surname> <given-names>H</given-names></string-name>, <string-name><surname>Solomon</surname> <given-names>J</given-names></string-name></person-group>. <article-title>Detr3d: 3D object detection from multi-view images via 3D-to-2D queries</article-title>. <source>Proc Mach Learn Res</source>. <year>2022</year>;<volume>164</volume>:<fpage>180</fpage>&#x2013;<lpage>91</lpage>.</mixed-citation></ref>
<ref id="ref-13"><label>[13]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Liu</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Yan</surname> <given-names>J</given-names></string-name>, <string-name><surname>Jia</surname> <given-names>F</given-names></string-name>, <string-name><surname>Li</surname> <given-names>S</given-names></string-name>, <string-name><surname>Gao</surname> <given-names>Q</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>T</given-names></string-name>, <etal>et al.</etal></person-group> <article-title>Petrv2: a unified framework for 3D perception from multi-camera images</article-title>. In: <conf-name>Proceedings of the IEEE/CVF International Conference on Computer Vision; 2023 Oct 2&#x2013;3</conf-name>; <publisher-loc>Paris, France</publisher-loc>. p. <fpage>3262</fpage>&#x2013;<lpage>72</lpage>.</mixed-citation></ref>
<ref id="ref-14"><label>[14]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><surname>Lin</surname> <given-names>X</given-names></string-name>, <string-name><surname>Lin</surname> <given-names>T</given-names></string-name>, <string-name><surname>Pei</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Huang</surname> <given-names>L</given-names></string-name>, <string-name><surname>Su</surname> <given-names>Z</given-names></string-name></person-group>. <article-title>Sparse4D: multi-view 3D object detection with sparse spatial-temporal fusion</article-title>. <comment>arXiv:2211.10581. 2022</comment>.</mixed-citation></ref>
<ref id="ref-15"><label>[15]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Chen</surname> <given-names>J</given-names></string-name>, <string-name><surname>Chen</surname> <given-names>H</given-names></string-name>, <string-name><surname>Zhu</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Shen</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Ling</surname> <given-names>X</given-names></string-name>, <string-name><surname>Zhu</surname> <given-names>Y</given-names></string-name></person-group>. <article-title>DDCFusion: dynamic depth compensation fusion for camera&#x2013;radar 3-D object detection</article-title>. <source>IEEE Sens J</source>. <year>2026</year>;<volume>26</volume>(<issue>3</issue>):<fpage>4561</fpage>&#x2013;<lpage>74</lpage>.</mixed-citation></ref>
<ref id="ref-16"><label>[16]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Vora</surname> <given-names>S</given-names></string-name>, <string-name><surname>Lang</surname> <given-names>AH</given-names></string-name>, <string-name><surname>Helou</surname> <given-names>B</given-names></string-name>, <string-name><surname>Beijbom</surname> <given-names>O</given-names></string-name></person-group>. <article-title>Pointpainting: sequential fusion for 3D object detection</article-title>. In: <conf-name>Proceedings of the IEEE/CVF conference on computer vision and pattern recognition; 2020 Jun 13&#x2013;19</conf-name>; <publisher-loc>Seattle, WA, USA</publisher-loc>. p. <fpage>4604</fpage>&#x2013;<lpage>12</lpage>.</mixed-citation></ref>
<ref id="ref-17"><label>[17]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Sindagi</surname> <given-names>VA</given-names></string-name>, <string-name><surname>Zhou</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Tuzel</surname> <given-names>O</given-names></string-name></person-group>. <article-title>MVX-Net: multimodal voxelNet for 3D object detection</article-title>. In: <conf-name>Proceedings of the 2019 International Conference on Robotics and Automation (ICRA); 2019 May 20&#x2013;24</conf-name>; <publisher-loc>Montreal, QC, Canada</publisher-loc>. p. <fpage>7276</fpage>&#x2013;<lpage>82</lpage>.</mixed-citation></ref>
<ref id="ref-18"><label>[18]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Zhang</surname> <given-names>L</given-names></string-name>, <string-name><surname>Li</surname> <given-names>C</given-names></string-name></person-group>. <article-title>PPF-Net: efficient multimodal 3D object detection with pillar-point fusion</article-title>. <source>Electronics</source>. <year>2025</year>;<volume>14</volume>(<issue>4</issue>):<fpage>685</fpage>.</mixed-citation></ref>
<ref id="ref-19"><label>[19]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Liang</surname> <given-names>T</given-names></string-name>, <string-name><surname>Xie</surname> <given-names>H</given-names></string-name>, <string-name><surname>Yu</surname> <given-names>K</given-names></string-name>, <string-name><surname>Xia</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Lin</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>Y</given-names></string-name>, <etal>et al</etal></person-group>. <article-title>BEVFusion: a simple and robust LiDAR-camera fusion framework</article-title>. <source>Adv Neural Inf Process Syst</source>. <year>2022</year>;<volume>35</volume>:<fpage>10421</fpage>&#x2013;<lpage>34</lpage>.</mixed-citation></ref>
<ref id="ref-20"><label>[20]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Liu</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Tang</surname> <given-names>H</given-names></string-name>, <string-name><surname>Amini</surname> <given-names>A</given-names></string-name>, <string-name><surname>Yang</surname> <given-names>X</given-names></string-name>, <string-name><surname>Mao</surname> <given-names>H</given-names></string-name>, <string-name><surname>Rus</surname> <given-names>DL</given-names></string-name>, <etal>et al</etal></person-group>. <article-title>Bevfusion: multi-task multi-sensor fusion with unified bird&#x2019;s-eye view representation</article-title>. In: <conf-name>Proceedings of the 2023 IEEE International Conference on Robotics and Automation (ICRA); 2023 May 29&#x2013;Jun 2</conf-name>; <publisher-loc>London, UK</publisher-loc>. p. <fpage>2774</fpage>&#x2013;<lpage>81</lpage>.</mixed-citation></ref>
<ref id="ref-21"><label>[21]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Chen</surname> <given-names>X</given-names></string-name>, <string-name><surname>Ma</surname> <given-names>H</given-names></string-name>, <string-name><surname>Wan</surname> <given-names>J</given-names></string-name>, <string-name><surname>Li</surname> <given-names>B</given-names></string-name>, <string-name><surname>Xia</surname> <given-names>T</given-names></string-name></person-group>. <article-title>Multi-view 3D object detection network for autonomous driving</article-title>. In: <conf-name>Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017; 2017 Jul 21&#x2013;26</conf-name>; <publisher-loc>Honolulu, HI, USA</publisher-loc>. p. <fpage>6526</fpage>&#x2013;<lpage>34</lpage>.</mixed-citation></ref>
<ref id="ref-22"><label>[22]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Pang</surname> <given-names>S</given-names></string-name>, <string-name><surname>Morris</surname> <given-names>D</given-names></string-name>, <string-name><surname>Radha</surname> <given-names>H</given-names></string-name></person-group>. <article-title>CLOCs: camera-LiDAR object candidates fusion for 3D object detection</article-title>. In: <conf-name>Proceedings of the 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS); 2020 Oct 24&#x2013;2021 Jan 24</conf-name>; <publisher-loc>Las Vegas, NV, USA</publisher-loc>. p. <fpage>10386</fpage>&#x2013;<lpage>93</lpage>.</mixed-citation></ref>
<ref id="ref-23"><label>[23]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Wang</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Huang</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Gao</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>N</given-names></string-name>, <string-name><surname>Liu</surname> <given-names>S</given-names></string-name></person-group>. <article-title>MV2DFusion: leveraging modality-specific object semantics for multi-modal 3D detection</article-title>. <source>IEEE Trans Pattern Anal Mach Intell</source>. <year>2026</year>;<volume>48</volume>(<issue>1</issue>):<fpage>609</fpage>&#x2013;<lpage>23</lpage>; <pub-id pub-id-type="pmid">40938719</pub-id></mixed-citation></ref>
<ref id="ref-24"><label>[24]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Zhang</surname> <given-names>G</given-names></string-name>, <string-name><surname>Song</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Liu</surname> <given-names>L</given-names></string-name>, <string-name><surname>Ou</surname> <given-names>Z</given-names></string-name></person-group>. <article-title>FGU3R: fine-grained fusion via unified 3D representation for multimodal 3D object Detection</article-title>. In: <conf-name>Proceedings of the 2025 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2025; 2025 Apr 6&#x2013;11</conf-name>; <publisher-loc>Hyderabad, India</publisher-loc>. p. <fpage>1</fpage>&#x2013;<lpage>5</lpage>.</mixed-citation></ref>
<ref id="ref-25"><label>[25]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Wang</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Li</surname> <given-names>D</given-names></string-name>, <string-name><surname>Luo</surname> <given-names>C</given-names></string-name>, <string-name><surname>Xie</surname> <given-names>C</given-names></string-name>, <string-name><surname>Yang</surname> <given-names>X</given-names></string-name></person-group>. <article-title>Distillbev: boosting multi-camera 3D object detection with cross-modal knowledge distillation</article-title>. In: <conf-name>Proceedings of the IEEE/CVF International Conference on Computer Vision; 2023 Oct 1&#x2013;6</conf-name>; <publisher-loc>Paris, France</publisher-loc>. p. <fpage>8637</fpage>&#x2013;<lpage>46</lpage>.</mixed-citation></ref>
<ref id="ref-26"><label>[26]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Zhou</surname> <given-names>S</given-names></string-name>, <string-name><surname>Liu</surname> <given-names>W</given-names></string-name>, <string-name><surname>Hu</surname> <given-names>C</given-names></string-name>, <string-name><surname>Zhou</surname> <given-names>S</given-names></string-name>, <string-name><surname>Ma</surname> <given-names>C</given-names></string-name></person-group>. <article-title>UniDistill: a universal cross-modality knowledge distillation framework for 3D object detection in bird&#x2019;s-eye view</article-title>. In: <conf-name>Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2023 Jun 17&#x2013;24</conf-name>; <publisher-loc>Vancouver, BC, Canada</publisher-loc>. p. <fpage>5116</fpage>&#x2013;<lpage>25</lpage>.</mixed-citation></ref>
<ref id="ref-27"><label>[27]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Hekimoglu</surname> <given-names>A</given-names></string-name>, <string-name><surname>Schmidt</surname> <given-names>M</given-names></string-name>, <string-name><surname>Marcos-Ramiro</surname> <given-names>A</given-names></string-name></person-group>. <article-title>Monocular 3D object detection with LiDAR guided semi supervised active learning</article-title>. In: <conf-name>Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, WACV 2024; 2024 Jan 3&#x2013;8</conf-name>; <publisher-loc>Waikoloa, HI, USA</publisher-loc>. p. <fpage>2335</fpage>&#x2013;<lpage>44</lpage>.</mixed-citation></ref>
<ref id="ref-28"><label>[28]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Xu</surname> <given-names>R</given-names></string-name>, <string-name><surname>Xiang</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>C</given-names></string-name>, <string-name><surname>Zhong</surname> <given-names>H</given-names></string-name>, <string-name><surname>Zhao</surname> <given-names>X</given-names></string-name>, <string-name><surname>Dang</surname> <given-names>R</given-names></string-name>, <etal>et al</etal></person-group>. <article-title>SCKD: semi-supervised cross-modality knowledge distillation for 4D radar object detection</article-title>. <source>Proc AAAI Conf Artif Intell</source>. <year>2025</year>;<volume>39</volume>(<issue>9</issue>):<fpage>8933</fpage>&#x2013;<lpage>41</lpage>.</mixed-citation></ref>
<ref id="ref-29"><label>[29]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Dong</surname> <given-names>N</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>YQ</given-names></string-name>, <string-name><surname>Ding</surname> <given-names>ML</given-names></string-name>, <string-name><surname>Xu</surname> <given-names>SB</given-names></string-name>, <string-name><surname>Bai</surname> <given-names>YC</given-names></string-name></person-group>. <article-title>One-stage object detection knowledge distillation via adversarial learning</article-title>. <source>Appl Intell</source>. <year>2022</year>;<volume>52</volume>(<issue>4</issue>):<fpage>4582</fpage>&#x2013;<lpage>98</lpage>. doi:<pub-id pub-id-type="doi">10.1007/s10489-021-02634-6</pub-id>.</mixed-citation></ref>
<ref id="ref-30"><label>[30]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Kim</surname> <given-names>S</given-names></string-name>, <string-name><surname>Kim</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Hwang</surname> <given-names>S</given-names></string-name>, <string-name><surname>Jeong</surname> <given-names>H</given-names></string-name>, <string-name><surname>Kum</surname> <given-names>D</given-names></string-name></person-group>. <article-title>LabelDistill: label-guided cross-modal knowledge distillation for camera-based 3D object detection</article-title>. In: <conf-name>European Conference on Computer Vision</conf-name>. <publisher-loc>Berlin/Heidelberg, Germany</publisher-loc>: <publisher-name>Springer</publisher-name>; <year>2024</year>. p. <fpage>19</fpage>&#x2013;<lpage>37</lpage>.</mixed-citation></ref>
<ref id="ref-31"><label>[31]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Zhou</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Tuzel</surname> <given-names>O</given-names></string-name></person-group>. <article-title>Voxelnet: end-to-end learning for point cloud based 3D object detection</article-title>. In: <conf-name>Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; 2018 Jun 18&#x2013;23</conf-name>; <publisher-loc>Salt Lake City, UT, USA</publisher-loc>. p. <fpage>4490</fpage>&#x2013;<lpage>9</lpage>.</mixed-citation></ref>
<ref id="ref-32"><label>[32]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Lang</surname> <given-names>AH</given-names></string-name>, <string-name><surname>Vora</surname> <given-names>S</given-names></string-name>, <string-name><surname>Caesar</surname> <given-names>H</given-names></string-name>, <string-name><surname>Zhou</surname> <given-names>L</given-names></string-name>, <string-name><surname>Yang</surname> <given-names>J</given-names></string-name>, <string-name><surname>Beijbom</surname> <given-names>O</given-names></string-name></person-group>. <article-title>PointPillars: fast encoders for object detection from point clouds</article-title>. In: <conf-name>Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019; 2017 Jun 16&#x2013;20</conf-name>; <publisher-loc>Long Beach, CA, USA</publisher-loc>. p. <fpage>12697</fpage>&#x2013;<lpage>705</lpage>.</mixed-citation></ref>
<ref id="ref-33"><label>[33]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><surname>Huang</surname> <given-names>J</given-names></string-name>, <string-name><surname>Huang</surname> <given-names>G</given-names></string-name></person-group>. <article-title>BEVDet4D: exploit temporal cues in multi-camera 3D object detection</article-title>. <comment>arXiv:2203.17054. 2022</comment>.</mixed-citation></ref>
<ref id="ref-34"><label>[34]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><surname>Huang</surname> <given-names>J</given-names></string-name>, <string-name><surname>Huang</surname> <given-names>G</given-names></string-name>, <string-name><surname>Zhu</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Du</surname> <given-names>D</given-names></string-name></person-group>. <article-title>BEVDet: high-performance multi-camera 3D object detection in bird-eye-view</article-title>. <comment>arXiv:2112.11790. 2021</comment>.</mixed-citation></ref>
<ref id="ref-35"><label>[35]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Li</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Bao</surname> <given-names>H</given-names></string-name>, <string-name><surname>Ge</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Yang</surname> <given-names>J</given-names></string-name>, <string-name><surname>Sun</surname> <given-names>J</given-names></string-name>, <string-name><surname>Li</surname> <given-names>Z</given-names></string-name></person-group>. <article-title>BEVStereo: enhancing depth estimation in multi-view 3D object detection with temporal stereo</article-title>. <source>Proc AAAI Conf Artif Intell</source>. <year>2023</year>;<volume>37</volume>(<issue>2</issue>):<fpage>1486</fpage>&#x2013;<lpage>94</lpage>.</mixed-citation></ref>
<ref id="ref-36"><label>[36]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Yang</surname> <given-names>C</given-names></string-name>, <string-name><surname>Chen</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Tian</surname> <given-names>H</given-names></string-name>, <string-name><surname>Tao</surname> <given-names>C</given-names></string-name>, <string-name><surname>Zhu</surname> <given-names>X</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>Z</given-names></string-name>, <etal>et al.</etal></person-group> <article-title>BEVFormer v2: adapting modern image backbones to bird&#x2019;s-eye-view recognition via perspective supervision</article-title>. In: <conf-name>Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023; 2023 Jun 17&#x2013;24</conf-name>; <publisher-loc>Vancouver, BC, Canada</publisher-loc>. p. <fpage>17830</fpage>&#x2013;<lpage>9</lpage>.</mixed-citation></ref>
<ref id="ref-37"><label>[37]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Park</surname> <given-names>J</given-names></string-name>, <string-name><surname>Xu</surname> <given-names>C</given-names></string-name>, <string-name><surname>Yang</surname> <given-names>S</given-names></string-name>, <string-name><surname>Keutzer</surname> <given-names>K</given-names></string-name>, <string-name><surname>Kitani</surname> <given-names>KM</given-names></string-name>, <string-name><surname>Tomizuka</surname> <given-names>M</given-names></string-name>, <etal>et al</etal></person-group>. <article-title>Time will tell: new outlooks and a baseline for temporal multi-view 3D object detection</article-title>. In: <conf-name>Proceedings of the The Eleventh International Conference on Learning Representations, ICLR 2023; 2023 May 1&#x2013;5</conf-name>; <publisher-loc>Kigali, Rwanda</publisher-loc>.</mixed-citation></ref>
<ref id="ref-38"><label>[38]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><surname>Huang</surname> <given-names>J</given-names></string-name>, <string-name><surname>Huang</surname> <given-names>G</given-names></string-name></person-group>. <article-title>BEVPoolv2: a cutting-edge implementation of BEVDet toward deployment</article-title>. <comment>arXiv:2211.17111. 2022</comment>.</mixed-citation></ref>
<ref id="ref-39"><label>[39]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Wang</surname> <given-names>S</given-names></string-name>, <string-name><surname>Jiang</surname> <given-names>X</given-names></string-name>, <string-name><surname>Li</surname> <given-names>Y</given-names></string-name></person-group>. <article-title>Focal-PETR: embracing foreground for efficient multi-camera 3D object detection</article-title>. <source>IEEE Trans Intell Veh</source>. <year>2024</year>;<volume>9</volume>(<issue>1</issue>):<fpage>1481</fpage>&#x2013;<lpage>9</lpage>.</mixed-citation></ref>
<ref id="ref-40"><label>[40]</label><mixed-citation publication-type="book"><person-group person-group-type="author"><string-name><surname>Liu</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>T</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>X</given-names></string-name>, <string-name><surname>Sun</surname> <given-names>J</given-names></string-name></person-group>. <chapter-title>PETR: position embedding transformation for multi-view 3D object detection</chapter-title>. In: <person-group person-group-type="editor"><string-name><surname>Avidan</surname> <given-names>S</given-names></string-name>, <string-name><surname>Brostow</surname> <given-names>GJ</given-names></string-name>, <string-name><surname>Ciss&#x00E9;</surname> <given-names>M</given-names></string-name>, <string-name><surname>Farinella</surname> <given-names>GM</given-names></string-name>, <string-name><surname>Hassner</surname> <given-names>T</given-names></string-name></person-group>, editors. <source>Computer vision&#x2014;ECCV 2022</source>. <publisher-loc>Berlin/Heidelberg, Germany</publisher-loc>: <publisher-name>Springer</publisher-name>; <year>2022</year>. p. <fpage>531</fpage>&#x2013;<lpage>48</lpage>. doi:<pub-id pub-id-type="doi">10.1007/978-3-031-19812-0_31</pub-id>.</mixed-citation></ref>
<ref id="ref-41"><label>[41]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Li</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>W</given-names></string-name>, <string-name><surname>Li</surname> <given-names>H</given-names></string-name>, <string-name><surname>Xie</surname> <given-names>E</given-names></string-name>, <string-name><surname>Sima</surname> <given-names>C</given-names></string-name>, <string-name><surname>Lu</surname> <given-names>T</given-names></string-name>, <etal>et al</etal></person-group>. <article-title>BEVFormer: learning bird&#x2019;s-eye-view representation from LiDAR-camera via spatiotemporal transformers</article-title>. <source>IEEE Trans Pattern Anal Mach Intell</source>. <year>2025</year>;<volume>47</volume>(<issue>3</issue>):<fpage>2020</fpage>&#x2013;<lpage>36</lpage>.</mixed-citation></ref>
<ref id="ref-42"><label>[42]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><surname>Chen</surname> <given-names>S</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>X</given-names></string-name>, <string-name><surname>Cheng</surname> <given-names>T</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>Q</given-names></string-name>, <string-name><surname>Huang</surname> <given-names>C</given-names></string-name>, <string-name><surname>Liu</surname> <given-names>W</given-names></string-name></person-group>. <article-title>Polar parametrization for vision-based surround-view 3D detection</article-title>. <comment>arXiv:2206.10965. 2022</comment>.</mixed-citation></ref>
<ref id="ref-43"><label>[43]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Yin</surname> <given-names>T</given-names></string-name>, <string-name><surname>Zhou</surname> <given-names>X</given-names></string-name>, <string-name><surname>Kr&#x00E4;henb&#x00FC;hl</surname> <given-names>P</given-names></string-name></person-group>. <article-title>Center-based 3D object detection and tracking</article-title>. In: <conf-name>Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021; 2021 Jun 19&#x2013;25; Virtual</conf-name>. p. <fpage>11784</fpage>&#x2013;<lpage>93</lpage>. doi:<pub-id pub-id-type="doi">10.1109/cvpr46437.2021.01161</pub-id>.</mixed-citation></ref>
<ref id="ref-44"><label>[44]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Wang</surname> <given-names>T</given-names></string-name>, <string-name><surname>Zhu</surname> <given-names>X</given-names></string-name>, <string-name><surname>Pang</surname> <given-names>J</given-names></string-name>, <string-name><surname>Lin</surname> <given-names>D</given-names></string-name></person-group>. <article-title>FCOS3D: fully convolutional one-stage monocular 3D object detection</article-title>. In: <conf-name>Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, ICCVW 2021; 2021 Oct 11&#x2013;17</conf-name>; <publisher-loc>Montreal, QC, Canada</publisher-loc>. p. <fpage>913</fpage>&#x2013;<lpage>22</lpage>.</mixed-citation></ref>
<ref id="ref-45"><label>[45]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Wang</surname> <given-names>T</given-names></string-name>, <string-name><surname>Zhu</surname> <given-names>X</given-names></string-name>, <string-name><surname>Pang</surname> <given-names>J</given-names></string-name>, <string-name><surname>Lin</surname> <given-names>D</given-names></string-name></person-group>. <article-title>Probabilistic and geometric depth: detecting objects in perspective</article-title>. <source>Proc Mach Learn Res</source>. <year>2022</year>;<volume>164</volume>:<fpage>1475</fpage>&#x2013;<lpage>85</lpage>.</mixed-citation></ref>
<ref id="ref-46"><label>[46]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Nachkov</surname> <given-names>A</given-names></string-name>, <string-name><surname>Paudel</surname> <given-names>DP</given-names></string-name>, <string-name><surname>Danelljan</surname> <given-names>M</given-names></string-name>, <string-name><surname>Gool</surname> <given-names>LV</given-names></string-name></person-group>. <article-title>Diffusion-based particle-DETR for BEV perception</article-title>. In: <conf-name>Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, WACV 2025; 2025 Feb 26&#x2013;Mar 6</conf-name>; <publisher-loc>Tucson, AZ, USA</publisher-loc>. p. <fpage>2725</fpage>&#x2013;<lpage>35</lpage>.</mixed-citation></ref>
<ref id="ref-47"><label>[47]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Ye</surname> <given-names>X</given-names></string-name>, <string-name><surname>Yaman</surname> <given-names>B</given-names></string-name>, <string-name><surname>Cheng</surname> <given-names>S</given-names></string-name>, <string-name><surname>Tao</surname> <given-names>F</given-names></string-name>, <string-name><surname>Mallik</surname> <given-names>A</given-names></string-name>, <string-name><surname>Ren</surname> <given-names>L</given-names></string-name></person-group>. <article-title>BEVDiffuser: plug-and-play diffusion model for BEV denoising with ground-truth guidance</article-title>. In: <conf-name>Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2025; 2025 Jun 11&#x2013;15</conf-name>; <publisher-loc>Nashville, TN, USA</publisher-loc>. p. <fpage>1495</fpage>&#x2013;<lpage>504</lpage>.</mixed-citation></ref>
<ref id="ref-48"><label>[48]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Kwon</surname> <given-names>D</given-names></string-name>, <string-name><surname>Yoon</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Son</surname> <given-names>H</given-names></string-name>, <string-name><surname>Kwak</surname> <given-names>S</given-names></string-name></person-group>. <article-title>MemDistill: distilling LiDAR knowledge into memory for camera-only 3D object detection</article-title>. In: <conf-name>Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV); 2025 Oct 19&#x2013;23</conf-name>; <publisher-loc>Honolulu, HI, USA</publisher-loc>. p. <fpage>6828</fpage>&#x2013;<lpage>38</lpage>.</mixed-citation></ref>
<ref id="ref-49"><label>[49]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Klingner</surname> <given-names>M</given-names></string-name>, <string-name><surname>Borse</surname> <given-names>S</given-names></string-name>, <string-name><surname>Kumar</surname> <given-names>VR</given-names></string-name>, <string-name><surname>Rezaei</surname> <given-names>B</given-names></string-name>, <string-name><surname>Narayanan</surname> <given-names>V</given-names></string-name>, <string-name><surname>Yogamani</surname> <given-names>SK</given-names></string-name>, <etal>et al.</etal></person-group> <article-title>X3KD: knowledge distillation across modalities, tasks and stages for multi-camera 3D object detection</article-title>. In: <conf-name>Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023; 2023 Jun 17&#x2013;24</conf-name>; <publisher-loc>Vancouver, BC, Canada</publisher-loc>. p. <fpage>13343</fpage>&#x2013;<lpage>53</lpage>.</mixed-citation></ref>
<ref id="ref-50"><label>[50]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Wang</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Solomon</surname> <given-names>JM</given-names></string-name></person-group>. <article-title>Object DGCNN: 3D object detection using dynamic graphs</article-title>. In: <conf-name>Proceedings of the Neural Information Processing Systems 2021, NeurIPS 2021; 2021 Dec 6&#x2013;14; Virtual</conf-name>. p. <fpage>20745</fpage>&#x2013;<lpage>58</lpage>.</mixed-citation></ref>
<ref id="ref-51"><label>[51]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Wang</surname> <given-names>S</given-names></string-name>, <string-name><surname>Liu</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>T</given-names></string-name>, <string-name><surname>Li</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>X</given-names></string-name></person-group>. <article-title>Exploring object-centric temporal modeling for efficient multi-view 3D object detection</article-title>. In: <conf-name>Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV); 2023 Oct 1&#x2013;6</conf-name>; <publisher-loc>Paris, France</publisher-loc>. p. <fpage>3621</fpage>&#x2013;<lpage>31</lpage>.</mixed-citation></ref>
<ref id="ref-52"><label>[52]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Guo</surname> <given-names>K</given-names></string-name>, <string-name><surname>Ling</surname> <given-names>Q</given-names></string-name></person-group>. <article-title>PromptDet: a lightweight 3D object detection framework with LiDAR prompts</article-title>. <source>Proc AAAI Conf Artif Intell</source>. <year>2025</year>;<volume>39</volume>(<issue>3</issue>):<fpage>3266</fpage>&#x2013;<lpage>74</lpage>.</mixed-citation></ref>
<ref id="ref-53"><label>[53]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><surname>Huang</surname> <given-names>P</given-names></string-name>, <string-name><surname>Liu</surname> <given-names>L</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>R</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>S</given-names></string-name>, <string-name><surname>Xu</surname> <given-names>X</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>B</given-names></string-name>, <etal>et al</etal></person-group>. <article-title>TiG-BEV: multi-view BEV 3D object detection via target inner-geometry learning</article-title>. <comment>arXiv:2212.13979. 2022</comment>.</mixed-citation></ref>
<ref id="ref-54"><label>[54]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Zhao</surname> <given-names>H</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>Q</given-names></string-name>, <string-name><surname>Zhao</surname> <given-names>S</given-names></string-name>, <string-name><surname>Chen</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>J</given-names></string-name>, <string-name><surname>Tao</surname> <given-names>D</given-names></string-name></person-group>. <article-title>SimDistill: simulated multi-modal distillation for BEV 3D object detection</article-title>. <source>Proc AAAI Conf Artif Intell</source>. <year>2024</year>;<volume>38</volume>(<issue>7</issue>):<fpage>7460</fpage>&#x2013;<lpage>8</lpage>. doi:<pub-id pub-id-type="doi">10.1609/aaai.v38i7.28577</pub-id>.</mixed-citation></ref>
</ref-list>
</back></article>