<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.1 20151215//EN" "http://jats.nlm.nih.gov/publishing/1.1/JATS-journalpublishing1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:mml="http://www.w3.org/1998/Math/MathML" xml:lang="en" article-type="research-article" dtd-version="1.1">
<front>
<journal-meta>
<journal-id journal-id-type="pmc">CMC</journal-id>
<journal-id journal-id-type="nlm-ta">CMC</journal-id>
<journal-id journal-id-type="publisher-id">CMC</journal-id>
<journal-title-group>
<journal-title>Computers, Materials &#x0026; Continua</journal-title>
</journal-title-group>
<issn pub-type="epub">1546-2226</issn>
<issn pub-type="ppub">1546-2218</issn>
<publisher>
<publisher-name>Tech Science Press</publisher-name>
<publisher-loc>USA</publisher-loc>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="publisher-id">73330</article-id>
<article-id pub-id-type="doi">10.32604/cmc.2025.073330</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Article</subject>
</subj-group>
</article-categories>
<title-group>
<article-title>Can Domain Knowledge Make Deep Models Smarter? Expert-Guided PointPillar (EG-PointPillar) for Enhanced 3D Object Detection</article-title>
<alt-title alt-title-type="left-running-head">Can Domain Knowledge Make Deep Models Smarter? Expert-Guided PointPillar (EG-PointPillar) for Enhanced 3D Object Detection</alt-title>
<alt-title alt-title-type="right-running-head">Can Domain Knowledge Make Deep Models Smarter? Expert-Guided PointPillar(EG-PointPillar) for Enhanced 3D Object Detection</alt-title>
</title-group>
<contrib-group>
<contrib id="author-1" contrib-type="author">
<name name-style="western"><surname>Ahn</surname><given-names>Chiwan</given-names></name><xref ref-type="aff" rid="aff-1">1</xref></contrib>
<contrib id="author-2" contrib-type="author" corresp="yes">
<name name-style="western"><surname>Kim</surname><given-names>Daehee</given-names></name><xref ref-type="aff" rid="aff-2">2</xref><xref rid="cor1" ref-type="corresp">&#x002A;</xref><email>daeheekim@sch.ac.kr</email></contrib>
<contrib id="author-3" contrib-type="author" corresp="yes">
<name name-style="western"><surname>Park</surname><given-names>Seongkeun</given-names></name><xref ref-type="aff" rid="aff-3">3</xref><xref rid="cor1" ref-type="corresp">&#x002A;</xref><email>skpark@tukorea.ac.kr</email></contrib>
<aff id="aff-1"><label>1</label><institution>Convergence Security for Automobile, Soonchunhyang University, Asan-si</institution>, <addr-line>31538, Chungcheongnam-do</addr-line>, <country>Republic of Korea</country></aff>
<aff id="aff-2"><label>2</label><institution>Department of Future Convergence Technology, Soonchunhyang University, Asan-si</institution>, <addr-line>31538, Chungcheongnam-do</addr-line>, <country>Republic of Korea</country></aff>
<aff id="aff-3"><label>3</label><institution>Department of Mechanical Design Engineering, Tech University of Korea, Siheung-si</institution>, <addr-line>15073, Gyeonggi-do</addr-line>, <country>Republic of Korea</country></aff>
</contrib-group>
<author-notes>
<corresp id="cor1"><label>&#x002A;</label>Corresponding Authors: Daehee Kim. Email: <email>daeheekim@sch.ac.kr</email>; Seongkeun Park. Email: <email>skpark@tukorea.ac.kr</email></corresp>
</author-notes>
<pub-date date-type="collection" publication-format="electronic">
<year>2026</year>
</pub-date>
<pub-date date-type="pub" publication-format="electronic">
<day>10</day><month>2</month><year>2026</year>
</pub-date>
<volume>87</volume>
<issue>1</issue>
<elocation-id>84</elocation-id>
<history>
<date date-type="received">
<day>16</day>
<month>09</month>
<year>2025</year>
</date>
<date date-type="accepted">
<day>11</day>
<month>12</month>
<year>2025</year>
</date>
</history>
<permissions>
<copyright-statement>&#x00A9; 2026 The Authors.</copyright-statement>
<copyright-year>2026</copyright-year>
<copyright-holder>Published by Tech Science Press.</copyright-holder>
<license xlink:href="https://creativecommons.org/licenses/by/4.0/">
<license-p>This work is licensed under a <ext-link ext-link-type="uri" xlink:type="simple" xlink:href="https://creativecommons.org/licenses/by/4.0/">Creative Commons Attribution 4.0 International License</ext-link>, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
</license>
</permissions>
<self-uri content-type="pdf" xlink:href="TSP_CMC_73330.pdf"></self-uri>
<abstract>
<p>This paper proposes a deep learning-based 3D LiDAR perception framework designed for applications such as autonomous robots and vehicles. To address the high dependency on large-scale annotated data&#x2014;an inherent limitation of deep learning models&#x2014;this study introduces a hybrid perception architecture that incorporates expert-driven LiDAR processing techniques into the deep neural network. Traditional 3D LiDAR processing methods typically remove ground planes and apply distance- or density-based clustering for object detection. In this work, such expert knowledge is encoded as feature-level inputs and fused with the deep network, thereby mitigating the data dependency issue of conventional learning-based approaches. Specifically, the proposed method combines two expert algorithms&#x2014;Patchwork&#x002B;&#x002B; for ground segmentation and DBSCAN for clustering&#x2014;with a PointPillars-based LiDAR detection network. We design four hybrid versions of the network depending on the stage and method of integrating expert features into the feature map of the deep model. Among these, Version 4 incorporates a modified neck structure in PointPillars and introduces a new Cluster 2D Pseudo-Map Branch that utilizes cluster-level pseudo-images generated from Patchwork&#x002B;&#x002B; and DBSCAN. This version achieved a &#x002B;3.88% improvement mean Average Precision (mAP) compared to the baseline PointPillars. The results demonstrate that embedding expert-based perception logic into deep neural architectures can effectively enhance performance and reduce dependency on extensive training datasets, offering a promising direction for robust 3D LiDAR object detection in real-world scenarios.</p>
</abstract>
<kwd-group kwd-group-type="author">
<kwd>LiDAR</kwd>
<kwd>PointPillar</kwd>
<kwd>expert knowledge</kwd>
<kwd>autonomous driving</kwd>
<kwd>deep learning</kwd>
</kwd-group>
<funding-group>
<award-group id="awg1">
<funding-source>Basic Science Research Program through the National Research Foundation of Korea (NRF)</funding-source>
<award-id>RS-2023-00245084</award-id>
</award-group>
<award-group id="awg2">
<funding-source>Korea Institute for Advancement of Technology (KIAT) grant funded by the Korea Government (MOTIE)</funding-source>
<award-id>RS-2024-00415938</award-id>
</award-group>
</funding-group>
</article-meta>
</front>
<body>
<sec id="s1">
<label>1</label>
<title>Introduction</title>
<p>Accurate perception of surrounding objects is a fundamental requirement for high-level autonomous driving systems [<xref ref-type="bibr" rid="ref-1">1</xref>]. Autonomous vehicles are typically equipped with multiple sensors, such as LiDAR, RADAR, and cameras, and various perception algorithms have been proposed to detect and classify surrounding objects in both 2D and 3D spaces [<xref ref-type="bibr" rid="ref-2">2</xref>&#x2013;<xref ref-type="bibr" rid="ref-4">4</xref>]. Among these, approaches utilizing 3D point cloud data acquired from LiDAR sensors have gained significant attention. These methods include traditional clustering-based techniques [<xref ref-type="bibr" rid="ref-5">5</xref>] as well as deep learning-based models such as VoxelNet [<xref ref-type="bibr" rid="ref-6">6</xref>] and PointPillars [<xref ref-type="bibr" rid="ref-7">7</xref>].</p>
<p>Traditional LiDAR-based perception algorithms typically detect objects by clustering spatially adjacent points. A widely used algorithm in this category is Density-Based Spatial Clustering of Applications with Noise (DBSCAN) [<xref ref-type="bibr" rid="ref-5">5</xref>], which identifies clusters in a point cloud based on density. DBSCAN groups points into a cluster if a minimum number of neighboring points fall within a predefined radius. This method is robust to noise and outliers and does not require prior knowledge of the number or shape of clusters. However, it has limitations: it cannot classify each cluster into semantic object categories, nor can it leverage internal distributional features of the clustered points. Furthermore, optimal performance depends on careful tuning of multiple parameters, which often requires the intervention of domain experts.</p>
<p>Additionally, traditional clustering-based methods struggle in complex scenarios&#x2014;such as when objects are partially occluded by others or when the sensor&#x2019;s viewpoint varies&#x2014;resulting in reduced adaptability, as illustrated in the red box in <xref ref-type="fig" rid="fig-1">Fig. 1</xref>.</p>
<fig id="fig-1">
<label>Figure 1</label>
<caption>
<title>Example of occluded object clustering results (<bold>A</bold>) Camera data, (<bold>B</bold>) Clustering object using lidar data</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_73330-fig-1.tif"/>
</fig>
<p>To overcome these limitations, deep learning-based approaches have recently emerged as a dominant paradigm for 3D object detection. One notable method is VoxelNet, which employs 3D Convolutional Neural Networks (3D CNNs) to process LiDAR point clouds by converting voxel features into 2D representations for object detection. VoxelNet achieved a mean Average Precision (mAP) of 49.05% on the KITTI 3D Object Detection benchmark. Subsequent models further improved performance by replacing standard 3D convolutions with sparse convolutions and adopting multi-scale feature extraction strategies, achieving an mAP of 56.69% [<xref ref-type="bibr" rid="ref-8">8</xref>].</p>
<p>PointPillars, another widely used approach, voxelizes the 3D point cloud into vertical columns (pillars) and generates pseudo-images for efficient 2D convolutional processing. This method demonstrated an improved mAP of 59.20% on the KITTI dataset. However, despite its effectiveness, PointPillars still exhibits limitations in detecting distant or sparsely represented objects with few LiDAR points. Furthermore, like most deep learning models, it is highly dependent on the training dataset. This includes vulnerability to data bias and uncertainty in scenarios not represented during training [<xref ref-type="bibr" rid="ref-9">9</xref>].</p>
<p><xref ref-type="fig" rid="fig-2">Fig. 2</xref> illustrates a common labeling issue in training data, where a vehicle is present in the LiDAR point cloud but is missing a corresponding 3D bounding box annotation. Such omissions can cause the model to learn incorrect associations, ultimately degrading detection performance.</p>
<fig id="fig-2">
<label>Figure 2</label>
<caption>
<title>Examples of training data labeling quality issues</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_73330-fig-2.tif"/>
</fig>
<p>To address the limitations of data dependency while leveraging the strengths of deep learning in adapting to diverse scenarios, we propose a hybrid framework called the Expert-Guided PointPillars (EG-PointPillars). The proposed method follows the standard pipeline of LiDAR-based deep learning perception architectures but incorporates expert-driven, clustering-based LiDAR processing techniques into the deep learning model itself. This hybridization aims to enhance robustness in situations where training data is insufficient, incomplete, or biased.</p>
<p>In particular, this study seeks to reduce reliance on large annotated datasets by combining classical clustering algorithms with deep neural networks. While various combinations of expert-based and deep learning-based methods are possible, this work focuses on integrating two representative expert algorithms&#x2014;Patchwork&#x002B;&#x002B; [<xref ref-type="bibr" rid="ref-10">10</xref>] for ground segmentation and DBSCAN [<xref ref-type="bibr" rid="ref-5">5</xref>] for point cloud clustering&#x2014;into the widely used PointPillars model. Through this integration, we demonstrate the effectiveness of the proposed hybrid approach in enhancing LiDAR-based object detection performance under challenging and unseen conditions.</p>
</sec>
<sec id="s2">
<label>2</label>
<title>Related Works</title>
<p>In this chapter, a concise review of studies relevant to the proposed algorithm is presented. <xref ref-type="sec" rid="s2_1">Section 2.1</xref> examines various deep learning&#x2013;based recognition algorithms, and <xref ref-type="sec" rid="s2_2">Section 2.2</xref> discusses empirical approaches employed for LiDAR perception.</p>
<sec id="s2_1">
<label>2.1</label>
<title>Deep Learning Based LiDAR Algorithm</title>
<sec id="s2_1_1">
<label>2.1.1</label>
<title>PointPillars</title>
<p>PointPillars, introduced in 2019, is a deep learning algorithm that performs 3D object detection by converting 3D LiDAR point cloud datainto a pseudo-image representation, allowing the use of efficient 2D convolutions. In reference [<xref ref-type="bibr" rid="ref-7">7</xref>], the architecture and operational flow of PointPillars are shown in <xref ref-type="fig" rid="fig-3">Fig. 3</xref>.</p>
<fig id="fig-3">
<label>Figure 3</label>
<caption>
<title>The structure of PointPillars</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_73330-fig-3.tif"/>
</fig>
<p>1. Pillar Feature Net
<list list-type="simple">
<list-item><label>&#x2003;&#x2013;</label><p>The input 3D point cloud is voxelized into vertical columns called pillars.</p></list-item>
<list-item><label>&#x2003;&#x2013;</label><p>A simplified PointNet is applied to each pillar to extract point-wise features.</p></list-item>
<list-item><label>&#x2003;&#x2013;</label><p>These features are aggregated and transformed into a pseudo-image with dimensions (H, W, C), where H and W correspond to spatial resolution and C denotes the feature channel dimension.</p></list-item>
</list></p>
<p>2. Backbone
<list list-type="simple">
<list-item><label>&#x2003;&#x2013;</label><p>This stage extracts spatial features from the pseudo-image.</p></list-item>
<list-item><label>&#x2003;&#x2013;</label><p>A series of 2D convolutional layers are applied to obtain multi-scale feature maps.</p></list-item>
<list-item><label>&#x2003;&#x2013;</label><p>These multi-scale features are then upsampled using transposed 2D convolutions.</p></list-item>
<list-item><label>&#x2003;&#x2013;</label><p>The upsampled features are concatenated along the channel dimension to form a unified feature representation.</p></list-item>
</list></p>
<p>3. Detection Head
<list list-type="simple">
<list-item><label>&#x2003;&#x2013;</label><p>A Single Shot Detector (SSD) is employed to perform 3D object detection from the aggregated feature map.</p></list-item>
</list></p>
<p>In summary, the key innovation of PointPillars lies in its Pillar Feature Net, which enables efficient processing of 3D LiDAR point clouds using standard 2D convolutional neural networks by transforming sparse 3D data into dense 2D pseudo-images.</p>
</sec>
<sec id="s2_1_2">
<label>2.1.2</label>
<title>ExistenceMap-PointPillars</title>
<p>ExistenceMap-PointPillars [<xref ref-type="bibr" rid="ref-11">11</xref>], proposed in 2023, is a deep learning-based 3D object detection framework that builds upon PointPillars by incorporating additional image data. In this model, both 3D LiDAR point cloud data and 360&#x00B0; surround-view images captured around the vehicle are utilized to enhance object detection performance. The detailed architecture of ExistenceMap-PointPillars is shown in <xref ref-type="fig" rid="fig-4">Fig. 4</xref>.</p>
<fig id="fig-4">
<label>Figure 4</label>
<caption>
<title>The structure of ExistenceMap-PointPillars. Reprinted with permission from Reference [<xref ref-type="bibr" rid="ref-11">11</xref>]. &#x00A9; 2023 by Hariya et al. Licensee MDPI, Basel, Switzerland</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_73330-fig-4.tif"/>
</fig>
<p>The overall architecture of ExistenceMap-PointPillars and the operation of its core component, the ExistenceMap Module, are described as follows.
<list list-type="simple">
<list-item><label>1.</label><p>YOLOv7 [<xref ref-type="bibr" rid="ref-12">12</xref>] is applied to the input image data to detect and classify objects using 2D bounding boxes and semantic labels.</p></list-item>
<list-item><label>2.</label><p>The detected 2D object information is projected into a Bird&#x2019;s Eye View (BEV) coordinate system using a transformation algorithm. Each object&#x2019;s presence is represented as a probabilistic elliptical region, forming a pseudo-2D map that indicates likely object locations.</p></list-item>
<list-item><label>3.</label><p>LiDAR-based object tracking data is additionally used to reinforce the pseudo 2D map with additional candidate object regions.</p></list-item>
<list-item><label>4.</label><p>This pseudo-2D map is then processed by a dedicated Pseudo Map Feature Net to extract semantic features.</p></list-item>
<list-item><label>5.</label><p>The resulting feature map is concatenated along the channel axis with the pseudo-image generated by the Pillar Feature Net of PointPillars.</p></list-item>
<list-item><label>6.</label><p>The combined feature map is passed through the backbone network to perform final 3D object detection.</p></list-item>
<list-item><label>7.</label><p>Through this fusion strategy, ExistenceMap-PointPillars achieved a &#x002B;4.19% increase in mean Average Precision (mAP) compared to the baseline PointPillars on its custom dataset. It also demonstrated a reduction in false positives. However, one notable limitation is the increased computational cost due to the additional object detection process required on image data using deep neural networks.</p></list-item>
</list></p>
</sec>
<sec id="s2_1_3">
<label>2.1.3</label>
<title>TM3DOD</title>
<p>TM3DOD, shown in <xref ref-type="fig" rid="fig-5">Fig. 5</xref> is a model designed to effectively leverage temporal information between consecutive LiDAR frames to enhance the detection performance of dynamic objects [<xref ref-type="bibr" rid="ref-13">13</xref>]. Conventional single-frame-based approaches, such as PointPillars, fail to account for inter-frame object motion, resulting in limited accuracy in predicting the positions of moving objects. To address this limitation, TM3DOD introduces the Temporal Voxel Encoder (TVE) and Motion-Aware Feature Aggregation Network (MFANet) modules, which improve detection performance through attention-based fusion that exploits motion information across consecutive BEV features.</p>
<fig id="fig-5">
<label>Figure 5</label>
<caption>
<title>Overall architecture of the proposed TM3DOD. Reprinted with permission from Reference [<xref ref-type="bibr" rid="ref-13">13</xref>]. &#x00A9; 2024 by Park et al. Licensee MDPI, Basel, Switzerland</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_73330-fig-5.tif"/>
</fig>
<p>The core components of TM3DOD are as follows:</p>
<p>1. Temporal Voxel Encoder (TVE)
<list list-type="simple">
<list-item><label>&#x2003;&#x2013;</label><p>Trains the relationships among point sets accumulated along the temporal axis within the same spatial voxel, thereby generating temporal voxel features.</p></list-item>
</list></p>
<p>2. Motion-Aware Feature Aggregation Network (MFANet)
<list list-type="simple">
<list-item><label>&#x2003;&#x2013;</label><p>Extracts motion features from consecutive BEV feature maps and performs attention-based weighted fusion to produce motion-aware BEV features.</p></list-item>
</list></p>
<p>3. Detection Head
<list list-type="simple">
<list-item><label>&#x2003;&#x2013;</label><p>An anchor-free detector that simultaneously predicts center heatmaps and 3D bounding boxes from the BEV features.</p></list-item>
</list></p>
<p>TM3DOD enhances dynamic object detection performance by directly utilizing temporal information. In addition, it effectively mitigates false detections that frequently occur in single-frame-based models. However, this approach increases computational complexity due to the frame alignment process, and its ability to represent motion features is limited in low-speed scanning environments where temporal resolution is insufficient.</p>
</sec>
<sec id="s2_1_4">
<label>2.1.4</label>
<title>L4DR</title>
<p>L4DR [<xref ref-type="bibr" rid="ref-14">14</xref>] is a state-of-the-art multi-sensor model that fuses LiDAR and 4D radar data to achieve robust 3D object detection under various weather conditions. While LiDAR provides high spatial resolution, its signals are significantly attenuated in adverse environments such as rain, snow, and fog. In contrast, radar offers lower spatial resolution but provides accurate velocity and range information. L4DR effectively combines the complementary characteristics of these two sensors, ensuring high detection performance even in harsh weather conditions. The architecture of L4DR is shown in <xref ref-type="fig" rid="fig-6">Fig. 6</xref>.</p>
<fig id="fig-6">
<label>Figure 6</label>
<caption>
<title>Overall architecture of the proposed L4DR. Reprinted with permission from Reference [<xref ref-type="bibr" rid="ref-14">14</xref>]. 2025, Huang et al.</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_73330-fig-6.tif"/>
</fig>
<p>The core components of L4DR are as follows:
<list list-type="simple">
<list-item><label>1.</label><p>Foreground-Aware Denoising (FAD)
<list list-type="simple">
<list-item><label>-</label><p>A module that removes noise from Radar and LiDAR data prior to fusion, thereby enhancing the overall fusion quality.</p></list-item>
</list></p></list-item>
<list-item><label>2.</label><p>Multi-Scale Gated Fusion (MSGF)
<list list-type="simple">
<list-item><label>-</label><p>Selectively fuses features at different scales through a gated network, enabling adaptive integration across multiple feature resolutions.</p></list-item>
</list></p></list-item>
<list-item><label>3.</label><p>Inter-Modal &#x0026; Intra-Modal Backbone (IM<sup>2</sup>)
<list list-type="simple">
<list-item><label>-</label><p>A backbone architecture designed to extract features not only across different modalities (LiDAR and 4D Radar) but also within the same modality in parallel, improving both inter- and intra-modal feature representation.</p></list-item>
</list></p></list-item>
</list></p>
<p>L4DR leverages the velocity information from 4D radar to maintain high recall and precision even under adverse weather conditions, resulting in improved detection reliability compared to LiDAR-only systems. However, it is sensitive to Radar&#x2013;LiDAR calibration errors, and the overall system complexity increases due to the processing of multi-sensor inputs.</p>
<p>The key strengths and limitations of each model are summarized in <xref ref-type="table" rid="table-1">Table 1</xref>.</p>
<table-wrap id="table-1">
<label>Table 1</label>
<caption>
<title>Comparison of representative LiDAR-based 3D object detection models</title>
</caption>
<table>
<colgroup>
<col align="center" width="16mm"/>
<col align="center" width="24mm"/>
<col align="center" width="35mm"/>
<col align="center" width="35mm"/>
<col align="center" width="35mm"/> </colgroup>
<thead>
<tr>
<th></th>
<th>Input</th>
<th>Core structure</th>
<th>Advantages</th>
<th>Limitations</th>
</tr>
</thead>
<tbody>
<tr>
<td>PointPillars</td>
<td>Single-Frame LiDAR</td>
<td>Projects point clouds into vertical pillars &#x2192; processed by a 2D CNN backbone</td>
<td>High inference speed (&#x003E;20 Hz), simple architecture, efficient BEV representation</td>
<td>Lacks temporal information due to single-frame input, vulnerable under adverse weather</td>
</tr>
<tr>
<td>ExistenceMap-PointPillars</td>
<td>LiDAR &#x002B; Camera</td>
<td>PointPillars framework combined with an object existence map</td>
<td>Improved detection accuracy through multi-sensor fusion, reduced false positives</td>
<td>Requires precise LiDAR-camera calibration and additional sensor synchronization, increases computational load and slightly reduces real-time performance</td>
</tr>
<tr>
<td>TM3DOD</td>
<td>Sequential LiDAR Frames</td>
<td>Temporal voxel encoder (TVE) &#x002B; motion-aware feature aggregation network (MFANet) with attention-based temporal fusion</td>
<td>Encodes voxel-level temporal correlations, enhances dynamic-object detection and temporal consistency</td>
<td>Computational cost increases due to multi-frame alignment; temporal feature representation becomes limited under low-frame-rate LiDAR or slow-scanning conditions</td>
</tr>
<tr>
<td>LiRaFusion</td>
<td>LiDAR &#x002B; 4D Radar</td>
<td>Inter-/intra-modal backbone (IM<sup>2</sup>) &#x002B; foreground-aware denoising (FAD) &#x002B; multi-scale gated fusion (MSGF)</td>
<td>Robust in adverse weather, effective noise suppression, complementary fusion of LiDAR and radar features</td>
<td>Requires precise radar&#x2013;LiDAR calibration and data synchronization, system complexity increases due to multi-sensor fusion operations</td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
</sec>
<sec id="s2_2">
<label>2.2</label>
<title>Empirical Approaches for LiDAR Perception</title>
<sec id="s2_2_1">
<label>2.2.1</label>
<title>Patchwork&#x002B;&#x002B;</title>
<p>To effectively apply clustering algorithms to LiDAR-based 3D point cloud data for object detection, it is essential to first remove points corresponding to the ground surface. Patchwork&#x002B;&#x002B; is a ground segmentation algorithm specifically designed for 3D LiDAR point clouds and was proposed by Lee et al. [<xref ref-type="bibr" rid="ref-10">10</xref>].</p>
<p><xref ref-type="fig" rid="fig-7">Fig. 7</xref> illustrates an example in which Patchwork&#x002B;&#x002B; is applied to a single frame from the KITTI 3D Object Detection dataset, successfully filtering out ground points. <xref ref-type="fig" rid="fig-8">Fig. 8</xref> presents the overall architecture of the Patchwork&#x002B;&#x002B; algorithm.</p>
<fig id="fig-7">
<label>Figure 7</label>
<caption>
<title>Patchwork&#x002B;&#x002B; application example</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_73330-fig-7.tif"/>
</fig><fig id="fig-8">
<label>Figure 8</label>
<caption>
<title>Structure of Patchwork&#x002B;&#x002B;. Reprinted with permission from Reference [<xref ref-type="bibr" rid="ref-10">10</xref>]. 2022, Lee et al. &#x00A9; 2022, IEEE</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_73330-fig-8.tif"/>
</fig>
<p>Through the above process, Patchwork&#x002B;&#x002B; effectively separates ground and non-ground points in a 3D LiDAR point cloud. Unlike traditional ground segmentation methods that rely on fixed sensor height thresholds, Patchwork&#x002B;&#x002B; is particularly well-suited for complex terrains, such as steep slopes or environments with multiple ground height levels, where conventional methods often fail. It offers both high segmentation accuracy and fast processing speed, making it suitable for real-time applications.</p>
</sec>
<sec id="s2_2_2">
<label>2.2.2</label>
<title>DBSCAN</title>
<p>DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a density-based clustering algorithm that utilizes two parameters: <inline-formula id="ieqn-1"><mml:math id="mml-ieqn-1"><mml:mi>&#x03B5;</mml:mi></mml:math></inline-formula> (the neighborhood radius) and the minimum number of points required to form a cluster [<xref ref-type="bibr" rid="ref-5">5</xref>]. When applying DBSCAN to LiDAR-based 3D point cloud data, it is essential to first remove ground points to ensure meaningful clustering of above-ground objects. Therefore, in this study, DBSCAN is applied to point clouds in which ground points have been filtered out using Patchwork&#x002B;&#x002B;.</p>
<p><xref ref-type="fig" rid="fig-9">Fig. 9</xref> shows the result of applying DBSCAN to a single frame from the KITTI 3D Object Detection dataset after ground removal via Patchwork&#x002B;&#x002B;. Each identified cluster is visualized in a different color, while noise points&#x2014;those not assigned to any cluster&#x2014;are displayed in black.</p>
<fig id="fig-9">
<label>Figure 9</label>
<caption>
<title>Example of DBSCAN</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_73330-fig-9.tif"/>
</fig>
<p>While DBSCAN offers several advantages&#x2014;such as robustness to noise and the ability to identify clusters without requiring the number of clusters to be predefined&#x2014;it also has notable limitations. Specifically, it does not assign semantic class labels to clusters, nor does it capture detailed structural characteristics within each cluster, such as the spatial distribution of points. Moreover, DBSCAN tends to perform poorly in complex scenarios where objects are partially occluded by others or when the viewpoint of the sensor changes significantly, leading to limited generalization capability in real-world applications.</p>
</sec>
</sec>
</sec>
<sec id="s3">
<label>3</label>
<title>Proposed Method</title>
<p>In this paper, we propose a hybrid deep learning framework that incorporates expert-driven LiDAR clustering algorithms into the internal pipeline of deep neural networks to address the limitations of conventional learning-based methods. To mitigate the data dependency and limited generalization of conventional LiDAR-based deep learning models, this study integrates expert-driven clustering and ground segmentation algorithms into the deep learning pipeline, forming a hybrid framework called EG-PointPillars.</p>
<p>Although various clustering and deep learning techniques exist, this work specifically employs Patchwork&#x002B;&#x002B; and DBSCAN as representative expert-based LiDAR preprocessing algorithms and adopts PointPillars as the baseline 3D object detection model. DBSCAN and PointPillars are among the most well-established methods in their respective domains; however, the proposed framework is flexible and can be extended to other combinations of clustering and deep learning algorithms in future work. In this study, we aim to demonstrate that integrating these representative approaches can effectively improve the performance and robustness of existing deep learning models.</p>
<p>The overall pipeline of the proposed method is structured as follows:
<list list-type="simple">
<list-item><label>1.</label><p>Clustering-Based Preprocessing
<list list-type="simple">
<list-item><label>-</label><p>Ground points are first removed from the LiDAR 3D point cloud using the Patchwork&#x002B;&#x002B; algorithm [<xref ref-type="bibr" rid="ref-10">10</xref>], followed by DBSCAN clustering on the remaining non-ground points to generate object candidates. These expert-based modules serve as preprocessing steps, providing structured inputs for the proposed hybrid learning model.</p></list-item>
</list></p></list-item>
<list-item><label>2.</label><p>Cluster Map Feature Representation and BEV Transformation
<list list-type="simple">
<list-item><label>-</label><p>A novel cluster map feature representation is developed to convert DBSCAN outputs into a BEV-formatted feature map. This transformation incorporates the geometric size and spatial distribution of each cluster, enabling more effective fusion with the deep feature space of the backbone network.</p></list-item>
</list></p></list-item>
<list-item><label>3.</label><p>PointPillars Architecture Enhancement
<list list-type="simple">
<list-item><label>-</label><p>The backbone and neck components of the original PointPillars model are redesigned to accommodate the new feature representation. The modified FPN structure enhances multi-scale feature aggregation, improving detection robustness for small or occluded objects.</p></list-item>
</list></p></list-item>
<list-item><label>4.</label><p>Fusion Strategy Design
<list list-type="simple">
<list-item><label>-</label><p>Multiple fusion methods are explored to integrate the expert-derived cluster features with the deep features. This strategy ensures smooth and effective information flow from the clustering-based preprocessing to the detection head, aligning both expert priors and learned representations.</p></list-item>
</list></p></list-item>
<list-item><label>5.</label><p>Development of Variants and Evaluation
<list list-type="simple">
<list-item><label>-</label><p>Four variants of the proposed Expert&#x2013;Neural Network Collaborative LiDAR Perception model are developed and evaluated. These variants are designed to validate the contribution of each proposed component and demonstrate the overall performance improvement achieved through expert-guided hybridization.</p></list-item>
</list></p></list-item>
</list></p>
<sec id="s3_1">
<label>3.1</label>
<title>Basic Structure of Proposed Algorithm</title>
<sec id="s3_1_1">
<label>3.1.1</label>
<title>3D Object Detection Pipeline</title>
<p>All four versions of the proposed method share a common 3D object detection pipeline, which is based on a modified neck structure of the original PointPillars framework. The improved neck module of our proposed algorithm adopts a Feature Pyramid Network (FPN) [<xref ref-type="bibr" rid="ref-15">15</xref>] design to effectively aggregate multi-scale features. Specifically, the input feature maps&#x2014;of sizes (8C, H/8, W/8), (4C, H/4, W/4), and (2C, H/2, W/2)&#x2014;are fused into a single high-resolution feature representation.</p>
<p><xref ref-type="fig" rid="fig-10">Fig. 10</xref> illustrates the revised neck structure built upon PointPillars, as applied in our proposed architecture. The pseudo code of proposed neck structure is in Algorithm 1. <xref ref-type="fig" rid="fig-11">Fig. 11</xref> visualizes the change in feature map sizes across each stage of the network, highlighting the multi-scale aggregation process enabled by the FPN structure.</p>
<fig id="fig-10">
<label>Figure 10</label>
<caption>
<title>Improved neck structure based on PointPillars</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_73330-fig-10.tif"/>
</fig><fig id="fig-11">
<label>Figure 11</label>
<caption>
<title>Feature map size according to Backbone-neck stage</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_73330-fig-11.tif"/>
</fig>
<fig id="fig-21">
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_73330-fig-21.tif"/>
</fig>
<p>The modified neck structure receives three feature maps (<inline-formula id="ieqn-16"><mml:math id="mml-ieqn-16"><mml:msub><mml:mi>F</mml:mi><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>F</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>F</mml:mi><mml:mrow><mml:mn>3</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula>) from the backbone, corresponding to different spatial resolutions. The fusion process, inspired by the Feature Pyramid Network (FPN), can be mathematically expressed as follows:
<disp-formula id="eqn-1"><label>(1)</label><mml:math id="mml-eqn-1" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd /><mml:mtd><mml:msubsup><mml:mi>F</mml:mi><mml:mrow><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:mi>C</mml:mi><mml:mi>o</mml:mi><mml:mi>n</mml:mi><mml:mi>v</mml:mi><mml:mn>1</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mn>1</mml:mn><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>F</mml:mi><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mi>R</mml:mi><mml:mrow><mml:mn>2</mml:mn><mml:mi>C</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mfrac><mml:mi>H</mml:mi><mml:mn>8</mml:mn></mml:mfrac><mml:mo>&#x00D7;</mml:mo><mml:mfrac><mml:mi>W</mml:mi><mml:mn>8</mml:mn></mml:mfrac></mml:mrow></mml:msup></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<disp-formula id="eqn-2"><label>(2)</label><mml:math id="mml-eqn-2" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd /><mml:mtd><mml:msubsup><mml:mi>F</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:mi>C</mml:mi><mml:mi>o</mml:mi><mml:mi>n</mml:mi><mml:mi>v</mml:mi><mml:mn>1</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mn>1</mml:mn><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>F</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mi>R</mml:mi><mml:mrow><mml:mn>2</mml:mn><mml:mi>C</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mfrac><mml:mi>H</mml:mi><mml:mn>4</mml:mn></mml:mfrac><mml:mo>&#x00D7;</mml:mo><mml:mfrac><mml:mi>W</mml:mi><mml:mn>4</mml:mn></mml:mfrac></mml:mrow></mml:msup></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<disp-formula id="eqn-3"><label>(3)</label><mml:math id="mml-eqn-3" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd /><mml:mtd><mml:msubsup><mml:mi>F</mml:mi><mml:mrow><mml:mn>3</mml:mn></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:mi>C</mml:mi><mml:mi>o</mml:mi><mml:mi>n</mml:mi><mml:mi>v</mml:mi><mml:mn>1</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mn>1</mml:mn><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>F</mml:mi><mml:mrow><mml:mn>3</mml:mn></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mi>R</mml:mi><mml:mrow><mml:mn>2</mml:mn><mml:mi>C</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mfrac><mml:mi>H</mml:mi><mml:mn>2</mml:mn></mml:mfrac><mml:mo>&#x00D7;</mml:mo><mml:mfrac><mml:mi>W</mml:mi><mml:mn>2</mml:mn></mml:mfrac></mml:mrow></mml:msup></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<disp-formula id="eqn-4"><label>(4)</label><mml:math id="mml-eqn-4" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd /><mml:mtd><mml:msubsup><mml:mrow><mml:mover><mml:mi>F</mml:mi><mml:mo stretchy="false">&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:mi>B</mml:mi><mml:mi>i</mml:mi><mml:mi>l</mml:mi><mml:mi>i</mml:mi><mml:mi>n</mml:mi><mml:mi>e</mml:mi><mml:mi>a</mml:mi><mml:mi>r</mml:mi><mml:mi>I</mml:mi><mml:mi>n</mml:mi><mml:mi>t</mml:mi><mml:mi>e</mml:mi><mml:mi>r</mml:mi><mml:mi>p</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:msubsup><mml:mi>F</mml:mi><mml:mrow><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msubsup><mml:mo>)</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<disp-formula id="eqn-5"><label>(5)</label><mml:math id="mml-eqn-5" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd /><mml:mtd><mml:msub><mml:mi>F</mml:mi><mml:mrow><mml:mi>f</mml:mi><mml:mi>u</mml:mi><mml:mi>s</mml:mi><mml:mi>e</mml:mi><mml:mi>d</mml:mi><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msubsup><mml:mrow><mml:mover><mml:mi>F</mml:mi><mml:mo stretchy="false">&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msubsup><mml:mo>&#x2295;</mml:mo><mml:msubsup><mml:mi>F</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msubsup></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<disp-formula id="eqn-6"><label>(6)</label><mml:math id="mml-eqn-6" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd /><mml:mtd><mml:msub><mml:mrow><mml:mover><mml:mi>F</mml:mi><mml:mo stretchy="false">&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>f</mml:mi><mml:mi>u</mml:mi><mml:mi>s</mml:mi><mml:mi>e</mml:mi><mml:mi>d</mml:mi><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mi>B</mml:mi><mml:mi>i</mml:mi><mml:mi>l</mml:mi><mml:mi>i</mml:mi><mml:mi>n</mml:mi><mml:mi>e</mml:mi><mml:mi>a</mml:mi><mml:mi>r</mml:mi><mml:mi>I</mml:mi><mml:mi>n</mml:mi><mml:mi>t</mml:mi><mml:mi>e</mml:mi><mml:mi>r</mml:mi><mml:mi>p</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>F</mml:mi><mml:mrow><mml:mi>f</mml:mi><mml:mi>u</mml:mi><mml:mi>s</mml:mi><mml:mi>e</mml:mi><mml:mi>d</mml:mi><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<disp-formula id="eqn-7"><label>(7)</label><mml:math id="mml-eqn-7" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd /><mml:mtd><mml:msub><mml:mi>F</mml:mi><mml:mrow><mml:mi>f</mml:mi><mml:mi>u</mml:mi><mml:mi>s</mml:mi><mml:mi>e</mml:mi><mml:mi>d</mml:mi><mml:mn>2</mml:mn></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mover><mml:mi>F</mml:mi><mml:mo stretchy="false">&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>f</mml:mi><mml:mi>u</mml:mi><mml:mi>s</mml:mi><mml:mi>e</mml:mi><mml:mi>d</mml:mi><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>&#x2295;</mml:mo><mml:msubsup><mml:mi>F</mml:mi><mml:mrow><mml:mn>3</mml:mn></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msubsup></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<disp-formula id="eqn-8"><label>(8)</label><mml:math id="mml-eqn-8" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd /><mml:mtd><mml:msub><mml:mi>F</mml:mi><mml:mrow><mml:mi>f</mml:mi><mml:mi>i</mml:mi><mml:mi>n</mml:mi><mml:mi>a</mml:mi><mml:mi>l</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mi>R</mml:mi><mml:mi>e</mml:mi><mml:mi>L</mml:mi><mml:mi>U</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>B</mml:mi><mml:mi>N</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>C</mml:mi><mml:mi>o</mml:mi><mml:mi>n</mml:mi><mml:mi>v</mml:mi><mml:mn>3</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mn>3</mml:mn><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>F</mml:mi><mml:mrow><mml:mi>f</mml:mi><mml:mi>u</mml:mi><mml:mi>s</mml:mi><mml:mi>e</mml:mi><mml:mi>d</mml:mi><mml:mn>2</mml:mn></mml:mrow></mml:msub><mml:mo stretchy="false">)</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>This modified neck structure replaces the transposed 2D convolutions used in the original PointPillars with bilinear interpolation, thereby reducing the number of learnable parameters and computational cost. Despite this simplification, the design still enables effective fusion of multi-scale features, maintaining high detection performance while improving overall efficiency.</p>
</sec>
<sec id="s3_1_2">
<label>3.1.2</label>
<title>Cluster 2D Pseudo-Map Branch</title>
<p>The Cluster 2D Pseudo-Map Branch proposed in this section is designed to generate a cluster pseudo-image based on the clustering results obtained from DBSCAN. This branch takes the clustered LiDAR 3D point cloud as input and performs heuristic classification of each cluster into one of three object classes: vehicle, two-wheeler, or pedestrian. Based on this classification, a 2D pseudo-map is generated, which is then processed by the Cluster Map Feature Net to produce the final cluster pseudo-image.</p>
<p>To heuristically classify each cluster, a class assignment criterion is required. In this study, the criterion is based on the diagonal length of the ground truth 3D bounding box projected onto the XY-plane, as illustrated in <xref ref-type="fig" rid="fig-12">Fig. 12</xref>. <xref ref-type="fig" rid="fig-13">Fig. 13</xref> shows a histogram of the diagonal length distributions for all objects in the KITTI 3D Object Detection dataset. In the histogram, pedestrians are marked in orange, two-wheelers in green, and vehicles in blue. As shown in <xref ref-type="fig" rid="fig-13">Fig. 13</xref>, the three classes can be effectively separated based on the diagonal length of their 3D bounding boxes in the XY-plane, providing a reliable heuristic for cluster classification.</p>
<fig id="fig-12">
<label>Figure 12</label>
<caption>
<title>The diagonal length of the <inline-formula id="ieqn-17"><mml:math id="mml-ieqn-17"><mml:mi>X</mml:mi><mml:mi>Y</mml:mi></mml:math></inline-formula> plane of the 3D bounding box used as a classification criterion</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_73330-fig-12.tif"/>
</fig><fig id="fig-13">
<label>Figure 13</label>
<caption>
<title>Histogram distribution of <inline-formula id="ieqn-18"><mml:math id="mml-ieqn-18"><mml:mi>x</mml:mi><mml:mi>y</mml:mi></mml:math></inline-formula> plane diagonal length of each class</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_73330-fig-13.tif"/>
</fig>
<p>However, the size of a LiDAR point cloud cluster does not exactly match the dimensions of its corresponding 3D bounding box. Therefore, the diagonal length thresholds used for classification must be appropriately adjusted to better fit the characteristics of the clustered data. In this study, we define heuristic classification rules for clusters based on empirical thresholds, as summarized in <xref ref-type="table" rid="table-2">Table 2</xref>.</p>
<table-wrap id="table-2">
<label>Table 2</label>
<caption>
<title>Criteria of class classification</title>
</caption>
<table>
<colgroup>
<col align="center" width="28mm"/>
<col align="center" width="33mm"/>
<col align="center" width="39mm"/> </colgroup>
<thead>
<tr>
<th></th>
<th>Min.(m)</th>
<th>Max.(m)</th>
</tr>
</thead>
<tbody>
<tr>
<td>Car</td>
<td>3.00</td>
<td>6.40</td>
</tr>
<tr>
<td>Cyclist</td>
<td>1.72</td>
<td>2.99</td>
</tr>
<tr>
<td>Pedestrian</td>
<td>0.40</td>
<td>1.71</td>
</tr>
</tbody>
</table>
</table-wrap>
<p><xref ref-type="fig" rid="fig-14">Fig. 14</xref> presents a visualization of the cluster classification results based on the proposed criteria. In this figure, clusters classified as vehicles are shown in blue, two-wheelers in green, pedestrians in red, and noise in black.</p>
<fig id="fig-14">
<label>Figure 14</label>
<caption>
<title>Results of randomly classifying classes for each cluster</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_73330-fig-14.tif"/>
</fig>
<p>In the next stage, a cluster 2D pseudo-map is generated using the previously classified cluster information. Each cluster is first enclosed by a fitted ellipse, and its color is assigned based on the inferred class: blue for vehicles, green for two-wheelers, and red for pedestrians.</p>
<p>The resulting pseudo-map is rendered in the Bird&#x2019;s Eye View (BEV) coordinate space, with the same spatial dimensions as the pseudo-image used in the original PointPillars framework. The map consists of three channels in the BGR color format to encode the class-specific appearance of each cluster.</p>
<p><xref ref-type="fig" rid="fig-15">Fig. 15</xref> shows an example of a generated cluster 2D pseudo-map, illustrating the spatial distribution and semantic coloring of clustered objects.</p>
<fig id="fig-15">
<label>Figure 15</label>
<caption>
<title>Example of cluster 2D pseudo-map</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_73330-fig-15.tif"/>
</fig>
<p>The generated cluster 2D pseudo-map is then processed by the Cluster Map Feature Net to produce a corresponding cluster pseudo-image. As shown in <xref ref-type="fig" rid="fig-16">Fig. 16</xref>, the Cluster Map Feature Net consists of a series of 2D convolutional layers, each followed by Batch Normalization and ReLU activation functions.</p>
<fig id="fig-16">
<label>Figure 16</label>
<caption>
<title>Cluster map feature net</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_73330-fig-16.tif"/>
</fig>
<p>The resulting cluster pseudo-image has a shape of H &#x00D7; W &#x00D7; 64, which is identical to the pseudo-image produced by the PointPillars&#x2019; Pillar Feature Net, allowing for seamless integration within the overall detection pipeline.</p>
<p>The process by which the Cluster Map Feature Net transforms the input cluster pseudo-image <inline-formula id="ieqn-19"><mml:math id="mml-ieqn-19"><mml:mi>F</mml:mi></mml:math></inline-formula> can be mathematically formulated as:
<disp-formula id="eqn-9"><label>(9)</label><mml:math id="mml-eqn-9" display="block"><mml:msup><mml:mi>F</mml:mi><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:mi>R</mml:mi><mml:mi>e</mml:mi><mml:mi>L</mml:mi><mml:mi>U</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mi>B</mml:mi><mml:mi>N</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mi>C</mml:mi><mml:mi>o</mml:mi><mml:mi>n</mml:mi><mml:mi>v</mml:mi><mml:mn>3</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mn>3</mml:mn><mml:mrow><mml:mo>(</mml:mo><mml:mi>F</mml:mi><mml:mo>)</mml:mo></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:math></disp-formula></p>
<p>The overall processing pipeline of the Cluster 2D Pseudo-Map Branch can be summarized as below pseudo code in Algorithm 2:</p>
<fig id="fig-22">
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_73330-fig-22.tif"/>
</fig>
</sec>
</sec>
<sec id="s3_2">
<label>3.2</label>
<title>Expert Guided PointPillars&#x2014;EG-PointPillars</title>
<p>In this subsection, we present four different versions of EG-PointPillars, each integrating the previously described algorithms into the original PointPillars framework in different ways. The details of how each algorithm is incorporated are provided below.</p>
<sec id="s3_2_1">
<label>3.2.1</label>
<title>EG-PointPillars Version 1</title>
<p>EG-PointPillars ver. 1 integrates two key components into the baseline PointPillars framework: (1) a preprocessing step using the Patchwork&#x002B;&#x002B; algorithm and (2) an enhanced 3D object detection pipeline with a modified neck structure.
<list list-type="simple">
<list-item><label>1.</label><p>The 3D object detection procedure for ver. 1 is as follows:</p></list-item>
<list-item><label>2.</label><p>The input LiDAR 3D point cloud is processed with Patchwork&#x002B;&#x002B; to separate ground and non-ground points.</p></list-item>
<list-item><label>3.</label><p>The resulting non-ground point cloud is passed to the Pillar Feature Net for feature extraction.</p></list-item>
</list></p>
<p>The extracted features are forwarded through the improved detection pipeline with the modified neck structure to perform 3D object detection.</p>
<p><xref ref-type="fig" rid="fig-17">Fig. 17</xref> illustrates the overall architecture of EG-PointPillars ver. 1.</p>
<fig id="fig-17">
<label>Figure 17</label>
<caption>
<title>Overall structure of EG-PointPillars ver. 1</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_73330-fig-17.tif"/>
</fig>
</sec>
<sec id="s3_2_2">
<label>3.2.2</label>
<title>EG-PointPillars Version 2</title>
<p>EG-PointPillars ver. 2 extends v 1 by incorporating an additional clustering step using DBSCAN, resulting in a total of three integrated components: (1) Patchwork&#x002B;&#x002B; for ground removal, (2) an enhanced 3D object detection pipeline with a modified neck structure, and (3) DBSCAN-based clustering.</p>
<p>The 3D object detection procedure for ver. 2 is as follows:
<list list-type="simple">
<list-item><label>1.</label><p>The input LiDAR 3D point cloud is first processed with Patchwork&#x002B;&#x002B; to separate ground and non-ground points. The non-ground points are then passed to DBSCAN for clustering.</p></list-item>
<list-item><label>2.</label><p>DBSCAN is applied to the non-ground point cloud to generate clusters.</p></list-item>
<list-item><label>3.</label><p>Each point in the clustered output is assigned a cluster ID as an additional feature:
<list list-type="simple">
<list-item><label>-</label><p>Points belonging to a valid cluster are assigned positive integer IDs starting from 1.</p></list-item>
<list-item><label>-</label><p>Noise points are assigned a cluster <inline-formula id="ieqn-43"><mml:math id="mml-ieqn-43"><mml:mi>I</mml:mi><mml:mi>D</mml:mi><mml:mo>=</mml:mo><mml:mn>0</mml:mn></mml:math></inline-formula>.</p></list-item>
</list></p></list-item>
<list-item><label>4.</label><p>As a result, each point in the processed non-ground point cloud is represented as a 5D feature vector:
<disp-formula id="ueqn-11"><mml:math id="mml-ueqn-11" display="block"><mml:mrow><mml:mo>(</mml:mo><mml:mi>x</mml:mi><mml:mo>,</mml:mo><mml:mi>y</mml:mi><mml:mo>,</mml:mo><mml:mi>z</mml:mi><mml:mo>,</mml:mo><mml:mi>r</mml:mi><mml:mo>,</mml:mo><mml:mi>I</mml:mi><mml:mi>D</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:math></disp-formula>where <inline-formula id="ieqn-44"><mml:math id="mml-ieqn-44"><mml:mi>x</mml:mi><mml:mo>,</mml:mo><mml:mi>y</mml:mi><mml:mo>,</mml:mo><mml:mi>z</mml:mi></mml:math></inline-formula> denote the 3D spatial coordinates, <inline-formula id="ieqn-45"><mml:math id="mml-ieqn-45"><mml:mi>r</mml:mi></mml:math></inline-formula> is the reflectance intensity, and <inline-formula id="ieqn-46"><mml:math id="mml-ieqn-46"><mml:mi>I</mml:mi><mml:mi>D</mml:mi></mml:math></inline-formula> is the cluster identifier.</p></list-item>
<list-item><label>5.</label><p>The preprocessed non-ground point cloud with appended cluster IDs is fed into the Pillar Feature Net.</p></list-item>
</list></p>
<p>Finally, the extracted features are processed through the modified 3D object detection pipeline using the enhanced neck structure.</p>
</sec>
<sec id="s3_2_3">
<label>3.2.3</label>
<title>EG-PointPillars Version 3</title>
<p>EG-PointPillars ver. 3 incorporates the same three key components as ver. 2&#x2014;Patchwork&#x002B;&#x002B;, DBSCAN, and a modified 3D object detection pipeline. However, the main distinction lies in the input to the Pillar Feature Net: unlike ver. 2, ver. 3 includes both ground and non-ground points.</p>
<p>The 3D object detection procedure for ver. 3 is as follows:
<list list-type="simple">
<list-item><label>1.</label><p>The input LiDAR 3D point cloud is first processed with Patchwork&#x002B;&#x002B; to separate ground and non-ground points. Both sets are then passed to DBSCAN individually.</p></list-item>
<list-item><label>2.</label><p>DBSCAN is applied only to the non-ground point cloud to perform clustering.</p></list-item>
<list-item><label>3.</label><p>Clustered points are assigned a cluster ID as an additional feature:
<list list-type="simple">
<list-item><label>-</label><p>Points belonging to a valid cluster are assigned positive integer IDs starting from 1.</p></list-item>
<list-item><label>-</label><p>Noise points are assigned a cluster <inline-formula id="ieqn-47"><mml:math id="mml-ieqn-47"><mml:mi>I</mml:mi><mml:mi>D</mml:mi><mml:mo>=</mml:mo><mml:mn>0</mml:mn></mml:math></inline-formula>.</p></list-item>
</list></p></list-item>
<list-item><label>4.</label><p>All ground points are also assigned a cluster <inline-formula id="ieqn-48"><mml:math id="mml-ieqn-48"><mml:mi>I</mml:mi><mml:mi>D</mml:mi><mml:mo>=</mml:mo><mml:mn>0</mml:mn></mml:math></inline-formula>, identical to noise points.</p></list-item>
<list-item><label>5.</label><p>The ground and non-ground point clouds, now augmented with cluster IDs, are merged into a single point cloud.</p></list-item>
<list-item><label>6.</label><p>Each point in the merged cloud is represented as a 5D feature vector:
<disp-formula id="ueqn-131"><mml:math id="mml-ueqn-131" display="block"><mml:mrow><mml:mo>(</mml:mo><mml:mi>x</mml:mi><mml:mo>,</mml:mo><mml:mi>y</mml:mi><mml:mo>,</mml:mo><mml:mi>z</mml:mi><mml:mo>,</mml:mo><mml:mi>r</mml:mi><mml:mo>,</mml:mo><mml:mi>I</mml:mi><mml:mi>D</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:math></disp-formula>where <inline-formula id="ieqn-49"><mml:math id="mml-ieqn-49"><mml:mi>x</mml:mi><mml:mo>,</mml:mo><mml:mi>y</mml:mi><mml:mo>,</mml:mo><mml:mi>z</mml:mi></mml:math></inline-formula> denote the 3D spatial coordinates, <inline-formula id="ieqn-50"><mml:math id="mml-ieqn-50"><mml:mi>r</mml:mi></mml:math></inline-formula> is the reflectance intensity, and <inline-formula id="ieqn-51"><mml:math id="mml-ieqn-51"><mml:mi>I</mml:mi><mml:mi>D</mml:mi></mml:math></inline-formula> is the cluster identifier.</p></list-item>
<list-item><label>7.</label><p>The merged point cloud is then passed into the Pillar Feature Net.</p></list-item>
<list-item><label>8.</label><p>The resulting features are processed through the enhanced 3D detection pipeline using the modified neck structure.</p></list-item>
</list></p>
<p><xref ref-type="fig" rid="fig-18">Fig. 18</xref> illustrates the overall architecture shared by EG-PointPillars ver. 2 and 3.</p>
<fig id="fig-18">
<label>Figure 18</label>
<caption>
<title>Overall structure of EG-PointPillars ver. 2 &#x0026; 3</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_73330-fig-18.tif"/>
</fig>
</sec>
<sec id="s3_2_4">
<label>3.2.4</label>
<title>EG-PointPillars Version 4</title>
<p>EG-PointPillars ver. 4 incorporates all four components proposed in this study: (1) ground removal using Patchwork&#x002B;&#x002B;, (2) clustering using DBSCAN, (3) an enhanced 3D object detection pipeline with a modified neck, and (4) the cluster 2D pseudo-map branch. This version represents the most comprehensive integration of expert-based and deep learning-based techniques.</p>
<p>The 3D object detection procedure for ver. 4 is as follows:
<list list-type="simple">
<list-item><label>1.</label><p>The input LiDAR 3D point cloud is processed using Patchwork&#x002B;&#x002B; to separate ground and non-ground points. The non-ground points are passed to both DBSCAN and the Pillar Feature Net.</p></list-item>
<list-item><label>2.</label><p>DBSCAN is applied to the non-ground point cloud to generate clusters.</p></list-item>
<list-item><label>3.</label><p>Each cluster is heuristically classified into one of three classes&#x2014;vehicle, two-wheeler, or pedestrian&#x2014;based on the diagonal length of its bounding box in the XY-plane.</p></list-item>
<list-item><label>4.</label><p>Using the location and class of each cluster, a cluster 2D pseudo-map is generated in BEV format and input into the Cluster Map Feature Net.</p></list-item>
<list-item><label>5.</label><p>The Cluster Map Feature Net produces a cluster pseudo-image, which is then concatenated along the channel axis with the pseudo-image output from the Pillar Feature Net.</p></list-item>
<list-item><label>6.</label><p>The combined pseudo-image is forwarded to the backbone of the 3D object detection pipeline to perform final detection.</p></list-item>
</list></p>
<p><xref ref-type="fig" rid="fig-19">Fig. 19</xref> shows the complete architecture of EG-PointPillars ver. 4.</p>
<fig id="fig-19">
<label>Figure 19</label>
<caption>
<title>Overall structure of EG-PointPillars ver. 4</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_73330-fig-19.tif"/>
</fig>
<p>A summary of the architectural variations among the four EG-PointPillars versions is provided in <xref ref-type="table" rid="table-3">Table 3</xref>, which serves as a reference for the performance comparison presented in the following section.</p>
<table-wrap id="table-3">
<label>Table 3</label>
<caption>
<title>Comparison of the four versions of EG-PointPillars</title>
</caption>
<table>
<colgroup>
<col align="center" width="14mm"/>
<col align="center" width="50mm"/>
<col align="center" width="75mm"/> </colgroup>
<thead>
<tr>
<th>Version</th>
<th>Added/Modified modules</th>
<th>Key characteristics</th>
</tr>
</thead>
<tbody>
<tr>
<td>V1</td>
<td>Patchwork&#x002B;&#x002B;, modified neck</td>
<td>Uses only non-ground points for the pillar feature net</td>
</tr>
<tr>
<td>V2</td>
<td>Adds DBSCAN clustering to the v1 pipeline</td>
<td>Non-ground points &#x2192; DBSCAN &#x2192; cluster ID assigned &#x2192; PFN (5-D features x, y, z, r, ID)</td>
</tr>
<tr>
<td>V3</td>
<td>Same as v2 but changes PFN input policy</td>
<td>Both ground and non-ground points fed to PFN ground points given noise ID and merged</td>
</tr>
<tr>
<td>V4</td>
<td>Patchwork&#x002B;&#x002B; &#x002B; DBSCAN &#x002B; modified neck &#x002B; cluster map feature net (cluster 2D pseudo-map branch)</td>
<td>Generates cluster-level pseudo-image (H &#x00D7; W &#x00D7; 64) and concatenates it with pillar pseudo-image before the backbone</td>
</tr>
</tbody>
</table>
</table-wrap>
 
</sec>
</sec>
</sec> 
<sec id="s4">
<label>4</label>
<title>Experimental Results</title>
<sec id="s4_1">
<label>4.1</label>
<title>Experimental Environments</title>
<p>The proposed EG-PointPillars models are evaluated through comparative experiments against the baseline PointPillars framework. To ensure a fair comparison, all models are trained and tested using the LiDAR data from the KITTI 3D Object Detection dataset.</p>
<p>For training, 6767 frames are selected from the total 7518 training frames, while the remaining 751 frames are used for validation during training. All experiments are evaluated on the full 7518-frame test set. Each model is trained for 300 epochs, and validation is performed every 5 epochs. The model with the best performance on the validation set is selected as the final model for testing.</p>
<p>To ensure consistency and reproducibility, all experiments&#x2014;including training and inference for both PointPillars and the four versions of EG-PointPillars&#x2014;are conducted under identical hardware conditions. The specifications of the computing environment are detailed in <xref ref-type="table" rid="table-4">Table 4</xref>, and all models utilize the same GPU resources throughout the experimental process.</p>
<table-wrap id="table-4">
<label>Table 4</label>
<caption>
<title>Experimental environments</title>
</caption>
<table>
<colgroup>
<col align="center" width="18mm"/>
<col align="center" width="17mm"/>
<col align="center" width="85mm"/> </colgroup>
<thead>
<tr>
<th>Component</th>
<th>Quantity</th>
<th>Details</th>
</tr>
</thead>
<tbody>
<tr>
<td>CPU</td>
<td>1</td>
<td>Ryzen threadripper pro 5955wx (16-cores, 32-threads)</td>
</tr>
<tr>
<td>GPU</td>
<td>4</td>
<td>NVIDIA GeForce RTX 4090 (82.58 TFLOPs)</td>
</tr>
<tr>
<td>RAM</td>
<td>4</td>
<td>DDR4 PC4-25600 3.2 GHz 64 GB</td>
</tr>
<tr>
<td>SSD</td>
<td>1</td>
<td>PCIe4.0 Platinum P41 M.2 NVME 1 TB</td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
<sec id="s4_2">
<label>4.2</label>
<title>Data Set</title>
<p>The proposed EG-PointPillars models are developed based on the original PointPillars architecture, and both are trained and evaluated using the same methodology to enable a fair performance comparison. For this purpose, we utilize the 3D LiDAR data from the KITTI 3D Object Detection dataset, which was also used in the original PointPillars implementation.</p>
<p>The KITTI dataset was collected using a Velodyne HDL-64E LiDAR sensor and includes diverse driving scenarios such as urban streets, highways, and residential areas. It consists of 7481 labeled training frames and 7518 test frames, containing a total of 80,256 annotated 3D objects [<xref ref-type="bibr" rid="ref-16">16</xref>].</p>
</sec>
<sec id="s4_3">
<label>4.3</label>
<title>Experimental Results with Ablation</title>
<p><xref ref-type="table" rid="table-5">Table 5</xref> presents the performance evaluation results of the baseline PointPillars and the four proposed versions of EG-PointPillars, tested on the entire test set of the KITTI 3D Object Detection dataset after training completion. The results of proposed algorithms are shown in <xref ref-type="table" rid="table-5">Table 5</xref>.</p>
<table-wrap id="table-5">
<label>Table 5</label>
<caption>
<title>Ablation performance comparison of each PointPillars</title>
</caption>
 
<table>
<colgroup>
<col align="center" width="20mm"/>
<col align="center" width="9mm"/>
<col align="center" width="11mm"/>
<col align="center" width="11mm"/>
<col align="center" width="11mm"/>
<col align="center" width="11mm"/>
<col align="center" width="11mm"/>
<col align="center" width="11mm"/>
<col align="center" width="11mm"/>
<col align="center" width="11mm"/>
<col align="center" width="7mm"/>
<col align="center" width="17mm"/> </colgroup>
<thead>
<tr>
<th/>
<th colspan="3">Car AP@0.7</th>
<th colspan="3">Cyclist AP@0.5</th>
<th colspan="3">Pedestrian AP@0.5</th>
<th rowspan="2">mAP</th>
<th rowspan="2">Inferencetime (ms)</th>
</tr>
<tr>
<th/>
<th>Easy</th>
<th>Moderate</th>
<th>Hard</th>
<th>Easy</th>
<th>Moderate</th>
<th>Hard</th>
<th>Easy</th>
<th>Moderate</th>
<th>Hard</th>
</tr>
</thead>
<tbody>
<tr>
<td>PointPillars</td>
<td>84.00</td>
<td>72.89</td>
<td>68.24</td>
<td>80.57</td>
<td>61.96</td>
<td>58.32</td>
<td>52.66</td>
<td>47.09</td>
<td>43.21</td>
<td>63.21</td>
<td>13.1</td>
</tr>
<tr>
<td>EG-PointPillars ver. 1</td>
<td>86.94</td>
<td>76.06</td>
<td>68.77</td>
<td>79.53</td>
<td>61.85</td>
<td>58.18</td>
<td>56.12</td>
<td>50.96</td>
<td>46.56</td>
<td>65.00</td>
<td>13.9</td>
</tr>
<tr>
<td>EG-PointPillars ver. 2</td>
<td>86.16</td>
<td>76.01</td>
<td>68.69</td>
<td>82.79</td>
<td>65.31</td>
<td>62.01</td>
<td>55.68</td>
<td>50.05</td>
<td>44.68</td>
<td>65.71</td>
<td>24.4</td>
</tr>
<tr>
<td>EG-PointPillars ver. 3</td>
<td>84.23</td>
<td>74.82</td>
<td>68.13</td>
<td>81.37</td>
<td>64.81</td>
<td>61.37</td>
<td>55.60</td>
<td>50.81</td>
<td>45.47</td>
<td>65.18</td>
<td>32.5</td>
</tr>
<tr>
<td>EG-PointPillars ver. 4</td>
<td>86.69</td>
<td>76.44</td>
<td>74.93</td>
<td>81.21</td>
<td>66.74</td>
<td>61.91</td>
<td>56.73</td>
<td>52.31</td>
<td>46.92</td>
<td>67.09</td>
<td>46.4</td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
<sec id="s4_4">
<label>4.4</label>
<title>Performance Evaluation</title>
<p>As shown in <xref ref-type="table" rid="table-5">Table 5</xref>, EG-PointPillars ver. 3 exhibited a slight decrease in mAP compared to ver. 2, despite adopting a more comprehensive input strategy. This performance drop is mainly attributed to the inclusion of ground points during feature encoding, which introduced additional noise into the pillar representations and weakened the cluster-level feature distinction.</p>

<p>EG-PointPillars ver. 4, which integrates all proposed modules, achieved the highest detection accuracy, recording a &#x002B;3.88% improvement in mAP over the baseline PointPillars. In particular, ver. 4 showed significant performance gains for the car-hard, cyclist-moderate, and pedestrian categories&#x2014;not only compared to PointPillars, but also in relation to the other EG-PointPillars variants. These results indicate that the Cluster 2D Pseudo-Map Branch plays a critical role in improving detection performance.</p>
<p>Ver. 1 also demonstrated meaningful performance gains with minimal architectural changes. By simply removing ground points using Patchwork&#x002B;&#x002B; and applying the modified neck structure, ver. 1 achieved a &#x002B;1.79% mAP improvement over PointPillars. This highlights the effectiveness of expert-based preprocessing even without additional components such as clustering or pseudo-maps.</p>
</sec>
<sec id="s4_5">
<label>4.5</label>
<title>Runtime Analysis</title>
<p>In terms of inference speed, EG-PointPillars ver.1 showed the most comparable performance to the baseline, with only a 0.8 ms increase in runtime. ver. 2, 3, and 4, which incorporate the DBSCAN clustering step, exhibited higher computational costs. However, ver. 4, despite being the slowest among the proposed methods, maintained a processing speed of 21.55 Hz, which is sufficient for real-time 3D LiDAR-based object detection.</p>
<p>A summary of experimental results compared to the baseline PointPillars is provided below <xref ref-type="table" rid="table-6">Table 6</xref>.</p>
<table-wrap id="table-6">
<label>Table 6</label>
<caption>
<title>Performance summary of 4 versions of EG PointPillars</title>
</caption>
<table>
<colgroup>
<col align="center" width="14mm"/>
<col align="center" width="29mm"/>
<col align="center" width="16mm"/>
<col align="center" width="71mm"/> </colgroup>
<thead>
<tr>
<th></th>
<th>mAP improvement</th>
<th>Runtime overhead</th>
<th>Strong point</th>
</tr>
</thead>
<tbody>
<tr>
<td>Ver. 1</td>
<td>&#x002B;1.79%</td>
<td>&#x002B;0.8 ms</td>
<td>Best suited for applications requiring real-time processing in resource-constrained environments with minimal architectural changes</td>
</tr>
<tr>
<td>Ver. 2</td>
<td>&#x002B;2.50%</td>
<td>&#x002B;11.3 ms</td>
<td>Ideal for applications balancing accuracy and computational efficiency</td>
</tr>
<tr>
<td>Ver. 3</td>
<td>&#x002B;1.97%</td>
<td>&#x002B;19.4 ms</td>
<td></td>
</tr>
<tr>
<td>Ver. 4</td>
<td>&#x002B;3.88%</td>
<td>&#x002B;11.3 ms</td>
<td>Recommended when maximum detection performance is required regardless of computational cost</td>
</tr>
</tbody>
</table>
</table-wrap>
<p><xref ref-type="fig" rid="fig-20">Fig. 20</xref> presents qualitative comparisons between the baseline PointPillars and EG-PointPillars ver. 4 for three representative test cases.</p>
<fig id="fig-20">
<label>Figure 20</label>
<caption>
<title>Comparison of results for PointPillars and EG-PointPillars ver. 4</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_73330-fig-20a.tif"/>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_73330-fig-20b.tif"/>
</fig>
<p><list list-type="simple">
<list-item><label>-</label><p>Case 1: Two objects highlighted with pink circles are shown.
<list list-type="simple">
<list-item><label>1.</label><p>Object 1 was a vehicle that was missed by PointPillars but was accurately detected as a vehicle by EG-PointPillars.</p></list-item>
<list-item><label>2.</label><p>Object 2, corresponding to a traffic signal, was incorrectly classified as a pedestrian by PointPillars, whereas EG-PointPillars correctly identified it as noise, since it does not belong to any of the three target classes.</p></list-item>
</list></p></list-item>
<list-item><label>-</label><p>Case 2: Two additional examples are highlighted.
<list list-type="simple">
<list-item><label>1.</label><p>Object 1 was a part of a wall, which was falsely detected as a pedestrian by PointPillars, but correctly rejected as noise by EG-PointPillars.</p></list-item>
<list-item><label>2.</label><p>Object 2, comprising points from a fence and a tree, was misclassified as a vehicle by PointPillars, while EG-PointPillars correctly identified it as noise.</p></list-item>
</list></p></list-item>
<list-item><label>-</label><p>Case 3: The object marked by a pink circle corresponds to part of a street vendor stall.
<list list-type="simple">
<list-item><label>1.</label><p>PointPillars misclassified it as a pedestrian, whereas EG-PointPillars correctly determined that it belonged to none of the valid object classes and labeled it as noise.</p></list-item>
</list></p></list-item>
</list></p>
<p>These examples demonstrate the enhanced discriminative capability of EG-PointPillars in rejecting false positives, particularly for ambiguous structures and non-target objects, contributing to its overall performance improvement. Additional videos are provided and described in <xref ref-type="sec" rid="app-1">Appendix A</xref>.</p>
</sec>
</sec>
<sec id="s5">
<label>5</label>
<title>Conclusion</title>
<p>In this paper, we proposed EG-PointPillars, a hybrid 3D object detection framework that combines the PointPillars deep learning model with the DBSCAN clustering algorithm to enhance LiDAR-based detection performance. The proposed method improves upon the original PointPillars architecture by integrating ground point removal using Patchwork&#x002B;&#x002B;, DBSCAN-based clustering, and the Cluster 2D Pseudo-Map Branch to achieve robust performance across various real-world scenarios.</p>
<p>Experimental results demonstrated that EG-PointPillars ver. 4 achieved the highest accuracy, with a &#x002B;3.88% improvement in mAP compared to the baseline PointPillars. It notably outperformed in challenging categories such as car-hard, cyclist-moderate, and pedestrian-hard. The inclusion of the Cluster 2D Pseudo-Map Branch played a key role in this performance gain. Additionally, ver. 1, which only incorporated Patchwork&#x002B;&#x002B; preprocessing and a modified neck structure, achieved a &#x002B;1.79% mAP improvement with negligible increase in runtime, highlighting the effectiveness of lightweight enhancements.</p>
<p>The proposed EG-PointPillars offers the following key contributions:
<list list-type="simple">
<list-item><label>-</label><p>Reduced Dependency on Training Data</p></list-item>
</list></p>
<p>By incorporating clustering algorithms like DBSCAN, the model effectively leverages the consistent spatial density of input data, thereby reducing reliance on large-scale labeled datasets. This was empirically validated through performance comparisons.
<list list-type="simple">
<list-item><label>-</label><p>Balanced Performance and Real-Time Capability</p></list-item>
</list></p>
<p>The modular design of four model versions enables flexible trade-offs between detection accuracy and computational efficiency, supporting deployment in both high-performance and resource-constrained environments.
<list list-type="simple">
<list-item><label>-</label><p>Generality and Extensibility</p></list-item>
</list></p>
<p>The clustering and structural improvements proposed in EG-PointPillars are modular and model-agnostic. They can be readily applied to other LiDAR-based 3D object detection frameworks beyond DBSCAN and PointPillars.</p>
<p>In conclusion, this study presents a practical and effective hybrid approach that combines expert-driven LiDAR perception with deep learning, significantly improving 3D object detection performance. Future work will focus on extending the framework to handle more complex environments and object classes, as well as further optimizing computational efficiency to enhance real-time processing capabilities.</p>
</sec>
</body>
<back>
<ack>
<p>None.</p>
</ack>
<sec>
<title>Funding Statement</title>
<p>This research was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (RS-2023-00245084), by Korea Institute for Advancement of Technology (KIAT) grant funded by the Korea Government (MOTIE) (RS-2024-00415938, HRD Program for Industrial Innovation) and Soonchunhyang University.</p>
</sec>
<sec>
<title>Author Contributions</title>
<p>The authors confirm contribution to the paper as follows study conception and design: Chiwan Ahn, Daehee Kim and Seongkeun Park; data collection: Chiwan Ahn; analysis and interpretation of results: Chiwan Ahn, Daehee Kim and Seongkeun Park; draft manuscript preparation: Chiwan Ahn, Daehee Kim and Seongkeun Park. All authors reviewed the results and approved the final version of the manuscript.</p>
</sec>
<sec sec-type="data-availability">
<title>Availability of Data and Materials</title>
<p>The dataset used in this study is an open and publicly available dataset.</p>
</sec>
<sec>
<title>Ethics Approval</title>
<p>Not applicable.</p>
</sec>
<sec sec-type="COI-statement">
<title>Conflicts of Interest</title>
<p>The authors declare no conflicts of interest to report regarding the present study.</p>
</sec>
<app-group id="appg-1">
<app id="app-1">
<title>Appendix A: Video Demonstration</title>
<p>A video demonstration of the proposed EG-PointPillars is available at the following link, illustrating qualitative detection results under various LiDAR data point cloud data and partial occlusion scenarios.</p>
<p>URL: <ext-link ext-link-type="uri" xlink:href="https://youtu.be/wksFKydFFTA">https://youtu.be/wksFKydFFTA</ext-link></p>
<p>The upper panel of the video shows the real camera footage capturing the overall driving scene. The lower-left panel displays the detection results of the standard PointPillars, while the lower-right panel presents those of the proposed EG-PointPillars (ver. 4). As shown in the video, the proposed method maintains more stable recognition performance even when the vehicle rotates, changes its measured shape, or experiences partial occlusion, demonstrating improved robustness compared to the baseline.</p>
</app>
</app-group>
<ref-list content-type="authoryear">
<title>References</title>
<ref id="ref-1"><label>[1]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Yurtsever</surname> <given-names>E</given-names></string-name>, <string-name><surname>Lambert</surname> <given-names>J</given-names></string-name>, <string-name><surname>Carballo</surname> <given-names>A</given-names></string-name>, <string-name><surname>Takeda</surname> <given-names>K</given-names></string-name></person-group>. <article-title>A survey of autonomous driving: common practices and emerging technologies</article-title>. <source>IEEE Access</source>. <year>2020</year>;<volume>8</volume>:<fpage>58443</fpage>&#x2013;<lpage>69</lpage>. doi:<pub-id pub-id-type="doi">10.1109/access.2020.2983149</pub-id>.</mixed-citation></ref>
<ref id="ref-2"><label>[2]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Teichman</surname> <given-names>A</given-names></string-name>, <string-name><surname>Levinson</surname> <given-names>J</given-names></string-name>, <string-name><surname>Thrun</surname> <given-names>S</given-names></string-name></person-group>. <article-title>Towards 3D object recognition via classification of arbitrary object tracks</article-title>. In: <conf-name>Proceedings of the 2011 IEEE International Conference on Robotics and Automation; 2011 May 9&#x2013;13</conf-name>; <publisher-loc>Shanghai, China</publisher-loc>. p. <fpage>4034</fpage>&#x2013;<lpage>41</lpage>. doi:<pub-id pub-id-type="doi">10.1109/icra.2011.5979636</pub-id>.</mixed-citation></ref>
<ref id="ref-3"><label>[3]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Grigorescu</surname> <given-names>S</given-names></string-name>, <string-name><surname>Trasnea</surname> <given-names>B</given-names></string-name>, <string-name><surname>Cocias</surname> <given-names>T</given-names></string-name>, <string-name><surname>Macesanu</surname> <given-names>G</given-names></string-name></person-group>. <article-title>A survey of deep learning techniques for autonomous driving</article-title>. <source>J Field Robot</source>. <year>2020</year>;<volume>37</volume>(<issue>3</issue>):<fpage>362</fpage>&#x2013;<lpage>86</lpage>. doi:<pub-id pub-id-type="doi">10.1002/rob.21918</pub-id>.</mixed-citation></ref>
<ref id="ref-4"><label>[4]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Arnold</surname> <given-names>E</given-names></string-name>, <string-name><surname>Al-Jarrah</surname> <given-names>OY</given-names></string-name>, <string-name><surname>Dianati</surname> <given-names>M</given-names></string-name>, <string-name><surname>Fallah</surname> <given-names>S</given-names></string-name>, <string-name><surname>Oxtoby</surname> <given-names>D</given-names></string-name>, <string-name><surname>Mouzakitis</surname> <given-names>A</given-names></string-name></person-group>. <article-title>A survey on 3D object detection methods for autonomous driving applications</article-title>. <source>IEEE Trans Intell Transport Syst</source>. <year>2019</year>;<volume>20</volume>(<issue>10</issue>):<fpage>3782</fpage>&#x2013;<lpage>95</lpage>. doi:<pub-id pub-id-type="doi">10.1109/tits.2019.2892405</pub-id>.</mixed-citation></ref>
<ref id="ref-5"><label>[5]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Ester</surname> <given-names>M</given-names></string-name>, <string-name><surname>Kriegel</surname> <given-names>HP</given-names></string-name>, <string-name><surname>Sander</surname> <given-names>J</given-names></string-name>, <string-name><surname>Xu</surname> <given-names>X</given-names></string-name></person-group>. <article-title>A density-based algorithm for discovering clusters in large spatial databases with noise</article-title>. In: <conf-name>Proceedings of the Second International Conference on Knowledge Discovery and Data Mining; 1996 Aug 2&#x2013;4</conf-name>; <publisher-loc>Portland, OR, USA</publisher-loc>. p. <fpage>226</fpage>&#x2013;<lpage>31</lpage>.</mixed-citation></ref>
<ref id="ref-6"><label>[6]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Zhou</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Tuzel</surname> <given-names>O</given-names></string-name></person-group>. <article-title>VoxelNet: end-to-end learning for point cloud based 3D object detection</article-title>. In: <conf-name>Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2018 Jun 18&#x2013;23</conf-name>; <publisher-loc>Salt Lake City, UT, USA</publisher-loc>. p. <fpage>4490</fpage>&#x2013;<lpage>9</lpage>.</mixed-citation></ref>
<ref id="ref-7"><label>[7]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Lang</surname> <given-names>AH</given-names></string-name>, <string-name><surname>Vora</surname> <given-names>S</given-names></string-name>, <string-name><surname>Caesar</surname> <given-names>H</given-names></string-name>, <string-name><surname>Zhou</surname> <given-names>L</given-names></string-name>, <string-name><surname>Yang</surname> <given-names>J</given-names></string-name>, <string-name><surname>Beijbom</surname> <given-names>O</given-names></string-name></person-group>. <article-title>PointPillars: fast encoders for object detection from point clouds</article-title>. In: <conf-name>Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); 2019 Jun 15&#x2013;20</conf-name>; <publisher-loc>Long Beach, CA, USA</publisher-loc>. p. <fpage>12689</fpage>&#x2013;<lpage>97</lpage>. doi:<pub-id pub-id-type="doi">10.1109/cvpr.2019.01298</pub-id>.</mixed-citation></ref>
<ref id="ref-8"><label>[8]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Yan</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Mao</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Li</surname> <given-names>B</given-names></string-name></person-group>. <article-title>SECOND: sparsely embedded convolutional detection</article-title>. <source>Sensors</source>. <year>2018</year>;<volume>18</volume>(<issue>10</issue>):<fpage>3337</fpage>. doi:<pub-id pub-id-type="doi">10.3390/s18103337</pub-id>; <pub-id pub-id-type="pmid">30301196</pub-id></mixed-citation></ref>
<ref id="ref-9"><label>[9]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Mao</surname> <given-names>J</given-names></string-name>, <string-name><surname>Shi</surname> <given-names>S</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>X</given-names></string-name>, <string-name><surname>Li</surname> <given-names>H</given-names></string-name></person-group>. <article-title>3D object detection for autonomous driving: a comprehensive survey</article-title>. <source>Int J Comput Vis</source>. <year>2023</year>;<volume>131</volume>(<issue>8</issue>):<fpage>1909</fpage>&#x2013;<lpage>63</lpage>. doi:<pub-id pub-id-type="doi">10.1007/s11263-023-01790-1</pub-id>.</mixed-citation></ref>
<ref id="ref-10"><label>[10]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><surname>Lee</surname> <given-names>S</given-names></string-name>, <string-name><surname>Lim</surname> <given-names>H</given-names></string-name>, <string-name><surname>Myung</surname> <given-names>H</given-names></string-name></person-group>. <article-title>Patchwork&#x002B;&#x002B;: fast and robust ground segmentation solving partial under-segmentation using 3D point cloud</article-title>. <comment>arXiv:2207.11919</comment>. <year>2022</year>.</mixed-citation></ref>
<ref id="ref-11"><label>[11]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Hariya</surname> <given-names>K</given-names></string-name>, <string-name><surname>Inoshita</surname> <given-names>H</given-names></string-name>, <string-name><surname>Yanase</surname> <given-names>R</given-names></string-name>, <string-name><surname>Yoneda</surname> <given-names>K</given-names></string-name>, <string-name><surname>Suganuma</surname> <given-names>N</given-names></string-name></person-group>. <article-title>ExistenceMap-PointPillars: a multifusion network for robust 3D object detection with object existence probability map</article-title>. <source>Sensors</source>. <year>2023</year>;<volume>23</volume>(<issue>20</issue>):<fpage>8367</fpage>. doi:<pub-id pub-id-type="doi">10.3390/s23208367</pub-id>; <pub-id pub-id-type="pmid">37896463</pub-id></mixed-citation></ref>
<ref id="ref-12"><label>[12]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Wang</surname> <given-names>CY</given-names></string-name>, <string-name><surname>Bochkovskiy</surname> <given-names>A</given-names></string-name>, <string-name><surname>Liao</surname> <given-names>HM</given-names></string-name></person-group>. <article-title>YOLOv7: trainable bag-of-freebies sets new state-of-the-art for real-time object detectors</article-title>. In: <conf-name>Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); 2023 Jun 17&#x2013;24</conf-name>; <publisher-loc>Vancouver, BC, Canada</publisher-loc>. p. <fpage>7464</fpage>&#x2013;<lpage>75</lpage>. doi:<pub-id pub-id-type="doi">10.1109/CVPR52729.2023.00721</pub-id>.</mixed-citation></ref>
<ref id="ref-13"><label>[13]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Park</surname> <given-names>G</given-names></string-name>, <string-name><surname>Koh</surname> <given-names>J</given-names></string-name>, <string-name><surname>Kim</surname> <given-names>J</given-names></string-name>, <string-name><surname>Moon</surname> <given-names>J</given-names></string-name>, <string-name><surname>Choi</surname> <given-names>JW</given-names></string-name></person-group>. <article-title>LiDAR-based 3D temporal object detection via motion-aware LiDAR feature fusion</article-title>. <source>Sensors</source>. <year>2024</year>;<volume>24</volume>(<issue>14</issue>):<fpage>4667</fpage>. doi:<pub-id pub-id-type="doi">10.3390/s24144667</pub-id>; <pub-id pub-id-type="pmid">39066063</pub-id></mixed-citation></ref>
<ref id="ref-14"><label>[14]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><surname>Huang</surname> <given-names>X</given-names></string-name>, <string-name><surname>Xu</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Wu</surname> <given-names>H</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>J</given-names></string-name>, <string-name><surname>Xia</surname> <given-names>Q</given-names></string-name>, <string-name><surname>Xia</surname> <given-names>Y</given-names></string-name>, <etal>et al</etal></person-group>. <article-title>L4DR: liDAR-4DRadar fusion for weather-robust 3D object detection</article-title>. <comment>arXiv:2408.03677</comment>. <year>2024</year>.</mixed-citation></ref>
<ref id="ref-15"><label>[15]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Lin</surname> <given-names>TY</given-names></string-name>, <string-name><surname>Dollar</surname> <given-names>P</given-names></string-name>, <string-name><surname>Girshick</surname> <given-names>R</given-names></string-name>, <string-name><surname>He</surname> <given-names>K</given-names></string-name>, <string-name><surname>Hariharan</surname> <given-names>B</given-names></string-name>, <string-name><surname>Belongie</surname> <given-names>S</given-names></string-name></person-group>. <article-title>Feature pyramid networks for object detection</article-title>. In: <conf-name>Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR); 2017 Jul 21&#x2013;26</conf-name>; <publisher-loc>Honolulu, HI, USA</publisher-loc>. p. <fpage>936</fpage>&#x2013;<lpage>44</lpage>.</mixed-citation></ref>
<ref id="ref-16"><label>[16]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Geiger</surname> <given-names>A</given-names></string-name>, <string-name><surname>Lenz</surname> <given-names>P</given-names></string-name>, <string-name><surname>Stiller</surname> <given-names>C</given-names></string-name>, <string-name><surname>Urtasun</surname> <given-names>R</given-names></string-name></person-group>. <article-title>Vision meets robotics: the KITTI dataset</article-title>. <source>Int J Robot Res</source>. <year>2013</year>;<volume>32</volume>(<issue>11</issue>):<fpage>1231</fpage>&#x2013;<lpage>7</lpage>. doi:<pub-id pub-id-type="doi">10.1177/0278364913491297</pub-id>.</mixed-citation></ref>
</ref-list>
</back></article>







