<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.1 20151215//EN" "http://jats.nlm.nih.gov/publishing/1.1/JATS-journalpublishing1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:mml="http://www.w3.org/1998/Math/MathML" xml:lang="en" article-type="research-article" dtd-version="1.1">
<front>
<journal-meta>
<journal-id journal-id-type="pmc">CMC</journal-id>
<journal-id journal-id-type="nlm-ta">CMC</journal-id>
<journal-id journal-id-type="publisher-id">CMC</journal-id>
<journal-title-group>
<journal-title>Computers, Materials &#x0026; Continua</journal-title>
</journal-title-group>
<issn pub-type="epub">1546-2226</issn>
<issn pub-type="ppub">1546-2218</issn>
<publisher>
<publisher-name>Tech Science Press</publisher-name>
<publisher-loc>USA</publisher-loc>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="publisher-id">45818</article-id>
<article-id pub-id-type="doi">10.32604/cmc.2023.045818</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Article</subject>
</subj-group>
</article-categories>
<title-group>
<article-title>CFSA-Net: Efficient Large-Scale Point Cloud Semantic Segmentation Based on Cross-Fusion Self-Attention</article-title>
<alt-title alt-title-type="left-running-head">CFSA-Net: Efficient Large-Scale Point Cloud Semantic Segmentation Based on Cross-Fusion Self-Attention</alt-title>
<alt-title alt-title-type="right-running-head">CFSA-Net: Efficient Large-Scale Point Cloud Semantic Segmentation Based on Cross-Fusion Self-Attention</alt-title>
</title-group>
<contrib-group>
<contrib id="author-1" contrib-type="author">
<name name-style="western"><surname>Shu</surname><given-names>Jun</given-names></name><xref ref-type="aff" rid="aff-1">1</xref><xref ref-type="aff" rid="aff-2">2</xref></contrib>
<contrib id="author-2" contrib-type="author">
<name name-style="western"><surname>Wang</surname><given-names>Shuai</given-names></name><xref ref-type="aff" rid="aff-1">1</xref><xref ref-type="aff" rid="aff-2">2</xref></contrib>
<contrib id="author-3" contrib-type="author">
<name name-style="western"><surname>Yu</surname><given-names>Shiqi</given-names></name><xref ref-type="aff" rid="aff-1">1</xref><xref ref-type="aff" rid="aff-2">2</xref></contrib>
<contrib id="author-4" contrib-type="author" corresp="yes">
<name name-style="western"><surname>Zhang</surname><given-names>Jie</given-names></name><xref ref-type="aff" rid="aff-3">3</xref><email>zhangjie@wdu.edu.cn</email></contrib>
<aff id="aff-1"><label>1</label><institution>School of Electrical and Engineering, Hubei University of Technology</institution>, <addr-line>Wuhan, 430068</addr-line>, <country>China</country></aff>
<aff id="aff-2"><label>2</label><institution>Hubei Key Laboratory for High-Efficiency Utilization of Solar Energy and Operation Control of Energy Storage System, Hubei University of Technology</institution>, <addr-line>Wuhan, 430068</addr-line>, <country>China</country></aff>
<aff id="aff-3"><label>3</label><institution>School of Mechanical and Electrical Engineering, Wuhan Donghu University</institution>, <addr-line>Wuhan, 430212</addr-line>, <country>China</country></aff>
</contrib-group>
<author-notes>
<corresp id="cor1"><label>&#x002A;</label>Corresponding Author: Jie Zhang. Email: <email>zhangjie@wdu.edu.cn</email></corresp>
</author-notes>
<pub-date date-type="collection" publication-format="electronic">
<year>2023</year></pub-date>
<pub-date date-type="pub" publication-format="electronic"><day>26</day>
<month>12</month>
<year>2023</year></pub-date>
<volume>77</volume>
<issue>3</issue>
<fpage>2677</fpage>
<lpage>2697</lpage>
<history>
<date date-type="received">
<day>08</day>
<month>9</month>
<year>2023</year>
</date>
<date date-type="accepted">
<day>10</day>
<month>11</month>
<year>2023</year>
</date>
</history>
<permissions>
<copyright-statement>&#x00A9; 2023 Shu et al.</copyright-statement>
<copyright-year>2023</copyright-year>
<copyright-holder>Shu et al.</copyright-holder>
<license xlink:href="https://creativecommons.org/licenses/by/4.0/">
<license-p>This work is licensed under a <ext-link ext-link-type="uri" xlink:type="simple" xlink:href="https://creativecommons.org/licenses/by/4.0/">Creative Commons Attribution 4.0 International License</ext-link>, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
</license>
</permissions>
<self-uri content-type="pdf" xlink:href="TSP_CMC_45818.pdf"></self-uri>
<abstract>
<p>Traditional models for semantic segmentation in point clouds primarily focus on smaller scales. However, in real-world applications, point clouds often exhibit larger scales, leading to heavy computational and memory requirements. The key to handling large-scale point clouds lies in leveraging random sampling, which offers higher computational efficiency and lower memory consumption compared to other sampling methods. Nevertheless, the use of random sampling can potentially result in the loss of crucial points during the encoding stage. To address these issues, this paper proposes cross-fusion self-attention network (CFSA-Net), a lightweight and efficient network architecture specifically designed for directly processing large-scale point clouds. At the core of this network is the incorporation of random sampling alongside a local feature extraction module based on cross-fusion self-attention (CFSA). This module effectively integrates long-range contextual dependencies between points by employing hierarchical position encoding (HPC). Furthermore, it enhances the interaction between each point&#x0027;s coordinates and feature information through cross-fusion self-attention pooling, enabling the acquisition of more comprehensive geometric information. Finally, a residual optimization (RO) structure is introduced to extend the receptive field of individual points by stacking hierarchical position encoding and cross-fusion self-attention pooling, thereby reducing the impact of information loss caused by random sampling. Experimental results on the Stanford Large-Scale 3D Indoor Spaces (S3DIS), Semantic3D, and SemanticKITTI datasets demonstrate the superiority of this algorithm over advanced approaches such as RandLA-Net and KPConv. These findings underscore the excellent performance of CFSA-Net in large-scale 3D semantic segmentation.</p>
</abstract>
<kwd-group kwd-group-type="author">
<kwd>Semantic segmentation</kwd>
<kwd>large-scale point cloud</kwd>
<kwd>random sampling</kwd>
<kwd>cross-fusion self-attention</kwd>
</kwd-group>
<funding-group>
<award-group id="awg1">
<funding-source>National Natural Science Foundation</funding-source>
<award-id>61603127</award-id>
</award-group>
</funding-group>
</article-meta>
</front>
<body>
<sec id="s1">
<label>1</label>
<title>Introduction</title>
<p>Large-scale semantic segmentation of point clouds holds significant practical applications in real-time intelligent systems, such as autonomous driving and remote sensing. However, due to the voluminous nature of large-scale point cloud datasets, often exceeding millions of points, efficiently conducting semantic segmentation on such a scale poses a formidable challenge. Furthermore, compared to two-dimensional image data, three-dimensional point cloud data exhibits characteristics of disorder and unstructured. Leveraging the underlying data structure of point clouds, designing a deep neural network tailored for the semantic segmentation of three-dimensional point clouds becomes an arduous and demanding research endeavor.</p>
<p>In addressing the challenges of point cloud semantic segmentation, researchers have devoted substantial efforts to exploring deep learning-based approaches for 3D point cloud semantic segmentation. Over the past years, a growing number of deep learning frameworks have been proposed to tackle this task. Notably, Qi et al. introduced the groundbreaking PointNet [<xref ref-type="bibr" rid="ref-1">1</xref>] network, which was the first model capable of directly processing point cloud data using neural networks without additional operations. However, the PointNet network did not account for local feature extraction, prompting subsequent studies to propose various methods to address this limitation. These methods [<xref ref-type="bibr" rid="ref-2">2</xref>&#x2013;<xref ref-type="bibr" rid="ref-4">4</xref>] not only rely on individual points for feature extraction but also incorporate the aggregation of local geometric information to capture the point cloud&#x0027;s structural features. Additionally, graph-based [<xref ref-type="bibr" rid="ref-5">5</xref>&#x2013;<xref ref-type="bibr" rid="ref-7">7</xref>] and kernel-based [<xref ref-type="bibr" rid="ref-8">8</xref>&#x2013;<xref ref-type="bibr" rid="ref-10">10</xref>] convolution techniques, which have demonstrated significant advancements in the field of image processing, have been introduced to capture relationships between different local structural features through convolutional neural networks. While these algorithms have achieved noteworthy results in point cloud processing, they often partition the point cloud into small, independent blocks, such as 1 &#x00D7; 1 &#x00D7; 1-meter blocks, each containing 1024 points, for efficiency purposes. However, this partitioning approach proves impractical for large-scale point clouds as it disrupts the inherent three-dimensional object structure and incurs high computational costs. There are two primary reasons for the low efficiency of semantic segmentation in large-scale point clouds. 1) These methods often employ complex point sampling strategies to ensure the uniform distribution of points. However, these strategies are either computationally intensive or have low memory efficiency. 2) Previous research has typically treated feature information and coordinate information separately during the process of local feature aggregation. They simply concatenate the three-dimensional raw coordinates with the feature information, overlooking the comprehensive modeling of geometric information.</p>
<p>Currently, there are also existing approaches that can directly handle tasks involving large-scale point clouds. For instance, SPG [<xref ref-type="bibr" rid="ref-11">11</xref>] preprocesses point cloud data into superpoint graphs and then employs neural networks for semantic segmentation. RangeNet&#x002B;&#x002B; [<xref ref-type="bibr" rid="ref-12">12</xref>] and PCT [<xref ref-type="bibr" rid="ref-13">13</xref>] utilize projection-based and voxel-based methods to handle large-scale point clouds. However, these methods either entail computationally intensive and time-consuming preprocessing steps or require the partitioning of point clouds into smaller blocks for learning, resulting in suboptimal overall performance.</p>
<p>To tackle the aforementioned issues, this paper designs a new large-scale point cloud semantic segmentation framework. The framework uses a random reduced sampling strategy to process large amounts of point cloud data with fewer computing resources. Furthermore, this paper introduces a robust module for extracting local features, enhancing the network&#x2019;s capacity to describe fine-grained features at a local level and model geometric information in a more comprehensive manner. To this end, this paper first establishes the efficacy of random sampling and subsequently emphasizes the necessity of designing a feature extraction module to comprehensively capture geometric information.</p>
<p>The downsampling of point clouds is a vital component in point cloud semantic segmentation networks. This step involves the selection of representative subset points from the point clouds, for which Farthest Point Sampling (FPS) [<xref ref-type="bibr" rid="ref-2">2</xref>] and Inverse Density Importance Sub-Sampling (IDIS) [<xref ref-type="bibr" rid="ref-14">14</xref>] are commonly used methods. The computational complexity of farthest point sampling is <inline-formula id="ieqn-1"><mml:math id="mml-ieqn-1"><mml:mi>O</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mi>N</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula>, where N denotes the number of points in the point cloud. Inverse density sampling, on the other hand, exhibits a computational complexity of <inline-formula id="ieqn-2"><mml:math id="mml-ieqn-2"><mml:mi>O</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:msup><mml:mi>N</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msup><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula>, assuming N points in the point cloud. It is worth noting that there exist other learning-based sampling methods [<xref ref-type="bibr" rid="ref-15">15</xref>&#x2013;<xref ref-type="bibr" rid="ref-18">18</xref>], although they are not specifically mentioned in the paper. In contrast, Random Sampling (RS) exhibits a computational complexity of only <inline-formula id="ieqn-3"><mml:math id="mml-ieqn-3"><mml:mi>O</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mn>1</mml:mn><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula>, making it an efficient option to consider when dealing with large-scale point clouds. However, while random sampling offers efficiency advantages, it comes with associated costs. This sampling method may result in a lack of representativeness within the sampled point set and the loss of crucial structural information within the point cloud, as depicted in <xref ref-type="fig" rid="fig-1">Fig. 1</xref>. To overcome the potential drawbacks of random sampling, this paper proposes a local feature extraction module based on Cross-Fusion Self-Attention (CFSA), which effectively captures intricate local structures.</p>
<fig id="fig-1">
<label>Figure 1</label>
<caption>
<title>Sampling effect of different sampling methods under the same sampling ratio</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_45818-fig-1.tif"/>
</fig>
<p>The local feature extraction module, based on cross-fusion self-attention, consists of three pivotal components. Firstly, this paper proposes a hierarchical location coding module that conducts hierarchical sampling and relative location coding for each query point. This module effectively addresses the long-distance dependencies between points. Secondly, this study presents a cross-fusion self-attention pooling module, which facilitates the interactive fusion of features and coordinates information within the point clouds. The CFSA pooling module dynamically enhances the expressive capacity between features and coordinates, thereby preserving intricate local geometric structure information. Lastly, this paper introduces a residual optimization module, which enhances the performance of feature extraction by stacking the hierarchical position coding module and the cross-fusion self-attention pooling module. This integration increases the depth of the network and expands the receptive field of each point, thereby further improving the efficacy of feature extraction.</p>
<p>This paper makes significant contributions in the following aspects:</p>
<p>1. This paper, through meticulous analysis and comparison of existing sampling methods, has chosen random sampling as the down-sampling strategy in this paper to efficiently process large-scale point cloud data.</p>
<p>2. This paper proposes a local feature extraction module based on cross-fusion self-attention, which can better integrate the remote context dependence of the points, interactively enhance the coordinates and feature information of the points, and expand the receptive field of each point to model more complete geometric information.</p>
<p>3. Building upon the aforementioned contributions, this paper proposes CFSA-Net, a powerful network designed to effectively tackle the segmentation task of large-scale point clouds. Notably, CFSA-Net achieves competitive results on three mainstream datasets: S3DIS [<xref ref-type="bibr" rid="ref-19">19</xref>], Semantic3D [<xref ref-type="bibr" rid="ref-20">20</xref>], and SemanticKITTI [<xref ref-type="bibr" rid="ref-21">21</xref>].</p>
<p>The subsequent organization of this paper is outlined as follows: <xref ref-type="sec" rid="s2">Section 2</xref> provides a detailed overview of the classical approaches utilized in point cloud semantic segmentation tasks. In <xref ref-type="sec" rid="s3">Section 3</xref>, we present an elaborate description of our proposed methodology. Comprehensive performance evaluations of the proposed method are conducted in <xref ref-type="sec" rid="s4">Section 4</xref> through comparative experiments and ablation studies. Finally, an objective summary is presented in <xref ref-type="sec" rid="s5">Section 5</xref> to conclude this paper.</p>
</sec>
<sec id="s2">
<label>2</label>
<title>Related Work</title>
<p><bold>Projection-based and voxel-based methods:</bold> The methodologies based on projection and voxelization entail specific preprocessing steps for the raw point cloud. The projection-based [<xref ref-type="bibr" rid="ref-22">22</xref>&#x2013;<xref ref-type="bibr" rid="ref-25">25</xref>] approach involves projecting the 3D point cloud onto a 2D plane, enabling the direct application of conventional 2D Convolutional Neural Networks (CNN). By leveraging the powerful capabilities of 2D CNN [<xref ref-type="bibr" rid="ref-26">26</xref>], semantic segmentation can be performed using the projected image information. On the other hand, the voxel-based [<xref ref-type="bibr" rid="ref-27">27</xref>&#x2013;<xref ref-type="bibr" rid="ref-29">29</xref>] approach transforms the 3D point cloud into a regular 3D grid or voxel representation, facilitating processing through 3D CNN. This allows for capturing the spatial relationships between the voxels through 3D convolutions. However, the projection-based methods may suffer from information loss during the projection process and may encounter limitations in capturing fine-grained geometric details. On the other hand, voxel-based methods often face challenges in handling high-resolution data due to memory constraints and exhibit inefficiency when representing sparse point clouds. They also exhibit significant drawbacks when dealing with large-scale point clouds.</p>
<p><bold>Point-based methods:</bold> The point-based methodologies involve direct manipulation of point cloud data to implement algorithms for semantic segmentation by assigning each point in the point cloud to its corresponding semantic class. Drawing inspiration from the groundbreaking work of PointNet [<xref ref-type="bibr" rid="ref-1">1</xref>], researchers have proposed a series of neural network models to directly process raw point cloud data. For instance, Qi et al. introduced the PointNet&#x002B;&#x002B; [<xref ref-type="bibr" rid="ref-2">2</xref>] network, which integrates a sophisticated multi-level local feature aggregation module, thereby facilitating enhanced aggregation of local features. Thomas et al. proposed KPConv [<xref ref-type="bibr" rid="ref-30">30</xref>], which introduces the novel concept of kernel points and adaptively selects certain points in the point cloud as templates for convolutional kernels. Li et al. introduced the PSNet [<xref ref-type="bibr" rid="ref-31">31</xref>] network, which provides a rapid data structuring approach for simultaneous point sampling and grouping. Ibrahim et al. proposed SAT3D [<xref ref-type="bibr" rid="ref-32">32</xref>], which introduces the first-ever technique based on the Slot Attention Transformer to effectively model object-centric features in point cloud data. Point-based methods exhibit remarkable performance in handling irregular and sparse point clouds as they directly capture the local geometric attributes of each point. These networks demonstrate promising results on small-scale point clouds. However, due to their high computational and memory costs, most networks face limitations in direct scalability to larger scenes, thus hindering their modeling capabilities for large-scale point clouds.</p>
<p><bold>Large-scale point cloud semantic segmentation:</bold> Recently, various models have been introduced in academia to address the challenge of large-scale point cloud semantic segmentation. Among them, Landrieu et al. introduced SPG [<xref ref-type="bibr" rid="ref-11">11</xref>], which leverages the concept of a superpoint graph to transform point cloud data into a graph structure and utilizes graph neural networks for semantic segmentation. Additionally, to improve computational efficiency, some models convert 3D point clouds into 2D representations, enabling the utilization of efficient 2D convolutions for semantic segmentation. For example, Tatarchenko et al. [<xref ref-type="bibr" rid="ref-33">33</xref>] projected the local surface geometry of the point cloud onto the tangent plane of each point and process it using 2D convolutions. Wu et al. [<xref ref-type="bibr" rid="ref-24">24</xref>] employed point cloud spherical projection methods to transform point cloud data into a data format compatible with various mature 2D image processing techniques. Moreover, some methods directly operate on points to handle large-scale point clouds. Zhang et al. proposed PointCCR [<xref ref-type="bibr" rid="ref-34">34</xref>], which enhances efficiency through random sampling while leveraging the local structure of the point cloud and expanding the receptive field of individual points. Although the aforementioned methods have achieved significant results, the preprocessing steps involve substantial computational complexity, and the projections disrupt the 3D geometric structure of the point cloud. Motivated by these approaches, to balance efficiency and preserve the original 3D geometric relationships, we propose CFSA-Net, an end-to-end efficient network specifically designed for large-scale point cloud semantic segmentation.</p>
<p>Self-attention mechanism: The self-attention mechanism was initially introduced in the fields of natural language processing and 2D image processing [<xref ref-type="bibr" rid="ref-35">35</xref>], and it has garnered considerable attention in current research due to its remarkable ability to model contextual information. In recent years, researchers have focused on applying this mechanism to point cloud processing tasks to further enhance the processing capabilities of point cloud data. Several self-attention-based point cloud processing methods have been proposed. For instance, Fu et al. introduced FFANet [<xref ref-type="bibr" rid="ref-36">36</xref>], which effectively captures the contextual information of each point using the self-attention mechanism. Chen et al. introduced GAPNet [<xref ref-type="bibr" rid="ref-37">37</xref>], which integrates graph attention mechanisms into a series of stacked Multi-Layer Perceptron (MLP) layers to effectively learn the local features of input point clouds. Guo et al. proposed PCT [<xref ref-type="bibr" rid="ref-13">13</xref>], which adopts the self-attention mechanism from Transformers to effectively capture the relationships between points in point cloud data, enabling better capturing of fine-grained details. Ren et al. proposed PA-Net [<xref ref-type="bibr" rid="ref-38">38</xref>], which designs two parallel self-attention mechanisms that simultaneously focus on coordinate and feature information. Previous works have primarily handled coordinate and feature information separately. In contrast, our network employs a cross-fusion self-attention mechanism, which interactively captures and integrates coordinate and feature information, considering the relative positional relations of the point cloud, thereby modeling more comprehensive geometric information.</p>
</sec>
<sec id="s3">
<label>3</label>
<title>Methodology</title>
<sec id="s3_1">
<label>3.1</label>
<title>Overview</title>
<p>The model, as illustrated in <xref ref-type="fig" rid="fig-2">Fig. 2</xref>, utilizes an encoder-decoder architecture with skip connections to process a point cloud collection comprising N points. Each point encompasses xyz coordinate position information and feature attributes (e.g., color, normal vectors) as inputs. To capture the intricate characteristics of each point, the input point cloud undergoes a series of five encoding and decoding layers. During the encoding phase, the point cloud scale is reduced through the application of random sampling. By incorporating the Local Feature Extraction (LFE) module, the model enriches the coordinate information, enhances the interaction between coordinate and feature attributes, and expands the receptive field of each point. In the decoding phase, each point employs the K-Nearest Neighbor (KNN) approach to identify its nearest neighboring point. Subsequently, Up-Sampling (US) is performed using linear interpolation to restore the point cloud to its original scale. The features from the encoding phase and the skip connections are combined through summation and then input into a shared Multi-Layer Perceptron (MLP) to reduce the dimensionality of the features. Finally, the entire process is iteratively repeated to obtain the final segmentation result.</p>
<fig id="fig-2">
<label>Figure 2</label>
<caption>
<title>Network structural diagram</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_45818-fig-2.tif"/>
</fig>
</sec>
<sec id="s3_2">
<label>3.2</label>
<title>Local Feature Extraction Based on Cross-Fusion Self-Attention Mechanism</title>
<p>Local Feature Extraction (LFE) constitutes the core of the encoding layer and is composed of three primary components: Hierarchical Position Coding (HPC), Cross-Fusion Self-Attention (CFSA) pooling module, and Residual Optimization (RO) structure.</p>
<sec id="s3_2_1">
<label>3.2.1</label>
<title>Hierarchical Position Coding (HPC)</title>
<p>The module encompasses hierarchical sampling and relative position encoding. The first is sampling. Common sampling methods usually only perform KNN-based sampling on neighboring points. However, this approach limits the receptive field of each query point, hindering the establishment of long-range contextual dependencies. To address this issue, a straightforward solution is to increase the sampling radius, but this results in increased computational memory requirements. To effectively aggregate distant contextual dependencies with lower memory costs, a hierarchical sampling strategy is introduced, as illustrated in <xref ref-type="fig" rid="fig-3">Fig. 3</xref>. The specific strategy is defined as follows:</p>
<fig id="fig-3">
<label>Figure 3</label>
<caption>
<title>Hierarchical positional coding module</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_45818-fig-3.tif"/>
</fig>
<p><disp-formula id="eqn-1"><label>(1)</label><mml:math id="mml-eqn-1" display="block"><mml:mrow><mml:mo>{</mml:mo><mml:mtable columnalign="left" rowspacing="4pt" columnspacing="1em"><mml:mtr><mml:mtd><mml:msub><mml:mi>K</mml:mi><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mrow><mml:mtext>KNN</mml:mtext></mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>p</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>f</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:msub><mml:mi>K</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mrow><mml:mtext>FPS</mml:mtext></mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>p</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>f</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:msub><mml:mi>K</mml:mi><mml:mrow><mml:mn>3</mml:mn></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mi>K</mml:mi><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>&#x222A;</mml:mo><mml:msub><mml:mi>K</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub></mml:mtd></mml:mtr></mml:mtable><mml:mo fence="true" stretchy="true" symmetric="true"></mml:mo></mml:mrow></mml:math></disp-formula></p>
<p>Given an input point set, denoted as <inline-formula id="ieqn-4"><mml:math id="mml-ieqn-4"><mml:mi>P</mml:mi><mml:mo>=</mml:mo><mml:mrow><mml:mo>{</mml:mo><mml:msub><mml:mi>p</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>f</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mn>2</mml:mn><mml:mo>,</mml:mo><mml:mn>3</mml:mn><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:mi>n</mml:mi><mml:mo>}</mml:mo></mml:mrow></mml:math></inline-formula>, where n signifies the total number of points within the point cloud, <inline-formula id="ieqn-5"><mml:math id="mml-ieqn-5"><mml:msub><mml:mi>p</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> represents the positional information (x, y, z), and <inline-formula id="ieqn-6"><mml:math id="mml-ieqn-6"><mml:msub><mml:mi>f</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> represents the feature information (e.g., color, normal vectors, etc.), the following approach is employed for each query point: Initially, a dense selection of K neighboring points is performed using the KNN method, resulting in the set <inline-formula id="ieqn-7"><mml:math id="mml-ieqn-7"><mml:mi>K</mml:mi><mml:msubsup><mml:mrow></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow><mml:mrow></mml:mrow></mml:msubsup><mml:mrow></mml:mrow><mml:mrow></mml:mrow></mml:math></inline-formula>. Subsequently, a sparser selection of K neighboring points is achieved by employing the FPS method within a larger radius, forming the set <inline-formula id="ieqn-8"><mml:math id="mml-ieqn-8"><mml:mi>K</mml:mi><mml:msubsup><mml:mrow></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow><mml:mrow></mml:mrow></mml:msubsup><mml:mrow></mml:mrow><mml:mrow></mml:mrow></mml:math></inline-formula>. Finally, the two sets, <italic>K</italic><sub>1</sub> and <italic>K</italic><sub>2</sub>, are merged and duplicate points are removed, resulting in the final set of neighboring points, denoted as <inline-formula id="ieqn-9"><mml:math id="mml-ieqn-9"><mml:mi>K</mml:mi><mml:msubsup><mml:mrow></mml:mrow><mml:mrow><mml:mn>3</mml:mn></mml:mrow><mml:mrow></mml:mrow></mml:msubsup><mml:mrow></mml:mrow><mml:mrow></mml:mrow></mml:math></inline-formula>.</p>
<p>Then the relative position coding is performed, and the neighbor point set <inline-formula id="ieqn-10"><mml:math id="mml-ieqn-10"><mml:mi>K</mml:mi><mml:msubsup><mml:mrow></mml:mrow><mml:mrow><mml:mn>3</mml:mn></mml:mrow><mml:mrow></mml:mrow></mml:msubsup><mml:mrow></mml:mrow><mml:mrow></mml:mrow></mml:math></inline-formula> is encoded. The coding process is defined as follows:
<disp-formula id="eqn-2"><label>(2)</label><mml:math id="mml-eqn-2" display="block"><mml:msubsup><mml:mi>h</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:msup><mml:mi>K</mml:mi><mml:mrow><mml:msup><mml:mi></mml:mi><mml:mo>&#x2032;</mml:mo></mml:msup></mml:mrow></mml:msup></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:mi>M</mml:mi><mml:mi>L</mml:mi><mml:mi>P</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mi>g</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>p</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msup><mml:mrow><mml:msub><mml:mi>p</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:msup><mml:mi>K</mml:mi><mml:mrow><mml:msup><mml:mi></mml:mi><mml:mo>&#x2032;</mml:mo></mml:msup></mml:mrow></mml:msup></mml:mrow></mml:msup><mml:mo>,</mml:mo><mml:msub><mml:mi>p</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>&#x2212;</mml:mo><mml:msup><mml:mrow><mml:msub><mml:mi>p</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:msup><mml:mi>K</mml:mi><mml:mrow><mml:msup><mml:mi></mml:mi><mml:mo>&#x2032;</mml:mo></mml:msup></mml:mrow></mml:msup></mml:mrow></mml:msup><mml:mo>,</mml:mo><mml:mrow><mml:mo symmetric="true">&#x2016;</mml:mo><mml:msub><mml:mi>p</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>&#x2212;</mml:mo><mml:msup><mml:mrow><mml:msub><mml:mi>p</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:msup><mml:mi>K</mml:mi><mml:mrow><mml:msup><mml:mi></mml:mi><mml:mo>&#x2032;</mml:mo></mml:msup></mml:mrow></mml:msup></mml:mrow></mml:msup><mml:mo symmetric="true">&#x2016;</mml:mo></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:math></disp-formula></p>
<p>where <inline-formula id="ieqn-11"><mml:math id="mml-ieqn-11"><mml:msup><mml:mi>K</mml:mi><mml:mrow><mml:msup><mml:mi></mml:mi><mml:mo>&#x2032;</mml:mo></mml:msup></mml:mrow></mml:msup></mml:math></inline-formula> is the number of points of the set <inline-formula id="ieqn-12"><mml:math id="mml-ieqn-12"><mml:mi>K</mml:mi><mml:msubsup><mml:mrow></mml:mrow><mml:mrow><mml:mn>3</mml:mn></mml:mrow><mml:mrow></mml:mrow></mml:msubsup><mml:mrow></mml:mrow><mml:mrow></mml:mrow></mml:math></inline-formula>; <inline-formula id="ieqn-13"><mml:math id="mml-ieqn-13"><mml:msubsup><mml:mi>h</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:msup><mml:mi>K</mml:mi><mml:mrow><mml:msup><mml:mi></mml:mi><mml:mo>&#x2032;</mml:mo></mml:msup></mml:mrow></mml:msup></mml:mrow></mml:msubsup></mml:math></inline-formula> is the result of spatial position encoding of points; <inline-formula id="ieqn-14"><mml:math id="mml-ieqn-14"><mml:msub><mml:mi>p</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> is the coordinates of the query point; <inline-formula id="ieqn-15"><mml:math id="mml-ieqn-15"><mml:msup><mml:mrow><mml:msub><mml:mi>p</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:msup><mml:mi>K</mml:mi><mml:mrow><mml:msup><mml:mi></mml:mi><mml:mo>&#x2032;</mml:mo></mml:msup></mml:mrow></mml:msup></mml:mrow></mml:msup></mml:math></inline-formula> is the coordinates of <inline-formula id="ieqn-16"><mml:math id="mml-ieqn-16"><mml:msup><mml:mi>K</mml:mi><mml:mrow><mml:msup><mml:mi></mml:mi><mml:mo>&#x2032;</mml:mo></mml:msup></mml:mrow></mml:msup></mml:math></inline-formula> adjacent points; <inline-formula id="ieqn-17"><mml:math id="mml-ieqn-17"><mml:msub><mml:mi>p</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>&#x2212;</mml:mo><mml:msup><mml:mrow><mml:msub><mml:mi>p</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:msup><mml:mi>K</mml:mi><mml:mrow><mml:msup><mml:mi></mml:mi><mml:mo>&#x2032;</mml:mo></mml:msup></mml:mrow></mml:msup></mml:mrow></mml:msup></mml:math></inline-formula> is the relative coordinate between the query point and the adjacent point.; <inline-formula id="ieqn-18"><mml:math id="mml-ieqn-18"><mml:mrow><mml:mo symmetric="true">&#x2016;</mml:mo><mml:msub><mml:mi>p</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>&#x2212;</mml:mo><mml:msup><mml:mrow><mml:msub><mml:mi>p</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:msup><mml:mi>K</mml:mi><mml:mrow><mml:msup><mml:mi></mml:mi><mml:mo>&#x2032;</mml:mo></mml:msup></mml:mrow></mml:msup></mml:mrow></mml:msup><mml:mo symmetric="true">&#x2016;</mml:mo></mml:mrow></mml:math></inline-formula> is the Euclidean distance between the query point and the adjacent points; g represents the connection operation, which connects the above relative position information; MLP extends the relative position information of the connection to the same dimension as <inline-formula id="ieqn-19"><mml:math id="mml-ieqn-19"><mml:msub><mml:mi>f</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>.</p>
<p>As depicted in <xref ref-type="fig" rid="fig-3">Fig. 3</xref>, the variable <inline-formula id="ieqn-20"><mml:math id="mml-ieqn-20"><mml:msubsup><mml:mi>f</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:msup><mml:mi>K</mml:mi><mml:mrow><mml:msup><mml:mi></mml:mi><mml:mo>&#x2032;</mml:mo></mml:msup></mml:mrow></mml:msup></mml:mrow></mml:msubsup></mml:math></inline-formula> denotes a feature information matrix of dimensions (<inline-formula id="ieqn-21"><mml:math id="mml-ieqn-21"><mml:msup><mml:mi>K</mml:mi><mml:mrow><mml:msup><mml:mi></mml:mi><mml:mo>&#x2032;</mml:mo></mml:msup></mml:mrow></mml:msup><mml:mo>&#x00D7;</mml:mo><mml:mi>d</mml:mi></mml:math></inline-formula>). This matrix is derived from a set <inline-formula id="ieqn-22"><mml:math id="mml-ieqn-22"><mml:mi>K</mml:mi><mml:msubsup><mml:mrow></mml:mrow><mml:mrow><mml:mn>3</mml:mn></mml:mrow><mml:mrow></mml:mrow></mml:msubsup><mml:mrow></mml:mrow><mml:mrow></mml:mrow></mml:math></inline-formula> comprising <inline-formula id="ieqn-23"><mml:math id="mml-ieqn-23"><mml:msup><mml:mi>K</mml:mi><mml:mrow><mml:msup><mml:mi></mml:mi><mml:mo>&#x2032;</mml:mo></mml:msup></mml:mrow></mml:msup></mml:math></inline-formula> neighboring points. It is worth noting that the matrix does not include coordinate information.</p>
<p>Ultimately, the HPC module produces the original feature information of <inline-formula id="ieqn-24"><mml:math id="mml-ieqn-24"><mml:msup><mml:mi>K</mml:mi><mml:mrow><mml:msup><mml:mi></mml:mi><mml:mo>&#x2032;</mml:mo></mml:msup></mml:mrow></mml:msup></mml:math></inline-formula> nearest neighbor points along with corresponding relative spatial positional information, which has the same dimension as the original features. Compared to conventional sampling methods, this approach involves additional computations for sparse neighboring points and effectively addresses long-range dependency issues. However, due to the sparsity of distant neighbor points, it does not excessively consume computational memory resources.</p>
</sec>
<sec id="s3_2_2">
<label>3.2.2</label>
<title>Cross-Fusion Self-Attention (CFSA) Pooling</title>
<p>The CFSA pooling module uses a powerful self-attention mechanism to interactively enhance local coordinate and feature information. It takes as input the output of the HPC module, which consists of the coordinates and feature information after being processed by HPC. The specific structure of this module is illustrated in <xref ref-type="fig" rid="fig-4">Fig. 4</xref>.</p>
<fig id="fig-4">
<label>Figure 4</label>
<caption>
<title>Cross-fusing self-attention pooling module</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_45818-fig-4.tif"/>
</fig>
<p>The input of the upper part is <inline-formula id="ieqn-25"><mml:math id="mml-ieqn-25"><mml:msubsup><mml:mi>h</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:msup><mml:mi>K</mml:mi><mml:mrow><mml:msup><mml:mi></mml:mi><mml:mo>&#x2032;</mml:mo></mml:msup></mml:mrow></mml:msup></mml:mrow></mml:msubsup></mml:math></inline-formula>, and after the linear transformation of <inline-formula id="ieqn-26"><mml:math id="mml-ieqn-26"><mml:msubsup><mml:mi>h</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:msup><mml:mi>K</mml:mi><mml:mrow><mml:msup><mml:mi></mml:mi><mml:mo>&#x2032;</mml:mo></mml:msup></mml:mrow></mml:msup></mml:mrow></mml:msubsup></mml:math></inline-formula>, the three feature descriptions of <inline-formula id="ieqn-27"><mml:math id="mml-ieqn-27"><mml:msub><mml:mi>h</mml:mi><mml:mrow><mml:mi>q</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>h</mml:mi><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>h</mml:mi><mml:mrow><mml:mi>v</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> are obtained. Similarly, <inline-formula id="ieqn-28"><mml:math id="mml-ieqn-28"><mml:msub><mml:mi>f</mml:mi><mml:mrow><mml:mi>q</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>f</mml:mi><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>f</mml:mi><mml:mrow><mml:mi>v</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> are obtained after the linear transformation of the input <inline-formula id="ieqn-29"><mml:math id="mml-ieqn-29"><mml:msubsup><mml:mi>f</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:msup><mml:mi>K</mml:mi><mml:mrow><mml:msup><mml:mi></mml:mi><mml:mo>&#x2032;</mml:mo></mml:msup></mml:mrow></mml:msup></mml:mrow></mml:msubsup></mml:math></inline-formula> in the lower half. The process of linear transformation can be described as follows:
<disp-formula id="eqn-3"><label>(3)</label><mml:math id="mml-eqn-3" display="block"><mml:mrow><mml:mo>{</mml:mo><mml:mtable columnalign="left" rowspacing="4pt" columnspacing="1em"><mml:mtr><mml:mtd><mml:msub><mml:mi>h</mml:mi><mml:mrow><mml:mi>q</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>h</mml:mi><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>h</mml:mi><mml:mrow><mml:mi>v</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mi>L</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:msubsup><mml:mi>h</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:msup><mml:mi>K</mml:mi><mml:mrow><mml:msup><mml:mi></mml:mi><mml:mo>&#x2032;</mml:mo></mml:msup></mml:mrow></mml:msup></mml:mrow></mml:msubsup><mml:mo>)</mml:mo></mml:mrow></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:msub><mml:mi>f</mml:mi><mml:mrow><mml:mi>q</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>f</mml:mi><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>f</mml:mi><mml:mrow><mml:mi>v</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mi>L</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:msubsup><mml:mi>f</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:msup><mml:mi>K</mml:mi><mml:mrow><mml:msup><mml:mi></mml:mi><mml:mo>&#x2032;</mml:mo></mml:msup></mml:mrow></mml:msup></mml:mrow></mml:msubsup><mml:mo>)</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable><mml:mo fence="true" stretchy="true" symmetric="true"></mml:mo></mml:mrow></mml:math></disp-formula></p>
<p>where <inline-formula id="ieqn-30"><mml:math id="mml-ieqn-30"><mml:msubsup><mml:mi>h</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:msup><mml:mi>K</mml:mi><mml:mrow><mml:msup><mml:mi></mml:mi><mml:mo>&#x2032;</mml:mo></mml:msup></mml:mrow></mml:msup></mml:mrow></mml:msubsup></mml:math></inline-formula>, <inline-formula id="ieqn-31"><mml:math id="mml-ieqn-31"><mml:msubsup><mml:mi>f</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:msup><mml:mi>K</mml:mi><mml:mrow><mml:msup><mml:mi></mml:mi><mml:mo>&#x2032;</mml:mo></mml:msup></mml:mrow></mml:msup></mml:mrow></mml:msubsup></mml:math></inline-formula> represent the input, L represents the function of a linear transformation, the q, k, and v correspond to the query, key, and value, respectively.</p>
<p>Some of the above elements are cross-fused to obtain the output <inline-formula id="ieqn-32"><mml:math id="mml-ieqn-32"><mml:msub><mml:mi>h</mml:mi><mml:mrow><mml:mi>o</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> and <inline-formula id="ieqn-33"><mml:math id="mml-ieqn-33"><mml:msub><mml:mi>f</mml:mi><mml:mrow><mml:mi>o</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> after self-attention calculation. The specific process is defined as follows:
<disp-formula id="eqn-4"><label>(4)</label><mml:math id="mml-eqn-4" display="block"><mml:mrow><mml:mo>{</mml:mo><mml:mtable columnalign="left" rowspacing="4pt" columnspacing="1em"><mml:mtr><mml:mtd><mml:msub><mml:mi>h</mml:mi><mml:mrow><mml:mi>o</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mi>h</mml:mi><mml:mrow><mml:mi>v</mml:mi></mml:mrow></mml:msub><mml:mo>&#x2297;</mml:mo><mml:msub><mml:mi>f</mml:mi><mml:mrow><mml:mi>a</mml:mi></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:msubsup><mml:mi>h</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:msup><mml:mi>K</mml:mi><mml:mrow><mml:msup><mml:mi></mml:mi><mml:mo>&#x2032;</mml:mo></mml:msup></mml:mrow></mml:msup></mml:mrow></mml:msubsup></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:msub><mml:mi>f</mml:mi><mml:mrow><mml:mi>o</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mi>f</mml:mi><mml:mrow><mml:mi>v</mml:mi></mml:mrow></mml:msub><mml:mo>&#x2297;</mml:mo><mml:msub><mml:mi>h</mml:mi><mml:mrow><mml:mi>a</mml:mi></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:msubsup><mml:mi>f</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:msup><mml:mi>K</mml:mi><mml:mrow><mml:msup><mml:mi></mml:mi><mml:mo>&#x2032;</mml:mo></mml:msup></mml:mrow></mml:msup></mml:mrow></mml:msubsup></mml:mtd></mml:mtr></mml:mtable><mml:mo fence="true" stretchy="true" symmetric="true"></mml:mo></mml:mrow></mml:math></disp-formula></p>
<p>where <inline-formula id="ieqn-34"><mml:math id="mml-ieqn-34"><mml:mo>&#x2297;</mml:mo></mml:math></inline-formula> represents matrix multiplication, it can be seen from <xref ref-type="disp-formula" rid="eqn-4">Eq. (4)</xref> that coordinates and feature information are effectively enhanced. <inline-formula id="ieqn-35"><mml:math id="mml-ieqn-35"><mml:msub><mml:mi>h</mml:mi><mml:mrow><mml:mi>a</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> and <inline-formula id="ieqn-36"><mml:math id="mml-ieqn-36"><mml:msub><mml:mi>f</mml:mi><mml:mrow><mml:mi>a</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> in the above equation are obtained by query and key weighting. The specific process is defined as follows:
<disp-formula id="eqn-5"><label>(5)</label><mml:math id="mml-eqn-5" display="block"><mml:mrow><mml:mo>{</mml:mo><mml:mtable columnalign="left" rowspacing="4pt" columnspacing="1em"><mml:mtr><mml:mtd><mml:msub><mml:mi>h</mml:mi><mml:mrow><mml:mi>a</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mi>s</mml:mi><mml:mi>o</mml:mi><mml:mi>f</mml:mi><mml:mi>t</mml:mi><mml:mo movablelimits="true" form="prefix">max</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:mi>s</mml:mi><mml:mi>u</mml:mi><mml:mi>m</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:msubsup><mml:mi>h</mml:mi><mml:mrow><mml:mi>q</mml:mi></mml:mrow><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msubsup><mml:mo>&#x2297;</mml:mo><mml:msub><mml:mi>h</mml:mi><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:msub><mml:mi>f</mml:mi><mml:mrow><mml:mi>a</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mi>s</mml:mi><mml:mi>o</mml:mi><mml:mi>f</mml:mi><mml:mi>t</mml:mi><mml:mo movablelimits="true" form="prefix">max</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:mi>s</mml:mi><mml:mi>u</mml:mi><mml:mi>m</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:msubsup><mml:mi>f</mml:mi><mml:mrow><mml:mi>q</mml:mi></mml:mrow><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msubsup><mml:mo>&#x2297;</mml:mo><mml:msub><mml:mi>f</mml:mi><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable><mml:mo fence="true" stretchy="true" symmetric="true"></mml:mo></mml:mrow></mml:math></disp-formula></p>
<p>where <inline-formula id="ieqn-37"><mml:math id="mml-ieqn-37"><mml:mo>&#x2297;</mml:mo></mml:math></inline-formula> also represents matrix multiplication, the sum represents adding the first row of the result of <inline-formula id="ieqn-38"><mml:math id="mml-ieqn-38"><mml:mo>&#x2297;</mml:mo></mml:math></inline-formula> to each subsequent row, and finally assigning weights through softmax.</p>
<p>Compared with some traditional self-attention mechanisms, the cross-fusion self-attention mechanism enables the coordinates and feature information after HPC to be mutually enhanced. Finally, the new feature description <inline-formula id="ieqn-39"><mml:math id="mml-ieqn-39"><mml:msub><mml:mi>F</mml:mi><mml:mrow><mml:mrow><mml:mtext>out</mml:mtext></mml:mrow></mml:mrow></mml:msub></mml:math></inline-formula> of the query point is obtained after sum pooling and MLP. The specific definition process is as follows:
<disp-formula id="eqn-6"><label>(6)</label><mml:math id="mml-eqn-6" display="block"><mml:msub><mml:mi>F</mml:mi><mml:mrow><mml:mi>o</mml:mi><mml:mi>u</mml:mi><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mi>M</mml:mi><mml:mi>L</mml:mi><mml:mi>P</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>K</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:msup><mml:mi>K</mml:mi><mml:mrow><mml:msup><mml:mi></mml:mi><mml:mo>&#x2032;</mml:mo></mml:msup></mml:mrow></mml:msup></mml:mrow></mml:munderover><mml:mi>g</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>h</mml:mi><mml:mrow><mml:mi>o</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>f</mml:mi><mml:mrow><mml:mi>o</mml:mi></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:math></disp-formula></p>
</sec>
<sec id="s3_2_3">
<label>3.2.3</label>
<title>Residual Optimization (RO)</title>
<p>In this study, the residual optimization module is used to stack the HPC module and the CFSA pooling module to enhance the receptive field of individual points and mitigate the potential loss of key point information resulting from random sampling. According to the aforementioned theory, a higher number of stacked HPC modules and CFSA pooling modules leads to a more effective extension of the receptive field. However, computational efficiency and module transferability are taken into consideration. The residual optimization structure in this paper consists of two stacked HPC modules and CFSA pooling modules, complemented by residual connections. Additionally, a multilayer perceptron is incorporated before the input and after the output to achieve the necessary feature dimensions. Finally, the output features after stacking are added to the features of the input point cloud after shared MLP processing to obtain the final aggregation features. The specific structure is illustrated in <xref ref-type="fig" rid="fig-5">Fig. 5</xref>.</p>
<fig id="fig-5">
<label>Figure 5</label>
<caption>
<title>Residual optimization module</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_45818-fig-5.tif"/>
</fig>
<p>After the first stacking operation, the receptive field of the query point is <inline-formula id="ieqn-40"><mml:math id="mml-ieqn-40"><mml:msup><mml:mi>K</mml:mi><mml:mrow><mml:msup><mml:mi></mml:mi><mml:mo>&#x2032;</mml:mo></mml:msup></mml:mrow></mml:msup></mml:math></inline-formula> points. After the second stacking operation, the receptive field will be raised to <inline-formula id="ieqn-41"><mml:math id="mml-ieqn-41"><mml:msup><mml:mi>K</mml:mi><mml:mrow><mml:msup><mml:mi></mml:mi><mml:mrow><mml:msup><mml:mi></mml:mi><mml:mo>&#x2032;</mml:mo></mml:msup></mml:mrow></mml:msup><mml:mn>2</mml:mn></mml:mrow></mml:msup></mml:math></inline-formula> points. The receptive field expansion diagram is shown in <xref ref-type="fig" rid="fig-6">Fig. 6</xref>.</p>
<fig id="fig-6">
<label>Figure 6</label>
<caption>
<title>Receptive field expansion diagram</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_45818-fig-6.tif"/>
</fig>
</sec>
</sec>
</sec>
<sec id="s4">
<label>4</label>
<title>Performance Analysis</title>
<p>In this section, the proposed network is evaluated on three mainstream semantic segmentation datasets (S3DIS, Semantic3D, SemanticKITTI). In addition, some related ablation experiments, including network structure analysis and self-attention mechanism selection, have been carried out to verify the proposed modules.</p>
<sec id="s4_1">
<label>4.1</label>
<title>Data Set Introduction</title>
<p>This study primarily conducts evaluations on three datasets, namely S3DIS, Semantic3D, and SemantiKITTI. S3DIS represents a dataset of indoor scenes, Semantic3D represents a dataset of outdoor scenes, and SemantiKITTI represents a dataset of autonomous driving scenarios. Each dataset has distinct point counts and features. A detailed introduction to each dataset is provided below.</p>
<p>S3DIS represents a comprehensive dataset of indoor scenes, comprising six educational and office regions with a total of 271 rooms. This dataset encompasses 13 distinct categories. Each point cloud data within S3DIS is defined by nine features, encompassing coordinate information and color information, along with three corresponding normal vectors.</p>
<p>The Semantic3D dataset provides a vast collection of natural scene point clouds, exceeding a total of 4 billion points. It encompasses a diverse range of urban scenes, including churches, streets, railways, squares, villages, football fields, and castles. Each point cloud data is characterized by seven features, encompassing coordinate information (x, y, z), reflectance intensity, as well as color information (R, G, B).</p>
<p>SemanticKITTI stands as an authoritative dataset in the field of autonomous driving. This dataset incorporates various categories such as pedestrians, vehicles, and other traffic participants, along with ground facilities like parking lots and sidewalks. Each point cloud data within the SemanticKITTI dataset consists of four features, namely coordinate information (x, y, z), and reflectance intensity.</p>
</sec>
<sec id="s4_2">
<label>4.2</label>
<title>Experimental Environment</title>
<p>The experimental parameters are set as follows: The computations are performed on the Ubuntu 20.04 system utilizing the TensorFlow 2.6.0 framework, with acceleration provided by the NVIDIA Quadro P6000 GPU. The Adam optimizer is employed, and the batch sizes for the three datasets are respectively set to 6, 3, and 3. The initial learning rates are uniformly set to 0.01, and the maximum number of iterations for all datasets is established as 100.</p>
</sec>
<sec id="s4_3">
<label>4.3</label>
<title>Comparative Experiments and Results Analysis</title>
<sec id="s4_3_1">
<label>4.3.1</label>
<title>Experimental Results Evaluation of S3DIS Dataset</title>
<p>This study utilizes the S3DIS dataset, which partitions 271 rooms into 6 regions, to evaluate the performance of the proposed algorithm through 6-fold cross-validation on these regions. The quantitative results of comparing the proposed algorithm with other algorithms across the 6 regions are presented in <xref ref-type="table" rid="table-1">Table 1</xref>, with the best results highlighted in bold. Our algorithm outperforms others in terms of three metrics: Overall Accuracy (OA), Mean Accuracy (mAcc), and Mean Intersection over Union (mIoU), achieving values of 87.6%, 82.3%, and 71.2%, respectively. The categories of floor, pillar, chair, whiteboard, and clutter exhibit the best performance in mIoU, with improvements of 0.9%, 0.7%, 1.8%, 0.8%, and 0.5%, respectively, compared to the best results of other algorithms in the table. Additionally, the segmentation accuracy is equally impressive for categories such as windows and doors.</p>
<table-wrap id="table-1">
<label>Table 1</label>
<caption>
<title>Quantitative results of semantic segmentation of S3DIS dataset</title>
</caption>
<table frame="hsides">
<colgroup>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
</colgroup>
<thead>
<tr>
<th>Model</th>
<th>mIoU</th>
<th>OA</th>
<th>mAcc</th>
<th>Ceiling</th>
<th>Floor</th>
<th>Wall</th>
<th>Beam</th>
<th>Column</th>
<th>Window</th>
<th>Door</th>
<th>Table</th>
<th>Chair</th>
<th>Sofa</th>
<th>Book</th>
<th>Board</th>
<th>Clutter</th>
</tr>
</thead>
<tbody>
<tr>
<td>PointNet</td>
<td>47.6</td>
<td>78.6</td>
<td>66.2</td>
<td>88.0</td>
<td>88.7</td>
<td>69.3</td>
<td>42.4</td>
<td>23.1</td>
<td>47.5</td>
<td>51.6</td>
<td>54.1</td>
<td>42.0</td>
<td>9.6</td>
<td>38.2</td>
<td>29.4</td>
<td>35.2</td>
</tr>
<tr>
<td>PointNet<break/>&#x002B;&#x002B; (SSG)</td>
<td>55.7</td>
<td>83.9</td>
<td>68.3</td>
<td>91.5</td>
<td>95.6</td>
<td>77.5</td>
<td>28.3</td>
<td>29.1</td>
<td>50.8</td>
<td>44.3</td>
<td>61.1</td>
<td>68.4</td>
<td>21.8</td>
<td>54.1</td>
<td>48.0</td>
<td>53.3</td>
</tr>
<tr>
<td>PointNet<break/>&#x002B;&#x002B;(MSG)</td>
<td>57.6</td>
<td>86.0</td>
<td>68.5</td>
<td>92.2</td>
<td>91.8</td>
<td>78.1</td>
<td>30.6</td>
<td>31.3</td>
<td>56.5</td>
<td>63.1</td>
<td>62.8</td>
<td>64.9</td>
<td>19.4</td>
<td>55.8</td>
<td>49.1</td>
<td>54.1</td>
</tr>
<tr>
<td>SPG</td>
<td>62.1</td>
<td>85.5</td>
<td>73.0</td>
<td>89.9</td>
<td>95.1</td>
<td>76.4</td>
<td>62.8</td>
<td>47.1</td>
<td>55.3</td>
<td>68.4</td>
<td><bold>73.5</bold></td>
<td>69.2</td>
<td>63.2</td>
<td>45.9</td>
<td>8.7</td>
<td>52.9</td>
</tr>
<tr>
<td>PointWeb</td>
<td>66.7</td>
<td>87.3</td>
<td>76.2</td>
<td>93.5</td>
<td>94.2</td>
<td>80.8</td>
<td>52.4</td>
<td>41.3</td>
<td>64.9</td>
<td>68.1</td>
<td>71.4</td>
<td>67.1</td>
<td>50.3</td>
<td>62.7</td>
<td>62.2</td>
<td>58.5</td>
</tr>
<tr>
<td>KPCnov</td>
<td>70.6</td>
<td>&#x2014;</td>
<td>79.1</td>
<td><bold>93.6</bold></td>
<td>92.4</td>
<td><bold>83.1</bold></td>
<td><bold>63.9</bold></td>
<td>54.3</td>
<td><bold>66.1</bold></td>
<td><bold>76.6</bold></td>
<td>57.8</td>
<td>64.0</td>
<td><bold>69.3</bold></td>
<td><bold>74.9</bold></td>
<td>61.3</td>
<td>60.3</td>
</tr>
<tr>
<td>RandLA-Net</td>
<td>70.0</td>
<td>87.1</td>
<td>81.5</td>
<td>93.1</td>
<td>96.1</td>
<td>80.6</td>
<td>62.4</td>
<td>48.0</td>
<td>64.4</td>
<td>69.4</td>
<td>69.4</td>
<td>76.4</td>
<td>60.0</td>
<td>64.2</td>
<td>65.9</td>
<td>60.1</td>
</tr>
<tr>
<td>Ours</td>
<td><bold>71.2</bold></td>
<td><bold>87.6</bold></td>
<td><bold>82.3</bold></td>
<td>93.4</td>
<td><bold>97.0</bold></td>
<td>80.5</td>
<td>63.1</td>
<td><bold>54.5</bold></td>
<td>64.8</td>
<td>70.4</td>
<td>68.5</td>
<td><bold>78.2</bold></td>
<td>64.1</td>
<td>64.1</td>
<td><bold>66.7</bold></td>
<td><bold>60.8</bold></td>
</tr>
</tbody>
</table>
</table-wrap>
<p>Next, we compare the proposed algorithm with PointNet&#x002B;&#x002B; and RandLA-Net, and provide visual comparisons to demonstrate the advantages of our algorithm. As shown in <xref ref-type="fig" rid="fig-7">Fig. 7</xref>, the first column represents a hallway scene, the second column depicts a conference room scene, and the third column illustrates an office scene. Each scene includes the ground truth labels, predictions from PointNet&#x002B;&#x002B;, predictions from RandLA-Net, and predictions from our algorithm. The algorithm presented in this study demonstrates the capability to accurately predict the contours of visually similar objects, the edges of small-scale objects, and the contours of embedded objects. For instance, it effectively captures the intricate geometric shapes of objects such as pillars, beams, and corners of walls, which share similarities in their geometry. Moreover, it successfully identifies the boundaries of small objects like bookshelves housing books and miscellaneous items, as well as accurately outlines embedded objects like blackboards on walls. This is attributed to the local coordinate encoding module and the cross-attention interaction module. The local coordinate encoding module preserves rich local geometric information, while the cross-attention interaction module enhances the learning capability of coordinate and feature interactions.</p>
<fig id="fig-7">
<label>Figure 7</label>
<caption>
<title>S3DIS dataset semantic segmentation visualization</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_45818-fig-7.tif"/>
</fig>
</sec>
<sec id="s4_3_2">
<label>4.3.2</label>
<title>Experimental Results Evaluation of Semantic3D Dataset</title>
<p>The experimental evaluation was performed using the reduce-8 subset of the Semantic3D dataset, which comprises training point cloud data from 15 distinct regions and testing point cloud data from 4 regions. The quantitative results of the experiments are presented in <xref ref-type="table" rid="table-2">Table 2</xref>. Our proposed algorithm surpasses the comparative algorithms in terms of both the mIoU and the OA on the Semantic3D dataset, achieving a mIoU of 78.2% and an OA of 94.9%. Particularly noteworthy is its outstanding performance in the domains of architecture (including structures such as churches, town halls, and stations), hard landscapes (a diverse category encompassing elements like garden walls, fountains, and banks), and automobiles. In comparison to the best results obtained by the comparative algorithms in this paper, our algorithm demonstrates improvements of 0.2%, 1.1%, and 0.4% in these respective categories. Furthermore, it achieves commendable results in classes such as artificial terrain and natural terrain.</p>
<table-wrap id="table-2">
<label>Table 2</label>
<caption>
<title>Quantitative results of semantic segmentation of Semantic3D dataset</title>
</caption>
<table frame="hsides">
<colgroup>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
</colgroup>
<thead>
<tr>
<th>Model</th>
<th>mIoU</th>
<th>OA</th>
<th>Man-made terrain</th>
<th>Natural terrain</th>
<th>High<break/>vegetation</th>
<th>Low<break/>vegetation</th>
<th>Buildings</th>
<th>Hard scope</th>
<th>Scanning artefact</th>
<th>Car</th>
</tr>
</thead>
<tbody>
<tr>
<td>SnapNet</td>
<td>59.1</td>
<td>88.6</td>
<td>82.0</td>
<td>77.3</td>
<td>79.7</td>
<td>22.9</td>
<td>91.1</td>
<td>18.4</td>
<td>37.3</td>
<td>64.4</td>
</tr>
<tr>
<td>ShellNet</td>
<td>69.3</td>
<td>93.2</td>
<td>96.3</td>
<td>90.4</td>
<td>83.9</td>
<td>41.0</td>
<td>94.2</td>
<td>34.7</td>
<td>43.9</td>
<td>70.2</td>
</tr>
<tr>
<td>GACNet</td>
<td>70.8</td>
<td>91.9</td>
<td>86.4</td>
<td>77.7</td>
<td><bold>88.5</bold></td>
<td><bold>60.6</bold></td>
<td>94.2</td>
<td>37.3</td>
<td>43.5</td>
<td>77.8</td>
</tr>
<tr>
<td>SPG</td>
<td>73.2</td>
<td>94.0</td>
<td><bold>97.4</bold></td>
<td><bold>92.6</bold></td>
<td>87.9</td>
<td>44.0</td>
<td>83.2</td>
<td>31.0</td>
<td>63.5</td>
<td>76.2</td>
</tr>
<tr>
<td>RandLA-Net</td>
<td>77.4</td>
<td>94.8</td>
<td>95.6</td>
<td>91.4</td>
<td>86.6</td>
<td>51.5</td>
<td>95.7</td>
<td>51.5</td>
<td>69.8</td>
<td>76.8</td>
</tr>
<tr>
<td>KPCnov</td>
<td>74.6</td>
<td>92.9</td>
<td>90.9</td>
<td>82.2</td>
<td>84.2</td>
<td>47.9</td>
<td>94.9</td>
<td>40.0</td>
<td><bold>77.3</bold></td>
<td>79.7</td>
</tr>
<tr>
<td>Ours</td>
<td><bold>78.2</bold></td>
<td><bold>94.9</bold></td>
<td>95.8</td>
<td>90.9</td>
<td>87.7</td>
<td>51.9</td>
<td><bold>95.9</bold></td>
<td><bold>52.6</bold></td>
<td>70.9</td>
<td><bold>80.1</bold></td>
</tr>
</tbody>
</table>
</table-wrap>
<p>The visualized test results are depicted in <xref ref-type="fig" rid="fig-8">Fig. 8</xref>. Due to the unavailability of the ground truth labels for the test set of this dataset, the images from left to right represent the input point cloud data and the predicted labels, respectively. On the whole, our proposed algorithm exhibits remarkable segmentation performance, effectively discerning the boundaries of buildings, roads, and other target objects. It is worth noting that the distribution of the hard landscape category is uneven, and characterized by substantial variations in shape and structure. The internal geometric shapes, colors, and texture features also change with different environmental contexts. Nonetheless, our proposed algorithm achieves optimal segmentation performance even in such complex scenarios. Through data analysis and result visualization, it becomes evident that the algorithm can identify intricate details and complex components within the point cloud structure, accurately distinguishing features and nuances associated with different targets. These findings validate the network&#x2019;s exceptional capabilities in feature extraction, spatial information aggregation, and precise segmentation, thereby providing comprehensive verification of the effectiveness of the feature extraction module.</p>
<fig id="fig-8">
<label>Figure 8</label>
<caption>
<title>Visualization results of semantic segmentation of Semantic3D dataset</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_45818-fig-8.tif"/>
</fig>
</sec>
<sec id="s4_3_3">
<label>4.3.3</label>
<title>Experimental Results Evaluation of SemanticKITTI Dataset</title>
<p>The SemanticKITTI dataset serves as an extension of the KITTI dataset, and <xref ref-type="table" rid="table-3">Table 3</xref> provides a quantitative comparison of our algorithm with several classical algorithms on the SemanticKITTI dataset. The results from the table indicate the superiority of our algorithm over the majority of existing approaches, achieving a mIoU of 55.4%. Notably, our algorithm demonstrates outstanding segmentation performance in the categories of vehicles, vegetation, and terrain, surpassing other methods. Our algorithm exhibits remarkable advantages in point-based approaches and also demonstrates certain strengths in projection-based and voxel-based methods, ranking second only to the SalsaNext algorithm.</p>
<table-wrap id="table-3">
<label>Table 3</label>
<caption>
<title>Quantitative results of semantic segmentation of SemanticKITTI dataset</title>
</caption>
<table frame="hsides">
<colgroup>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
</colgroup>
<thead>
<tr>
<th>Methods</th>
<th>Model</th>
<th>mIoU</th>
<th>Car</th>
<th>Bicycle</th>
<th>Motor<break/>cycle</th>
<th>Truck</th>
<th>Other-<break/>vehicle</th>
<th>Person</th>
<th>Bicyclist</th>
<th>Motor<break/>cyclist</th>
<th>Road</th>
<th>Parking</th>
<th>Side<break/>walk</th>
<th>Other-<break/>ground</th>
<th>Building</th>
<th>Fence</th>
<th>Vege-<break/>tation</th>
<th>Trunk</th>
<th>Terrain</th>
<th>Pole</th>
<th>Traffic-<break/>sign</th>
</tr>
</thead>
<tbody>
<tr>
<td/>
<td>SqueezeSeg</td>
<td>29.5</td>
<td>68.8</td>
<td>16.0</td>
<td>4.1</td>
<td>3.3</td>
<td>3.6</td>
<td>12.9</td>
<td>13.1</td>
<td>0.9</td>
<td>85.4</td>
<td>26.9</td>
<td>54.3</td>
<td>4.5</td>
<td>57.4</td>
<td>29.0</td>
<td>60.0</td>
<td>24.3</td>
<td>53.7</td>
<td>17.5</td>
<td>24.5</td>
</tr>
<tr>
<td/>
<td>SqueezeSegV2</td>
<td>39.7</td>
<td>81.8</td>
<td>18.5</td>
<td>17.9</td>
<td>13.4</td>
<td>14.0</td>
<td>20.1</td>
<td>25.1</td>
<td>3.9</td>
<td>88.6</td>
<td>45.8</td>
<td>67.6</td>
<td>17.7</td>
<td>73.7</td>
<td>41.1</td>
<td>71.8</td>
<td>35.8</td>
<td>60.2</td>
<td>20.2</td>
<td>36.3</td>
</tr>
<tr>
<td/>
<td>DarkNet21Seg</td>
<td>47.4</td>
<td>85.4</td>
<td>26.2</td>
<td>26.5</td>
<td>18.6</td>
<td>15.6</td>
<td>31.8</td>
<td>33.6</td>
<td>4.0</td>
<td>91.4</td>
<td>57.0</td>
<td>74.0</td>
<td>26.4</td>
<td>81.9</td>
<td>52.3</td>
<td>77.6</td>
<td>48.4</td>
<td>63.6</td>
<td>36.0</td>
<td>50.0</td>
</tr>
<tr>
<td/>
<td>DarkNet53Seg</td>
<td>49.9</td>
<td>86.4</td>
<td>24.5</td>
<td>32.7</td>
<td>25.5</td>
<td>22.6</td>
<td>36.2</td>
<td>33.6</td>
<td>4.7</td>
<td>91.8</td>
<td>64.8</td>
<td>74.6</td>
<td>27.9</td>
<td>84.1</td>
<td>55.0</td>
<td>78.3</td>
<td>50.1</td>
<td>64.0</td>
<td>38.9</td>
<td>52.2</td>
</tr>
<tr>
<td>Projection<break/>&#x0026;Voxel</td>
<td>S-BKI</td>
<td>51.3</td>
<td>83.8</td>
<td>30.6</td>
<td>43.0</td>
<td>26.0</td>
<td>19.6</td>
<td>8.5</td>
<td>3.4</td>
<td>0.0</td>
<td><bold>92.6</bold></td>
<td><bold>65.3</bold></td>
<td><bold>77.4</bold></td>
<td><bold>30.1</bold></td>
<td>89.7</td>
<td>63.7</td>
<td>83.4</td>
<td>64.3</td>
<td>67.4</td>
<td><bold>58.6</bold></td>
<td><bold>67.1</bold></td>
</tr>
<tr>
<td/>
<td>RangeNet&#x002B;&#x002B;</td>
<td>52.2</td>
<td>91.4</td>
<td>25.7</td>
<td>34.4</td>
<td>25.7</td>
<td>23.0</td>
<td>38.3</td>
<td>38.8</td>
<td>4.8</td>
<td>91.8</td>
<td>65.0</td>
<td>75.2</td>
<td>27.8</td>
<td>87.4</td>
<td>58.6</td>
<td>80.5</td>
<td>55.1</td>
<td>64.6</td>
<td>47.9</td>
<td>55.9</td>
</tr>
<tr>
<td/>
<td>LatticeNet</td>
<td>52.2</td>
<td>88.6</td>
<td>12.0</td>
<td>20.8</td>
<td>43.3</td>
<td>24.8</td>
<td>34.2</td>
<td>39.9</td>
<td><bold>60.9</bold></td>
<td>88.8</td>
<td>64.6</td>
<td>73.8</td>
<td>25.6</td>
<td>86.9</td>
<td>55.2</td>
<td>76.4</td>
<td>57.9</td>
<td>54.7</td>
<td>41.5</td>
<td>42.7</td>
</tr>
<tr>
<td/>
<td>PolarNet</td>
<td>54.3</td>
<td>83.8</td>
<td>40.3</td>
<td>30.1</td>
<td>22.9</td>
<td>28.5</td>
<td>43.2</td>
<td>40.2</td>
<td>5.6</td>
<td>90.8</td>
<td>61.7</td>
<td>74.4</td>
<td>21.7</td>
<td>90.0</td>
<td>61.3</td>
<td>84.0</td>
<td><bold>65.5</bold></td>
<td>67.8</td>
<td>51.8</td>
<td>57.5</td>
</tr>
<tr>
<td/>
<td>SalsaNext</td>
<td>59.5</td>
<td>91.9</td>
<td><bold>48.3</bold></td>
<td><bold>38.6</bold></td>
<td>38.9</td>
<td>31.9</td>
<td><bold>60.2</bold></td>
<td><bold>59.0</bold></td>
<td>19.4</td>
<td>91.7</td>
<td>63.7</td>
<td>75.8</td>
<td>29.1</td>
<td><bold>90.2</bold></td>
<td><bold>64.2</bold></td>
<td>81.8</td>
<td>63.6</td>
<td>66.5</td>
<td>54.3</td>
<td>62.1</td>
</tr>
<tr>
<td/>
<td>PointNet</td>
<td>14.6</td>
<td>46.3</td>
<td>1.3</td>
<td>0.3</td>
<td>0.1</td>
<td>0.8</td>
<td>0.2</td>
<td>0.2</td>
<td>0.0</td>
<td>61.6</td>
<td>15.8</td>
<td>35.7</td>
<td>1.4</td>
<td>41.4</td>
<td>12.9</td>
<td>31.0</td>
<td>4.6</td>
<td>17.6</td>
<td>2.4</td>
<td>3.7</td>
</tr>
<tr>
<td/>
<td>SPG</td>
<td>17.4</td>
<td>49.3</td>
<td>0.2</td>
<td>0.2</td>
<td>0.1</td>
<td>0.8</td>
<td>0.3</td>
<td>2.7</td>
<td>0.1</td>
<td>45.0</td>
<td>0.6</td>
<td>28.5</td>
<td>0.6</td>
<td>64.3</td>
<td>20.8</td>
<td>48.9</td>
<td>27.2</td>
<td>24.6</td>
<td>15.9</td>
<td>0.8</td>
</tr>
<tr>
<td>Point</td>
<td>Pointnet&#x002B;&#x002B;</td>
<td>20.1</td>
<td>53.7</td>
<td>1.9</td>
<td>0.2</td>
<td>0.9</td>
<td>0.2</td>
<td>0.9</td>
<td>1.0</td>
<td>0.0</td>
<td>72.0</td>
<td>18.7</td>
<td>41.8</td>
<td>5.6</td>
<td>62.3</td>
<td>16.9</td>
<td>46.5</td>
<td>0.9</td>
<td>30.0</td>
<td>6.0</td>
<td>8.9</td>
</tr>
<tr>
<td/>
<td>RandLA-Net</td>
<td>53.9</td>
<td>94.2</td>
<td>26.0</td>
<td>25.8</td>
<td><bold>40.1</bold></td>
<td><bold>38.9</bold></td>
<td>49.2</td>
<td>48.2</td>
<td>7.2</td>
<td>90.7</td>
<td>60.3</td>
<td>73.7</td>
<td>20.4</td>
<td>86.9</td>
<td>56.3</td>
<td>81.4</td>
<td>61.3</td>
<td>66.8</td>
<td>49.2</td>
<td>47.7</td>
</tr>
<tr>
<td/>
<td>Ours</td>
<td>55.4</td>
<td><bold>94.5</bold></td>
<td>31.8</td>
<td>36.2</td>
<td>35.9</td>
<td>33.7</td>
<td>45.4</td>
<td>50.5</td>
<td>6.5</td>
<td>91.2</td>
<td>62.0</td>
<td>74.8</td>
<td>24.5</td>
<td>89.7</td>
<td>60.1</td>
<td><bold>84.1</bold></td>
<td>58.3</td>
<td><bold>68.6</bold></td>
<td>51.0</td>
<td>53.7</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>The segmentation results of our algorithm on the SemanticKITTI dataset are visually depicted in <xref ref-type="fig" rid="fig-9">Fig. 9</xref>. From left to right, the images correspond to the ground truth labels, predictions from SqueezeSegV2, predictions from RandLA-Net, and predictions from our algorithm. It is evident from the figure that our algorithm achieves the closest approximation to the ground truth labels in vehicle predictions, while also demonstrating excellent segmentation performance in vegetation areas and along terrain edges. The visual analysis reveals that even on large-scale outdoor scene datasets characterized by sparse point cloud densities, our algorithm consistently achieves favorable segmentation results, effectively showcasing the efficacy of our network&#x0027;s feature extraction capabilities.</p>
<fig id="fig-9">
<label>Figure 9</label>
<caption>
<title>Visualization results of semantic segmentation of SemanticKITTI dataset</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_45818-fig-9.tif"/>
</fig>
</sec>
<sec id="s4_3_4">
<label>4.3.4</label>
<title>Discuss</title>
<p>S3DIS, Semantic3D, and SemantiKITTI are all point cloud datasets collected from the real world. S3DIS focuses on indoor scenes, Semantic3D covers large-scale outdoor scenes in various settings such as urban, rural, and natural environments, while SemantiKITTI specifically focuses on autonomous driving scenarios. These three datasets differ significantly in terms of scale and scenes. However, the proposed model in this paper has achieved competitive results on all three datasets, demonstrating its strong generalization ability. In future work, we plan to enhance the model&#x0027;s robustness to input data by introducing data augmentation techniques such as rotation, translation, and others during the training process.</p>
</sec>
</sec>
<sec id="s4_4">
<label>4.4</label>
<title>Ablation Experiments</title>
<sec id="s4_4_1">
<label>4.4.1</label>
<title>Efficiency Analysis of Sampling Method</title>
<p>This study aims to address the challenge of semantic segmentation in large-scale point clouds. We analyze existing semantic segmentation network models under the conditions of large-scale point clouds. Our findings reveal that the choice of sampling method significantly impacts both training time and memory consumption, thereby necessitating the establishment of an effective downsampling strategy. Such a strategy should enable the rational processing of large-scale point clouds and enhance the overall efficiency of the network. In this regard, we analyze five distinct sampling methods, namely Random Sampling (RS), Farthest Point Sampling (FPS), Generator-Based Sampling (GS), Policy Gradient-Based Sampling (PGS), and Inverse Density Importance Sampling (IDIS).</p>
<p><xref ref-type="fig" rid="fig-10">Fig. 10</xref> presents the experimental comparison of sampling methods in terms of efficiency when dealing with point clouds of different scales. The number of point cloud data is plotted on the x-axis, while memory consumption and processing time are represented on the y-axis. The experimental results for the time and memory consumption of each sampling method are illustrated in <xref ref-type="fig" rid="fig-10">Fig. 10</xref>. For smaller-scale point cloud quantities, all the aforementioned sampling methods exhibit similar time and memory consumption, suggesting minimal computational burden. However, as the number of point clouds gradually increases, FPS, GS, PGS, and IDIS either become highly time-consuming or significantly consume memory. In contrast, random sampling demonstrates relatively favorable performance in terms of time and memory consumption. This outcome indicates that most existing semantic segmentation network models perform well only when handling small-scale point clouds, primarily due to the limitations imposed by the employed sampling methods. In summary, considering the analysis of the six sampling methods discussed above, random sampling exhibits distinct advantages in terms of time and memory consumption. Consequently, this study opts to employ the random sampling algorithm for processing large-scale point cloud data.</p>
<fig id="fig-10">
<label>Figure 10</label>
<caption>
<title>Comparison of sampling effect</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_45818-fig-10.tif"/>
</fig>
</sec>
<sec id="s4_4_2">
<label>4.4.2</label>
<title>Network Structure Analysis</title>
<p>To validate the effectiveness of the proposed HPC and CFSA pooling modules, as shown in <xref ref-type="table" rid="table-4">Table 4</xref>, we conducted meticulous tests by systematically adjusting each module within the same network architecture and evaluated their performance on the S3DIS dataset. In the absence of any added modules, the mIoU was merely 68.1%. When employing the HPC and CFSA pooling modules individually, the mIoU improved by 1.1% and 2.1%, respectively, resulting in values of 70.1% and 69.2%. Furthermore, when both modules were introduced and jointly utilized, the mIoU experienced a significant boost of 3.1%, reaching an impressive 71.2%. These results from the conducted ablation experiments unequivocally demonstrate the pivotal role of the proposed modules in feature extraction.</p>
<table-wrap id="table-4">
<label>Table 4</label>
<caption>
<title>Analysis of experimental results of network structure</title>
</caption>
<table frame="hsides">
<colgroup>
<col align="left"/>
<col align="left"/>
<col align="left"/>
</colgroup>
<thead>
<tr>
<th>HPC</th>
<th>CFSA pooling</th>
<th>mIoU (S3DIS)</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td></td>
<td>68.1</td>
</tr>
<tr>
<td>&#x221A;</td>
<td></td>
<td>69.2</td>
</tr>
<tr>
<td></td>
<td>&#x221A;</td>
<td>70.1</td>
</tr>
<tr>
<td>&#x221A;</td>
<td>&#x221A;</td>
<td>71.2</td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
<sec id="s4_4_3">
<label>4.4.3</label>
<title>Selection of Self-Attention Mechanism</title>
<p><xref ref-type="table" rid="table-5">Table 5</xref> presents the results of ablation experiments on the S3DIS dataset, examining the impact of different self-attention mechanisms within the constructed local feature extraction module. The evaluated mechanisms include channel self-attention (CSA), spatial self-attention (SSA), dual-channel self-attention (DCSA) with parallel spatial and channel interactions, and our proposed CFSA mechanism. These experiments aim to assess the influence of these various self-attention mechanisms on the performance of point cloud semantic segmentation. The results in the table demonstrate that the CFSA mechanism achieves the most favorable outcomes, thus substantiating the effectiveness of this approach.</p>
<table-wrap id="table-5">
<label>Table 5</label>
<caption>
<title>Experimental results of different self-attention mechanisms</title>
</caption>
<table frame="hsides">
<colgroup>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
</colgroup>
<thead>
<tr>
<th>CSA</th>
<th>SSA</th>
<th>DCSA</th>
<th>CFSA</th>
<th>mIoU (S3DIS)</th>
</tr>
</thead>
<tbody>
<tr>
<td>&#x221A;</td>
<td></td>
<td></td>
<td></td>
<td>69.6</td>
</tr>
<tr>
<td></td>
<td>&#x221A;</td>
<td></td>
<td></td>
<td>70.0</td>
</tr>
<tr>
<td></td>
<td></td>
<td>&#x221A;</td>
<td></td>
<td>70.7</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td>&#x221A;</td>
<td>71.2</td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
</sec>
</sec>
<sec id="s5">
<label>5</label>
<title>Conclusions</title>
<p>This paper presents a novel CFSA-Net designed for large-scale semantic segmentation of point clouds. This paper&#x0027;s framework adopts a memory-efficient and computationally economical random sampling strategy. Furthermore, to mitigate the potential drawbacks associated with random sampling, this paper introduces a local feature extraction module based on cross-fusion self-attention, enabling a more comprehensive modeling of geometric information. This paper&#x0027;s network has exhibited exceptional performance in large-scale point cloud semantic segmentation tasks, as evidenced by comprehensive experiments conducted on public datasets, namely S3DIS, Semantic3D, and SemanticKITTI. The visualized results of our predictions clearly illustrate the network&#x0027;s ability to effectively adapt to variations in the shape, structure, and appearance of the target, thereby demonstrating its robust adaptability and generalization capabilities.</p>
<p>The primary limitation of this study emanates from the imperative of point-wise class annotations within the framework of the fully supervised learning paradigm, which presents a highly challenging task when dealing with large-scale point clouds. In future research, our research will be concentrated on exploring weakly/semi-supervised segmentation methods specifically tailored for large-scale point clouds, to alleviate the burden of manual annotation and reduce associated costs. The algorithm proposed in this paper can combine the multi-innovation theory and hierarchical identification principle [<xref ref-type="bibr" rid="ref-39">39</xref>&#x2013;<xref ref-type="bibr" rid="ref-42">42</xref>] to enhance computational efficiency and accuracy.</p>
</sec>
</body>
<back>
<ack>
<p>The authors would like to express their gratitude for the valuable feedback and suggestions provided by all the anonymous reviewers and the editorial team.</p>
</ack>
<sec><title>Funding Statement</title>
<p>This study was funded by the National Natural Science Foundation of China Youth Project (61603127).</p>
</sec>
<sec><title>Author Contributions</title>
<p>Conceptualization, Jun Shu and Jie Zhang; Data curation, Jie Zhang; Formal analysis, Shiqi Yu and Jie Zhang; Investigation, Shiqi Yu; Methodology, Jun Shu, Shuai Wang and Jie Zhang; Software, Jun Shu and Shiqi Yu; Validation, Jun Shu and Shuai Wang; Visualization, Shuai Wang; Writing&#x2013;original draft, Shuai Wang and Jie Zhang. All authors reviewed the results and approved the final version of the manuscript.</p>
</sec>
<sec sec-type="data-availability"><title>Availability of Data and Materials</title>
<p>The training data used in this paper were obtained from S3DIS, Semantic3D and SemantiKITTI, respectively. Available online via the following link: <ext-link ext-link-type="uri" xlink:href="http://buildingparser.stanford.edu/dataset.html">http://buildingparser.stanford.edu/dataset.html</ext-link>, <ext-link ext-link-type="uri" xlink:href="http://semantic3d.net/">http://semantic3d.net/</ext-link>, and <ext-link ext-link-type="uri" xlink:href="http://semantic-kitti.org/">http://semantic-kitti.org/</ext-link>.</p>
</sec>
<sec sec-type="COI-statement"><title>Conflicts of Interest</title>
<p>The authors declare that they have no conflicts of interest to report regarding the present study.</p>
</sec>
<ref-list content-type="authoryear">
<title>References</title>
<ref id="ref-1"><label>[1]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>C. R.</given-names> <surname>Qi</surname></string-name>, <string-name><given-names>H.</given-names> <surname>Su</surname></string-name>, <string-name><given-names>K.</given-names> <surname>Mo</surname></string-name> and <string-name><given-names>L. J.</given-names> <surname>Guibas</surname></string-name></person-group>, &#x201C;<article-title>PointNet: Deep learning on point sets for 3D classification and segmentation</article-title>,&#x201D; in <conf-name>30th IEEE Conf. on Computer Vision and Pattern Recognition</conf-name>, <publisher-loc>Honolulu, HI, USA</publisher-loc>, pp. <fpage>77</fpage>&#x2013;<lpage>85</lpage>, <year>2017</year>.</mixed-citation></ref>
<ref id="ref-2"><label>[2]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>C. R.</given-names> <surname>Qi</surname></string-name>, <string-name><given-names>L.</given-names> <surname>Yi</surname></string-name>, <string-name><given-names>H.</given-names> <surname>Su</surname></string-name> and <string-name><given-names>L. J.</given-names> <surname>Guibas</surname></string-name></person-group>, &#x201C;<article-title>PointNet&#x002B;&#x002B;: Deep hierarchical feature learning on point sets in a metric space</article-title>,&#x201D; in <conf-name>31st Annu. Conf. on Neural Information Processing Systems</conf-name>, <publisher-loc>Long Beach, CA, USA</publisher-loc>, pp. <fpage>5100</fpage>&#x2013;<lpage>5109</lpage>, <year>2017</year>.</mixed-citation></ref>
<ref id="ref-3"><label>[3]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>M.</given-names> <surname>Jiang</surname></string-name>, <string-name><given-names>Y.</given-names> <surname>Wu</surname></string-name>, <string-name><given-names>T.</given-names> <surname>Zhao</surname></string-name>, <string-name><given-names>Z.</given-names> <surname>Zhao</surname></string-name> and <string-name><given-names>C.</given-names> <surname>Lu</surname></string-name></person-group>, &#x201C;<article-title>PointSIFT: A SIFT-like network module for 3D point cloud semantic segmentation</article-title>,&#x201D; <year>2018</year>. <pub-id pub-id-type="doi">10.48550/arXiv.1807.00652</pub-id></mixed-citation></ref>
<ref id="ref-4"><label>[4]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>H.</given-names> <surname>Zhao</surname></string-name>, <string-name><given-names>L.</given-names> <surname>Jiang</surname></string-name>, <string-name><given-names>C. W.</given-names> <surname>Fu</surname></string-name> and <string-name><given-names>J.</given-names> <surname>Jia</surname></string-name></person-group>, &#x201C;<article-title>Pointweb: Enhancing local neighborhood features for point cloud processing</article-title>,&#x201D; in <conf-name>32nd IEEE/CVF Conf. on Computer Vision and Pattern Recognition</conf-name>, <publisher-loc>Long Beach, CA, USA</publisher-loc>, pp. <fpage>5560</fpage>&#x2013;<lpage>5568</lpage>, <year>2019</year>.</mixed-citation></ref>
<ref id="ref-5"><label>[5]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>H.</given-names> <surname>Lei</surname></string-name>, <string-name><given-names>N.</given-names> <surname>Akhtar</surname></string-name> and <string-name><given-names>A.</given-names> <surname>Mian</surname></string-name></person-group>, &#x201C;<article-title>Spherical convolutional neural network for 3D point clouds</article-title>,&#x201D; <year>2018</year>. <pub-id pub-id-type="doi">10.48550/arXiv.1805.07872</pub-id></mixed-citation></ref>
<ref id="ref-6"><label>[6]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>L.</given-names> <surname>Wang</surname></string-name>, <string-name><given-names>Y.</given-names> <surname>Huang</surname></string-name>, <string-name><given-names>Y.</given-names> <surname>Hou</surname></string-name>, <string-name><given-names>S.</given-names> <surname>Zhang</surname></string-name> and <string-name><given-names>J.</given-names> <surname>Shan</surname></string-name></person-group>, &#x201C;<article-title>Graph attention convolution for point cloud semantic segmentation</article-title>,&#x201D; in <conf-name>32nd IEEE/CVF Conf. on Computer Vision and Pattern Recognition</conf-name>, <publisher-loc>Long Beach, CA, USA</publisher-loc>, pp. <fpage>10288</fpage>&#x2013;<lpage>10297</lpage>, <year>2019</year>.</mixed-citation></ref>
<ref id="ref-7"><label>[7]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>Y.</given-names> <surname>Wang</surname></string-name>, <string-name><given-names>Y.</given-names> <surname>Sun</surname></string-name>, <string-name><given-names>Z.</given-names> <surname>Liu</surname></string-name>, <string-name><given-names>S. E.</given-names> <surname>Sarma</surname></string-name>, <string-name><given-names>M. M.</given-names> <surname>Bronstein</surname></string-name> <etal>et al.</etal></person-group><italic>,</italic> &#x201C;<article-title>Dynamic graph CNN for learning on point clouds</article-title>,&#x201D; <source>ACM Transactions on Graphics</source>, vol. <volume>38</volume>, no. <issue>5</issue>, pp. <fpage>1</fpage>&#x2013;<lpage>12</lpage>, <year>2019</year>.</mixed-citation></ref>
<ref id="ref-8"><label>[8]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>Y.</given-names> <surname>Li</surname></string-name>, <string-name><given-names>R.</given-names> <surname>Bu</surname></string-name>, <string-name><given-names>M.</given-names> <surname>Sun</surname></string-name>, <string-name><given-names>W.</given-names> <surname>Wu</surname></string-name>, <string-name><given-names>X.</given-names> <surname>Di</surname></string-name> <etal>et al.</etal></person-group><italic>,</italic> &#x201C;<article-title>PointCNN: Convolution on X-transformed points</article-title>,&#x201D; in <conf-name>32nd Conf. on Neural Information Processing Systems</conf-name>, <publisher-loc>Montreal, QC, Canada</publisher-loc>, pp. <fpage>820</fpage>&#x2013;<lpage>830</lpage>, <year>2018</year>.</mixed-citation></ref>
<ref id="ref-9"><label>[9]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>A.</given-names> <surname>Komarichev</surname></string-name>, <string-name><given-names>Z.</given-names> <surname>Zhong</surname></string-name> and <string-name><given-names>J.</given-names> <surname>Hua</surname></string-name></person-group>, &#x201C;<article-title>A-CNN: Annularly convolutional neural networks on point clouds</article-title>,&#x201D; in <conf-name>32nd IEEE/CVF Conf. on Computer Vision and Pattern Recognition</conf-name>, <publisher-loc>Long Beach, CA, USA</publisher-loc>, pp. <fpage>7413</fpage>&#x2013;<lpage>7422</lpage>, <year>2019</year>.</mixed-citation></ref>
<ref id="ref-10"><label>[10]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>W.</given-names> <surname>Wu</surname></string-name>, <string-name><given-names>Z.</given-names> <surname>Qi</surname></string-name> and <string-name><given-names>L.</given-names> <surname>Fuxin</surname></string-name></person-group>, &#x201C;<article-title>PointCONV: Deep convolutional networks on 3D point clouds</article-title>,&#x201D; in <conf-name>32nd IEEE/CVF Conf. on Computer Vision and Pattern Recognition</conf-name>, <publisher-loc>Long Beach, CA, United states</publisher-loc>, pp. <fpage>9613</fpage>&#x2013;<lpage>9622</lpage>, <year>2019</year>.</mixed-citation></ref>
<ref id="ref-11"><label>[11]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>L.</given-names> <surname>Landrieu</surname></string-name> and <string-name><given-names>M.</given-names> <surname>Simonovsky</surname></string-name></person-group>, &#x201C;<article-title>Large-scale point cloud semantic segmentation with superpoint graphs</article-title>,&#x201D; in <conf-name>31st Meeting of the IEEE/CVF Conf. on Computer Vision and Pattern Recognition</conf-name>, <publisher-loc>Salt Lake City, UT, USA</publisher-loc>, pp. <fpage>4558</fpage>&#x2013;<lpage>4567</lpage>, <year>2018</year>.</mixed-citation></ref>
<ref id="ref-12"><label>[12]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>A.</given-names> <surname>Milioto</surname></string-name>, <string-name><given-names>I.</given-names> <surname>Vizzo</surname></string-name>, <string-name><given-names>J.</given-names> <surname>Behley</surname></string-name> and <string-name><given-names>C.</given-names> <surname>Stachniss</surname></string-name></person-group>, &#x201C;<article-title>RangeNet &#x002B;&#x002B;: Fast and accurate LiDAR semantic segmentation</article-title>,&#x201D; in <conf-name>2019 IEEE/RSJ Int. Conf. on Intelligent Robots and Systems</conf-name>, <publisher-loc>Macau, China</publisher-loc>, pp. <fpage>4213</fpage>&#x2013;<lpage>4220</lpage>, <year>2019</year>.</mixed-citation></ref>
<ref id="ref-13"><label>[13]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>M. H.</given-names> <surname>Guo</surname></string-name>, <string-name><given-names>J. X.</given-names> <surname>Cai</surname></string-name>, <string-name><given-names>Z. N.</given-names> <surname>Liu</surname></string-name>, <string-name><given-names>T. J.</given-names> <surname>Mu</surname></string-name>, <string-name><given-names>R. R.</given-names> <surname>Martin</surname></string-name> <etal>et al.</etal></person-group><italic>,</italic> &#x201C;<article-title>PCT: Point cloud transformer</article-title>,&#x201D; <source>Computational Visual Media</source>, vol. <volume>7</volume>, no. <issue>2</issue>, pp. <fpage>187</fpage>&#x2013;<lpage>199</lpage>, <year>2021</year>.</mixed-citation></ref>
<ref id="ref-14"><label>[14]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>F.</given-names> <surname>Groh</surname></string-name>, <string-name><given-names>P.</given-names> <surname>Wieschollek</surname></string-name> and <string-name><given-names>H. P. A.</given-names> <surname>Lensch</surname></string-name></person-group>, &#x201C;<article-title>Flex-convolution: Million-scale point-cloud learning beyond grid-worlds</article-title>,&#x201D; in <conf-name>14th Asian Conf. on Computer Vision</conf-name>, <publisher-loc>Perth, WA, Australia</publisher-loc>, pp. <fpage>105</fpage>&#x2013;<lpage>122</lpage>, <year>2019</year>.</mixed-citation></ref>
<ref id="ref-15"><label>[15]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>A.</given-names> <surname>Abid</surname></string-name>, <string-name><given-names>M. F.</given-names> <surname>Balin</surname></string-name> and <string-name><given-names>J.</given-names> <surname>Zou</surname></string-name></person-group>, &#x201C;<article-title>Concrete autoencoders: Differentiable feature selection and reconstruction</article-title>,&#x201D; in <conf-name>36th Int. Conf. on Machine Learning</conf-name>, <publisher-loc>Long Beach, CA, USA</publisher-loc>, pp. <fpage>694</fpage>&#x2013;<lpage>711</lpage>, <year>2019</year>.</mixed-citation></ref>
<ref id="ref-16"><label>[16]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>O.</given-names> <surname>Dovrat</surname></string-name>, <string-name><given-names>I.</given-names> <surname>Lang</surname></string-name> and <string-name><given-names>S.</given-names> <surname>Avidan</surname></string-name></person-group>, &#x201C;<article-title>Learning to sample</article-title>,&#x201D; in <conf-name>32nd IEEE/CVF Conf. on Computer Vision and Pattern Recognition</conf-name>, <publisher-loc>Long Beach, CA, USA</publisher-loc>, pp. <fpage>2755</fpage>&#x2013;<lpage>2764</lpage>, <year>2019</year>.</mixed-citation></ref>
<ref id="ref-17"><label>[17]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>X.</given-names> <surname>Yan</surname></string-name>, <string-name><given-names>C.</given-names> <surname>Zheng</surname></string-name>, <string-name><given-names>Z.</given-names> <surname>Li</surname></string-name>, <string-name><given-names>S.</given-names> <surname>Wang</surname></string-name> and <string-name><given-names>S.</given-names> <surname>Cui</surname></string-name></person-group>, &#x201C;<article-title>PointASNL: Robust point clouds processing using nonlocal neural networks with adaptive sampling</article-title>,&#x201D; in <conf-name>2020 IEEE/CVF Conf. on Computer Vision and Pattern Recognition</conf-name>, <publisher-loc>Seattle, WA, USA</publisher-loc>, pp. <fpage>5588</fpage>&#x2013;<lpage>5597</lpage>, <year>2020</year>.</mixed-citation></ref>
<ref id="ref-18"><label>[18]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>M.</given-names> <surname>Wang</surname></string-name>, <string-name><given-names>H.</given-names> <surname>Yi</surname></string-name>, <string-name><given-names>F.</given-names> <surname>Jiang</surname></string-name>, <string-name><given-names>L.</given-names> <surname>Lin</surname></string-name> and <string-name><given-names>M.</given-names> <surname>Gao</surname></string-name></person-group>, &#x201C;<article-title>Review on offloading of vehicle edge computing</article-title>,&#x201D; <source>Journal of Artificial Intelligence and Technology</source>, vol. <volume>2</volume>, no. <issue>4</issue>, pp. <fpage>132</fpage>&#x2013;<lpage>143</lpage>, <year>2022</year>.</mixed-citation></ref>
<ref id="ref-19"><label>[19]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>I.</given-names> <surname>Armeni</surname></string-name>, <string-name><given-names>O.</given-names> <surname>Sener</surname></string-name>, <string-name><given-names>A. R.</given-names> <surname>Zamir</surname></string-name>, <string-name><given-names>H.</given-names> <surname>Jiang</surname></string-name>, <string-name><given-names>I.</given-names> <surname>Brilakis</surname></string-name> <etal>et al.</etal></person-group><italic>,</italic> &#x201C;<article-title>3D semantic parsing of large-scale indoor spaces</article-title>,&#x201D; in <conf-name>29th IEEE Conf. on Computer Vision and Pattern Recognition</conf-name>, <publisher-loc>Las Vegas, NV, USA</publisher-loc>, pp. <fpage>1534</fpage>&#x2013;<lpage>1543</lpage>, <year>2016</year>.</mixed-citation></ref>
<ref id="ref-20"><label>[20]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>T.</given-names> <surname>Hackel</surname></string-name>, <string-name><given-names>N.</given-names> <surname>Savinov</surname></string-name>, <string-name><given-names>L.</given-names> <surname>Ladicky</surname></string-name>, <string-name><given-names>J. D.</given-names> <surname>Wegner</surname></string-name>, <string-name><given-names>K.</given-names> <surname>Schindler</surname></string-name> <etal>et al.</etal></person-group><italic>,</italic> &#x201C;<article-title>SEMANTIC3D.NET: A new large-scale point cloud classification benchmark</article-title>,&#x201D; in <conf-name>ISPRS Hannover Workshop 2017 on High-Resolution Earth Imaging for Geospatial Information</conf-name>, <publisher-loc>Hannover, Germany</publisher-loc>, pp. <fpage>91</fpage>&#x2013;<lpage>98</lpage>, <year>2017</year>.</mixed-citation></ref>
<ref id="ref-21"><label>[21]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>J.</given-names> <surname>Behley</surname></string-name>, <string-name><given-names>M.</given-names> <surname>Garbade</surname></string-name>, <string-name><given-names>A.</given-names> <surname>Milioto</surname></string-name>, <string-name><given-names>J.</given-names> <surname>Quenzel</surname></string-name>, <string-name><given-names>S.</given-names> <surname>Behnke</surname></string-name> <etal>et al.</etal></person-group><italic>,</italic> &#x201C;<article-title>SemanticKITTI: A dataset for semantic scene understanding of LiDAR sequences</article-title>,&#x201D; in <conf-name>17th IEEE/CVF Int. Conf. on Computer Vision</conf-name>, <publisher-loc>Seoul, Korea</publisher-loc>, pp. <fpage>9296</fpage>&#x2013;<lpage>9306</lpage>, <year>2019</year>.</mixed-citation></ref>
<ref id="ref-22"><label>[22]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>B.</given-names> <surname>Wu</surname></string-name>, <string-name><given-names>A.</given-names> <surname>Wan</surname></string-name>, <string-name><given-names>X.</given-names> <surname>Yue</surname></string-name> and <string-name><given-names>K.</given-names> <surname>Keutzer</surname></string-name></person-group>, &#x201C;<article-title>SqueezeSeg: Convolutional neural nets with recurrent CRF for real-time road-object segmentation from 3D LiDAR point cloud</article-title>,&#x201D; in <conf-name>IEEE Int. Conf. on Robotics and Automation</conf-name>, <publisher-loc>Brisbane, QLD, Australia</publisher-loc>, pp. <fpage>1887</fpage>&#x2013;<lpage>1893</lpage>, <year>2018</year>.</mixed-citation></ref>
<ref id="ref-23"><label>[23]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>A. H.</given-names> <surname>Lang</surname></string-name>, <string-name><given-names>S.</given-names> <surname>Vora</surname></string-name>, <string-name><given-names>H.</given-names> <surname>Caesar</surname></string-name>, <string-name><given-names>L.</given-names> <surname>Zhou</surname></string-name>, <string-name><given-names>J.</given-names> <surname>Yang</surname></string-name> <etal>et al.</etal></person-group><italic>,</italic> &#x201C;<article-title>Pointpillars: Fast encoders for object detection from point clouds</article-title>,&#x201D; in <conf-name>32nd IEEE/CVF Conf. on Computer Vision and Pattern Recognition</conf-name>, <publisher-loc>Long Beach, CA, USA</publisher-loc>, pp. <fpage>12689</fpage>&#x2013;<lpage>12697</lpage>, <year>2019</year>.</mixed-citation></ref>
<ref id="ref-24"><label>[24]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>B.</given-names> <surname>Wu</surname></string-name>, <string-name><given-names>X.</given-names> <surname>Zhou</surname></string-name>, <string-name><given-names>S.</given-names> <surname>Zhao</surname></string-name>, <string-name><given-names>X.</given-names> <surname>Yue</surname></string-name> and <string-name><given-names>K.</given-names> <surname>Keutzer</surname></string-name></person-group>, &#x201C;<article-title>SqueezeSegV2: Improved model structure and unsupervised domain adaptation for road-object segmentation from a LiDAR point cloud</article-title>,&#x201D; in <conf-name>2019 Int. Conf. on Robotics and Automation</conf-name>, <publisher-loc>Montreal, QC, Canada</publisher-loc>, pp. <fpage>4376</fpage>&#x2013;<lpage>4382</lpage>, <year>2019</year>.</mixed-citation></ref>
<ref id="ref-25"><label>[25]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>P.</given-names> <surname>Zhang</surname></string-name>, <string-name><given-names>G.</given-names> <surname>Li</surname></string-name>, <string-name><given-names>C.</given-names> <surname>Liu</surname></string-name> and <string-name><given-names>J.</given-names> <surname>Ma</surname></string-name></person-group>, &#x201C;<article-title>End-to-end BEV perception via homography matrix</article-title>,&#x201D; in <conf-name>6th IEEE Information Technology, Networking, Electronic and Automation Control Conf.</conf-name>, <publisher-loc>Chongqing, China</publisher-loc>, pp. <fpage>1352</fpage>&#x2013;<lpage>1356</lpage>, <year>2023</year>.</mixed-citation></ref>
<ref id="ref-26"><label>[26]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>S. D.</given-names> <surname>Khan</surname></string-name>, <string-name><given-names>L.</given-names> <surname>Alarabi</surname></string-name> and <string-name><given-names>S.</given-names> <surname>Basalamah</surname></string-name></person-group>, &#x201C;<article-title>Deep hybrid network for land cover semantic segmentation in high-spatial resolution satellite images</article-title>,&#x201D; <source>Information</source>, vol. <volume>12</volume>, no. <issue>6</issue>, pp. <fpage>230</fpage>, <year>2021</year>.</mixed-citation></ref>
<ref id="ref-27"><label>[27]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>L. P.</given-names> <surname>Tchapmi</surname></string-name>, <string-name><given-names>C. B.</given-names> <surname>Choy</surname></string-name>, <string-name><given-names>I.</given-names> <surname>Armeni</surname></string-name>, <string-name><given-names>J.</given-names> <surname>Gwak</surname></string-name> and <string-name><given-names>S.</given-names> <surname>Savarese</surname></string-name></person-group>, &#x201C;<article-title>SEGCloud: Semantic segmentation of 3D point clouds</article-title>,&#x201D; <year>2017</year>. <pub-id pub-id-type="doi">10.48550/arXiv.1710.07563</pub-id></mixed-citation></ref>
<ref id="ref-28"><label>[28]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>H. Y.</given-names> <surname>Meng</surname></string-name>, <string-name><given-names>L.</given-names> <surname>Gao</surname></string-name>, <string-name><given-names>Y. K.</given-names> <surname>Lai</surname></string-name> and <string-name><given-names>D.</given-names> <surname>Manocha</surname></string-name></person-group>, &#x201C;<article-title>VV-net: Voxel VAE net with group convolutions for point cloud segmentation</article-title>,&#x201D; in <conf-name>17th IEEE/CVF Int. Conf. on Computer Vision</conf-name>, <publisher-loc>Seoul, Korea</publisher-loc>, pp. <fpage>8499</fpage>&#x2013;<lpage>8507</lpage>, <year>2019</year>.</mixed-citation></ref>
<ref id="ref-29"><label>[29]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>W.</given-names> <surname>Zhou</surname></string-name>, <string-name><given-names>X.</given-names> <surname>Zhang</surname></string-name>, <string-name><given-names>X.</given-names> <surname>Hao</surname></string-name>, <string-name><given-names>D.</given-names> <surname>Wang</surname></string-name> and <string-name><given-names>Y.</given-names> <surname>He</surname></string-name></person-group>, &#x201C;<article-title>Multi point-voxel convolution (MPVConv) for deep learning on point clouds</article-title>,&#x201D; <source>Computers &#x0026; Graphics</source>, vol. <volume>112</volume>, pp. <fpage>72</fpage>&#x2013;<lpage>80</lpage>, <year>2023</year>.</mixed-citation></ref>
<ref id="ref-30"><label>[30]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>H.</given-names> <surname>Thomas</surname></string-name>, <string-name><given-names>C. R.</given-names> <surname>Qi</surname></string-name>, <string-name><given-names>J. E.</given-names> <surname>Deschaud</surname></string-name>, <string-name><given-names>B.</given-names> <surname>Marcotegui</surname></string-name>, <string-name><given-names>F.</given-names> <surname>Goulette</surname></string-name> <etal>et al.</etal></person-group><italic>,</italic> &#x201C;<article-title>KPConv: Flexible and deformable convolution for point clouds</article-title>,&#x201D; in <conf-name>17th IEEE/CVF Int. Conf. on Computer Vision</conf-name>, <publisher-loc>Seoul, Korea</publisher-loc>, pp. <fpage>6410</fpage>&#x2013;<lpage>6419</lpage>, <year>2019</year>.</mixed-citation></ref>
<ref id="ref-31"><label>[31]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>L.</given-names> <surname>Li</surname></string-name>, <string-name><given-names>L.</given-names> <surname>He</surname></string-name>, <string-name><given-names>J.</given-names> <surname>Gao</surname></string-name> and <string-name><given-names>X.</given-names> <surname>Han</surname></string-name></person-group>, &#x201C;<article-title>PSNet: Fast data structuring for hierarchical deep learning on point cloud</article-title>,&#x201D; <source>IEEE Transactions on Circuits and Systems for Video Technology</source>, vol. <volume>32</volume>, no. <issue>10</issue>, pp. <fpage>6835</fpage>&#x2013;<lpage>6849</lpage>, <year>2022</year>.</mixed-citation></ref>
<ref id="ref-32"><label>[32]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>M.</given-names> <surname>Ibrahim</surname></string-name>, <string-name><given-names>N.</given-names> <surname>Akhtar</surname></string-name>, <string-name><given-names>S.</given-names> <surname>Anwar</surname></string-name> and <string-name><given-names>A.</given-names> <surname>Mian</surname></string-name></person-group>, &#x201C;<article-title>SAT3D: Slot attention transformer for 3D point cloud semantic segmentation</article-title>,&#x201D; <source>IEEE Transactions on Intelligent Transportation Systems</source>, vol. <volume>24</volume>, no. <issue>5</issue>, pp. <fpage>5456</fpage>&#x2013;<lpage>5466</lpage>, <year>2023</year>.</mixed-citation></ref>
<ref id="ref-33"><label>[33]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>M.</given-names> <surname>Tatarchenko</surname></string-name>, <string-name><given-names>J.</given-names> <surname>Park</surname></string-name>, <string-name><given-names>V.</given-names> <surname>Koltun</surname></string-name> and <string-name><given-names>Q. Y.</given-names> <surname>Zhou</surname></string-name></person-group>, &#x201C;<article-title>Tangent convolutions for dense prediction in 3D</article-title>,&#x201D; in <conf-name>31st Meeting of the IEEE/CVF Conf. on Computer Vision and Pattern Recognition</conf-name>, <publisher-loc>Salt Lake City, UT, USA</publisher-loc>, pp. <fpage>3887</fpage>&#x2013;<lpage>3896</lpage>, <year>2018</year>.</mixed-citation></ref>
<ref id="ref-34"><label>[34]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>F.</given-names> <surname>Zhang</surname></string-name> and <string-name><given-names>X.</given-names> <surname>Xia</surname></string-name></person-group>, &#x201C;<article-title>Cascaded contextual reasoning for large-scale point cloud semantic segmentation</article-title>,&#x201D; <source>IEEE Access</source>, vol. <volume>11</volume>, pp. <fpage>20755</fpage>&#x2013;<lpage>20768</lpage>, <year>2023</year>.</mixed-citation></ref>
<ref id="ref-35"><label>[35]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>R.</given-names> <surname>Chen</surname></string-name>, <string-name><given-names>D.</given-names> <surname>Pu</surname></string-name>, <string-name><given-names>Y.</given-names> <surname>Tong</surname></string-name> and <string-name><given-names>M.</given-names> <surname>Wu</surname></string-name></person-group>, &#x201C;<article-title>Image-denoising algorithm based on improved K-singular value decomposition and atom optimization</article-title>,&#x201D; <source>CAAI Transactions on Intelligence Technology</source>, vol. <volume>7</volume>, no. <issue>1</issue>, pp. <fpage>117</fpage>&#x2013;<lpage>127</lpage>, <year>2022</year>.</mixed-citation></ref>
<ref id="ref-36"><label>[36]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>J.</given-names> <surname>Fu</surname></string-name>, <string-name><given-names>J.</given-names> <surname>Liu</surname></string-name>, <string-name><given-names>J.</given-names> <surname>Jiang</surname></string-name>, <string-name><given-names>Y.</given-names> <surname>Li</surname></string-name>, <string-name><given-names>Y.</given-names> <surname>Bao</surname></string-name> <etal>et al.</etal></person-group><italic>,</italic> &#x201C;<article-title>Scene segmentation with dual relation-aware attention network</article-title>,&#x201D; <source>IEEE Transactions on Neural Networks and Learning Systems</source>, vol. <volume>32</volume>, no. <issue>6</issue>, pp. <fpage>2547</fpage>&#x2013;<lpage>2560</lpage>, <year>2021</year>; <pub-id pub-id-type="pmid">32745005</pub-id></mixed-citation></ref>
<ref id="ref-37"><label>[37]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>C.</given-names> <surname>Chen</surname></string-name>, <string-name><given-names>L. Z.</given-names> <surname>Fragonara</surname></string-name> and <string-name><given-names>A.</given-names> <surname>Tsourdos</surname></string-name></person-group>, &#x201C;<article-title>GAPointNet: Graph attention based point neural network for exploiting local feature of point cloud</article-title>,&#x201D; <source>Neurocomputing</source>, vol. <volume>438</volume>, pp. <fpage>122</fpage>&#x2013;<lpage>132</lpage>, <year>2021</year>.</mixed-citation></ref>
<ref id="ref-38"><label>[38]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>D.</given-names> <surname>Ren</surname></string-name>, <string-name><given-names>Z.</given-names> <surname>Wu</surname></string-name>, <string-name><given-names>J.</given-names> <surname>Li</surname></string-name>, <string-name><given-names>P.</given-names> <surname>Yu</surname></string-name>, <string-name><given-names>J.</given-names> <surname>Guo</surname></string-name> <etal>et al.</etal></person-group><italic>,</italic> &#x201C;<article-title>Point attention network for point cloud semantic segmentation</article-title>,&#x201D; <source>Science China Information Sciences</source>, vol. <volume>65</volume>, pp. <fpage>192104</fpage>, <year>2022</year>.</mixed-citation></ref>
<ref id="ref-39"><label>[39]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>F.</given-names> <surname>Ding</surname></string-name>, <string-name><given-names>L.</given-names> <surname>Xu</surname></string-name>, <string-name><given-names>D.</given-names> <surname>Meng</surname></string-name>, <string-name><given-names>X. B.</given-names> <surname>Jin</surname></string-name>, <string-name><given-names>A.</given-names> <surname>Alsaedi</surname></string-name> <etal>et al.</etal></person-group><italic>,</italic> &#x201C;<article-title>Gradient estimation algorithms for the parameter identification of bilinear systems using the auxiliary model</article-title>,&#x201D; <source>Journal of Computational and Applied Mathematics</source>, vol. <volume>369</volume>, pp. <fpage>112575</fpage>, <year>2020</year>.</mixed-citation></ref>
<ref id="ref-40"><label>[40]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>F.</given-names> <surname>Ding</surname></string-name>, <string-name><given-names>L.</given-names> <surname>Xu</surname></string-name>, <string-name><given-names>X.</given-names> <surname>Zhang</surname></string-name> and <string-name><given-names>Y.</given-names> <surname>Zhou</surname></string-name></person-group>, &#x201C;<article-title>Filtered auxiliary model recursive generalized extended parameter estimation methods for Box&#x2013;Jenkins systems by means of the filtering identification idea</article-title>,&#x201D; <source>International Journal of Robust and Nonlinear Control</source>, vol. <volume>33</volume>, no. <issue>10</issue>, pp. <fpage>5510</fpage>&#x2013;<lpage>5535</lpage>, <year>2023</year>.</mixed-citation></ref>
<ref id="ref-41"><label>[41]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>L.</given-names> <surname>Xu</surname></string-name> and <string-name><given-names>F.</given-names> <surname>Ding</surname></string-name></person-group>, &#x201C;<article-title>Separable synthesis gradient estimation methods and convergence analysis for multivariable systems</article-title>,&#x201D; <source>Journal of Computational and Applied Mathematics</source>, vol. <volume>427</volume>, pp. <fpage>115104</fpage>, <year>2023</year>.</mixed-citation></ref>
<ref id="ref-42"><label>[42]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>F.</given-names> <surname>Ding</surname></string-name></person-group>, &#x201C;<article-title>Least squares parameter estimation and multi-innovation least squares methods for linear fitting problems from noisy data</article-title>,&#x201D; <source>Journal of Computational and Applied Mathematics</source>, vol. <volume>426</volume>, pp. <fpage>115107</fpage>, <year>2023</year>.</mixed-citation></ref>
</ref-list>
</back></article>