<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.1 20151215//EN" "http://jats.nlm.nih.gov/publishing/1.1/JATS-journalpublishing1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:mml="http://www.w3.org/1998/Math/MathML" xml:lang="en" article-type="research-article" dtd-version="1.1">
<front>
<journal-meta>
<journal-id journal-id-type="pmc">CSSE</journal-id>
<journal-id journal-id-type="nlm-ta">CSSE</journal-id>
<journal-id journal-id-type="publisher-id">CSSE</journal-id>
<journal-title-group>
<journal-title>Computer Systems Science &#x0026; Engineering</journal-title>
</journal-title-group>
<issn pub-type="ppub">0267-6192</issn>
<publisher>
<publisher-name>Tech Science Press</publisher-name>
<publisher-loc>USA</publisher-loc>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="publisher-id">39272</article-id>
<article-id pub-id-type="doi">10.32604/csse.2023.039272</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Article</subject>
</subj-group>
</article-categories>
<title-group>
<article-title>RO-SLAM: A Robust SLAM for Unmanned Aerial Vehicles in a Dynamic Environment</article-title>
<alt-title alt-title-type="left-running-head">RO-SLAM: A Robust SLAM for Unmanned Aerial Vehicles in a Dynamic Environment</alt-title>
<alt-title alt-title-type="right-running-head">RO-SLAM: A Robust SLAM for Unmanned Aerial Vehicles in a Dynamic Environment</alt-title>
</title-group>
<contrib-group>
<contrib id="author-1" contrib-type="author" corresp="yes">
<name name-style="western"><surname>Peng</surname><given-names>Jingtong</given-names></name><email>jingtong_peng@163.com</email></contrib>
<aff id="aff-1"><institution>Shanghai Advanced Research Institute, Chinese Academy of Sciences</institution>, <addr-line>Shanghai, 200120</addr-line>, <country>China</country></aff>
</contrib-group>
<author-notes>
<corresp id="cor1"><label>&#x002A;</label>Corresponding Author: Jingtong Peng. Email: <email>jingtong_peng@163.com</email></corresp>
</author-notes>
<pub-date date-type="collection" publication-format="electronic"><year>2023</year></pub-date>
<pub-date date-type="pub" publication-format="electronic"><day>28</day><month>7</month><year>2023</year></pub-date>
<volume>47</volume>
<issue>2</issue>
<fpage>2275</fpage>
<lpage>2291</lpage>
<history>
<date date-type="received"><day>19</day><month>1</month><year>2023</year></date>
<date date-type="accepted"><day>18</day><month>4</month><year>2023</year></date>
</history>
<permissions>
<copyright-statement>&#x00A9; 2023 Peng</copyright-statement>
<copyright-year>2023</copyright-year>
<copyright-holder>Peng</copyright-holder>
<license xlink:href="https://creativecommons.org/licenses/by/4.0/">
<license-p>This work is licensed under a <ext-link ext-link-type="uri" xlink:type="simple" xlink:href="https://creativecommons.org/licenses/by/4.0/">Creative Commons Attribution 4.0 International License</ext-link>, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
</license>
</permissions>
<self-uri content-type="pdf" xlink:href="TSP_CSSE_39272.pdf"></self-uri>
<abstract>
<p>When applied to Unmanned Aerial Vehicles (UAVs), existing Simultaneous Localization and Mapping (SLAM) algorithms are constrained by several factors, notably the interference of dynamic outdoor objects, the limited computing performance of UAVs, and the holes caused by dynamic objects removal in the map. We proposed a new SLAM system for UAVs in dynamic environments to solve these problems based on ORB-SLAM2. We have improved the Pyramid Scene Parsing Network (PSPNet) using Depthwise Separable Convolution to reduce the model parameters. We also incorporated an auxiliary loss function to supervise the hidden layer to enhance accuracy. Then we used the improved PSPNet to detect whether there is a movable object in the scene. If there is a movable object, its feature points will be removed in the tracking thread, and the removed feature points will not participate in the pose estimation of the camera. In addition, we proposed a filling method based on Generative Adversarial Networks (GANs) for the holes caused by dynamic object removal in the map, which employs a new auxiliary descriptor to assist GANs in restoring static scenes based on semantic information. The proposed system is evaluated on the TUM dataset, and the results indicate that the proposed method performs better than DynaSLAM and DS-SLAM on the TUM dataset. We experimented on the Cityscapes dataset, the improved PSPNet achieving an Intersection Over Union (IOU) of 0.812.</p>
</abstract>
<kwd-group kwd-group-type="author">
<kwd>UAVs</kwd>
<kwd>SLAM</kwd>
<kwd>semantic segmentation</kwd>
<kwd>dynamic points remove</kwd>
<kwd>GANs</kwd>
</kwd-group>
</article-meta>
</front>
<body>
<sec id="s1"><label>1</label><title>Introduction</title>
<p>With the evolution of communication methods [<xref ref-type="bibr" rid="ref-1">1</xref>], UAVs are increasingly utilized in a broader range of applications, particularly in map building. Visual SLAM (VSLAM) is a vision-based localization and mapping technique extensively researched in recent years. Thanks to the development of deep learning, SLAM in dynamic environments has produced relatively good research results, such as DynaSLAM [<xref ref-type="bibr" rid="ref-2">2</xref>] and DS-SLAM [<xref ref-type="bibr" rid="ref-3">3</xref>]. The semantic information of the scene can help the visual SLAM system resist the interference of dynamic objects in the environment and provide additional auxiliary information for camera pose estimation. However, most existing visual SLAM systems are designed for small indoor areas. In indoor scenes, the SLAM algorithm is often used for wheeled robots or fixed equipment, and high-performance computers can be used to process SLAM algorithms. The indoor scene environment is relatively simple. Therefore, the map-building task for indoor SLAM is relatively easy.</p>
<p>When the SLAM algorithm is used on UAVs, the working scene of SLAM changes from indoors to outdoors. Three problems need to solve: (1) outdoor scenes have more dynamic distractions than indoor scenes. (2) UAVs are constrained by battery life and load, so it is challenging to use high-performance computers conditionally. The computing platform carried by the UAVs is generally an embedded computing platform and computing performance is relatively weak. (3) Since many holes will be generated in the constructed map after dynamic object removal, the traditional method is to fill the holes through the multi-view geometric filling method, but this method is greatly affected by pose estimation. Currently, several scholars have proposed SLAM algorithms designed explicitly for UAVs. Aguilar et al. [<xref ref-type="bibr" rid="ref-4">4</xref>] proposed a high-precision real-time SLAM system. The system can run the SLAM on the UAVs by using an RGB-D sensor, the Microsoft Kinect, and a small but powerful computer. Bu et al. [<xref ref-type="bibr" rid="ref-5">5</xref>] proposed a method for real-time incremental stitching of large-scale aerial images using a monocular SLAM system to estimate camera position and pose while generating a 3D point cloud. However, both of these methods don&#x2019;t solve the above problems, and there still needs to be satisfactory solutions for SLAM for UAVs in dynamic environments.</p>
<p>In this paper, we proposed a new SLAM system based on ORB-SLAM2 [<xref ref-type="bibr" rid="ref-6">6</xref>] for UAVs in a dynamic environment. We named the SLAM system RO-SLAM &#x201C;A Robust Outdoor SLAM.&#x201D; Firstly, we used Depthwise Separable Convolution (DSC) to reduce the model size of PSPNet [<xref ref-type="bibr" rid="ref-7">7</xref>] and used an auxiliary loss function to improve the accuracy. Then, we used the improved PSPNet to detect whether there is a movable object in the scene. If there is a movable object, its feature points will be removed in the tracking thread, and the removed feature points will not participate in the pose estimation of the camera. In addition, we proposed a filling method based on GANs, using an auxiliary descriptor to fill static scenes according to semantic information, which enhances the fault tolerance rate of the SLAM system. We used the TUM dataset and the Cityscapes dataset to verify our method. The results show that the proposed method performs better than others.</p>
<p>In summary, we highlight our contributions here:</p>
<p>1. We proposed a SLAM system working in a dynamic environment, using a semantic segmentation network to eliminate moving objects, the system has achieved outstanding results in trajectory estimation precision, and the results are better than DS-SLAM and Dyna-SLAM.</p>
<p>2. We also proposed a static background restoration method based on GANs. We used auxiliary descriptors to compensate for the shortcomings of the multi-view geometry method&#x2019;s shortcomings and improve the SLAM system&#x2019;s fault tolerance.</p>
<p>3. Given the current situation that the computing performance of the computing platform carried by the UAVs is generally not high, we used DSC to reduce the model size of PSPNet and an auxiliary loss function to improve the accuracy. This approach increases the potential for utilizing semantic-based SLAM algorithms on UAVs.</p>
</sec>
<sec id="s2"><label>2</label><title>Related Works</title>
<p>VSLAM have developed over the years, and they can be divided into two categories in a dynamic environment: (1) Semantic methods based. (2) Geometry methods based.</p>
<sec id="s2_1"><label>2.1</label><title>Semantic Methods Based</title>
<p>In recent years, Deep Neural Networks (DNNs) have made significant strides in data analysis. Many researchers have utilized DNNs to extract semantic information from data. For instance, Flint et al. [<xref ref-type="bibr" rid="ref-8">8</xref>] proposed an indoor mapping system that makes use of photometric cues, pose information, and sparse point cloud data obtained from a metric SLAM system to create a semantically meaningful map of the indoor environment. Kundu et al. [<xref ref-type="bibr" rid="ref-9">9</xref>] proposed a novel framework for simultaneously performing semantic segmentation and 3D reconstruction with monocular video sequences. The authors used a DNN to estimate the per-pixel depth and semantic labels and then used a fusion network to incorporate the estimated labels into the 3D reconstruction process. This approach significantly improved the accuracy of both semantic segmentation and 3D reconstruction, especially in challenging scenarios like environments with poor lighting or cluttered scenes. Hermans et al. [<xref ref-type="bibr" rid="ref-10">10</xref>] employed a probabilistic graphical model to jointly estimate the 3D geometry and semantically labeled objects in the scene. The model is trained on a large dataset of RGB-D images and object annotations and incorporates both depth and color information to achieve robustness against cluttered scenes and occlusions. Reference [<xref ref-type="bibr" rid="ref-11">11</xref>] introduced a method for generating dense maps of object-class semantics in real-time, utilizing RGB-D videos captured by a depth sensor. The approach employs a hierarchical Bayesian framework to jointly estimate the 3D geometry and semantic labels of objects in real-time. The method uses a deep neural network for object detection and a Gaussian process model for semantic segmentation, enabling the system to handle varying object appearances and dynamics. Masaya et al. [<xref ref-type="bibr" rid="ref-12">12</xref>] proposed a new SLAM algorithm that leverages the benefits of semantic segmentation to improve feature detection and tracking. DeepLab v2 [<xref ref-type="bibr" rid="ref-13">13</xref>] was employed by the algorithm to eliminate dynamic objects through masking, enhancing the precision of camera pose estimation and stability of the system. The approach outperformed baseline algorithms in challenging environments. Bescos et al. [<xref ref-type="bibr" rid="ref-2">2</xref>] proposed a dynamic SLAM algorithm capable of handling fast-moving and deformable objects in the environment. The algorithm combines semantic segmentation with traditional visual odometry and mapping techniques, resulting in improved tracking and mapping of the environment. Inpainting techniques also fill in missing areas due to dynamic objects. Riazuelo et al. [<xref ref-type="bibr" rid="ref-14">14</xref>] presented a SLAM algorithm that incorporates semantic segmentation to overcome challenges in densely populated environments. The algorithm reduces the impact of dynamic obstacles by selectively integrating the observations from the static background while using semantic segmentation to identify and track non-static objects. Finally, Yu et al. [<xref ref-type="bibr" rid="ref-3">3</xref>] proposed a dynamic SLAM algorithm that combined SegNet [<xref ref-type="bibr" rid="ref-15">15</xref>] to handle dynamic environments. The algorithm can detect dynamic objects, track their movements, and update the map accordingly. The approach achieves SOTA performance on several datasets, demonstrating its effectiveness in dynamic environments.</p>
</sec>
<sec id="s2_2"><label>2.2</label><title>Geometry Methods Based</title>
<p>The BaMVO [<xref ref-type="bibr" rid="ref-16">16</xref>] algorithm was proposed by Kim et al. to handle RGB-D sensors in dynamic environments. The algorithm estimated non-parametric background models from depth scenes in to reduce the residual weight of dynamic objects. Another motion removal method was presented by Sun et al. [<xref ref-type="bibr" rid="ref-17">17</xref>], who utilized particle filtering to improve motion detection and then applied a map to identify the foreground. Raluca [<xref ref-type="bibr" rid="ref-18">18</xref>] proposed a novel method to address dynamic objects, whereas most current methods still deploy outlier filtering techniques. Their approach utilized segmentation information to assign weights for dense RGB-D fusion. Emanuele et al. [<xref ref-type="bibr" rid="ref-19">19</xref>] employed a robust, geometric approach to moving objects without relying on scene semantic interpretation. Yang et al. [<xref ref-type="bibr" rid="ref-20">20</xref>] presented a meshing-based and geometric constraint visual SLAM algorithm that uses both sparse feature points and dense depth images. This algorithm divides the scene into small blocks utilizing meshing techniques, matches the blocks using geometric constraints, and excludes the influence of moving objects through dynamic object detection. Sun et al. [<xref ref-type="bibr" rid="ref-21">21</xref>] proposed a moving-object removal approach for dynamic scene modeling with an RGB-D camera. The method analyzes the depth image to detect dynamic objects and removes them to generate a 3D model of the static scene. Liu et al. [<xref ref-type="bibr" rid="ref-22">22</xref>] proposed a general visual SLAM system, named DMS-SLAM, for real-time localization and map construction in dynamic environments with multiple sensors. By using motion segmentation, they improved the precision and robustness of the SLAM algorithm by distinguishing static and dynamic objects. Moreover, the system utilizes multi-view geometric constraints and depth consistency checks to optimize the quality of the generated map. Song et al. [<xref ref-type="bibr" rid="ref-23">23</xref>] presented a robust Bundle Adjustment (BA) that can reject features from dynamic objects by leveraging the pose prior estimated by IMU pre-integration. Then, they proposed keyframe and constraint grouping, based on multiple assumptions, to decrease the impact of temporarily stationary objects on loop closure.</p>
<p>The methods mentioned above have shown promising results in pose estimation. However, there is a need to develop a solution that can effectively handle real-time and dynamic environments while resisting interference. Additionally, these methods may have limitations in restoring static scenes. <xref ref-type="table" rid="table-1">Table 1</xref> displays a comparison between various SLAM systems in dynamic environments.</p>
<table-wrap id="table-1"><label>Table 1</label><caption><title>Different SLAM systems in dynamic environments</title></caption>
<table frame="hsides">
<colgroup>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
</colgroup>
<thead>
<tr>
<th align="left"/>
<th align="left">Type</th>
<th align="left">Framework</th>
<th align="left">Speed</th>
<th align="left">Hardware</th>
<th align="left">Scenes</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">Flint et al. [<xref ref-type="bibr" rid="ref-8">8</xref>]</td>
<td align="left">SMB</td>
<td align="left">-</td>
<td align="left">-</td>
<td align="left">CPU</td>
<td align="left">M</td>
</tr>
<tr>
<td align="left">Kundu et al. [<xref ref-type="bibr" rid="ref-9">9</xref>]</td>
<td align="left">SMB</td>
<td align="left">-</td>
<td align="left">-</td>
<td align="left">-</td>
<td align="left">M</td>
</tr>
<tr>
<td align="left">Hermans et al. [<xref ref-type="bibr" rid="ref-10">10</xref>]</td>
<td align="left">SMB</td>
<td align="left">-</td>
<td align="left">0.75</td>
<td align="left">CPU</td>
<td align="left">D</td>
</tr>
<tr>
<td align="left">Stuckler et al. [<xref ref-type="bibr" rid="ref-11">11</xref>]</td>
<td align="left">SMB</td>
<td align="left">-</td>
<td align="left">-</td>
<td align="left">GTX 675M</td>
<td align="left">D</td>
</tr>
<tr>
<td align="left">Masaya et al. [<xref ref-type="bibr" rid="ref-12">12</xref>]</td>
<td align="left">SMB</td>
<td align="left">ORB-SLAM</td>
<td align="left">-</td>
<td align="left">-</td>
<td align="left">D</td>
</tr>
<tr>
<td align="left">Bescos et al. [<xref ref-type="bibr" rid="ref-2">2</xref>]</td>
<td align="left">SMB</td>
<td align="left">ORB-SLAM2</td>
<td align="left">2</td>
<td align="left">Titan X</td>
<td align="left">D/M/S</td>
</tr>
<tr>
<td align="left">Riazuelo et al. [<xref ref-type="bibr" rid="ref-14">14</xref>]</td>
<td align="left">SMB</td>
<td align="left">ORB-SLAM2</td>
<td align="left">-</td>
<td align="left">-</td>
<td align="left">D</td>
</tr>
<tr>
<td align="left">Yu et al. [<xref ref-type="bibr" rid="ref-3">3</xref>]</td>
<td align="left">SMB</td>
<td align="left">ORB-SLAM2</td>
<td align="left">17</td>
<td align="left">P4000</td>
<td align="left">D</td>
</tr>
<tr>
<td align="left">Kim et al. [<xref ref-type="bibr" rid="ref-16">16</xref>]</td>
<td align="left">GMB</td>
<td align="left">DVO</td>
<td align="left">23</td>
<td align="left">CPU</td>
<td align="left">D</td>
</tr>
<tr>
<td align="left">Sun et al. [<xref ref-type="bibr" rid="ref-17">17</xref>]</td>
<td align="left">GMB</td>
<td align="left">DVO</td>
<td align="left">-</td>
<td align="left">CPU</td>
<td align="left">D</td>
</tr>
<tr>
<td align="left">Raluca [<xref ref-type="bibr" rid="ref-18">18</xref>]</td>
<td align="left">GMB</td>
<td align="left">-</td>
<td align="left">-</td>
<td align="left">CPU</td>
<td align="left">D</td>
</tr>
<tr>
<td align="left">Emanuele et al. [<xref ref-type="bibr" rid="ref-19">19</xref>]</td>
<td align="left">GMB</td>
<td align="left">-</td>
<td align="left">-</td>
<td align="left">CPU</td>
<td align="left">D</td>
</tr>
<tr>
<td align="left">Yang et al. [<xref ref-type="bibr" rid="ref-20">20</xref>]</td>
<td align="left">GMB</td>
<td align="left">ORB-SLAM2</td>
<td align="left">23</td>
<td align="left">CPU</td>
<td align="left">D</td>
</tr>
<tr>
<td align="left">Sun et al. [<xref ref-type="bibr" rid="ref-21">21</xref>]</td>
<td align="left">GMB</td>
<td align="left">DVO</td>
<td align="left">0.1</td>
<td align="left">CPU</td>
<td align="left">D</td>
</tr>
<tr>
<td align="left">Liu et al. [<xref ref-type="bibr" rid="ref-22">22</xref>]</td>
<td align="left">GMB</td>
<td align="left">ORB-SLAM2</td>
<td align="left">30</td>
<td align="left">CPU</td>
<td align="left">D/M/S</td>
</tr>
<tr>
<td align="left">Song et al. [<xref ref-type="bibr" rid="ref-23">23</xref>]</td>
<td align="left">GMB</td>
<td align="left">-</td>
<td align="left">-</td>
<td align="left">CPU</td>
<td align="left">M/S</td>
</tr>
</tbody>
</table>
<table-wrap-foot><fn id="tfn1_1"><p>Note: The &#x201C;D&#x201D; means RGB-D camera, the &#x201C;M&#x201D; means monocular camera, the &#x201C;S&#x201D; means stereo camera, the &#x201C;SMB&#x201D; means &#x201C;Semantic Methods Based,&#x201D; and the &#x201C;GMB&#x201D; means &#x201C;Geometry Methods Based.&#x201D;</p></fn>
</table-wrap-foot>
</table-wrap>
</sec>
</sec>
<sec id="s3"><label>3</label><title>Proposed Method</title>
<sec id="s3_1"><label>3.1</label><title>Framework of RO-SLAM</title>
<p>The SLAM system in this paper is based on the ORB-SLAM2 framework. In this paper, ORB-SLAM2 is improved because of its shortcomings in dynamic environments and its application requirements on UAVs. The improved SLAM system is named RO-SLAM.</p>
<p>Once the SLAM system has been activated, it initializes the Intel RealSense camera to capture images. The tracking and semantic segmentation threads work in parallel to process these images simultaneously. The tracking thread first extracts ORB (Oriented FAST and Rotated BRIEF) feature points from the captured images. Subsequently, it awaits the output of the semantic segmentation thread, which provides information on the semantic meaning of each pixel in the images. When the semantic segmentation results are available, the tracking thread generates semantic descriptors based on the identified semantic labels, which helps to identify dynamic feature points and exclude them from the map construction process. Concurrently, an auxiliary descriptor is also produced to ensure robustness against potential point misplacements. By this process of identifying semantic feature points and removing dynamic ones, only stable static feature points are preserved. These points are then deployed for feature matching and map construction, which ultimately enables the system to accurately map the environment and navigate through it. After removing dynamic feature points, the semantic segmentation mask covers the image to simulate the missing image. Then, the pre-trained GAN will combine the auxiliary descriptor for inpainting in specific regions. The system will ultimately combine the inpainting results, semantic information, and stable static feature points to map construction. The system structure is shown in <xref ref-type="fig" rid="fig-1">Fig. 1</xref>.</p>
<fig id="fig-1"><label>Figure 1</label><caption><title>RO-SLAM structure</title></caption><graphic mimetype="image" mime-subtype="tif" xlink:href="CSSE_39272-fig-1.tif"/></fig>
</sec>
<sec id="s3_2"><label>3.2</label><title>Semantic Segmentation and Dynamic Points Remove</title>
<p>In order for this system to be practical in real-world scenarios, there needs to be a balance between accuracy and real-time performance. To reduce the model size, we use Depthwise Separable Convolution [<xref ref-type="bibr" rid="ref-24">24</xref>] to improve the backbone of PSPNet. The structure is shown in <xref ref-type="fig" rid="fig-2">Fig. 2</xref>.</p>
<fig id="fig-2"><label>Figure 2</label><caption><title>Improved PSPNet structure</title></caption><graphic mimetype="image" mime-subtype="tif" xlink:href="CSSE_39272-fig-2.tif"/></fig>
<p>The improved backbone of PSPNet consists of 5 parts. The parameter of each is shown in <xref ref-type="table" rid="table-2">Table 2</xref>.</p>
<table-wrap id="table-2"><label>Table 2</label><caption><title>The parameter of the new backbone</title></caption>
<table frame="hsides">
<colgroup>
<col align="left"/>
<col align="left"/>
<col align="left"/>
</colgroup>
<thead>
<tr>
<th align="left">Stage</th>
<th align="left">Components</th>
<th align="left">Output stride</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left"><inline-formula id="ieqn-1"><mml:math id="mml-ieqn-1"><mml:mi>B</mml:mi><mml:mi>l</mml:mi><mml:mi>o</mml:mi><mml:mi>c</mml:mi><mml:mi>k</mml:mi><mml:mtext>&#x00A0;</mml:mtext><mml:mn>1</mml:mn></mml:math></inline-formula></td>
<td align="left"><inline-formula id="ieqn-2"><mml:math id="mml-ieqn-2"><mml:mrow><mml:mo>[</mml:mo><mml:mn>3</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mn>3</mml:mn><mml:mspace width="thinmathspace" /><mml:mspace width="thinmathspace" /><mml:mrow><mml:mi>s</mml:mi><mml:mi>t</mml:mi><mml:mi>d</mml:mi></mml:mrow><mml:mo>.</mml:mo><mml:mrow><mml:mi>c</mml:mi><mml:mi>o</mml:mi><mml:mi>n</mml:mi><mml:mi>v</mml:mi></mml:mrow><mml:mo>,</mml:mo><mml:mtext>&#x00A0;</mml:mtext><mml:mn>8</mml:mn><mml:mo>]</mml:mo></mml:mrow></mml:math></inline-formula></td>
<td align="left"><inline-formula id="ieqn-3"><mml:math id="mml-ieqn-3"><mml:mn>2</mml:mn></mml:math></inline-formula></td>
</tr>
<tr>
<td align="left"><inline-formula id="ieqn-4"><mml:math id="mml-ieqn-4"><mml:mi>B</mml:mi><mml:mi>l</mml:mi><mml:mi>o</mml:mi><mml:mi>c</mml:mi><mml:mi>k</mml:mi><mml:mtext>&#x00A0;</mml:mtext><mml:mn>2</mml:mn></mml:math></inline-formula></td>
<td align="left"><inline-formula id="ieqn-5"><mml:math id="mml-ieqn-5"><mml:mrow><mml:mo>[</mml:mo><mml:mn>3</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mn>3</mml:mn><mml:mtext>&#x00A0;</mml:mtext><mml:mrow><mml:mi>m</mml:mi><mml:mi>a</mml:mi><mml:mi>x</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mi>p</mml:mi><mml:mi>o</mml:mi><mml:mi>o</mml:mi><mml:mi>l</mml:mi><mml:mi>i</mml:mi><mml:mi>n</mml:mi><mml:mi>g</mml:mi></mml:mrow><mml:mo>,</mml:mo><mml:mtext>&#x00A0;</mml:mtext><mml:mn>8</mml:mn><mml:mo>]</mml:mo></mml:mrow></mml:math></inline-formula></td>
<td align="left"><inline-formula id="ieqn-6"><mml:math id="mml-ieqn-6"><mml:mn>4</mml:mn></mml:math></inline-formula></td>
</tr>
<tr>
<td align="left"><inline-formula id="ieqn-7"><mml:math id="mml-ieqn-7"><mml:mi>B</mml:mi><mml:mi>l</mml:mi><mml:mi>o</mml:mi><mml:mi>c</mml:mi><mml:mi>k</mml:mi><mml:mtext>&#x00A0;</mml:mtext><mml:mn>3</mml:mn></mml:math></inline-formula></td>
<td align="left"><inline-formula id="ieqn-8"><mml:math id="mml-ieqn-8"><mml:mrow><mml:mo>[</mml:mo><mml:mtable columnalign="left" rowspacing="4pt" columnspacing="1em"><mml:mtr><mml:mtd><mml:mtable columnalign="left" rowspacing="4pt" columnspacing="1em"><mml:mtr><mml:mtd><mml:mn>3</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mn>3</mml:mn><mml:mspace width="thinmathspace" /><mml:mspace width="thinmathspace" /><mml:mrow><mml:mi>d</mml:mi><mml:mi>w</mml:mi></mml:mrow><mml:mo>.</mml:mo><mml:mrow><mml:mi>c</mml:mi><mml:mi>o</mml:mi><mml:mi>n</mml:mi><mml:mi>v</mml:mi></mml:mrow><mml:mo>,</mml:mo><mml:mtext>&#x00A0;</mml:mtext><mml:mn>64</mml:mn></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mn>1</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mn>1</mml:mn><mml:mspace width="thinmathspace" /><mml:mspace width="thinmathspace" /><mml:mrow><mml:mi>p</mml:mi><mml:mi>w</mml:mi></mml:mrow><mml:mo>.</mml:mo><mml:mrow><mml:mi>c</mml:mi><mml:mi>o</mml:mi><mml:mi>n</mml:mi><mml:mi>v</mml:mi></mml:mrow><mml:mo>,</mml:mo><mml:mtext>&#x00A0;</mml:mtext><mml:mn>16</mml:mn></mml:mtd></mml:mtr></mml:mtable></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mtable columnalign="left" rowspacing="4pt" columnspacing="1em"><mml:mtr><mml:mtd><mml:mn>3</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mn>3</mml:mn><mml:mspace width="thinmathspace" /><mml:mspace width="thinmathspace" /><mml:mrow><mml:mi>d</mml:mi><mml:mi>w</mml:mi></mml:mrow><mml:mo>.</mml:mo><mml:mrow><mml:mi>c</mml:mi><mml:mi>o</mml:mi><mml:mi>n</mml:mi><mml:mi>v</mml:mi></mml:mrow><mml:mo>,</mml:mo><mml:mtext>&#x00A0;</mml:mtext><mml:mn>16</mml:mn></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mn>1</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mn>1</mml:mn><mml:mspace width="thinmathspace" /><mml:mspace width="thinmathspace" /><mml:mrow><mml:mi>p</mml:mi><mml:mi>w</mml:mi></mml:mrow><mml:mo>.</mml:mo><mml:mrow><mml:mi>c</mml:mi><mml:mi>o</mml:mi><mml:mi>n</mml:mi><mml:mi>v</mml:mi></mml:mrow><mml:mo>,</mml:mo><mml:mtext>&#x00A0;</mml:mtext><mml:mn>64</mml:mn></mml:mtd></mml:mtr></mml:mtable></mml:mtd></mml:mtr></mml:mtable><mml:mo>]</mml:mo></mml:mrow><mml:mo>&#x00D7;</mml:mo><mml:mn>4</mml:mn></mml:math></inline-formula></td>
<td align="left"><inline-formula id="ieqn-9"><mml:math id="mml-ieqn-9"><mml:mn>8</mml:mn></mml:math></inline-formula></td>
</tr>
<tr>
<td align="left"><inline-formula id="ieqn-10"><mml:math id="mml-ieqn-10"><mml:mi>B</mml:mi><mml:mi>l</mml:mi><mml:mi>o</mml:mi><mml:mi>c</mml:mi><mml:mi>k</mml:mi><mml:mtext>&#x00A0;</mml:mtext><mml:mn>4</mml:mn></mml:math></inline-formula></td>
<td align="left"><inline-formula id="ieqn-11"><mml:math id="mml-ieqn-11"><mml:mrow><mml:mo>[</mml:mo><mml:mtable columnalign="left" rowspacing="4pt" columnspacing="1em"><mml:mtr><mml:mtd><mml:mtable columnalign="left" rowspacing="4pt" columnspacing="1em"><mml:mtr><mml:mtd><mml:mn>3</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mn>3</mml:mn><mml:mspace width="thinmathspace" /><mml:mspace width="thinmathspace" /><mml:mrow><mml:mi>d</mml:mi><mml:mi>w</mml:mi></mml:mrow><mml:mo>.</mml:mo><mml:mrow><mml:mi>c</mml:mi><mml:mi>o</mml:mi><mml:mi>n</mml:mi><mml:mi>v</mml:mi></mml:mrow><mml:mo>,</mml:mo><mml:mtext>&#x00A0;</mml:mtext><mml:mn>128</mml:mn></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mn>1</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mn>1</mml:mn><mml:mrow><mml:mspace width="thinmathspace" /><mml:mspace width="thinmathspace" /><mml:mi>p</mml:mi><mml:mi>w</mml:mi></mml:mrow><mml:mo>.</mml:mo><mml:mrow><mml:mi>c</mml:mi><mml:mi>o</mml:mi><mml:mi>n</mml:mi><mml:mi>v</mml:mi></mml:mrow><mml:mo>,</mml:mo><mml:mtext>&#x00A0;</mml:mtext><mml:mn>32</mml:mn></mml:mtd></mml:mtr></mml:mtable></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mtable columnalign="left" rowspacing="4pt" columnspacing="1em"><mml:mtr><mml:mtd><mml:mn>3</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mn>3</mml:mn><mml:mspace width="thinmathspace" /><mml:mspace width="thinmathspace" /><mml:mrow><mml:mi>d</mml:mi><mml:mi>w</mml:mi></mml:mrow><mml:mo>.</mml:mo><mml:mrow><mml:mi>c</mml:mi><mml:mi>o</mml:mi><mml:mi>n</mml:mi><mml:mi>v</mml:mi></mml:mrow><mml:mo>,</mml:mo><mml:mtext>&#x00A0;</mml:mtext><mml:mn>32</mml:mn></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mn>1</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mn>1</mml:mn><mml:mspace width="thinmathspace" /><mml:mspace width="thinmathspace" /><mml:mrow><mml:mi>p</mml:mi><mml:mi>w</mml:mi></mml:mrow><mml:mo>.</mml:mo><mml:mrow><mml:mi>c</mml:mi><mml:mi>o</mml:mi><mml:mi>n</mml:mi><mml:mi>v</mml:mi></mml:mrow><mml:mo>,</mml:mo><mml:mtext>&#x00A0;</mml:mtext><mml:mn>128</mml:mn></mml:mtd></mml:mtr></mml:mtable></mml:mtd></mml:mtr></mml:mtable><mml:mo>]</mml:mo></mml:mrow><mml:mo>&#x00D7;</mml:mo><mml:mn>8</mml:mn></mml:math></inline-formula></td>
<td align="left"><inline-formula id="ieqn-12"><mml:math id="mml-ieqn-12"><mml:mn>8</mml:mn></mml:math></inline-formula></td>
</tr>
<tr>
<td align="left"><inline-formula id="ieqn-13"><mml:math id="mml-ieqn-13"><mml:mi>B</mml:mi><mml:mi>l</mml:mi><mml:mi>o</mml:mi><mml:mi>c</mml:mi><mml:mi>k</mml:mi><mml:mtext>&#x00A0;</mml:mtext><mml:mn>5</mml:mn></mml:math></inline-formula></td>
<td align="left"><inline-formula id="ieqn-14"><mml:math id="mml-ieqn-14"><mml:mrow><mml:mo>[</mml:mo><mml:mtable columnalign="left" rowspacing="4pt" columnspacing="1em"><mml:mtr><mml:mtd><mml:mtable columnalign="left" rowspacing="4pt" columnspacing="1em"><mml:mtr><mml:mtd><mml:mn>3</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mn>3</mml:mn><mml:mspace width="thinmathspace" /><mml:mspace width="thinmathspace" /><mml:mrow><mml:mi>d</mml:mi><mml:mi>w</mml:mi></mml:mrow><mml:mo>.</mml:mo><mml:mrow><mml:mi>c</mml:mi><mml:mi>o</mml:mi><mml:mi>n</mml:mi><mml:mi>v</mml:mi></mml:mrow><mml:mo>,</mml:mo><mml:mtext>&#x00A0;</mml:mtext><mml:mn>256</mml:mn></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mn>1</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mn>1</mml:mn><mml:mspace width="thinmathspace" /><mml:mspace width="thinmathspace" /><mml:mrow><mml:mi>p</mml:mi><mml:mi>w</mml:mi></mml:mrow><mml:mo>.</mml:mo><mml:mrow><mml:mi>c</mml:mi><mml:mi>o</mml:mi><mml:mi>n</mml:mi><mml:mi>v</mml:mi></mml:mrow><mml:mo>,</mml:mo><mml:mtext>&#x00A0;</mml:mtext><mml:mn>64</mml:mn></mml:mtd></mml:mtr></mml:mtable></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mtable columnalign="left" rowspacing="4pt" columnspacing="1em"><mml:mtr><mml:mtd><mml:mn>3</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mn>3</mml:mn><mml:mspace width="thinmathspace" /><mml:mspace width="thinmathspace" /><mml:mrow><mml:mi>d</mml:mi><mml:mi>w</mml:mi></mml:mrow><mml:mo>.</mml:mo><mml:mrow><mml:mi>c</mml:mi><mml:mi>o</mml:mi><mml:mi>n</mml:mi><mml:mi>v</mml:mi></mml:mrow><mml:mo>,</mml:mo><mml:mtext>&#x00A0;</mml:mtext><mml:mn>64</mml:mn></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mn>1</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mn>1</mml:mn><mml:mspace width="thinmathspace" /><mml:mspace width="thinmathspace" /><mml:mrow><mml:mi>p</mml:mi><mml:mi>w</mml:mi></mml:mrow><mml:mo>.</mml:mo><mml:mrow><mml:mi>c</mml:mi><mml:mi>o</mml:mi><mml:mi>n</mml:mi><mml:mi>v</mml:mi></mml:mrow><mml:mo>,</mml:mo><mml:mtext>&#x00A0;</mml:mtext><mml:mn>256</mml:mn></mml:mtd></mml:mtr></mml:mtable></mml:mtd></mml:mtr></mml:mtable><mml:mo>]</mml:mo></mml:mrow><mml:mo>&#x00D7;</mml:mo><mml:mn>4</mml:mn></mml:math></inline-formula></td>
<td align="left"><inline-formula id="ieqn-15"><mml:math id="mml-ieqn-15"><mml:mn>32</mml:mn></mml:math></inline-formula></td>
</tr>
</tbody>
</table>
</table-wrap>
<p><italic>Block</italic> 1 consists of a standard convolution layer with a single layer and a convolution kernel size of (3&#x2009;&#x00D7;&#x2009;3), which can be utilized for extracting shallow features of the input image, most of which are contour and corner features. <italic>Block</italic> 2 is a max-pooling layer. Semantic segmentation focuses on distinguishing the boundary relationship between each instance and the scene and considering the demand for parameters of the lightweight network. Therefore, we use max-pooling to reduce the dimensionality of interior features. Although this approach may decrease accuracy, the balance between accuracy and inference speed is the key for lightweight real-time networks. <italic>Block</italic> 3 is a Depthwise Separable Convolution with four core sizes (3&#x2009;&#x00D7;&#x2009;3, 1&#x2009;&#x00D7;&#x2009;1) and channels (64, 16, 16, 64). It plays the role of deep feature extraction together with the next <italic>Block</italic> 4 and <italic>Block</italic> 5. The entire backbone network used 33 layers of Depthwise Separable Convolutions.</p>
<p>Although, the improved backbone network significantly reduces the number of parameters through the alternate use of two sets of Depthwise Separable Convolution. However, this bottleneck structure may suffer from a vanishing gradient. We integrate deep supervision information within the loss function used for training to solve this problem and use the auxiliary loss to supervise <italic>Block</italic> 4 and <italic>Block</italic> 5. The loss function is described as follows:
<disp-formula id="eqn-1"><label>(1)</label><mml:math id="mml-eqn-1" display="block"><mml:mi>L</mml:mi><mml:mo>=</mml:mo><mml:msub><mml:mi>L</mml:mi><mml:mrow><mml:mi>p</mml:mi></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:munderover><mml:mrow><mml:mo movablelimits="false">&#x2211;</mml:mo></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>K</mml:mi></mml:mrow></mml:munderover><mml:mo>&#x2061;</mml:mo><mml:msub><mml:mi>&#x03B1;</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:msub><mml:mi>L</mml:mi><mml:mrow><mml:mi>a</mml:mi></mml:mrow></mml:msub></mml:math></disp-formula></p>
<p>Among them, <inline-formula id="ieqn-16"><mml:math id="mml-ieqn-16"><mml:msub><mml:mrow><mml:mi>L</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mi>p</mml:mi></mml:mrow></mml:mrow></mml:msub></mml:math></inline-formula> is the main loss function. <inline-formula id="ieqn-17"><mml:math id="mml-ieqn-17"><mml:msub><mml:mrow><mml:mi>L</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mi>a</mml:mi></mml:mrow></mml:mrow></mml:msub></mml:math></inline-formula> is the auxiliary loss function, <inline-formula id="ieqn-18"><mml:math id="mml-ieqn-18"><mml:mrow><mml:mi>K</mml:mi></mml:mrow></mml:math></inline-formula> represents the number of auxiliary loss branches, and <inline-formula id="ieqn-19"><mml:math id="mml-ieqn-19"><mml:mrow><mml:mi>K</mml:mi></mml:mrow></mml:math></inline-formula> is equal to the number of categories, and the <inline-formula id="ieqn-20"><mml:math id="mml-ieqn-20"><mml:msub><mml:mi>&#x03B1;</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> represents the proportion of different categories in the total sample. The cross-entropy at each pixel is averaged to calculate <inline-formula id="ieqn-21"><mml:math id="mml-ieqn-21"><mml:msub><mml:mrow><mml:mi>L</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mi>p</mml:mi></mml:mrow></mml:mrow></mml:msub></mml:math></inline-formula> and <inline-formula id="ieqn-22"><mml:math id="mml-ieqn-22"><mml:msub><mml:mrow><mml:mi>L</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mi>a</mml:mi></mml:mrow></mml:mrow></mml:msub></mml:math></inline-formula> according to the following equation:
<disp-formula id="eqn-2"><label>(2)</label><mml:math id="mml-eqn-2" display="block"><mml:mi>L</mml:mi><mml:mo>=</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:mrow><mml:mi>H</mml:mi><mml:mi>W</mml:mi></mml:mrow></mml:mfrac><mml:munderover><mml:mrow><mml:mo movablelimits="false">&#x2211;</mml:mo></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>H</mml:mi><mml:mi>W</mml:mi></mml:mrow></mml:munderover><mml:mo>&#x2212;</mml:mo><mml:mi>log</mml:mi><mml:mo>&#x2061;</mml:mo><mml:mfrac><mml:msup><mml:mi>e</mml:mi><mml:mrow><mml:msubsup><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mover><mml:mi>c</mml:mi><mml:mo stretchy="false">&#x005E;</mml:mo></mml:mover></mml:mrow></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msubsup></mml:mrow></mml:msup><mml:mrow><mml:msubsup><mml:mrow><mml:mo movablelimits="false">&#x2211;</mml:mo></mml:mrow><mml:mrow><mml:mi>c</mml:mi></mml:mrow><mml:mrow><mml:mi>C</mml:mi></mml:mrow></mml:msubsup><mml:mo>&#x2061;</mml:mo><mml:msup><mml:mi>e</mml:mi><mml:mrow><mml:msubsup><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msubsup></mml:mrow></mml:msup></mml:mrow></mml:mfrac></mml:math></disp-formula></p>
<p><italic>HW</italic> represents the dimensions of the input image. <inline-formula id="ieqn-23"><mml:math id="mml-ieqn-23"><mml:msubsup><mml:mrow><mml:mrow><mml:mi>s</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:mi>c</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:mrow></mml:msubsup></mml:math></inline-formula> represents the probability that pixel <inline-formula id="ieqn-24"><mml:math id="mml-ieqn-24"><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:math></inline-formula> belongs to category <inline-formula id="ieqn-25"><mml:math id="mml-ieqn-25"><mml:mrow><mml:mi>c</mml:mi></mml:mrow><mml:mo>,</mml:mo><mml:mrow><mml:mover><mml:mi>c</mml:mi><mml:mo stretchy="false">&#x005E;</mml:mo></mml:mover></mml:mrow></mml:math></inline-formula>, represents the ground truth in the dataset, and <inline-formula id="ieqn-26"><mml:math id="mml-ieqn-26"><mml:mrow><mml:mi>C</mml:mi></mml:mrow></mml:math></inline-formula> represents the overall count of categories within the dataset. And we define the objective function as:
<disp-formula id="eqn-3"><label>(3)</label><mml:math id="mml-eqn-3" display="block"><mml:mi>F</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>W</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mi>P</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>W</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>+</mml:mo><mml:mi>A</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>W</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></disp-formula></p>
<p><inline-formula id="ieqn-27"><mml:math id="mml-ieqn-27"><mml:mrow><mml:mi>P</mml:mi></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>W</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> is the main objective function, <inline-formula id="ieqn-28"><mml:math id="mml-ieqn-28"><mml:mrow><mml:mi>A</mml:mi></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>W</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> is the auxiliary objective function for the hidden layer and <inline-formula id="ieqn-29"><mml:math id="mml-ieqn-29"><mml:mrow><mml:mi>W</mml:mi></mml:mrow></mml:math></inline-formula> represents convolution kernel weight. The <inline-formula id="ieqn-30"><mml:math id="mml-ieqn-30"><mml:mrow><mml:mi>P</mml:mi></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>W</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> is the optimization object of the output layer and is described as follows:
<disp-formula id="eqn-4"><label>(4)</label><mml:math id="mml-eqn-4" display="block"><mml:mi>P</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>W</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mo fence="false" stretchy="false">|</mml:mo><mml:mo fence="false" stretchy="false">|</mml:mo><mml:msup><mml:mi>w</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>o</mml:mi><mml:mi>u</mml:mi><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mo fence="false" stretchy="false">|</mml:mo><mml:msup><mml:mo fence="false" stretchy="false">|</mml:mo><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msup><mml:mo>+</mml:mo><mml:msub><mml:mi>L</mml:mi><mml:mrow><mml:mi>p</mml:mi></mml:mrow></mml:msub></mml:math></disp-formula></p>
<p><inline-formula id="ieqn-31"><mml:math id="mml-ieqn-31"><mml:msub><mml:mrow><mml:mi>L</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mi>p</mml:mi></mml:mrow></mml:mrow></mml:msub></mml:math></inline-formula> is described as follows:
<disp-formula id="eqn-5"><label>(5)</label><mml:math id="mml-eqn-5" display="block"><mml:msub><mml:mi>L</mml:mi><mml:mrow><mml:mi>p</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mi>l</mml:mi><mml:mrow><mml:mi>p</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mi>W</mml:mi><mml:mo>,</mml:mo><mml:msup><mml:mi>w</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>o</mml:mi><mml:mi>u</mml:mi><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mo stretchy="false">)</mml:mo></mml:math></disp-formula></p>
<p><inline-formula id="ieqn-32"><mml:math id="mml-ieqn-32"><mml:msup><mml:mrow><mml:mi>w</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>o</mml:mi><mml:mi>u</mml:mi><mml:mi>t</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula> represents the output layer weight. The <inline-formula id="ieqn-33"><mml:math id="mml-ieqn-33"><mml:mrow><mml:mi>A</mml:mi></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>W</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> can be characterized as the following:
<disp-formula id="eqn-6"><label>(6)</label><mml:math id="mml-eqn-6" display="block"><mml:mi>A</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mi>W</mml:mi><mml:mo>)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:munderover><mml:mrow><mml:mo movablelimits="false">&#x2211;</mml:mo></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>K</mml:mi></mml:mrow></mml:munderover><mml:mo>&#x2061;</mml:mo><mml:msub><mml:mi>&#x03B1;</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo>[</mml:mo><mml:mo fence="false" stretchy="false">|</mml:mo><mml:mo fence="false" stretchy="false">|</mml:mo><mml:msup><mml:mi>w</mml:mi><mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mi>i</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:msup><mml:mo fence="false" stretchy="false">|</mml:mo><mml:msup><mml:mo fence="false" stretchy="false">|</mml:mo><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msup><mml:mo>+</mml:mo><mml:msub><mml:mi>L</mml:mi><mml:mrow><mml:mi>a</mml:mi></mml:mrow></mml:msub><mml:mo>&#x2212;</mml:mo><mml:mi>&#x03B3;</mml:mi><mml:mo>]</mml:mo></mml:mrow></mml:math></disp-formula></p>
<p><inline-formula id="ieqn-34"><mml:math id="mml-ieqn-34"><mml:msup><mml:mrow><mml:mi>w</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:msup></mml:math></inline-formula> represents the input layer weight, <inline-formula id="ieqn-35"><mml:math id="mml-ieqn-35"><mml:mrow><mml:mi mathvariant="normal">&#x03B3;</mml:mi></mml:mrow></mml:math></inline-formula> is an artificial bias and <inline-formula id="ieqn-36"><mml:math id="mml-ieqn-36"><mml:msub><mml:mrow><mml:mi>L</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mi>a</mml:mi></mml:mrow></mml:mrow></mml:msub></mml:math></inline-formula> can be characterized as the following:
<disp-formula id="eqn-7"><label>(7)</label><mml:math id="mml-eqn-7" display="block"><mml:msub><mml:mi>L</mml:mi><mml:mrow><mml:mi>a</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mi>l</mml:mi><mml:mrow><mml:mi>a</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mi>W</mml:mi><mml:mo>,</mml:mo><mml:msup><mml:mi>w</mml:mi><mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mi>i</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:msup><mml:mo stretchy="false">)</mml:mo></mml:math></disp-formula></p>
<p>We combine <xref ref-type="disp-formula" rid="eqn-4">Eqs. (4)</xref>&#x2013;<xref ref-type="disp-formula" rid="eqn-7">(7)</xref>:
<disp-formula id="eqn-8"><label>(8)</label><mml:math id="mml-eqn-8" display="block"><mml:mi>F</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mi>W</mml:mi><mml:mo>)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mo fence="false" stretchy="false">|</mml:mo><mml:mo fence="false" stretchy="false">|</mml:mo><mml:msup><mml:mi>w</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>o</mml:mi><mml:mi>u</mml:mi><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mo fence="false" stretchy="false">|</mml:mo><mml:msup><mml:mo fence="false" stretchy="false">|</mml:mo><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msup><mml:mo>+</mml:mo><mml:msub><mml:mi>l</mml:mi><mml:mrow><mml:mi>p</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:mi>W</mml:mi><mml:mo>,</mml:mo><mml:msup><mml:mi>w</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>o</mml:mi><mml:mi>u</mml:mi><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mo>)</mml:mo></mml:mrow><mml:mo>+</mml:mo><mml:munderover><mml:mrow><mml:mo movablelimits="false">&#x2211;</mml:mo></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>K</mml:mi></mml:mrow></mml:munderover><mml:mo>&#x2061;</mml:mo><mml:msub><mml:mi>&#x03B1;</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo>[</mml:mo><mml:mo fence="false" stretchy="false">|</mml:mo><mml:mo fence="false" stretchy="false">|</mml:mo><mml:msup><mml:mi>w</mml:mi><mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mi>i</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:msup><mml:mo fence="false" stretchy="false">|</mml:mo><mml:msup><mml:mo fence="false" stretchy="false">|</mml:mo><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msup><mml:mo>+</mml:mo><mml:msub><mml:mi>l</mml:mi><mml:mrow><mml:mi>a</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:mi>W</mml:mi><mml:mo>,</mml:mo><mml:msup><mml:mi>w</mml:mi><mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mi>i</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:msup><mml:mo>)</mml:mo></mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mi>&#x03B3;</mml:mi><mml:mo>]</mml:mo></mml:mrow></mml:math></disp-formula></p>
<p>Among them, <inline-formula id="ieqn-37"><mml:math id="mml-ieqn-37"><mml:msub><mml:mi>l</mml:mi><mml:mrow><mml:mi>p</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> and <inline-formula id="ieqn-38"><mml:math id="mml-ieqn-38"><mml:msub><mml:mi>l</mml:mi><mml:mrow><mml:mi>a</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> are defined as follows:
<disp-formula id="eqn-9"><label>(9)</label><mml:math id="mml-eqn-9" display="block"><mml:msub><mml:mi>l</mml:mi><mml:mrow><mml:mi>p</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mo movablelimits="false">&#x2211;</mml:mo></mml:mrow><mml:mrow><mml:mrow><mml:msub><mml:mi>y</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:mrow></mml:msub><mml:mo>&#x2061;</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:mn>1</mml:mn><mml:mo>&#x2212;</mml:mo><mml:mo>&#x003C;</mml:mo><mml:msup><mml:mi>w</mml:mi><mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mi>i</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:msup><mml:mo>,</mml:mo><mml:mi>&#x03D5;</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msup><mml:mi>Z</mml:mi><mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mi>K</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:msup><mml:mo>,</mml:mo><mml:mi>y</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mi>&#x03D5;</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msup><mml:mi>Z</mml:mi><mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mi>K</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:msup><mml:mo>,</mml:mo><mml:msub><mml:mi>y</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mo>&#x003E;</mml:mo><mml:msubsup><mml:mrow><mml:mo stretchy="false">]</mml:mo></mml:mrow><mml:mrow><mml:mo>+</mml:mo></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msubsup></mml:math></disp-formula>
<disp-formula id="eqn-10"><label>(10)</label><mml:math id="mml-eqn-10" display="block"><mml:msub><mml:mi>l</mml:mi><mml:mrow><mml:mi>a</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mo movablelimits="false">&#x2211;</mml:mo></mml:mrow><mml:mrow><mml:mrow><mml:msub><mml:mi>y</mml:mi><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo>&#x2260;</mml:mo><mml:mi>y</mml:mi></mml:mrow></mml:msub><mml:mo>&#x2061;</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:mn>1</mml:mn><mml:mo>&#x2212;</mml:mo><mml:mo>&#x003C;</mml:mo><mml:msup><mml:mi>w</mml:mi><mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mi>i</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:msup><mml:mo>,</mml:mo><mml:mi>&#x03D5;</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msup><mml:mi>Z</mml:mi><mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mi>i</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:msup><mml:mo>,</mml:mo><mml:mi>y</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mi>&#x03D5;</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msup><mml:mi>Z</mml:mi><mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mi>i</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:msup><mml:mo>,</mml:mo><mml:msub><mml:mi>y</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mo>&#x003E;</mml:mo><mml:msubsup><mml:mrow><mml:mo stretchy="false">]</mml:mo></mml:mrow><mml:mrow><mml:mo>+</mml:mo></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msubsup></mml:math></disp-formula></p>
<p><inline-formula id="ieqn-39"><mml:math id="mml-ieqn-39"><mml:msup><mml:mrow><mml:mi>Z</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mo>&#x22C5;</mml:mo></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula> represents hidden layer variables, <inline-formula id="ieqn-40"><mml:math id="mml-ieqn-40"><mml:mrow><mml:mi>y</mml:mi></mml:mrow></mml:math></inline-formula> represents ground truth, and <inline-formula id="ieqn-41"><mml:math id="mml-ieqn-41"><mml:msub><mml:mrow><mml:mi>y</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:mrow></mml:msub></mml:math></inline-formula> represents the prediction of each branch.</p>
<p>In addition to learning the weight of the convolution kernel, the model can also learn sensitive features in the hidden layer.</p>
<p>After we get the semantic information from semantic segmentation model, we combine the ORB feature points to mark the feature points which belong to the movable objects. Then the ORB feature points that are marked as dynamic are eliminated from the original set.
<disp-formula id="eqn-11"><label>(11)</label><mml:math id="mml-eqn-11" display="block"><mml:msub><mml:mi>P</mml:mi><mml:mrow><mml:mrow><mml:mtext mathvariant="italic">static</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mi>P</mml:mi><mml:mrow><mml:mi>a</mml:mi><mml:mi>l</mml:mi><mml:mi>l</mml:mi></mml:mrow></mml:msub><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mi>P</mml:mi><mml:mrow><mml:mrow><mml:mtext mathvariant="italic">dynamic</mml:mtext></mml:mrow></mml:mrow></mml:msub></mml:math></disp-formula></p>
<p>In <xref ref-type="disp-formula" rid="eqn-11">Eq. (11)</xref>, the <inline-formula id="ieqn-42"><mml:math id="mml-ieqn-42"><mml:msub><mml:mrow><mml:mi>P</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mi>a</mml:mi><mml:mi>l</mml:mi><mml:mi>l</mml:mi></mml:mrow></mml:mrow></mml:msub></mml:math></inline-formula> refers to the original set of ORB feature points, while <inline-formula id="ieqn-43"><mml:math id="mml-ieqn-43"><mml:msub><mml:mrow><mml:mi>P</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mtext>dynamic</mml:mtext></mml:mrow></mml:mrow></mml:msub></mml:math></inline-formula> represents the set of feature points that pertain to objects which have the potential to move within the surrounding environment. To ensure accurate feature point matching and pose estimation, we utilize the subset <inline-formula id="ieqn-44"><mml:math id="mml-ieqn-44"><mml:msub><mml:mrow><mml:mi>P</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mtext>static</mml:mtext></mml:mrow></mml:mrow></mml:msub></mml:math></inline-formula>, which only includes feature points with static positions in the environment.
</p>
<fig id="fig-9">
<graphic mimetype="image" mime-subtype="tif" xlink:href="CSSE_39272-fig-9.tif"/>
</fig>
</sec>
<sec id="s3_3"><label>3.3</label><title>Auxiliary Descriptor and Static Scene Restoration</title>
<p>The SIFT [<xref ref-type="bibr" rid="ref-25">25</xref>], SURF [<xref ref-type="bibr" rid="ref-26">26</xref>], and BRIEF [<xref ref-type="bibr" rid="ref-6">6</xref>] descriptors are all common descriptors without any semantic information, even though they supply photometric information of key points. To restore static scenes after removing dynamic points, we proposed an auxiliary descriptor to record dynamic objects&#x2019; semantic and location information. The example of the auxiliary descriptor is described as follows:
<disp-formula id="eqn-12"><label>(12)</label><mml:math id="mml-eqn-12" display="block"><mml:msup><mml:mi>D</mml:mi><mml:mrow><mml:mn>4</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mn>4</mml:mn></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:mtable columnalign="left left left left" rowspacing="4pt" columnspacing="1em"><mml:mtr><mml:mtd><mml:msub><mml:mi>l</mml:mi><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:mtd><mml:mtd><mml:msub><mml:mi>u</mml:mi><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:mtd><mml:mtd><mml:msub><mml:mi>u</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub></mml:mtd><mml:mtd><mml:msub><mml:mi>l</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:msub><mml:mi>v</mml:mi><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:mtd><mml:mtd><mml:mi>p</mml:mi></mml:mtd><mml:mtd><mml:mi>c</mml:mi></mml:mtd><mml:mtd><mml:msub><mml:mi>v</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:msub><mml:mi>v</mml:mi><mml:mrow><mml:mn>3</mml:mn></mml:mrow></mml:msub></mml:mtd><mml:mtd><mml:mi>b</mml:mi></mml:mtd><mml:mtd><mml:mi>n</mml:mi></mml:mtd><mml:mtd><mml:msub><mml:mi>v</mml:mi><mml:mrow><mml:mn>4</mml:mn></mml:mrow></mml:msub></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:msub><mml:mi>l</mml:mi><mml:mrow><mml:mn>3</mml:mn></mml:mrow></mml:msub></mml:mtd><mml:mtd><mml:msub><mml:mi>u</mml:mi><mml:mrow><mml:mn>3</mml:mn></mml:mrow></mml:msub></mml:mtd><mml:mtd><mml:msub><mml:mi>u</mml:mi><mml:mrow><mml:mn>4</mml:mn></mml:mrow></mml:msub></mml:mtd><mml:mtd><mml:msub><mml:mi>l</mml:mi><mml:mrow><mml:mn>4</mml:mn></mml:mrow></mml:msub></mml:mtd></mml:mtr></mml:mtable><mml:mo>)</mml:mo></mml:mrow></mml:math></disp-formula></p>
<p>In <xref ref-type="disp-formula" rid="eqn-12">Eq. (12)</xref>, the <inline-formula id="ieqn-49"><mml:math id="mml-ieqn-49"><mml:mrow><mml:mo>(</mml:mo><mml:mtable columnalign="left left" rowspacing="4pt" columnspacing="1em"><mml:mtr><mml:mtd><mml:mrow><mml:mi>p</mml:mi></mml:mrow></mml:mtd><mml:mtd><mml:mrow><mml:mi>c</mml:mi></mml:mrow></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mrow><mml:mi>b</mml:mi></mml:mrow></mml:mtd><mml:mtd><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:mtd></mml:mtr></mml:mtable><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula> represents the class of the feature point (<inline-formula id="ieqn-50"><mml:math id="mml-ieqn-50"><mml:mrow><mml:mi>p</mml:mi></mml:mrow></mml:math></inline-formula> represents the person, <inline-formula id="ieqn-51"><mml:math id="mml-ieqn-51"><mml:mrow><mml:mi>c</mml:mi></mml:mrow></mml:math></inline-formula> represents the car, <inline-formula id="ieqn-52"><mml:math id="mml-ieqn-52"><mml:mrow><mml:mi>b</mml:mi></mml:mrow></mml:math></inline-formula> represents the bicycle and <inline-formula id="ieqn-53"><mml:math id="mml-ieqn-53"><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:math></inline-formula> represents the unknown object). The current area class belongs to the corresponding class whose corresponding position value is 1. The <inline-formula id="ieqn-54"><mml:math id="mml-ieqn-54"><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>u</mml:mi></mml:mrow><mml:mo>,</mml:mo><mml:mrow><mml:mtext>&#x00A0;</mml:mtext><mml:mi>v</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> represents the pixel coordinate of the center position of the area, and <inline-formula id="ieqn-55"><mml:math id="mml-ieqn-55"><mml:mrow><mml:mi>l</mml:mi></mml:mrow></mml:math></inline-formula> represents the max value between height and width.</p>
<p>The removed feature points will no longer participate in the construction of the point cloud map, and there will be many holes in the process of map construction, which is not friendly to downstream tasks. Therefore, the SLAM system must restore each frame&#x2019;s static scene. The traditional method uses the multi-view geometric filling method, which means projecting each pixel of the keyframe to the dynamic area of the current frame for filling. But this filling method does not guarantee that each removed area can be filled because the feature points of the current frame do not appear in the key frame database, and the corresponding pixels will not be filled. If the estimation of camera pose is inaccurate, the filling method will be highly ineffective and may even lead to ghosting artifacts.</p>
<p>We used a GAN to restore the static background after removing the dynamic objects. This method has the advantage of camera-independent pose estimation. Feature Normalization (FN) is a widely used technique in training neural networks, which normalizes the features of input data across spatial dimensions. However, in the context of image inpainting, previous methods utilizing fuzzy neural networks have neglected the impact of damaged areas in the input image on normalization. The alterations to the mean and variance resulting from full-space FN have the potential to limit the effectiveness of training image inpainting networks. To overcome this issue, the Region Normalization (RN) approach partitions the pixels of the input image into distinct regions, utilizing the input mask to calculate the mean and variance of each region for normalization. This technique facilitates improved training of image inpainting networks. RN can be described as follows:
<disp-formula id="eqn-13"><label>(13)</label><mml:math id="mml-eqn-13" display="block"><mml:msubsup><mml:mrow><mml:mrow><mml:mover><mml:mi>R</mml:mi><mml:mo stretchy="false">&#x005E;</mml:mo></mml:mover></mml:mrow></mml:mrow><mml:mrow><mml:mi>b</mml:mi><mml:mo>,</mml:mo><mml:mi>c</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:msubsup><mml:mrow><mml:mi>&#x03C3;</mml:mi></mml:mrow><mml:mrow><mml:mi>b</mml:mi><mml:mo>,</mml:mo><mml:mi>c</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msubsup></mml:mfrac><mml:mrow><mml:mo>(</mml:mo><mml:msubsup><mml:mrow><mml:mi>R</mml:mi></mml:mrow><mml:mrow><mml:mi>b</mml:mi><mml:mo>,</mml:mo><mml:mi>c</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msubsup><mml:mo>&#x2212;</mml:mo><mml:msubsup><mml:mrow><mml:mi>u</mml:mi></mml:mrow><mml:mrow><mml:mi>b</mml:mi><mml:mo>,</mml:mo><mml:mi>c</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msubsup><mml:mo>)</mml:mo></mml:mrow></mml:math></disp-formula></p>
<p>In <xref ref-type="disp-formula" rid="eqn-13">Eq. (13)</xref>, <inline-formula id="ieqn-56"><mml:math id="mml-ieqn-56"><mml:msubsup><mml:mrow><mml:mrow><mml:mi>u</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:mi>b</mml:mi></mml:mrow><mml:mo>,</mml:mo><mml:mrow><mml:mi>c</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:mrow></mml:msubsup></mml:math></inline-formula> is the mean value and <inline-formula id="ieqn-57"><mml:math id="mml-ieqn-57"><mml:msubsup><mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x03C3;</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:mi>b</mml:mi></mml:mrow><mml:mo>,</mml:mo><mml:mrow><mml:mi>c</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:mrow></mml:msubsup></mml:math></inline-formula> is the standard deviation. <inline-formula id="ieqn-58"><mml:math id="mml-ieqn-58"><mml:mrow><mml:mi>X</mml:mi></mml:mrow><mml:mrow><mml:mo>&#x2208;</mml:mo></mml:mrow><mml:msup><mml:mrow><mml:mi>R</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mi>B</mml:mi></mml:mrow><mml:mo>&#x00D7;</mml:mo><mml:mrow><mml:mi>C</mml:mi></mml:mrow><mml:mo>&#x00D7;</mml:mo><mml:mrow><mml:mi>S</mml:mi></mml:mrow><mml:mo>&#x00D7;</mml:mo><mml:mrow><mml:mi>W</mml:mi></mml:mrow></mml:mrow></mml:msup></mml:math></inline-formula> is assumed to be the full feature of the input, and <inline-formula id="ieqn-59"><mml:math id="mml-ieqn-59"><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>B</mml:mi></mml:mrow><mml:mo>,</mml:mo><mml:mrow><mml:mtext>&#x00A0;</mml:mtext><mml:mi>C</mml:mi></mml:mrow><mml:mo>,</mml:mo><mml:mrow><mml:mtext>&#x00A0;</mml:mtext><mml:mi>S</mml:mi></mml:mrow><mml:mo>,</mml:mo><mml:mrow><mml:mtext>&#x00A0;</mml:mtext><mml:mi>W</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> represent bacth_size, the number of channels, length, and width, respectively. We set <inline-formula id="ieqn-60"><mml:math id="mml-ieqn-60"><mml:mi>B</mml:mi><mml:mo>,</mml:mo><mml:mtext>&#x00A0;</mml:mtext><mml:mi>C</mml:mi></mml:math></inline-formula> as an index and divide <inline-formula id="ieqn-61"><mml:math id="mml-ieqn-61"><mml:msub><mml:mrow><mml:mi>X</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mi>b</mml:mi></mml:mrow><mml:mo>,</mml:mo><mml:mrow><mml:mi>c</mml:mi></mml:mrow></mml:mrow></mml:msub></mml:math></inline-formula> into <inline-formula id="ieqn-62"><mml:math id="mml-ieqn-62"><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:math></inline-formula> subregions:
<disp-formula id="eqn-14"><label>(14)</label><mml:math id="mml-eqn-14" display="block"><mml:msub><mml:mi>X</mml:mi><mml:mrow><mml:mi>b</mml:mi><mml:mo>,</mml:mo><mml:mi>c</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msubsup><mml:mrow><mml:mi>R</mml:mi></mml:mrow><mml:mrow><mml:mi>b</mml:mi><mml:mo>,</mml:mo><mml:mi>c</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msubsup><mml:mo>&#x222A;</mml:mo><mml:msubsup><mml:mrow><mml:mi>R</mml:mi></mml:mrow><mml:mrow><mml:mi>b</mml:mi><mml:mo>,</mml:mo><mml:mi>c</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msubsup><mml:mo>&#x222A;</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>&#x222A;</mml:mo><mml:msubsup><mml:mrow><mml:mi>R</mml:mi></mml:mrow><mml:mrow><mml:mi>b</mml:mi><mml:mo>,</mml:mo><mml:mi>c</mml:mi></mml:mrow><mml:mrow><mml:mi>K</mml:mi></mml:mrow></mml:msubsup></mml:math></disp-formula></p>
<p><inline-formula id="ieqn-63"><mml:math id="mml-ieqn-63"><mml:msub><mml:mrow><mml:mi>x</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mi>b</mml:mi></mml:mrow><mml:mo>,</mml:mo><mml:mrow><mml:mi>c</mml:mi></mml:mrow><mml:mo>,</mml:mo><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mo>,</mml:mo><mml:mrow><mml:mi>w</mml:mi></mml:mrow></mml:mrow></mml:msub></mml:math></inline-formula> is a pixel in the input feature while <inline-formula id="ieqn-64"><mml:math id="mml-ieqn-64"><mml:msub><mml:mrow><mml:mi>x</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mi>b</mml:mi></mml:mrow><mml:mo>,</mml:mo><mml:mrow><mml:mi>c</mml:mi></mml:mrow><mml:mo>,</mml:mo><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mo>,</mml:mo><mml:mrow><mml:mi>w</mml:mi></mml:mrow></mml:mrow></mml:msub><mml:mrow><mml:mo>&#x2208;</mml:mo></mml:mrow><mml:msubsup><mml:mrow><mml:mrow><mml:mi>R</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:mi>b</mml:mi></mml:mrow><mml:mo>,</mml:mo><mml:mrow><mml:mi>c</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:mrow></mml:msubsup><mml:mrow><mml:mo>&#x2208;</mml:mo></mml:mrow><mml:msub><mml:mrow><mml:mi>X</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mi>b</mml:mi></mml:mrow><mml:mo>,</mml:mo><mml:mrow><mml:mi>c</mml:mi></mml:mrow></mml:mrow></mml:msub></mml:math></inline-formula>, <inline-formula id="ieqn-65"><mml:math id="mml-ieqn-65"><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>b</mml:mi></mml:mrow><mml:mo>,</mml:mo><mml:mrow><mml:mtext>&#x00A0;</mml:mtext><mml:mi>c</mml:mi></mml:mrow><mml:mo>,</mml:mo><mml:mrow><mml:mtext>&#x00A0;</mml:mtext><mml:mi>s</mml:mi></mml:mrow><mml:mo>,</mml:mo><mml:mrow><mml:mtext>&#x00A0;</mml:mtext><mml:mi>w</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> is the index of <inline-formula id="ieqn-66"><mml:math id="mml-ieqn-66"><mml:msub><mml:mrow><mml:mi>x</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mi>b</mml:mi></mml:mrow><mml:mo>,</mml:mo><mml:mrow><mml:mi>c</mml:mi></mml:mrow><mml:mo>,</mml:mo><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mo>,</mml:mo><mml:mrow><mml:mi>w</mml:mi></mml:mrow></mml:mrow></mml:msub></mml:math></inline-formula> on <inline-formula id="ieqn-67"><mml:math id="mml-ieqn-67"><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>B</mml:mi></mml:mrow><mml:mo>,</mml:mo><mml:mrow><mml:mtext>&#x00A0;</mml:mtext><mml:mi>C</mml:mi></mml:mrow><mml:mo>,</mml:mo><mml:mrow><mml:mtext>&#x00A0;</mml:mtext><mml:mi>S</mml:mi></mml:mrow><mml:mo>,</mml:mo><mml:mrow><mml:mtext>&#x00A0;</mml:mtext><mml:mi>W</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>. <inline-formula id="ieqn-68"><mml:math id="mml-ieqn-68"><mml:msubsup><mml:mrow><mml:mrow><mml:mi>u</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:mi>b</mml:mi></mml:mrow><mml:mo>,</mml:mo><mml:mrow><mml:mi>c</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:mrow></mml:msubsup></mml:math></inline-formula> and <inline-formula id="ieqn-69"><mml:math id="mml-ieqn-69"><mml:msubsup><mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x03C3;</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:mi>b</mml:mi></mml:mrow><mml:mo>,</mml:mo><mml:mrow><mml:mi>c</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:mrow></mml:msubsup></mml:math></inline-formula> are described as follows:
<disp-formula id="eqn-15"><label>(15)</label><mml:math id="mml-eqn-15" display="block"><mml:msubsup><mml:mrow><mml:mi>u</mml:mi></mml:mrow><mml:mrow><mml:mi>b</mml:mi><mml:mo>,</mml:mo><mml:mi>c</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:mrow><mml:mo fence="false" stretchy="false">|</mml:mo><mml:msubsup><mml:mrow><mml:mi>R</mml:mi></mml:mrow><mml:mrow><mml:mi>b</mml:mi><mml:mo>,</mml:mo><mml:mi>c</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msubsup><mml:mo fence="false" stretchy="false">|</mml:mo></mml:mrow></mml:mfrac><mml:munder><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>b</mml:mi><mml:mo>,</mml:mo><mml:mi>c</mml:mi><mml:mo>,</mml:mo><mml:mi>s</mml:mi><mml:mo>,</mml:mo><mml:mi>w</mml:mi><mml:mo>&#x2208;</mml:mo><mml:msubsup><mml:mrow><mml:mi>R</mml:mi></mml:mrow><mml:mrow><mml:mi>b</mml:mi><mml:mo>,</mml:mo><mml:mi>c</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msubsup></mml:mrow></mml:msub></mml:mrow></mml:munder><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>b</mml:mi><mml:mo>,</mml:mo><mml:mi>c</mml:mi><mml:mo>,</mml:mo><mml:mi>s</mml:mi><mml:mo>,</mml:mo><mml:mi>w</mml:mi></mml:mrow></mml:msub></mml:math></disp-formula>
<disp-formula id="eqn-16"><label>(16)</label><mml:math id="mml-eqn-16" display="block"><mml:msubsup><mml:mrow><mml:mi>&#x03C3;</mml:mi></mml:mrow><mml:mrow><mml:mi>b</mml:mi><mml:mo>,</mml:mo><mml:mi>c</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:msqrt><mml:mfrac><mml:mn>1</mml:mn><mml:mrow><mml:mo fence="false" stretchy="false">|</mml:mo><mml:msubsup><mml:mrow><mml:mi>R</mml:mi></mml:mrow><mml:mrow><mml:mi>b</mml:mi><mml:mo>,</mml:mo><mml:mi>c</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msubsup><mml:mo fence="false" stretchy="false">|</mml:mo></mml:mrow></mml:mfrac><mml:munder><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>b</mml:mi><mml:mo>,</mml:mo><mml:mi>c</mml:mi><mml:mo>,</mml:mo><mml:mi>s</mml:mi><mml:mo>,</mml:mo><mml:mi>w</mml:mi><mml:mo>&#x2208;</mml:mo><mml:msubsup><mml:mrow><mml:mi>R</mml:mi></mml:mrow><mml:mrow><mml:mi>b</mml:mi><mml:mo>,</mml:mo><mml:mi>c</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msubsup></mml:mrow></mml:msub></mml:mrow></mml:munder><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>b</mml:mi><mml:mo>,</mml:mo><mml:mi>c</mml:mi><mml:mo>,</mml:mo><mml:mi>s</mml:mi><mml:mo>,</mml:mo><mml:mi>w</mml:mi></mml:mrow></mml:msub><mml:mo>&#x2212;</mml:mo><mml:msubsup><mml:mrow><mml:mi>u</mml:mi></mml:mrow><mml:mrow><mml:mi>b</mml:mi><mml:mo>,</mml:mo><mml:mi>c</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msubsup><mml:msup><mml:mo stretchy="false">)</mml:mo><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msup><mml:mo>+</mml:mo><mml:mi>&#x03B5;</mml:mi></mml:msqrt></mml:math></disp-formula></p>
<p>Finally, we merge all subregions:
<disp-formula id="eqn-17"><label>(17)</label><mml:math id="mml-eqn-17" display="block"><mml:msub><mml:mrow><mml:mover><mml:mi>X</mml:mi><mml:mo stretchy="false">&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>b</mml:mi><mml:mo>,</mml:mo><mml:mi>c</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msubsup><mml:mrow><mml:mrow><mml:mover><mml:mi>R</mml:mi><mml:mo stretchy="false">&#x005E;</mml:mo></mml:mover></mml:mrow></mml:mrow><mml:mrow><mml:mi>b</mml:mi><mml:mo>,</mml:mo><mml:mi>c</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msubsup><mml:mo>&#x222A;</mml:mo><mml:msubsup><mml:mrow><mml:mrow><mml:mover><mml:mi>R</mml:mi><mml:mo stretchy="false">&#x005E;</mml:mo></mml:mover></mml:mrow></mml:mrow><mml:mrow><mml:mi>b</mml:mi><mml:mo>,</mml:mo><mml:mi>c</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msubsup><mml:mo>&#x222A;</mml:mo><mml:mo>&#x2026;</mml:mo><mml:msubsup><mml:mrow><mml:mrow><mml:mover><mml:mi>R</mml:mi><mml:mo stretchy="false">&#x005E;</mml:mo></mml:mover></mml:mrow></mml:mrow><mml:mrow><mml:mi>b</mml:mi><mml:mo>,</mml:mo><mml:mi>c</mml:mi></mml:mrow><mml:mrow><mml:mi>K</mml:mi></mml:mrow></mml:msubsup></mml:math></disp-formula></p>
<p>There are certain particularities in using the GAN to repair images in SLAM. General image repair only needs to consider the rationality of the restored image, but SLAM considers the image&#x2019;s rationality and the reality&#x2019;s relevance. For example, the semantic segmentation network will divide a car into moving instances, and its feature points will be removed. If it is repaired with a general generation confrontation network, the empty area may be filled as a blank road, and for pedestrians or others, the correct one should be filled as a road. When the hole is filled as other instances, the semantic information of the scene will become confused. At this moment, the auxiliary descriptor can show its talents. When we input an image after removing dynamic points, the semantic segmentation mask covers the image to simulate the missing image. And then, the image is sent to the encoder of the corresponding attribute according to the information of the auxiliary descriptor. When performing region segmentation, the segmentation is performed according to the location information of the auxiliary descriptor:
<disp-formula id="eqn-18"><label>(18)</label><mml:math id="mml-eqn-18" display="block"><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>c</mml:mi><mml:mo>,</mml:mo><mml:mi>s</mml:mi><mml:mo>,</mml:mo><mml:mi>w</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mrow><mml:mo>{</mml:mo><mml:mtable columnalign="left left" rowspacing=".2em" columnspacing="1em" displaystyle="false"><mml:mtr><mml:mtd><mml:msubsup><mml:mrow><mml:mi>R</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msubsup><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mtext mathvariant="italic">Dynamic</mml:mtext></mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mrow><mml:mtext mathvariant="italic">object</mml:mtext></mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mrow><mml:mtext mathvariant="italic">class</mml:mtext></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:msubsup><mml:mrow><mml:mi>R</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msubsup><mml:mo stretchy="false">(</mml:mo><mml:mi>n</mml:mi><mml:mi>o</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mrow><mml:mtext mathvariant="italic">Dynamic</mml:mtext></mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mrow><mml:mtext mathvariant="italic">object</mml:mtext></mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mrow><mml:mtext mathvariant="italic">class</mml:mtext></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mtd></mml:mtr></mml:mtable><mml:mo fence="true" stretchy="true" symmetric="true"></mml:mo></mml:mrow></mml:math></disp-formula></p>
<p>In <xref ref-type="disp-formula" rid="eqn-18">Eq. (18)</xref>, the size of <inline-formula id="ieqn-70"><mml:math id="mml-ieqn-70"><mml:msubsup><mml:mrow><mml:mrow><mml:mi>R</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:mi>c</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msubsup></mml:math></inline-formula> is <inline-formula id="ieqn-71"><mml:math id="mml-ieqn-71"><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>l</mml:mi></mml:mrow><mml:mo>,</mml:mo><mml:mrow><mml:mtext>&#x00A0;</mml:mtext><mml:mi>l</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>. We use <xref ref-type="disp-formula" rid="eqn-13">Eq. (13)</xref> to conduct local normalization, merge A and B, and then get the repaired image through the decoder.</p>
</sec>
</sec>
<sec id="s4"><label>4</label><title>Experimental Results and Discussion</title>
<sec id="s4_1"><label>4.1</label><title>Pose Estimation Precision</title>
<p>The TUM dataset [<xref ref-type="bibr" rid="ref-27">27</xref>] comprises 39 indoor scenes captured in image sequences. Every sequence contained in the dataset includes 8-bit RGB images of size 640&#x2009;&#x00D7;&#x2009;480, along with their corresponding 16-bit depth images and timestamps. Additionally, the dataset provides accurate real camera trajectories. We used six sequences of dynamic scenes in the TUM dataset to verify the proposed method. The sequences are as follows:</p>
<p>&#x2460; fr3/walking_halfsphere sequence; &#x2461; fr3/sitting_halfsphere sequence; &#x2462; fr3/walking_xyz sequence; &#x2463; fr3/sitting_xyz sequence; &#x2464; fr3/walking_rpy sequence; &#x2465; fr3/walking_static sequence.</p>
<p>To evaluate the performance of our proposed methods, we performed comprehensive experiments that included a comparison of our test results against those of ORB-SLAM2 and other dynamic SLAM systems. The W<bold>_</bold>xyz sequence duration is 27&#x2005;s, with a total of 2884 frames of images. The absolute trajectory errors of fr3/walking_xyz are shown in <xref ref-type="fig" rid="fig-3">Fig. 3</xref>.</p>
<fig id="fig-3"><label>Figure 3</label><caption><title>Absolute trajectory errors of ORB-SLAM2 (left) and RO-SLAM (right) on the fr3/walking_xyz sequence</title></caption><graphic mimetype="image" mime-subtype="tif" xlink:href="CSSE_39272-fig-3.tif"/></fig>
<p>We also calculated the relative trajectory errors shown in <xref ref-type="fig" rid="fig-4">Fig. 4</xref>.</p>
<fig id="fig-4"><label>Figure 4</label><caption><title>Relative trajectory errors of ORB-SLAM2 (left) and RO-SLAM (right) on the fr3/walking_xyz sequence</title></caption><graphic mimetype="image" mime-subtype="tif" xlink:href="CSSE_39272-fig-4.tif"/></fig>
<p>Through <xref ref-type="fig" rid="fig-3">Figs. 3</xref> and <xref ref-type="fig" rid="fig-4">4</xref>, RO-SLAM has a more significant improvement in camera pose estimation than ORB-SLAM2 in a dynamic environment.</p>

<p>We also compared the performance with DynaSLAM and DS-SLAM. To ensure objective and accurate results, we ran each experimental sequence ten times and took the median value. The detailed results can be found in <xref ref-type="table" rid="table-3">Table 3</xref>.</p>
<table-wrap id="table-3"><label>Table 3</label><caption><title>The experiment results on the TUM dataset (RMSE of ATE) (unit: m)</title></caption>
<table frame="hsides">
<colgroup>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
</colgroup>
<thead>
<tr>
<th align="left">Sequence</th>
<th align="left">DynaSLAM</th>
<th align="left">DS-SLAM</th>
<th align="left">ORB-SLAM2</th>
<th align="left">RO-SLAM</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">fr3/walking_halfsphere</td>
<td align="left">0.0250</td>
<td align="left">0.0222</td>
<td align="left">0.3510</td>
<td align="left"><bold>0.0210</bold></td>
</tr>
<tr>
<td align="left">fr3/walking_xyz</td>
<td align="left">0.0150</td>
<td align="left">0.0151</td>
<td align="left">0.4590</td>
<td align="left"><bold>0.0146</bold></td>
</tr>
<tr>
<td align="left">fr3/walking_rpy</td>
<td align="left"><bold>0.0400</bold></td>
<td align="left">0.2835</td>
<td align="left">0.6620</td>
<td align="left">0.1300</td>
</tr>
<tr>
<td align="left">fr3/walking_static</td>
<td align="left">0.0090</td>
<td align="left"><bold>0.0067</bold></td>
<td align="left">0.0900</td>
<td align="left">0.0090</td>
</tr>
<tr>
<td align="left">fr3/sitting_halfsphere</td>
<td align="left">0.0170</td>
<td align="left">-</td>
<td align="left">0.0200</td>
<td align="left"><bold>0.0160</bold></td>
</tr>
<tr>
<td align="left">fr3/sitting_xyz</td>
<td align="left">0.0140</td>
<td align="left"><bold>-</bold></td>
<td align="left"><bold>0.0090</bold></td>
<td align="left"><bold>0.0090</bold></td>
</tr>
</tbody>
</table>
</table-wrap>
<p>It can be concluded from <xref ref-type="table" rid="table-3">Table 3</xref> that RO-SLAM has a better performance improvement than DynaSLAM and DS-SLAM on the fr3/walking_halfsphere and fr3/walking_xyz sequences, and the absolute trajectory errors are 0.0210 and 0.0146, respectively. However, the precision of the fr3/walking_rpy sequence is lower than that of DynaSLAM. The reason may be that the method in this paper is not sensitive to low-angle rotation.</p>

</sec>
<sec id="s4_2"><label>4.2</label><title>Semantic Segmentation Accuracy</title>
<p>The improved PSPNet is built using the Pytorch1.5.1 framework, and the network is randomly initialized under the default settings of PyTorch. The model training in this paper is performed on an NVIDIA Tesla V100-FHHL-16 GB. The improved PSPNet is trained using the standard Cityscapes dataset [<xref ref-type="bibr" rid="ref-28">28</xref>], the dataset is divided into 2957:500:1525 (training: verification: testing), the optimization function uses Adam, the learning rate is 0.01, and the training is 100 epochs. The visualization result of the model training convergence process is shown in <xref ref-type="fig" rid="fig-5">Fig. 5</xref>.</p>
<fig id="fig-5"><label>Figure 5</label><caption><title>Model training convergence</title></caption><graphic mimetype="image" mime-subtype="tif" xlink:href="CSSE_39272-fig-5.tif"/></fig>
<p>And we compared the proposed method with other classic methods on the Cityscapes dataset. These classic methods are often used in SLAM. <xref ref-type="table" rid="table-4">Table 4</xref> displays the results obtained from our experiments.</p>
<table-wrap id="table-4"><label>Table 4</label><caption><title>Semantic segmentation on the cityscapes dataset</title></caption>
<table frame="hsides">
<colgroup>
<col align="left"/>
<col align="left"/>
</colgroup>
<thead>
<tr>
<th align="left">Method</th>
<th align="left">IoU</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">SegNet [<xref ref-type="bibr" rid="ref-15">15</xref>]</td>
<td align="left">0.570</td>
</tr>
<tr>
<td align="left">FCN [<xref ref-type="bibr" rid="ref-29">29</xref>]</td>
<td align="left">0.653</td>
</tr>
<tr>
<td align="left">DPN [<xref ref-type="bibr" rid="ref-30">30</xref>]</td>
<td align="left">0.668</td>
</tr>
<tr>
<td align="left">DeepLab [<xref ref-type="bibr" rid="ref-13">13</xref>]</td>
<td align="left">0.704</td>
</tr>
<tr>
<td align="left">PSPNet [<xref ref-type="bibr" rid="ref-7">7</xref>]</td>
<td align="left">0.784</td>
</tr>
<tr>
<td align="left">Improved PSPNet</td>
<td align="left"><bold>0.812</bold></td>
</tr>
</tbody>
</table>
</table-wrap>
<p>Our experimental results indicate that the proposed method outperforms other classic methods. The proposed method performed better than PSPNet and improved the IoU by 2.8 points. In the actual scene, the segmentation outcomes of PSPNet and our proposed method are presented in <xref ref-type="fig" rid="fig-6">Fig. 6</xref>.</p>
<fig id="fig-6"><label>Figure 6</label><caption><title>The results of semantic segmentation in actual scene</title></caption><graphic mimetype="image" mime-subtype="tif" xlink:href="CSSE_39272-fig-6.tif"/></fig>
<p>It can be seen that PSPNet has made false detections on the upper part of the person and has missed detections on the left tree trunk and electric lights. The proposed method is relatively regular in scene segmentation, with no false or missed detections.</p>
</sec>
<sec id="s4_3"><label>4.3</label><title>Map Building in Real Environment and Efficiency Analysis</title>
<p>We experimented in a real environment, and the test scene is a straight road with pedestrians and cars. The DJI Matrices 600, a heavily loaded hexacopter UAV, is chosen as the aerial platform of this study. The flying height of the UVA is 3.5 meters, and the flying distance is 50 meters. The UAV&#x2019;s flying trajectory is a straight line of the south to north. And the UAV has a flying speed of 1 meter per second. Suppose the UAV follows a path that is too fast or too irregular. In that case, it may result in blurry or distorted images, making it difficult to accurately detect and classify objects or features in the scene. We implemented our system using C&#x002B;&#x002B;, which was executed on a computer running the Ubuntu 16.04 operating system. The computer was equipped with Intel i7-8750H@2.20&#x2005;GHz and 16 GB of memory. The graphic card is NVIDIA GTX1060-6 GB. And we used Intel Realsense D455 as the experiment depth camera to capture a video. The The videos comprise of components for RGB and depth. Each frame has a size of 640 pixels width by 480 pixels height. The camera and computer employ a USB Type-C data cable to communicate data. <xref ref-type="fig" rid="fig-7">Fig. 7</xref> presents the system diagram of the software and hardware configuration in this experiment.</p>
<fig id="fig-7"><label>Figure 7</label><caption><title>System diagram of the experiment</title></caption><graphic mimetype="image" mime-subtype="tif" xlink:href="CSSE_39272-fig-7.tif"/></fig>
<p>The point cloud map constructed in the real environment is shown in <xref ref-type="fig" rid="fig-8">Fig. 8</xref>. It can be seen from <xref ref-type="fig" rid="fig-8">Fig. 8B</xref> that dynamic objects will be retained in the map without dynamic object removal. <xref ref-type="fig" rid="fig-8">Fig. 8C</xref> shows the holes caused by dynamic object removal. It can be seen from <xref ref-type="fig" rid="fig-8">Figs. 8D</xref> and <xref ref-type="fig" rid="fig-8">8E</xref> that the point cloud map constructed by the proposed system is clear and not affected by dynamic objects. Through the static scene restore module, the map holes caused by dynamic removal are filled, almost consistent with the actual situation.</p>
<fig id="fig-8"><label>Figure 8</label><caption><title>Point cloud map of a real environment</title></caption><graphic mimetype="image" mime-subtype="tif" xlink:href="CSSE_39272-fig-8a.tif"/><graphic mimetype="image" mime-subtype="tif" xlink:href="CSSE_39272-fig-8b.tif"/></fig>
<p>To evaluate the computation time, we compare the proposed method with DynaSLAM. We conducted a total of ten trials for the experiments and computed the average time taken per frame, which is presented in <xref ref-type="table" rid="table-5">Table 5</xref>.</p>
<table-wrap id="table-5"><label>Table 5</label><caption><title>Average time expense (unit: ms)</title></caption>
<table frame="hsides">
<colgroup>
<col align="left"/>
<col align="left"/>
</colgroup>
<thead>
<tr>
<th align="left">Method</th>
<th align="left">Time</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">Dyna-SLAM with Mask RCNN</td>
<td align="left">235.98</td>
</tr>
<tr>
<td align="left">Dyna-SLAM with PSPNet</td>
<td align="left">198.42</td>
</tr>
<tr>
<td align="left">RO-SLAM with improved PSPNet</td>
<td align="left">137.96</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>The results show that RO-SLAM is faster than Dyna-SLAM.</p>
</sec>
</sec>
<sec id="s5"><label>5</label><title>Conclusion</title>
<p>In this paper, we proposed an innovative algorithm for visual SLAM called RO-SLAM. And the algorithm is designed for UAVs operating in dynamic outdoor environments. We have implemented this algorithm using the ORB-SLAM2 framework. And the proposed algorithm can handle intricate and dynamic environments. Our algorithm leveraged the improved PSPNet&#x2019;s semantic segmentation results to eliminate dynamic feature points from original feature points and construct an auxiliary descriptor. By eliminating the feature points of dynamic objects, the proposed algorithm improved the precision of camera pose estimation. Additionally, we proposed a new static sense restore method based on GANs, which addresses the shortcomings of traditional geometry-based methods. We performed extensive experiments on the TUM dataset to validate the effectiveness of our algorithm in dealing with highly dynamic environments and compared it with other existing algorithms. Our results showed significant improvements in performance. Additionally, we collected data using UAV in a real outdoor environment to verify RO-SLAM&#x2019;s ability of map construction. Through the static scene restore module, we were able to fill the map hole caused by dynamic removal, and successfully construct precision maps finally. In future work, we aim to investigate how machine learning-based schemes can be leveraged for data communication in SLAM.</p>
</sec>
</body>
<back>
<ack>
<p>The author wishes to express gratitude to the entire staff of the CIS lab at the Shanghai Advanced Research Institute for their valuable technical discussions, helpful suggestions, and provision of hardware resources.</p>
</ack>
<sec><title>Funding Statement</title>
<p>The author received no funding for this study.</p></sec>
<sec sec-type="COI-statement"><title>Conflicts of Interest</title>
<p>The author declares that he has no conflicts of interest to report regarding the present study.</p></sec>
<ref-list content-type="authoryear">
<title>References</title>
<ref id="ref-1"><label>[1]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>X.</given-names> <surname>Yihan</surname></string-name>, <string-name><given-names>L.</given-names> <surname>Jinhui</surname></string-name>, <string-name><given-names>W.</given-names> <surname>Zhou</surname></string-name> and <string-name><given-names>C.</given-names> <surname>Chen</surname></string-name></person-group>, &#x201C;<article-title>Learning-empowered resource allocation for air slicing in UAV-assisted cellular V2X communications</article-title>,&#x201D; <source>IEEE Systems Journal</source>, vol. <volume>17</volume>, no. <issue>1</issue>, pp. <fpage>1008</fpage>&#x2013;<lpage>1011</lpage>, <year>2022</year>.</mixed-citation></ref>
<ref id="ref-2"><label>[2]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>B.</given-names> <surname>Bescos</surname></string-name>, <string-name><given-names>J. M.</given-names> <surname>F&#x00E1;cil</surname></string-name>, <string-name><given-names>J.</given-names> <surname>Civera</surname></string-name> and <string-name><given-names>J.</given-names> <surname>Neira</surname></string-name></person-group>, &#x201C;<article-title>DynaSLAM: Tracking, mapping, and inpainting in dynamic scenes</article-title>,&#x201D; <source>IEEE Robotics and Automation Letters</source>, vol. <volume>3</volume>, no. <issue>4</issue>, pp. <fpage>4076</fpage>&#x2013;<lpage>4083</lpage>, <year>2018</year>.</mixed-citation></ref>
<ref id="ref-3"><label>[3]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>C.</given-names> <surname>Yu</surname></string-name>, <string-name><given-names>Z.</given-names> <surname>Liu</surname></string-name>, <string-name><given-names>X. J.</given-names> <surname>Liu</surname></string-name>, <string-name><given-names>F.</given-names> <surname>Xie</surname></string-name>, <string-name><given-names>Y.</given-names> <surname>Yang</surname></string-name> <etal>et al.,</etal></person-group> &#x201C;<article-title>DS-SLAM: A semantic visual SLAM towards dynamic environments</article-title>,&#x201D; in <conf-name>Proc. of IEEE/RSJ Int. Conf. on Intelligent Robots and Systems</conf-name>, <conf-loc>Madrid, ComunidaddeMadrid, Spain</conf-loc>, pp. <fpage>1168</fpage>&#x2013;<lpage>1174</lpage>, <year>2018</year>.</mixed-citation></ref>
<ref id="ref-4"><label>[4]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>W. G.</given-names> <surname>Aguilar</surname></string-name>, <string-name><given-names>G. A.</given-names> <surname>Rodr&#x00ED;guez</surname></string-name>, <string-name><given-names>L.</given-names> <surname>&#x00C1;lvarez</surname></string-name>, <string-name><given-names>S.</given-names> <surname>Sandoval</surname></string-name>, <string-name><given-names>F.</given-names> <surname>Quisaguano</surname></string-name> <etal>et al.,</etal></person-group> &#x201C;<article-title>Visual SLAM with a RGB-D camera on a quadrotor UAV using on-board processing</article-title>,&#x201D; in <conf-name>Proc. of Int. Work-Conf. on Artificial Neural Networks</conf-name>, <conf-loc>Alghero, Sardinia, Italy</conf-loc>, pp. <fpage>596</fpage>&#x2013;<lpage>606</lpage>, <year>2017</year>.</mixed-citation></ref>
<ref id="ref-5"><label>[5]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>S.</given-names> <surname>Bu</surname></string-name>, <string-name><given-names>Y.</given-names> <surname>Zhao</surname></string-name>, <string-name><given-names>G.</given-names> <surname>Wan</surname></string-name> and <string-name><given-names>Z.</given-names> <surname>Liu</surname></string-name></person-group>, &#x201C;<article-title>Map2DFusion: Real-time incremental UAV image mosaicing based on monocular SLAM</article-title>,&#x201D; in <conf-name>Proc. of IEEE/RSJ Int. Conf. on Intelligent Robots and Systems</conf-name>, <conf-loc>Daejeon, Korea</conf-loc>, pp. <fpage>4564</fpage>&#x2013;<lpage>4571</lpage>, <year>2016</year>.</mixed-citation></ref>
<ref id="ref-6"><label>[6]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>M. A.</given-names> <surname>Ra&#x00FA;l</surname></string-name></person-group> and <person-group person-group-type="author"><string-name><given-names>J. D.</given-names> <surname>Tard&#x00F3;s</surname></string-name></person-group>, &#x201C;<article-title>Orb-slam2: An open-source slam system for monocular, stereo, and rgb-d cameras</article-title>,&#x201D; <source>IEEE Transactions on Robotics</source>, vol. <volume>33</volume>, no. <issue>5</issue>, pp. <fpage>1255</fpage>&#x2013;<lpage>1262</lpage>, <year>2017</year>.</mixed-citation></ref>
<ref id="ref-7"><label>[7]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>H.</given-names> <surname>Zhao</surname></string-name>, <string-name><given-names>J.</given-names> <surname>Shi</surname></string-name>, <string-name><given-names>X.</given-names> <surname>Qi</surname></string-name>, <string-name><given-names>X.</given-names> <surname>Wang</surname></string-name> and <string-name><given-names>J.</given-names> <surname>Jia</surname></string-name></person-group>, &#x201C;<article-title>Pyramid scene parsing network</article-title>,&#x201D; in <conf-name>Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition</conf-name>, <conf-loc>Honolulu, Hawaii, USA</conf-loc>, pp. <fpage>2881</fpage>&#x2013;<lpage>2890</lpage>, <year>2017</year>.</mixed-citation></ref>
<ref id="ref-8"><label>[8]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>A.</given-names> <surname>Flint</surname></string-name>, <string-name><given-names>C.</given-names> <surname>Mei</surname></string-name>, <string-name><given-names>D.</given-names> <surname>Murray</surname></string-name> and <string-name><given-names>I.</given-names> <surname>Reid</surname></string-name></person-group>, &#x201C;<article-title>Growing semantically meaningful models for visual slam</article-title>,&#x201D; in <conf-name>Proc. of IEEE Computer Vision and Pattern Recognition</conf-name>, <conf-loc>San Francisco, California, USA</conf-loc>, pp. <fpage>467</fpage>&#x2013;<lpage>474</lpage>, <year>2010</year>.</mixed-citation></ref>
<ref id="ref-9"><label>[9]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>A.</given-names> <surname>Kundu</surname></string-name>, <string-name><given-names>Y.</given-names> <surname>Li</surname></string-name>, <string-name><given-names>F.</given-names> <surname>Dellaert</surname></string-name>, <string-name><given-names>F.</given-names> <surname>Li</surname></string-name> and <string-name><given-names>J. M.</given-names> <surname>Rehg</surname></string-name></person-group>, &#x201C;<article-title>Joint semantic segmentation and 3D reconstruction from monocular video</article-title>,&#x201D; in <conf-name>Proc. of European Conf. on Computer Vision</conf-name>, <conf-loc>Zurich, Switzerland</conf-loc>, pp. <fpage>703</fpage>&#x2013;<lpage>718</lpage>, <year>2014</year>.</mixed-citation></ref>
<ref id="ref-10"><label>[10]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>A.</given-names> <surname>Hermans</surname></string-name>, <string-name><given-names>G.</given-names> <surname>Floros</surname></string-name> and <string-name><given-names>B.</given-names> <surname>Leibe</surname></string-name></person-group>, &#x201C;<article-title>Dense 3D semantic mapping of indoor scenes from RGB-D images</article-title>,&#x201D; in <conf-name>Proc. of IEEE Int. Conf. on Robotics and Automation</conf-name>, <conf-loc>Miami, Florida, USA</conf-loc>, pp. <fpage>2631</fpage>&#x2013;<lpage>2638</lpage>, <year>2014</year>.</mixed-citation></ref>
<ref id="ref-11"><label>[11]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>J.</given-names> <surname>Stuckler</surname></string-name>, <string-name><given-names>B.</given-names> <surname>Waldvogel</surname></string-name>, <string-name><given-names>H.</given-names> <surname>Schulz</surname></string-name> and <string-name><given-names>S.</given-names> <surname>Behnke</surname></string-name></person-group>, &#x201C;<article-title>Dense real-time mapping of object-class semantics from RGB-D video</article-title>,&#x201D; <source>Journal of Real-Time Image Processing</source>, vol. <volume>10</volume>, no. <issue>4</issue>, pp. <fpage>599</fpage>&#x2013;<lpage>609</lpage>, <year>2015</year>.</mixed-citation></ref>
<ref id="ref-12"><label>[12]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>K.</given-names> <surname>Masaya</surname></string-name>, <string-name><given-names>K.</given-names> <surname>Iwami</surname></string-name>, <string-name><given-names>T.</given-names> <surname>Ogawa</surname></string-name>, <string-name><given-names>T.</given-names> <surname>Yamasaki</surname></string-name> and <string-name><given-names>K.</given-names> <surname>Aizawa</surname></string-name></person-group>, &#x201C;<article-title>Mask-slam: Robust feature-based monocular slam by masking using semantic segmentation</article-title>,&#x201D; in <conf-name>Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition Workshops</conf-name>, <conf-loc>Salt Lake City, UT, USA</conf-loc>, pp. <fpage>258</fpage>&#x2013;<lpage>266</lpage>, <year>2018</year>.</mixed-citation></ref>
<ref id="ref-13"><label>[13]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>L. C.</given-names> <surname>Chen</surname></string-name>, <string-name><given-names>G.</given-names> <surname>Papandreou</surname></string-name>, <string-name><given-names>I.</given-names> <surname>Kokkinos</surname></string-name>, <string-name><given-names>K.</given-names> <surname>Murphy</surname></string-name> and <string-name><given-names>A. L.</given-names> <surname>Yuille</surname></string-name></person-group>, &#x201C;<article-title>Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs</article-title>,&#x201D; <source>IEEE Transactions on Pattern Analysis and Machine Intelligence</source>, vol. <volume>40</volume>, no. <issue>4</issue>, pp. <fpage>834</fpage>&#x2013;<lpage>848</lpage>, <year>2017</year>; <pub-id pub-id-type="pmid">28463186</pub-id></mixed-citation></ref>
<ref id="ref-14"><label>[14]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>L.</given-names> <surname>Riazuelo</surname></string-name>, <string-name><given-names>L.</given-names> <surname>Montano</surname></string-name> and <string-name><given-names>J. M. M.</given-names> <surname>Montiel</surname></string-name></person-group>, &#x201C;<article-title>Semantic visual slam in populated environments</article-title>,&#x201D; in <conf-name>Proc. of European Conf. on Mobile Robots</conf-name>, <conf-loc>Paris, France</conf-loc>, pp. <fpage>1</fpage>&#x2013;<lpage>7</lpage>, <year>2017</year>.</mixed-citation></ref>
<ref id="ref-15"><label>[15]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>B.</given-names> <surname>Vijay</surname></string-name>, <string-name><given-names>A.</given-names> <surname>Kendall</surname></string-name> and <string-name><given-names>R.</given-names> <surname>Cipolla</surname></string-name></person-group>, &#x201C;<article-title>Segnet: A deep convolutional encoder-decoder architecture for image segmentation</article-title>,&#x201D; <source>IEEE Transactions on Pattern Analysis and Machine Intelligence</source>, vol. <volume>39</volume>, no. <issue>12</issue>, pp. <fpage>2481</fpage>&#x2013;<lpage>2495</lpage>, <year>2017</year>.</mixed-citation></ref>
<ref id="ref-16"><label>[16]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>D. H.</given-names> <surname>Kim</surname></string-name> and <string-name><given-names>J. H.</given-names> <surname>Kim</surname></string-name></person-group>, &#x201C;<article-title>Effective background model-based RGB-D dense visual odometry in a dynamic environment</article-title>,&#x201D; <source>IEEE Transactions on Robotics</source>, vol. <volume>32</volume>, no. <issue>6</issue>, pp. <fpage>1565</fpage>&#x2013;<lpage>1573</lpage>, <year>2016</year>.</mixed-citation></ref>
<ref id="ref-17"><label>[17]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>Y.</given-names> <surname>Sun</surname></string-name>, <string-name><given-names>M.</given-names> <surname>Liu</surname></string-name> and <string-name><given-names>M. Q. H.</given-names> <surname>Meng</surname></string-name></person-group>, &#x201C;<article-title>Improving RGB-D SLAM in dynamic environments: A motion removal approach</article-title>,&#x201D; <source>Robotics and Autonomous Systems</source>, vol. <volume>89</volume>, no. <issue>4</issue>, pp. <fpage>110</fpage>&#x2013;<lpage>122</lpage>, <year>2017</year>.</mixed-citation></ref>
<ref id="ref-18"><label>[18]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>S.</given-names> <surname>Raluca</surname></string-name></person-group>, &#x201C;<article-title>Staticfusion: Background reconstruction for dense rgb-d slam in dynamic environments</article-title>,&#x201D; in <conf-name>Proc. of IEEE Int. Conf. on Robotics and Automation</conf-name>, <conf-loc>Brisbane, Queensland, Australia</conf-loc>, pp. <fpage>67</fpage>&#x2013;<lpage>79</lpage>, <year>2018</year>.</mixed-citation></ref>
<ref id="ref-19"><label>[19]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>P.</given-names> <surname>Emanuele</surname></string-name>, <string-name><given-names>J.</given-names> <surname>Behley</surname></string-name>, <string-name><given-names>P.</given-names> <surname>Lottes</surname></string-name>, <string-name><given-names>P.</given-names> <surname>Giguere</surname></string-name> and <string-name><given-names>C.</given-names> <surname>Stachniss</surname></string-name></person-group>, &#x201C;<article-title>Refusion: 3D reconstruction in dynamic environments for rgb-d cameras exploiting residuals</article-title>,&#x201D; in <conf-name>Proc. of Int. Conf. on Intelligent Robots and Systems</conf-name>, <conf-loc>Macao, China</conf-loc>, pp. <fpage>7855</fpage>&#x2013;<lpage>7862</lpage>, <year>2019</year>.</mixed-citation></ref>
<ref id="ref-20"><label>[20]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>S.</given-names> <surname>Yang</surname></string-name>, <string-name><given-names>G.</given-names> <surname>Fan</surname></string-name>, <string-name><given-names>L.</given-names> <surname>Bai</surname></string-name>, <string-name><given-names>R.</given-names> <surname>Li</surname></string-name> and <string-name><given-names>D.</given-names> <surname>Li</surname></string-name></person-group>, &#x201C;<article-title>MGC-VSLAM: A meshing-based and geometric constraint VSLAM for dynamic indoor environments</article-title>,&#x201D; <source>IEEE Access</source>, vol. <volume>8</volume>, no. <issue>3</issue>, pp. <fpage>81007</fpage>&#x2013;<lpage>81021</lpage>, <year>2020</year>.</mixed-citation></ref>
<ref id="ref-21"><label>[21]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>Y.</given-names> <surname>Sun</surname></string-name>, <string-name><given-names>M.</given-names> <surname>Liu</surname></string-name> and <string-name><given-names>M. Q. H.</given-names> <surname>Meng</surname></string-name></person-group>, &#x201C;<article-title>December. Invisibility: A moving-object removal approach for dynamic scene modelling using RGB-D camera</article-title>,&#x201D; in <conf-name>Proc. of the IEEE Int. Conf. on Robotics and Biomimetics</conf-name>, <conf-loc>Macao, China</conf-loc>, pp. <fpage>50</fpage>&#x2013;<lpage>55</lpage>, <year>2017</year>.</mixed-citation></ref>
<ref id="ref-22"><label>[22]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>G.</given-names> <surname>Liu</surname></string-name>, <string-name><given-names>W.</given-names> <surname>Zeng</surname></string-name>, <string-name><given-names>B.</given-names> <surname>Feng</surname></string-name> and <string-name><given-names>F.</given-names> <surname>Xu</surname></string-name></person-group>, &#x201C;<article-title>DMS-SLAM: A general visual SLAM system for dynamic scenes with multiple sensors</article-title>,&#x201D; <source>Sensors</source>, vol. <volume>19</volume>, no. <issue>17</issue>, pp. <fpage>3714</fpage>, <year>2019</year>; <pub-id pub-id-type="pmid">31461943</pub-id></mixed-citation></ref>
<ref id="ref-23"><label>[23]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>S.</given-names> <surname>Song</surname></string-name>, <string-name><given-names>H.</given-names> <surname>Lim</surname></string-name>, <string-name><given-names>A. J.</given-names> <surname>Lee</surname></string-name> and <string-name><given-names>H.</given-names> <surname>Myung</surname></string-name></person-group>, &#x201C;<article-title>DynaVINS: A visual-inertial SLAM for dynamic environments</article-title>,&#x201D; <source>IEEE Robotics and Automation Letters</source>, vol. <volume>7</volume>, no. <issue>4</issue>, pp. <fpage>11523</fpage>&#x2013;<lpage>11530</lpage>, <year>2022</year>.</mixed-citation></ref>
<ref id="ref-24"><label>[24]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>C.</given-names> <surname>Fran&#x00E7;ois</surname></string-name></person-group>, &#x201C;<article-title>Xception: Deep learning with depthwise separable convolutions</article-title>,&#x201D; in <conf-name>Proc. of IEEE Conf. on Computer Vision and Pattern Recognition</conf-name>, <conf-loc>Honolulu, HI, USA</conf-loc>, pp. <fpage>1251</fpage>&#x2013;<lpage>1258</lpage>, <year>2017</year>.</mixed-citation></ref>
<ref id="ref-25"><label>[25]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>C.</given-names> <surname>Pauline</surname></string-name> and <string-name><given-names>S.</given-names> <surname>Henikoff</surname></string-name></person-group>, &#x201C;<article-title>SIFT: Predicting amino acid changes that affect protein function</article-title>,&#x201D; <source>Nucleic Acids Research</source>, vol. <volume>31</volume>, no. <issue>13</issue>, pp. <fpage>3812</fpage>&#x2013;<lpage>3814</lpage>, <year>2013</year>.</mixed-citation></ref>
<ref id="ref-26"><label>[26]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>B.</given-names> <surname>Herbert</surname></string-name>, <string-name><given-names>T.</given-names> <surname>Tuytelaars</surname></string-name> and <string-name><given-names>L. V.</given-names> <surname>Gool</surname></string-name></person-group>, &#x201C;<article-title>Surf: Speeded up robust features</article-title>,&#x201D; in <conf-name>Proc. of European Conf. on Computer Vision</conf-name>, <conf-loc>Graz, Austria</conf-loc>, pp. <fpage>404</fpage>&#x2013;<lpage>417</lpage>, <year>2006</year>.</mixed-citation></ref>
<ref id="ref-27"><label>[27]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>S.</given-names> <surname>J&#x00FC;rgen</surname></string-name>, <string-name><given-names>S.</given-names> <surname>Magnenat</surname></string-name>, <string-name><given-names>N.</given-names> <surname>Engelhard</surname></string-name>, <string-name><given-names>F.</given-names> <surname>Pomerleau</surname></string-name>, <string-name><given-names>F.</given-names> <surname>Colas</surname></string-name> <etal>et al.,</etal></person-group> &#x201C;<article-title>Towards a benchmark for RGB-D SLAM evaluation</article-title>,&#x201D; in <conf-name>Proc. of RGB-D Workshop on Advanced Reasoning with Depth Cameras at Robotics: Science and Systems Conf.</conf-name>, <conf-loc>New York City, NY, USA</conf-loc>, pp. <fpage>231</fpage>&#x2013;<lpage>236</lpage>, <year>2011</year>.</mixed-citation></ref>
<ref id="ref-28"><label>[28]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>C.</given-names> <surname>Marius</surname></string-name>, <string-name><given-names>M.</given-names> <surname>Omran</surname></string-name>, <string-name><given-names>S.</given-names> <surname>Ramos</surname></string-name>, <string-name><given-names>T.</given-names> <surname>Scharw&#x00E4;chter</surname></string-name>, <string-name><given-names>M.</given-names> <surname>Enzweiler</surname></string-name> <etal>et al.,</etal></person-group> &#x201C;<article-title>The cityscapes dataset</article-title>,&#x201D; in <conf-name>Proc. of CVPR Workshop on the Future of Datasets in Vision</conf-name>, <conf-loc>Boston, MA, USA</conf-loc>, pp. <fpage>23</fpage>&#x2013;<lpage>41</lpage>, <year>2015</year>.</mixed-citation></ref>
<ref id="ref-29"><label>[29]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>L.</given-names> <surname>Jonathan</surname></string-name>, <string-name><given-names>E.</given-names> <surname>Shelhamer</surname></string-name> and <string-name><given-names>T.</given-names> <surname>Darrell</surname></string-name></person-group>, &#x201C;<article-title>Fully convolutional networks for semantic segmentation</article-title>,&#x201D; in <conf-name>Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition</conf-name>, <conf-loc>Boston, MA, USA</conf-loc>, pp. <fpage>3431</fpage>&#x2013;<lpage>3440</lpage>, <year>2015</year>.</mixed-citation></ref>
<ref id="ref-30"><label>[30]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>Y.</given-names> <surname>Chen</surname></string-name>, <string-name><given-names>J.</given-names> <surname>Li</surname></string-name>, <string-name><given-names>H.</given-names> <surname>Xiao</surname></string-name>, <string-name><given-names>X.</given-names> <surname>Jin</surname></string-name>, <string-name><given-names>S.</given-names> <surname>Yan</surname></string-name> <etal>et al.,</etal></person-group> &#x201C;<article-title>Dual path networks</article-title>,&#x201D; <source>Advances in Neural Information Processing Systems</source>, vol. <volume>30</volume>, no. <issue>4</issue>, pp. <fpage>184</fpage>&#x2013;<lpage>199</lpage>, <year>2017</year>.</mixed-citation></ref>
</ref-list>
</back></article>