<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.1 20151215//EN" "http://jats.nlm.nih.gov/publishing/1.1/JATS-journalpublishing1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:mml="http://www.w3.org/1998/Math/MathML" xml:lang="en" article-type="research-article" dtd-version="1.1">
<front>
<journal-meta>
<journal-id journal-id-type="pmc">CMC</journal-id>
<journal-id journal-id-type="nlm-ta">CMC</journal-id>
<journal-id journal-id-type="publisher-id">CMC</journal-id>
<journal-title-group>
<journal-title>Computers, Materials &#x0026; Continua</journal-title>
</journal-title-group>
<issn pub-type="epub">1546-2226</issn>
<issn pub-type="ppub">1546-2218</issn>
<publisher>
<publisher-name>Tech Science Press</publisher-name>
<publisher-loc>USA</publisher-loc>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="publisher-id">65205</article-id>
<article-id pub-id-type="doi">10.32604/cmc.2025.065205</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Article</subject>
</subj-group>
</article-categories>
<title-group>
<article-title>Research on Adaptive Reward Optimization Method for Robot Navigation in Complex Dynamic Environment</article-title>
<alt-title alt-title-type="left-running-head">Research on Adaptive Reward Optimization Method for Robot Navigation in Complex Dynamic Environment</alt-title>
<alt-title alt-title-type="right-running-head">Research on Adaptive Reward Optimization Method for Robot Navigation in Complex Dynamic Environment</alt-title>
</title-group>
<contrib-group>
<contrib id="author-1" contrib-type="author">
<name name-style="western"><surname>He</surname><given-names>Jie</given-names></name></contrib>
<contrib id="author-2" contrib-type="author">
<name name-style="western"><surname>Zhao</surname><given-names>Dongmei</given-names></name></contrib>
<contrib id="author-3" contrib-type="author" corresp="yes">
<name name-style="western"><surname>Liu</surname><given-names>Tao</given-names></name><email>swust_lt@sina.com</email></contrib>
<contrib id="author-4" contrib-type="author">
<name name-style="western"><surname>Zou</surname><given-names>Qingfeng</given-names></name></contrib>
<contrib id="author-5" contrib-type="author">
<name name-style="western"><surname>Xie</surname><given-names>Jian&#x2019;an</given-names></name></contrib>
<aff id="aff-1"><institution>School of Computer Science and Technology, Southwest University of Science and Technology</institution>, <addr-line>Mianyang, 621010</addr-line>, <country>China</country></aff>
</contrib-group>
<author-notes>
<corresp id="cor1"><label>&#x002A;</label>Corresponding Author: Tao Liu. Email: <email>swust_lt@sina.com</email></corresp>
</author-notes>
<pub-date date-type="collection" publication-format="electronic">
<year>2025</year>
</pub-date>
<pub-date date-type="pub" publication-format="electronic">
<day>03</day><month>07</month><year>2025</year>
</pub-date>
<volume>84</volume>
<issue>2</issue>
<fpage>2733</fpage>
<lpage>2749</lpage>
<history>
<date date-type="received">
<day>06</day>
<month>3</month>
<year>2025</year>
</date>
<date date-type="accepted">
<day>06</day>
<month>5</month>
<year>2025</year>
</date>
</history>
<permissions>
<copyright-statement>&#x00A9; 2025 The Authors.</copyright-statement>
<copyright-year>2025</copyright-year>
<copyright-holder>Published by Tech Science Press.</copyright-holder>
<license xlink:href="https://creativecommons.org/licenses/by/4.0/">
<license-p>This work is licensed under a <ext-link ext-link-type="uri" xlink:type="simple" xlink:href="https://creativecommons.org/licenses/by/4.0/">Creative Commons Attribution 4.0 International License</ext-link>, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
</license>
</permissions>
<self-uri content-type="pdf" xlink:href="TSP_CMC_65205.pdf"></self-uri>
<abstract>
<p>Robot navigation in complex crowd service scenarios, such as medical logistics and commercial guidance, requires a dynamic balance between safety and efficiency, while the traditional fixed reward mechanism lacks environmental adaptability and struggles to adapt to the variability of crowd density and pedestrian motion patterns. This paper proposes a navigation method that integrates spatiotemporal risk field modeling and adaptive reward optimization, aiming to improve the robot&#x2019;s decision-making ability in diverse crowd scenarios through dynamic risk assessment and nonlinear weight adjustment. We construct a spatiotemporal risk field model based on a Gaussian kernel function by combining crowd density, relative distance, and motion speed to quantify environmental complexity and realize crowd-density-sensitive risk assessment dynamically. We apply an exponential decay function to reward design to address the linear conflict problem of fixed weights in multi-objective optimization. We adaptively adjust weight allocation between safety constraints and navigation efficiency based on real-time risk values, prioritizing safety in highly dense areas and navigation efficiency in sparse areas. Experimental results show that our method improves the navigation success rate by 9.0% over state-of-the-art models in high-density scenarios, with a 10.7% reduction in intrusion time ratio. Simulation comparisons validate the risk field model&#x2019;s ability to capture risk superposition effects in dense scenarios and the suppression of near-field dangerous behaviors by the exponential decay mechanism. Our parametric optimization paradigm establishes an explicit mapping between navigation objectives and risk parameters through rigorous mathematical formalization, providing an interpretable approach for safe deployment of service robots in dynamic environments.</p>
</abstract>
<kwd-group kwd-group-type="author">
<kwd>Machine learning</kwd>
<kwd>reinforcement learning</kwd>
<kwd>robots</kwd>
<kwd>autonomous navigation</kwd>
<kwd>reward shaping</kwd>
</kwd-group>
<funding-group>
<award-group id="awg1">
<funding-source>Sichuan Science and Technology Program</funding-source>
<award-id>2025ZNSFSC0005</award-id>
</award-group>
</funding-group>
</article-meta>
</front>
<body>
<sec id="s1">
<label>1</label>
<title>Introduction</title>
<p>The proliferation of service robots in public spaces&#x2014;from hospital logistics to commercial guidance systems&#x2014;has created unprecedented demands for safe and efficient navigation in human-dominated environments [<xref ref-type="bibr" rid="ref-1">1</xref>,<xref ref-type="bibr" rid="ref-2">2</xref>]. While traditional navigation algorithms achieve satisfactory performance in structured industrial settings [<xref ref-type="bibr" rid="ref-3">3</xref>], their effectiveness diminishes significantly in dynamic crowd scenarios characterized by rapidly evolving pedestrian movements, heterogeneous motion patterns, and time-varying social constraints [<xref ref-type="bibr" rid="ref-4">4</xref>]. This limitation becomes particularly critical in safety-sensitive domains like medical delivery, where collision risks could lead to catastrophic consequences, and mall environments requiring socially compliant navigation to ensure user acceptance [<xref ref-type="bibr" rid="ref-5">5</xref>].</p>
<p>Deep reinforcement learning (DRL) approaches have demonstrated remarkable progress in handling environmental uncertainties through end-to-end policy learning [<xref ref-type="bibr" rid="ref-6">6</xref>&#x2013;<xref ref-type="bibr" rid="ref-9">9</xref>]. Still, their reliance on fixed reward mechanisms creates fundamental limitations in real-world crowd navigation. A primary constraint lies in the static safety-efficiency trade-offs of conventional multi-objective reward functions, which assign fixed weights to collision avoidance and navigation efficiency [<xref ref-type="bibr" rid="ref-4">4</xref>,<xref ref-type="bibr" rid="ref-6">6</xref>,<xref ref-type="bibr" rid="ref-10">10</xref>] while ignoring the context-dependent nature of human-robot interaction. This rigidity becomes particularly problematic when considering scenario variations: safety constraints should dominate in high-density environments like hospital corridors during peak hours to prevent collisions. In contrast, efficiency should take priority in sparse settings such as late-night commercial spaces to optimize energy consumption and task completion time. Additionally, existing Gaussian-based reward formulations [<xref ref-type="bibr" rid="ref-11">11</xref>] exhibit inadequate risk quantification by failing to capture emergent risks from collective crowd dynamics, including pedestrian group movements and velocity-dependent collision probabilities. This deficiency intensifies when handling risk superposition effects in dense crowds [<xref ref-type="bibr" rid="ref-3">3</xref>]. Another critical limitation stems from computational inefficiency, where the quadratic computational complexity of pairwise distance evaluations in dense environments severely degrades real-time performance. This bottleneck often forces robots to adopt overly conservative navigation strategies that compromise operational fluency [<xref ref-type="bibr" rid="ref-12">12</xref>], highlighting the need for adaptive algorithmic frameworks in practical deployments.</p>
<p>These limitations stem from a critical gap in current research: the absence of dynamic reward mechanisms that explicitly couple environmental complexity with navigation objectives. While recent works attempt to enhance adaptability through different approaches, they exhibit distinct limitations. For instance, GST &#x002B; HH Attn [<xref ref-type="bibr" rid="ref-9">9</xref>] introduces attention-based interaction modeling and multi-step trajectory prediction to improve intention awareness. Yet its reward function relies on fixed penalties for predicted collisions without dynamically reweighting safety-efficiency tradeoffs based on real-time crowd density. In contrast, TGRF [<xref ref-type="bibr" rid="ref-11">11</xref>] proposes a flexible Gaussian-shaped reward structure to reduce hyperparameter tuning, but its adaptability primarily targets static object characteristics rather than explicitly addressing dynamic crowd motion patterns. Both methods lack mechanisms to dynamically reweight safety-efficiency objectives based on real-time crowd density and motion characteristics.</p>
<p>To address these challenges, this work makes three primary contributions: (1) A Gaussian kernel-based spatiotemporal risk field that quantifies environmental complexity by integrating crowd density, relative distance, and pedestrian velocity into a unified risk metric, enabling real-time assessment of emergent crowd behaviors. (2) An exponential decay reward mechanism that nonlinearly adjusts safety constraints based on instantaneous risk levels, automatically prioritizing collision avoidance in dense regions while permitting efficient navigation in sparse areas. (3) A parametric optimization framework establishes explicit mappings between risk parameters and navigation performance, providing interpretable guidelines for deploying service robots across diverse operational scenarios.</p>
<p>Our experimental validation demonstrates that this approach fundamentally transforms the safety-efficiency trade-off paradigm. In high-density environments (0.21 persons/m<sup>2</sup>), the proposed method achieves a 9.0% higher success rate than state-of-the-art baselines while reducing human space intrusion time by 10.7%. These advancements hold significant implications for deploying service robots in real-world applications where adaptive behavior is paramount, from hospital logistics to crowded urban service platforms.</p>
<p>The remainder of this paper is organized as follows: <xref ref-type="sec" rid="s2">Section 2</xref> reviews related works in navigation algorithms and reward shaping. <xref ref-type="sec" rid="s3">Section 3</xref> details our risk field modeling and adaptive reward framework. <xref ref-type="sec" rid="s4">Sections 4</xref> and <xref ref-type="sec" rid="s5">5</xref> present experimental results and discussions, respectively. Finally, <xref ref-type="sec" rid="s6">Section 6</xref> concludes with future research directions.</p>
</sec>
<sec id="s2">
<label>2</label>
<title>Related Works</title>
<sec id="s2_1">
<label>2.1</label>
<title> Research on Robot Navigation Methods</title>
<p>There has been a notable transition in research methodologies in robot navigation, shifting from conventional deterministic algorithms to learning-based approaches. Early navigation algorithms primarily relied on search-based methods, such as the A&#x002A; algorithm [<xref ref-type="bibr" rid="ref-13">13</xref>]. These methods guarantee completeness and optimality in discrete spaces; however, their computational complexity grows exponentially with increasing dimensions, leading to the &#x201C;curse of dimensionality&#x201D; [<xref ref-type="bibr" rid="ref-14">14</xref>]. Subsequently, methods based on artificial potential fields have garnered significant attention. Algorithms like the Dynamic Window Approach (DWA) and the Timed Elastic Band (TEB) employ virtual potential fields to avoid obstacles. Still, they often become trapped in local optima in complex dynamic environments. As research advanced, the Optimal Reciprocal Collision Avoidance (ORCA) [<xref ref-type="bibr" rid="ref-15">15</xref>] identifies the optimal path in the velocity space through linear programming to mitigate potential deadlocks or oscillations in dense environments, thereby resolving local optima issues and achieving robust navigation in complex dynamic scenarios.</p>
<p>Recent advancements in deep reinforcement learning (DRL) and graph neural networks (GNNs) have enabled novel solutions for robot navigation in socially complex environments. Recent advancements in deep reinforcement learning (DRL) and graph neural networks (GNNs) have enabled novel solutions for robot navigation in socially complex environments. DRL trains agents through trial-and-error interactions to maximize cumulative rewards, allowing robots to adapt to human behaviors and environmental uncertainties dynamically. GNNs, meanwhile, excel at modeling relational dynamics in scenarios with multiple interacting agents, such as human-robot coexistence. For example, Chen et al. [<xref ref-type="bibr" rid="ref-12">12</xref>] designed an attention-based DRL framework to improve navigation by explicitly encoding human-robot and human-human interactions. At the same time, Liu et al. [<xref ref-type="bibr" rid="ref-6">6</xref>] proposed a Decentralized Structured Recurrent Neural Network (DS-RNN) capable of operating in dense crowds and partially observable settings. Furthermore, GNNs are increasingly being incorporated into navigation frameworks: Chen et al. [<xref ref-type="bibr" rid="ref-16">16</xref>] leveraged Graph Convolutional Networks (GCNs) to optimize navigation by learning human attention weights, and Zhou and Garcke [<xref ref-type="bibr" rid="ref-17">17</xref>] developed a spatiotemporal graph architecture with attention mechanisms to capture human intentions and social norms, thereby enhancing navigation performance. Nevertheless, challenges persist in ensuring decision stability and real-time responsiveness in highly dynamic, densely populated environments.</p>
<p>Despite these advancements, existing methods still face critical limitations in highly dynamic crowd environments. Traditional search-based algorithms (e.g., A&#x002A;) suffer from the curse of dimensionality and lack adaptability to dynamic obstacles. While ORCA improves obstacle avoidance through motion prediction, it struggles in high-density scenarios due to limited predictive accuracy for collective crowd behaviors. Although DRL and GNN-based approaches enable end-to-end learning and social interaction modeling, their reliance on fixed reward weights often leads to suboptimal trade-offs between safety and efficiency across varying crowd densities. These limitations highlight the need for adaptive mechanisms that dynamically adjust risk assessment and reward allocation based on real-time environmental complexity.</p>
</sec>
<sec id="s2_2">
<label>2.2</label>
<title> Design of Reward Functions</title>
<p>The design of the reward functions represents the primary challenge in reinforcement learning-based robot navigation [<xref ref-type="bibr" rid="ref-18">18</xref>], as their mathematical formulation directly influences strategy convergence and operational safety [<xref ref-type="bibr" rid="ref-19">19</xref>].</p>
<p>Existing research [<xref ref-type="bibr" rid="ref-6">6</xref>,<xref ref-type="bibr" rid="ref-20">20</xref>&#x2013;<xref ref-type="bibr" rid="ref-22">22</xref>] primarily utilizes multi-objective weighted fusion to optimize navigation, incorporating reward components for target approach, collision avoidance, social distance maintenance, and path efficiency. Social distance and path efficiency rewards typically utilize distance-based penalty functions such as L2 norms [<xref ref-type="bibr" rid="ref-9">9</xref>] or Gaussian distributions [<xref ref-type="bibr" rid="ref-11">11</xref>], which quantify discomfort through human-robot distance metrics while integrating prior knowledge of socially acceptable spacing [<xref ref-type="bibr" rid="ref-23">23</xref>&#x2013;<xref ref-type="bibr" rid="ref-25">25</xref>]. For efficiency quantification, researchers commonly adopt L2-based metrics. Though their weighting coefficients remain fixed, these reward values undergo dynamic adjustment through time-varying distance calculations between the robot and the target.</p>
<p>However, suboptimal reward design may cause policy learning to diverge from intended objectives, while the inherent conflict between sparse safety rewards and dense efficiency rewards can induce robot behavior freezing [<xref ref-type="bibr" rid="ref-26">26</xref>]. Furthermore, the hyperparameter combinatorial explosion in multi-objective systems significantly increases policy search dimensionality [<xref ref-type="bibr" rid="ref-27">27</xref>].</p>
<p>To address these challenges, Kim et al. [<xref ref-type="bibr" rid="ref-11">11</xref>] introduced the Transformable Gaussian Reward Function (TGRF), which leverages a Gaussian distribution with three tunable hyperparameters&#x2014;weight, mean, and standard deviation&#x2014;to adjust penalties based on proximity to humans dynamically. The TGRF incorporates normalization to stabilize reward magnitudes across varying standard deviations, enabling adaptable risk-sensitive navigation while reducing hyperparameter redundancy. Despite these advancements, relying on Gaussian-derived exponential operations for distance-based penalties introduces computational overhead, particularly when evaluating dense crowds in real-time scenarios.</p>
</sec>
</sec>
<sec id="s3">
<label>3</label>
<title>Methodology</title>
<p>In the context of autonomous navigation tasks, the navigation problem is typically modeled as a Markov decision process (MDP). This modeling approach enables the utilization of reinforcement learning techniques for path planning and obstacle avoidance. An MDP is typically defined as a quintuple <inline-formula id="ieqn-1"><mml:math id="mml-ieqn-1"><mml:mrow><mml:mo>&#x27E8;</mml:mo><mml:mi>S</mml:mi><mml:mo>,</mml:mo><mml:mi>A</mml:mi><mml:mo>,</mml:mo><mml:mi>P</mml:mi><mml:mo>,</mml:mo><mml:mi>R</mml:mi><mml:mo>,</mml:mo><mml:mi>&#x03B3;</mml:mi><mml:mo>&#x27E9;</mml:mo></mml:mrow></mml:math></inline-formula>, where <inline-formula id="ieqn-2"><mml:math id="mml-ieqn-2"><mml:mi>S</mml:mi></mml:math></inline-formula> denotes the state space, which encompasses information such as the robot&#x2019;s position, speed, and the presence of surrounding pedestrians, and <inline-formula id="ieqn-3"><mml:math id="mml-ieqn-3"><mml:mi>A</mml:mi></mml:math></inline-formula> represents the action space, specifying the navigation decisions (e.g., speed and direction adjustments) that the robot can execute at each time step. The state transition probability <inline-formula id="ieqn-4"><mml:math id="mml-ieqn-4"><mml:mi>P</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>s</mml:mi><mml:mrow><mml:mi>t</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:msub><mml:mi>s</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>a</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula> signifies the likelihood of the robot transitioning from state <inline-formula id="ieqn-5"><mml:math id="mml-ieqn-5"><mml:msub><mml:mi>s</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> to <inline-formula id="ieqn-6"><mml:math id="mml-ieqn-6"><mml:msub><mml:mi>s</mml:mi><mml:mrow><mml:mi>t</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula> following the execution of an action <inline-formula id="ieqn-7"><mml:math id="mml-ieqn-7"><mml:msub><mml:mi>a</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>. Developing a reward function, denoted by <inline-formula id="ieqn-8"><mml:math id="mml-ieqn-8"><mml:mi>R</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>s</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>a</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula>, is essential to ensure safety and efficiency in path planning. This function guides the robot&#x2019;s behavior, providing incentives for approaching the goal, penalizing collisions with obstacles and pedestrians, and ensuring smooth navigation through a comfort reward. Within the framework of this MDP, the objective of robot navigation is to identify a strategy, denoted by <inline-formula id="ieqn-9"><mml:math id="mml-ieqn-9"><mml:mi>&#x03C0;</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>a</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:msub><mml:mi>s</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula>, that maximizes the robot&#x2019;s cumulative discounted reward throughout the task. This strategy is the probability distribution of selecting an action <inline-formula id="ieqn-10"><mml:math id="mml-ieqn-10"><mml:msub><mml:mi>a</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> in a state <inline-formula id="ieqn-11"><mml:math id="mml-ieqn-11"><mml:msub><mml:mi>s</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>. The cumulative discounted reward can be expressed as <xref ref-type="disp-formula" rid="eqn-1">Eq. (1)</xref>.
<disp-formula id="eqn-1"><label>(1)</label><mml:math id="mml-eqn-1" display="block"><mml:msub><mml:mi>G</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mrow><mml:mi mathvariant="double-struck">E</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>&#x03C0;</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo>[</mml:mo><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>k</mml:mi><mml:mo>=</mml:mo><mml:mn>0</mml:mn></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x221E;</mml:mi></mml:mrow></mml:munderover><mml:msup><mml:mi>&#x03B3;</mml:mi><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msup><mml:mi>R</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>s</mml:mi><mml:mrow><mml:mi>t</mml:mi><mml:mo>+</mml:mo><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>a</mml:mi><mml:mrow><mml:mi>t</mml:mi><mml:mo>+</mml:mo><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow><mml:mo>]</mml:mo></mml:mrow><mml:mo>,</mml:mo></mml:math></disp-formula>the discount factor <inline-formula id="ieqn-12"><mml:math id="mml-ieqn-12"><mml:mi>&#x03B3;</mml:mi><mml:mo>&#x2208;</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">]</mml:mo></mml:math></inline-formula> indicates that future rewards are discounted, the <inline-formula id="ieqn-13"><mml:math id="mml-ieqn-13"><mml:msup><mml:mi>&#x03B3;</mml:mi><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula> indicates the discount weight at <inline-formula id="ieqn-14"><mml:math id="mml-ieqn-14"><mml:mi>k</mml:mi></mml:math></inline-formula>-step, and <inline-formula id="ieqn-15"><mml:math id="mml-ieqn-15"><mml:mi>R</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>s</mml:mi><mml:mrow><mml:mi>t</mml:mi><mml:mo>+</mml:mo><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>a</mml:mi><mml:mrow><mml:mi>t</mml:mi><mml:mo>+</mml:mo><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula> indicates the immediate reward received after performing action <inline-formula id="ieqn-16"><mml:math id="mml-ieqn-16"><mml:msub><mml:mi>a</mml:mi><mml:mrow><mml:mi>t</mml:mi><mml:mo>+</mml:mo><mml:mi>k</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> in state <inline-formula id="ieqn-17"><mml:math id="mml-ieqn-17"><mml:msub><mml:mi>s</mml:mi><mml:mrow><mml:mi>t</mml:mi><mml:mo>+</mml:mo><mml:mi>k</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>.</p>
<p>The present paper utilizes a risk field to quantify the scene complexity in the environment and adjust the reward function. The structure of the paper, which follows the MDP paradigm, is shown in <xref ref-type="fig" rid="fig-1">Fig. 1</xref>. First, the density of people and the speed of pedestrians within the robot&#x2019;s current visual range in the scene are evaluated to obtain a risk score for the scene complexity. The robot&#x2019;s collision reward is scaled according to the score. In the subsequent phase, for robots entering dense areas, the reward is attenuated according to the action taken by applying an exponential decay function to the reward. This can assign exponentially increasing negative rewards to the robot&#x2019;s actions of approaching pedestrians, to guide the robot to reduce the intrusion time ratio.</p>
<fig id="fig-1">
<label>Figure 1</label>
<caption>
<title>Follow the structural diagram of the MDP paradigm. The gray individuals symbolize pedestrians outside the robot&#x2019;s field of view, and the yellow individuals represent pedestrians within the field of view. They are employed to calculate the scene&#x2019;s complexity, and the red individuals represent pedestrians too close to the robot. The blue dashed circle with the robot at its center represents the robot&#x2019;s field of view</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_65205-fig-1.tif"/>
</fig>
<sec id="s3_1">
<label>3.1</label>
<title> Scenario Complexity Modeling Based on Risk Fields</title>
<p>In a dynamic crowd environment, the risk impact of pedestrians at a specific location on a robot is not discrete; rather, it gradually decreases with increasing distance and decreasing speed. Inspired by this, this paper proposes a modeling method based on the risk field, comprehensively considering the three key factors of spatial distance, pedestrian speed, and crowd density. First, the spatial scope of risk propagation can be flexibly adjusted by introducing the parameter <inline-formula id="ieqn-18"><mml:math id="mml-ieqn-18"><mml:mi>&#x03C3;</mml:mi></mml:math></inline-formula> to control the attenuation rate in the exponential term. Concurrently, the speed component <inline-formula id="ieqn-19"><mml:math id="mml-ieqn-19"><mml:msub><mml:mi>v</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> of each pedestrian can be used as a weighting factor to integrate dynamic characteristics into the risk assessment effectively. Pedestrians with higher speeds will generate higher risk values, consistent with the risk distribution characteristics in real scenarios. The risk field modeling method based on the Gaussian kernel has good mathematical continuity and differentiability, which facilitates subsequent path planning optimization and intuitively reflects the risk distribution law in human-computer interaction scenarios.</p>
<p>The present study utilizes a risk field function to model the potential risk of each pedestrian within the robot&#x2019;s field of view. This function incorporates spatial distance and motion characteristics (speed)into the evaluation model (<xref ref-type="disp-formula" rid="eqn-2">Eq. (2)</xref>). The combination of the distance from the pedestrian to the robot <inline-formula id="ieqn-20"><mml:math id="mml-ieqn-20"><mml:msub><mml:mi>d</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> and the pedestrian&#x2019;s speed <inline-formula id="ieqn-21"><mml:math id="mml-ieqn-21"><mml:msub><mml:mi>v</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> enables the dynamic evaluation of the relative risk between the robot and the pedestrian. Furthermore, the range of the risk impact can be controlled by adjusting only one parameter, to suit the complexity requirements of different scenarios.
<disp-formula id="eqn-2"><label>(2)</label><mml:math id="mml-eqn-2" display="block"><mml:mi>C</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mi>&#x03C3;</mml:mi><mml:mo>)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:munderover><mml:mfrac><mml:msub><mml:mi>v</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mi>e</mml:mi><mml:mi>x</mml:mi><mml:mi>p</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mstyle displaystyle="true" scriptlevel="0"><mml:mfrac><mml:msubsup><mml:mi>d</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msubsup><mml:mrow><mml:mn>2</mml:mn><mml:msup><mml:mi>&#x03C3;</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msup></mml:mrow></mml:mfrac></mml:mstyle><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:mfrac><mml:mo>,</mml:mo></mml:math></disp-formula>where <inline-formula id="ieqn-22"><mml:math id="mml-ieqn-22"><mml:msub><mml:mi>d</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> represents the distance from the robot to the third person, <inline-formula id="ieqn-23"><mml:math id="mml-ieqn-23"><mml:msub><mml:mi>v</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> represents the speed of the third person, and <inline-formula id="ieqn-24"><mml:math id="mml-ieqn-24"><mml:mi>&#x03C3;</mml:mi></mml:math></inline-formula> is the range factor of the risk field, which controls the decay rate of the risk field intensity with distance. When <inline-formula id="ieqn-25"><mml:math id="mml-ieqn-25"><mml:mi>&#x03C3;</mml:mi></mml:math></inline-formula> is small, the risk field exhibits a rapid spatial attenuation characteristic. This parameter configuration is suitable for accurately assessing close-range risks in open spaces. When <inline-formula id="ieqn-26"><mml:math id="mml-ieqn-26"><mml:mi>&#x03C3;</mml:mi></mml:math></inline-formula> is large, the risk field has stronger spatial extensibility and can effectively assess potential risks at medium and long distances. This characteristic is fundamental in crowded scenes. As shown in <xref ref-type="fig" rid="fig-2">Fig. 2</xref>, different <inline-formula id="ieqn-27"><mml:math id="mml-ieqn-27"><mml:mi>&#x03C3;</mml:mi></mml:math></inline-formula> correspond to differentiated risk assessment models.</p>
<fig id="fig-2">
<label>Figure 2</label>
<caption>
<title>The influence of different &#x03C3; values on complexity calculation. The figure on the far left represents the distribution of scene complexity with pedestrian speed and distance between the pedestrian and the robot when <italic>&#x03C3;</italic> &#x003D; 0.5. The figure in the middle represents the distribution of scene complexity with pedestrian speed and distance between the pedestrian and the robot when <italic>&#x03C3;</italic> &#x003D; 1.0. The figure on the far right represents the distribution of scene complexity with pedestrian speed and distance between the pedestrian and the robot when <italic>&#x03C3;</italic> &#x003D; 3.0</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_65205-fig-2.tif"/>
</fig>
<p>The three-dimensional surface in <xref ref-type="fig" rid="fig-2">Fig. 2</xref> reveals the regulatory mechanism of the parameter &#x03C3; on the complexity of the risk field. In this example, when <italic>&#x03C3;</italic> &#x003D; 0.5, the effective action radius of the risk field shrinks to within 1 m, and its intensity gradient shows a steep attenuation characteristic. This parameter setting is particularly suitable for modeling the close-range risks in high-density scenarios such as subway stations and commercial centers. When <italic>&#x03C3;</italic> &#x003D; 1.0, the risk gradient curve exhibits a smooth transition characteristic, maintaining significant risk perception ability at a moderate distance of 1 to 3 m. This balanced characteristic suits medium-density scenarios such as shopping malls and office areas. When <italic>&#x03C3;</italic> &#x003D; 3.0, the range of action of the risk field extends to more than 3 m, and its slow decay characteristic can accurately capture the potential long-distance interaction risks in low-density scenarios such as open squares and stadiums.</p>
<p><xref ref-type="fig" rid="fig-3">Fig. 3</xref> shows the risk field distribution under different crowd densities: in low-density scenarios, the risk field presents discrete and independent peaks, and the risk value is generally low (base speed0.5 m/s), providing the robot with flexible navigation space; while in high-density crowd behaviors (such as walking side by side), the risk superposition caused by the crowd effect leads to the formation of significant high-risk areas in local areas, forcing the robot to adopt conservative strategies such as deceleration and increasing the avoidance distance. This dynamic risk field drives the robot&#x2019;s navigation strategy from proactive to conservative.</p>
<fig id="fig-3">
<label>Figure 3</label>
<caption>
<title>A schematic diagram of the risk field under different population densities. The left side of the diagram shows a sparse pedestrian scene, and the right side shows a dense pedestrian scene. The red dots are robots, and the blue dots are pedestrians. The darker the red around the pedestrians, the higher the risk</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_65205-fig-3.tif"/>
</fig>
</sec>
<sec id="s3_2">
<label>3.2</label>
<title> Design of Reward Functions</title>
<p>Unlike previous research [<xref ref-type="bibr" rid="ref-11">11</xref>], this paper explicitly designs a scenario complexity score to adjust the collision penalty in different pedestrian density scenarios, prompting the robot to take more cautious actions to maintain appropriate social distance in high-density scenarios. When the robot acts according to the learned strategy, the reward or penalty obtained will be adjusted according to the complexity value <inline-formula id="ieqn-28"><mml:math id="mml-ieqn-28"><mml:mi>C</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mi>&#x03C3;</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula>, that is, the penalty <inline-formula id="ieqn-29"><mml:math id="mml-ieqn-29"><mml:msub><mml:mi>r</mml:mi><mml:mrow><mml:mrow><mml:mtext>col</mml:mtext></mml:mrow></mml:mrow></mml:msub></mml:math></inline-formula> for the robot colliding in a crowded scene will be increased, as in <xref ref-type="disp-formula" rid="eqn-3">Eq. (3)</xref>.
<disp-formula id="eqn-3"><label>(3)</label><mml:math id="mml-eqn-3" display="block"><mml:msub><mml:mi>r</mml:mi><mml:mrow><mml:mrow><mml:mtext>col</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mn>10</mml:mn><mml:mo>&#x22C5;</mml:mo><mml:mi>C</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mi>&#x03C3;</mml:mi><mml:mo>)</mml:mo></mml:mrow><mml:mspace width="negativethinmathspace" /><mml:mo>.</mml:mo></mml:math></disp-formula></p>
<p>In addition, this paper also designs an exponential decay reward mechanism to modulate the reward for dangerous areas, as shown in <xref ref-type="disp-formula" rid="eqn-4">Eq. (4)</xref>. When the robot enters a dangerous area (<inline-formula id="ieqn-30"><mml:math id="mml-ieqn-30"><mml:msub><mml:mi>d</mml:mi><mml:mrow><mml:mo movablelimits="true" form="prefix">min</mml:mo></mml:mrow></mml:msub></mml:math></inline-formula> within <inline-formula id="ieqn-31"><mml:math id="mml-ieqn-31"><mml:msub><mml:mi>r</mml:mi><mml:mrow><mml:mrow><mml:mtext>col</mml:mtext></mml:mrow></mml:mrow></mml:msub></mml:math></inline-formula>) determined by the nearest human distance (denoted as <inline-formula id="ieqn-32"><mml:math id="mml-ieqn-32"><mml:msub><mml:mi>d</mml:mi><mml:mrow><mml:mo movablelimits="true" form="prefix">min</mml:mo></mml:mrow></mml:msub></mml:math></inline-formula>), it will be punished by <inline-formula id="ieqn-33"><mml:math id="mml-ieqn-33"><mml:msub><mml:mi>r</mml:mi><mml:mrow><mml:mrow><mml:mtext>col</mml:mtext></mml:mrow></mml:mrow></mml:msub></mml:math></inline-formula>. To make the reward function adaptive and able to reflect environmental changes dynamically, this section combines the scene complexity in <xref ref-type="sec" rid="s3_1">Section 3.1</xref> to design a reward for dangerous areas that measures the risk of pedestrian distribution in the current scene.
<disp-formula id="eqn-4"><label>(4)</label><mml:math id="mml-eqn-4" display="block"><mml:msub><mml:mi>r</mml:mi><mml:mrow><mml:mrow><mml:mtext>disc</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mi>C</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mi>&#x03C3;</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:mrow><mml:mn>2</mml:mn></mml:mfrac><mml:mo>&#x22C5;</mml:mo><mml:mi>exp</mml:mi><mml:mo>&#x2061;</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:mn>1</mml:mn><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mi>d</mml:mi><mml:mrow><mml:mi>m</mml:mi><mml:mi>i</mml:mi><mml:mi>n</mml:mi></mml:mrow></mml:msub><mml:mo>&#x22C5;</mml:mo><mml:mi>&#x03BB;</mml:mi><mml:mo>)</mml:mo></mml:mrow><mml:mo>,</mml:mo></mml:math></disp-formula>where <inline-formula id="ieqn-34"><mml:math id="mml-ieqn-34"><mml:msub><mml:mi>r</mml:mi><mml:mrow><mml:mrow><mml:mtext>disc</mml:mtext></mml:mrow></mml:mrow></mml:msub></mml:math></inline-formula> is designed to prevent the robot from colliding with humans in dense scenes, it follows the exponential decay law. At this time, the sensitivity to distance is greater than that to scene complexity, so <inline-formula id="ieqn-35"><mml:math id="mml-ieqn-35"><mml:mi>C</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mi>&#x03C3;</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula> is weighted and reduced here.</p>
<p>The exponential decay mechanism in this paper only uses one hyperparameter to adjust the reward effect. The researchers can control the sensitivity of the reward decay to the distance by adjusting the value of the decay rate <italic>&#x03BB;</italic>. In addition, this paper follows the definitions of the punishment for future trajectory conflicts between robots and pedestrians and the potential field reward from previous work [<xref ref-type="bibr" rid="ref-9">9</xref>].
<disp-formula id="eqn-5"><label>(5)</label><mml:math id="mml-eqn-5" display="block"><mml:msubsup><mml:mi>r</mml:mi><mml:mrow><mml:mrow><mml:mtext>pred</mml:mtext></mml:mrow></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msubsup><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>s</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:munder><mml:mo movablelimits="true" form="prefix">min</mml:mo><mml:mrow><mml:mi>k</mml:mi><mml:mo>=</mml:mo><mml:mrow><mml:mtext mathvariant="bold">1,</mml:mtext></mml:mrow><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:mi>K</mml:mi></mml:mrow></mml:munder><mml:mrow><mml:mo>(</mml:mo><mml:msubsup><mml:mrow><mml:mtext mathvariant="bold">1</mml:mtext></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi><mml:mo>+</mml:mo><mml:mi>k</mml:mi></mml:mrow></mml:msubsup><mml:mfrac><mml:msub><mml:mi>r</mml:mi><mml:mrow><mml:mi>c</mml:mi><mml:mi>o</mml:mi><mml:mi>l</mml:mi></mml:mrow></mml:msub><mml:msup><mml:mrow><mml:mtext mathvariant="bold">2</mml:mtext></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msup></mml:mfrac><mml:mo>)</mml:mo></mml:mrow><mml:mo>,</mml:mo><mml:mspace width="0pt" /><mml:msub><mml:mi>r</mml:mi><mml:mrow><mml:mrow><mml:mtext>pred</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>s</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:munder><mml:mo movablelimits="true" form="prefix">min</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mrow><mml:mtext mathvariant="bold">1,</mml:mtext></mml:mrow><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:mi>n</mml:mi></mml:mrow></mml:munder><mml:msubsup><mml:mi>r</mml:mi><mml:mrow><mml:mrow><mml:mtext>pred</mml:mtext></mml:mrow></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msubsup><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>s</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow><mml:mspace width="negativethinmathspace" /><mml:mo>.</mml:mo></mml:math></disp-formula></p>
<p>When using the trajectory prediction model, <inline-formula id="ieqn-36"><mml:math id="mml-ieqn-36"><mml:msub><mml:mi>r</mml:mi><mml:mrow><mml:mrow><mml:mtext>pred</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>s</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula> is calculated as a penalty term, as in <xref ref-type="disp-formula" rid="eqn-5">Eq. (5)</xref>. <inline-formula id="ieqn-37"><mml:math id="mml-ieqn-37"><mml:msub><mml:mi>r</mml:mi><mml:mrow><mml:mrow><mml:mtext>pred</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>s</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula> represents the potential risk of a collision between the robot and the pedestrian&#x2019;s future trajectory. The <inline-formula id="ieqn-38"><mml:math id="mml-ieqn-38"><mml:msubsup><mml:mrow><mml:mtext mathvariant="bold">1</mml:mtext></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi><mml:mo>+</mml:mo><mml:mi>k</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula> calculates whether the robot will enter the predicted position of the <inline-formula id="ieqn-39"><mml:math id="mml-ieqn-39"><mml:mi>i</mml:mi></mml:math></inline-formula> pedestrian at time <inline-formula id="ieqn-40"><mml:math id="mml-ieqn-40"><mml:mi>t</mml:mi><mml:mo>+</mml:mo><mml:mi>k</mml:mi></mml:math></inline-formula>. The value of <inline-formula id="ieqn-41"><mml:math id="mml-ieqn-41"><mml:msub><mml:mi>r</mml:mi><mml:mrow><mml:mrow><mml:mtext>pred</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>s</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula> takes the minimum penalty value of all potential conflicts and represents the lowest collision risk faced by the robot.</p>
<p>The potential field reward <inline-formula id="ieqn-42"><mml:math id="mml-ieqn-42"><mml:msub><mml:mi>r</mml:mi><mml:mrow><mml:mrow><mml:mtext>pot</mml:mtext></mml:mrow></mml:mrow></mml:msub></mml:math></inline-formula> is used to guide the robot to the reward obtained when approaching the target, as in <xref ref-type="disp-formula" rid="eqn-6">Eq. (6)</xref>. Where <inline-formula id="ieqn-43"><mml:math id="mml-ieqn-43"><mml:msubsup><mml:mi>d</mml:mi><mml:mrow><mml:mi>g</mml:mi><mml:mi>o</mml:mi><mml:mi>a</mml:mi><mml:mi>l</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula> is the L2 distance between the robot position and the target position at a given time <inline-formula id="ieqn-44"><mml:math id="mml-ieqn-44"><mml:mi>t</mml:mi></mml:math></inline-formula>.
<disp-formula id="eqn-6"><label>(6)</label><mml:math id="mml-eqn-6" display="block"><mml:msub><mml:mi>r</mml:mi><mml:mrow><mml:mrow><mml:mtext>pot</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msubsup><mml:mi>d</mml:mi><mml:mrow><mml:mi>g</mml:mi><mml:mi>o</mml:mi><mml:mi>a</mml:mi><mml:mi>l</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msubsup><mml:mo>&#x2212;</mml:mo><mml:msubsup><mml:mi>d</mml:mi><mml:mrow><mml:mi>g</mml:mi><mml:mi>o</mml:mi><mml:mi>a</mml:mi><mml:mi>l</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msubsup></mml:math></disp-formula></p>
<p>Finally, the reward function defined in this paper is as in <xref ref-type="disp-formula" rid="eqn-7">Eq. (7)</xref>.
<disp-formula id="eqn-7"><label>(7)</label><mml:math id="mml-eqn-7" display="block"><mml:mi>r</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>s</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>a</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mrow><mml:mo>{</mml:mo><mml:mtable columnalign="left left left left" rowspacing="4pt" columnspacing="1em"><mml:mtr><mml:mtd><mml:mn>10</mml:mn><mml:mo>,</mml:mo></mml:mtd><mml:mtd><mml:mrow><mml:mtext>if&#xA0;</mml:mtext></mml:mrow><mml:msub><mml:mi>s</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>&#x2208;</mml:mo><mml:msub><mml:mi>S</mml:mi><mml:mrow><mml:mrow><mml:mtext>goal</mml:mtext></mml:mrow></mml:mrow></mml:msub></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:msub><mml:mi>r</mml:mi><mml:mrow><mml:mrow><mml:mtext>col</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:mo>,</mml:mo></mml:mtd><mml:mtd><mml:mrow><mml:mtext>if&#xA0;</mml:mtext></mml:mrow><mml:msub><mml:mi>s</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>&#x2208;</mml:mo><mml:msub><mml:mi>S</mml:mi><mml:mrow><mml:mrow><mml:mtext>collision</mml:mtext></mml:mrow></mml:mrow></mml:msub></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:msub><mml:mi>r</mml:mi><mml:mrow><mml:mrow><mml:mtext>pred</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>s</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow><mml:mo>+</mml:mo><mml:msub><mml:mi>r</mml:mi><mml:mrow><mml:mrow><mml:mtext>disc</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>s</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow><mml:mo>,</mml:mo></mml:mtd><mml:mtd><mml:mrow><mml:mtext>if&#xA0;</mml:mtext></mml:mrow><mml:msub><mml:mi>s</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>&#x2208;</mml:mo><mml:msub><mml:mi>S</mml:mi><mml:mrow><mml:mrow><mml:mtext>confined zone.</mml:mtext></mml:mrow></mml:mrow></mml:msub></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:msub><mml:mi>r</mml:mi><mml:mrow><mml:mrow><mml:mtext>pred</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>s</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow><mml:mo>+</mml:mo><mml:msub><mml:mi>r</mml:mi><mml:mrow><mml:mrow><mml:mtext>pot</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>s</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow><mml:mo>,</mml:mo></mml:mtd><mml:mtd><mml:mrow><mml:mtext>otherwise</mml:mtext></mml:mrow></mml:mtd></mml:mtr></mml:mtable><mml:mo fence="true" stretchy="true" symmetric="true"></mml:mo></mml:mrow></mml:math></disp-formula></p>
</sec>
</sec>
<sec id="s4">
<label>4</label>
<title>Experiments and Results</title>
<p>This section describes this paper&#x2019;s simulation environment, experimental setup, and results. We tested models that did not use the method in this paper and compared them with the latest research and the method in this paper. We also compared navigation performance at different population densities and compared two hyperparameters in the method in this paper to explore their impact on the navigation strategy.</p>
<sec id="s4_1">
<label>4.1</label>
<title> Experimental Environment</title>
<p>As in the previous work [<xref ref-type="bibr" rid="ref-6">6</xref>,<xref ref-type="bibr" rid="ref-9">9</xref>,<xref ref-type="bibr" rid="ref-11">11</xref>,<xref ref-type="bibr" rid="ref-27">27</xref>], we used the CrowdSim framework for all simulation experiments. CrowdSim is an open-source 2D robot navigation crowd navigation simulator based on OpenAI Gym, obtained from the GitHub code warehouse disclosed in Liu et al.&#x2019;s work [<xref ref-type="bibr" rid="ref-9">9</xref>]. This environment comprises a 12 m &#x00D7; 12 m planar workspace, where the robot and pedestrians are modeled as circular agents with collision radii. The robot perceives its surroundings through a 360&#x00B0; field of view (FOV) and a lidar sensor with a detection range of 5 m. Pedestrians follow the ORCA (Optimal Reciprocal Collision Avoidance) algorithm for collision avoidance, while the robot is invisible to pedestrians to simulate unidirectional interaction.</p>
<p>The robots&#x2019; and pedestrians&#x2019; starting and target positions are randomly generated in the 2D plane. Upon reaching their destinations, pedestrians are dynamically reassigned to new random targets, ensuring continuous movement patterns. A medium-density scenario with 20 pedestrians (density <italic>&#x03C1;</italic> &#x003D; 0.15 persons/m<sup>2</sup>) is adopted for model training.
<disp-formula id="eqn-8"><label>(8)</label><mml:math id="mml-eqn-8" display="block"><mml:msub><mml:mi>p</mml:mi><mml:mrow><mml:mi>x</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo>[</mml:mo><mml:mi>t</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn><mml:mo>]</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:msub><mml:mi>p</mml:mi><mml:mrow><mml:mi>x</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo>[</mml:mo><mml:mi>t</mml:mi><mml:mo>]</mml:mo></mml:mrow><mml:mo>+</mml:mo><mml:msub><mml:mi>v</mml:mi><mml:mrow><mml:mi>x</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo>[</mml:mo><mml:mi>t</mml:mi><mml:mo>]</mml:mo></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x0394;</mml:mi></mml:mrow><mml:mi>t</mml:mi><mml:mo>,</mml:mo><mml:msub><mml:mi>p</mml:mi><mml:mrow><mml:mi>y</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo>[</mml:mo><mml:mi>t</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn><mml:mo>]</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:msub><mml:mi>p</mml:mi><mml:mrow><mml:mi>y</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo>[</mml:mo><mml:mi>t</mml:mi><mml:mo>]</mml:mo></mml:mrow><mml:mo>+</mml:mo><mml:msub><mml:mi>v</mml:mi><mml:mrow><mml:mi>y</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo>[</mml:mo><mml:mi>t</mml:mi><mml:mo>]</mml:mo></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x0394;</mml:mi></mml:mrow><mml:mi>t</mml:mi><mml:mo>.</mml:mo></mml:math></disp-formula></p>
<p>Regarding the kinematic model, this paper uses the overall kinematic equation (<xref ref-type="disp-formula" rid="eqn-8">Eq. (8)</xref>) to update the robot&#x2019;s and pedestrians&#x2019; positions. At each time step: <inline-formula id="ieqn-45"><mml:math id="mml-ieqn-45"><mml:mi>t</mml:mi></mml:math></inline-formula>, the movement of each agent is represented by the desired velocity <inline-formula id="ieqn-46"><mml:math id="mml-ieqn-46"><mml:msub><mml:mi>a</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mrow><mml:mo>[</mml:mo><mml:msub><mml:mi>v</mml:mi><mml:mrow><mml:mi>x</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>v</mml:mi><mml:mrow><mml:mi>y</mml:mi></mml:mrow></mml:msub><mml:mo>]</mml:mo></mml:mrow></mml:math></inline-formula> in the <italic>x</italic>-axis and <italic>y</italic>-axis, and both the robot and the human can reach the desired velocity immediately within the time frame of &#x025B3;<italic>t</italic>. The robot employs a continuous action space with a maximum speed of 1 m/s, consistent with real-world service robots. A collision radius of 0.3 m constrains the robot&#x2019;s motion, while pedestrians have radii ranging from 0.3 to 0.5 m and speeds between 0.5 and 1.5 m/s.</p>
<p>During training, the robot&#x2019;s and pedestrians&#x2019; initial positions are regenerated at the start of each new episode by invoking the environment&#x2019;s reset method. This ensures diverse training scenarios through randomized configurations, where each episode begins with a unique layout determined by a fixed random seed and predefined parameters. The visibility between agents is determined solely by the two-dimensional field of view (FOV) and distance thresholds. Specifically, a pedestrian or robot is considered visible if it lies within another agent&#x2019;s FOV cone and a maximum detection range (5 m), regardless of potential occlusions by other agents along the line of sight. This simplified perception model resembles a third-person perspective rather than simulating physical volume-based occlusions in three-dimensional space.</p>
<p>The Proximal Policy Optimization (PPO) algorithm was implemented with <italic>&#x03B3;</italic> &#x003D; 0.99 discount factor, 4e&#x2212;5 learning rate, and 0.2 clip parameter across 16 parallel environments, while risk field parameters used <italic>&#x03C3;</italic> &#x003D; 8 spatial decay and <italic>&#x03BB;</italic> &#x003D; 0.1 exponential reward decay. The experiment was conducted on a workstation with a GeForce GTX TITAN GPU and an AMD Ryzen 3990X CPU. A total of 20,820 training iterations were performed, with the model achieving the highest average reward selected for testing.</p>
</sec>
<sec id="s4_2">
<label>4.2</label>
<title> Relevant Evaluation Indicators</title>
<p>In terms of evaluation methodology, this study assesses all approaches using 500 randomized test cases and evaluates their performance through navigation and social awareness metrics, consistent with prior research [<xref ref-type="bibr" rid="ref-9">9</xref>]. Navigation metrics quantify pathfinding quality through three key indicators: success rate (SR), average navigation time (NT, in seconds), and mean path length (PL, in meters) across successful cases. Social metrics analyze robotic social compliance through two primary measures: the intrusion-to-time ratio (ITR) and mean social distance (SD, in meters) at intrusion instances. ITR represents the temporal proportion during which the robot violates pedestrian spaces across all test scenarios. During intrusion events, SD is computed as the average minimum distance between the robot and surrounding pedestrians. All intrusion determinations utilize ground-truth pedestrian trajectory data from subsequent timesteps to maintain comparative validity.</p>
</sec>
<sec id="s4_3">
<label>4.3</label>
<title> Results</title>
<p>Experiments were conducted with a fixed random seed for environment initialization and policy training to mitigate training stochasticity. The reported results are averaged across 500 test cases to ensure statistical reliability.</p>
<sec id="s4_3_1">
<label>4.3.1</label>
<title>Experimental Results of Reward Mechanism Comparison</title>
<p>To comprehensively evaluate the performance advantages of the proposed method, we conducted systematic comparative experiments with five robot navigation strategies. The baseline methods include DS-RNN [<xref ref-type="bibr" rid="ref-6">6</xref>], three attention-based variants (Const vel &#x002B; HH Attn, Truth &#x002B; HH Attn, GST &#x002B; HH Attn [<xref ref-type="bibr" rid="ref-9">9</xref>]), and TGRF [<xref ref-type="bibr" rid="ref-11">11</xref>]. DS-RNN is a model that uses RNN but does not include pedestrian trajectory prediction and a self-attention mechanism. The baseline models that include pedestrian trajectory prediction and self-attention mechanisms include Const vel &#x002B; HH Attn (which assumes that pedestrians move at a constant speed for trajectory prediction), Truth &#x002B; HH Attn (which assumes that the robot can obtain the true future trajectory of the pedestrian), GST &#x002B; HH Attn (which uses the GST model for nonlinear trajectory prediction), and TGRF (which performs reward adjustment based on the transformable Gaussian reward function). In contrast, our method introduces dynamic risk field modeling and adaptive exponential decay rewards. This design enables real-time prioritization of safety in dense crowds (via risk score amplification) and efficiency in sparse regions (via exponential decay suppression), addressing the rigidity of fixed-weight approaches.</p>
<p><xref ref-type="table" rid="table-1">Table 1</xref> compares various models&#x2019; performance when implementing our proposed risk field modeling and exponential decay reward method under ORCA-governed pedestrian dynamics. The hyperparameters were configured with <inline-formula id="ieqn-47"><mml:math id="mml-ieqn-47"><mml:mi>&#x03C3;</mml:mi><mml:mo>=</mml:mo><mml:mn>8.0</mml:mn></mml:math></inline-formula> (risk diffusion coefficient) and <inline-formula id="ieqn-48"><mml:math id="mml-ieqn-48"><mml:mi>&#x03BB;</mml:mi><mml:mo>=</mml:mo><mml:mn>0.1</mml:mn></mml:math></inline-formula> (decay factor). Both quantitative analysis from <xref ref-type="table" rid="table-1">Table 1</xref> and qualitative visualization in <xref ref-type="fig" rid="fig-4">Fig. 4</xref> reveal three significant improvements attributable to our risk field model implementation.</p>
<table-wrap id="table-1">
<label>Table 1</label>
<caption>
<title>Performance comparison of navigation methods with different reward mechanisms (pedestrians follow ORCA policy, red data represents the best results)</title>
</caption>
<table>
<colgroup>
<col/>
<col/>
<col/>
<col/>
<col/>
<col/>
</colgroup>
<thead>
<tr>
<th>Methods</th>
<th>SR (%)&#x2191;</th>
<th>NT (s)&#x2193;</th>
<th>PL (m)&#x2193;</th>
<th>ITR (%)&#x2193;</th>
<th>SD&#x2191;</th>
</tr>
</thead>
<tbody>
<tr>
<td>DS-RNN</td>
<td>70.0</td>
<td>17.66</td>
<td>21.81</td>
<td>11.52</td>
<td>0.38</td>
</tr>
<tr>
<td>Const vel &#x002B; HH Attn</td>
<td>79.0</td>
<td>23.48</td>
<td>28.70</td>
<td>3.74</td>
<td>0.43</td>
</tr>
<tr>
<td>Truth &#x002B; HH Attn</td>
<td>93.0</td>
<td>19.68</td>
<td>25.40</td>
<td>2.45</td>
<td>0.44</td>
</tr>
<tr>
<td>GST &#x002B; HH Attn</td>
<td>93.0</td>
<td>16.33</td>
<td>22.31</td>
<td>4.67</td>
<td>0.44</td>
</tr>
<tr>
<td>TGRF</td>
<td>95.0</td>
<td>18.49</td>
<td>24.25</td>
<td>4.36</td>
<td>0.43</td>
</tr>
<tr>
<td>DS-RNN With Ours</td>
<td>71.0</td>
<td>20.76</td>
<td>22.71</td>
<td>9.54</td>
<td>0.38</td>
</tr>
<tr>
<td>Const vel &#x002B; HH Attn With Ours</td>
<td>92.0</td>
<td>16.98</td>
<td>22.66</td>
<td>5.82</td>
<td>0.41</td>
</tr>
<tr>
<td>Truth &#x002B; HH Attn With Ours</td>
<td>96.0</td>
<td>19.89</td>
<td>26.15</td>
<td>1.99</td>
<td>0.45</td>
</tr>
<tr>
<td>GST &#x002B; HH Attn With Ours</td>
<td>97.0</td>
<td>18.44</td>
<td>24.38</td>
<td>2.94</td>
<td>0.45</td>
</tr>
</tbody>
</table>
</table-wrap><fig id="fig-4">
<label>Figure 4</label>
<caption>
<title>Comparison of robot strategies in a simulated environment. The yellow circles represent the robot, the blue circles represent humans within the sensor range, the red circles represent humans outside the sensor range, and the orange circles in front of the blue circles indicate the predicted trajectory of the GST &#x002B; HH Attn model</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_65205-fig-4.tif"/>
</fig>
<p>Firstly, the proposed methodology demonstrates substantial improvements in navigation safety metrics. Regarding success rate (SR), the GST With Ours method attains a success rate of 97.0%, 4% higher than the benchmark GST &#x002B; HH Attn method (93%) and 2% higher than the TGRF method&#x2019;s 95.0%. This finding signifies that the GST With Ours method demonstrates enhanced reliability in accomplishing navigation tasks. Concurrently, the intrusion-to-time ratio (ITR) has undergone a substantial reduction. The ITR of the GST &#x002B; HH Attn With Ours method is 2.94%, considerably lower than the 4.36% of the TGRF method, and the time of intruding into the crowd has been reduced by 32.56%. This superiority stems from the adaptive reward mechanism: the exponential decay function amplifies collision penalties in dense crowds while suppressing inefficiency penalties in sparse regions. Unlike fixed-weight methods, which rigidly balance safety and efficiency, our approach adapts weights to real-time risk levels. For instance, in high-density scenarios (<xref ref-type="fig" rid="fig-4">Fig. 4b</xref>), the exponential decay mechanism imposes exponentially increasing penalties as the robot approaches pedestrians, forcing proactive detours. Conversely, in low-density scenarios, reduced penalties allow faster navigation without compromising safety.</p>

<p>Secondly, while prioritizing safety, the method maintains competitive navigation efficiency despite inherent trade-offs. Given the need to navigate congested areas cautiously, this approach has significantly increased navigation time (NT) and path length (PL). However, this increase remains within the acceptable range. A comparison with the baseline GST &#x002B; HH Attn method reveals that navigation time (NT) increased from 16.33 to 18.44 s (an increase of 12.9%), and path length (PL) increased from 22.31 to 24.38 m (an increase of 9.3%). Notably, both indicators exhibit a marked superiority over conventional methodologies, such as the DS-RNN approach, which recorded times of 20.76 s and 22.71 m, respectively.</p>
<p>Finally, the approach delineated in this paper enhances the robot&#x2019;s comprehension of crowd density, thereby reducing the incidence of collisions. As illustrated in <xref ref-type="fig" rid="fig-4">Fig. 4a</xref>, robots that do not employ this method frequently exhibit aggressive navigation, characterized by sudden movements into crowds and subsequent collisions with pedestrians. This behavior signifies an inability to comprehend pedestrian intentions and to balance reward functions. In contrast, <xref ref-type="fig" rid="fig-4">Fig. 4b</xref> demonstrates the robot&#x2019;s enhanced performance when utilizing the proposed method, which anticipates pedestrian congregation and proactively avoids dense areas. This enhanced navigation facilitates safer and more socially acceptable movement while ensuring efficient progress toward the destination.</p>

</sec>
<sec id="s4_3_2">
<label>4.3.2</label>
<title>Results of the Crowd Density Adaptation Experiment</title>
<p>This paper proposes a crowd density gradient test to compare models&#x2019; generalization ability. The basic model, which is trained with <italic>N</italic> &#x003D; 20 pedestrians (corresponding to a density of &#x03C1; &#x003D; 0.15 people/m<sup>2</sup>), is used as the test object. Two extreme scenarios of low density (<italic>N</italic> &#x003D; 10/15, <italic>&#x03C1;</italic> &#x003D; 0.07/0.11 person/m<sup>2</sup>) and high density (<italic>N</italic> &#x003D; 25/30, <italic>&#x03C1;</italic> &#x003D; 0.21/0.25 person/m<sup>2</sup>), respectively. These scenarios are then compared with the GST &#x002B; HH Attn and TGRF models in <xref ref-type="table" rid="table-1">Table 1</xref>.</p>

<p>Risk fields and exponential decay methods in high-density scenarios show significant advantages in environmental adaptation. As shown in <xref ref-type="table" rid="table-2">Table 2</xref>, in the extreme scenario of <italic>&#x03C1;</italic> &#x003D; 0.21, our method improves the success rate (SR) by 7.0% compared to the GST model (82.0% vs. 75.0%) and 9.0% compared to the TGRF model (82.0% vs. 73.0%). In comparison, the intrusion-to-time ratio (ITR) decreased by 7.2% (6.74% vs. 7.26%) compared to the GST &#x002B; HH Attn model and by 10.7% (6.74% vs. 7.55%) compared to the TGRF model. This shows that in complex high-density environments, the model proposed in this paper can more effectively identify and avoid potential collision risks, thereby improving the safety and reliability of navigation. In contrast, TGRF employs a transformable Gaussian reward function but relies on fixed weights, which fail to prioritize safety in dense scenarios (ITR &#x003D; 7.55% at <italic>&#x03C1;</italic> &#x003D; 0.21).</p>
<table-wrap id="table-2">
<label>Table 2</label>
<caption>
<title>Comparison results of model generalization under different population densities (Red data represents the best results)</title>
</caption>
<table>
<colgroup>
<col/>
<col/>
<col/>
<col/>
<col/>
<col/>
</colgroup>
<thead>
<tr>
<th><italic>P</italic> (persons/m<sup><bold>2</bold></sup>)&#x002A;</th>
<th>Methods</th>
<th>SR (%)&#x2191;</th>
<th>NT (s)&#x2193;</th>
<th>PL (m)&#x2193;</th>
<th>ITR (%)</th>
</tr>
</thead>
<tbody>
<tr>
<td rowspan="3">0.21 (30 pedestrians)</td>
<td>GST &#x002B; HH Attn</td>
<td>75.0</td>
<td>18.99</td>
<td>22.44</td>
<td>7.26</td>
</tr>
<tr>
<td>TGRF</td>
<td>73.0</td>
<td>21.65</td>
<td>24.69</td>
<td>7.55</td>
</tr>
<tr>
<td>Ours</td>
<td>82.0</td>
<td>20.83</td>
<td>25.48</td>
<td>6.74</td>
</tr>
<tr>
<td rowspan="3">0.17 (25 pedestrians)</td>
<td>GST &#x002B; HH Attn</td>
<td>84.0</td>
<td>17.89</td>
<td>22.50</td>
<td>5.54</td>
</tr>
<tr>
<td>TGRF</td>
<td>86.0</td>
<td>20.29</td>
<td>25.29</td>
<td>5.12</td>
</tr>
<tr>
<td>Ours</td>
<td>95.0</td>
<td>19.70</td>
<td>25.45</td>
<td>3.51</td>
</tr>
<tr>
<td rowspan="3">0.14 (20 pedestrians)</td>
<td>GST &#x002B; HH Attn</td>
<td>93.0</td>
<td>16.33</td>
<td>22.31</td>
<td>4.67</td>
</tr>
<tr>
<td>TGRF</td>
<td>95.0</td>
<td>18.49</td>
<td>24.25</td>
<td>4.36</td>
</tr>
<tr>
<td>Ours</td>
<td>97.0</td>
<td>18.44</td>
<td>24.38</td>
<td>2.94</td>
</tr>
<tr>
<td rowspan="3">0.10 (15 pedestrians)</td>
<td>GST &#x002B; HH Attn</td>
<td>96.0</td>
<td>14.92</td>
<td>21.37</td>
<td>3.26</td>
</tr>
<tr>
<td>TGRF</td>
<td>97.0</td>
<td>17.50</td>
<td>23.75</td>
<td>3.23</td>
</tr>
<tr>
<td>Ours</td>
<td>98.0</td>
<td>16.67</td>
<td>22.89</td>
<td>1.98</td>
</tr>
<tr>
<td rowspan="3">0.07 (10 pedestrians)</td>
<td>GST &#x002B; HH Attn</td>
<td>98.0</td>
<td>13.78</td>
<td>20.29</td>
<td>2.13</td>
</tr>
<tr>
<td>TGRF</td>
<td>98.0</td>
<td>15.73</td>
<td>22.24</td>
<td>1.84</td>
</tr>
<tr>
<td>Ours</td>
<td>100.0</td>
<td>15.22</td>
<td>21.43</td>
<td>1.11</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<fn id="table-2fn1" fn-type="other">
<p>Note: &#x002A;Different <italic>p</italic>-values are calculated by adjusting the number of pedestrians while keeping the area of the simulation environment constant. The calculation is performed as follows: number of pedestrians divided by the area of the simulation environment.</p>
</fn>
</table-wrap-foot>
</table-wrap>
<p>The superiority of our method stems from its adaptive reward mechanism. Unlike fixed-weight approaches, the exponential decay function imposes nonlinearly increasing penalties as the robot approaches pedestrians (<xref ref-type="fig" rid="fig-4">Fig. 4b</xref>). This forces proactive detours in high-density scenarios while allowing efficient navigation in sparse regions. Mathematically, the penalty term <inline-formula id="ieqn-49"><mml:math id="mml-ieqn-49"><mml:msub><mml:mi>r</mml:mi><mml:mrow><mml:mrow><mml:mtext>col</mml:mtext></mml:mrow></mml:mrow></mml:msub></mml:math></inline-formula> scales with real-time risk scores <inline-formula id="ieqn-50"><mml:math id="mml-ieqn-50"><mml:mi>C</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mi>&#x03C3;</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula>, dynamically amplifying safety constraints when crowd density increases. This contrasts with TGRF&#x2019;s static Gaussian formulation, which cannot adjust penalty intensity based on spatiotemporal risk levels.</p>

<p>The method shows a more substantial performance advantage as the density of the environment decreases. In the low-density scenario of <italic>&#x03C1;</italic> &#x003D; 0.07, the navigation success rate increases to 100%, and both the benchmark methods GST &#x002B; HH Attn (98%) and TGRF model (98%) achieve completely reliable navigation performance. At the same time, the intrusion-to-time ratio (ITR) decreased to 1.11%, a 52.6% reduction compared to the benchmark method GST &#x002B; HH Attn (2.13%) and a 39.7% reduction compared to the TGRF model (1.84%).</p>
<p>This result shows that the method in this paper performs well in low-density environments and has better generalization ability than the GST &#x002B; HH Attn and TGRF models in higher-density environments. It can efficiently complete navigation tasks and maintain a low intrusion rate when interacting with pedestrians.</p>
</sec>
<sec id="s4_3_3">
<label>4.3.3</label>
<title>Model Hyperparameter Analysis</title>
<p>The selection of hyperparameters has a certain impact on the training and final performance of the model. To deeply analyze the impact of hyperparameters on the model, we quantitatively evaluate the synergistic effect of the risk field range coefficient (<inline-formula id="ieqn-51"><mml:math id="mml-ieqn-51"><mml:mi>&#x03C3;</mml:mi></mml:math></inline-formula>) and the exponential decay rate (<italic>&#x03BB;</italic>) on navigation performance through a cross-over experiment. The results are shown in <xref ref-type="table" rid="table-3">Table 3</xref>. The experimental design covers 16 parameter combinations of <inline-formula id="ieqn-52"><mml:math id="mml-ieqn-52"><mml:mi>&#x03C3;</mml:mi></mml:math></inline-formula> &#x2208; {2, 6, 8, 10} and <italic>&#x03BB;</italic> &#x2208; {0.005, 0.05, 0.1, 0.2}. It analyzes the mechanism of hyperparameter action from three dimensions: success rate, navigation efficiency (navigation time and path length), and safety (intrusion time ratio and social distance).</p>
<table-wrap id="table-3">
<label>Table 3</label>
<caption>
<title>Navigation performance of models under different hyperparameter configurations. (Red data represents the best results)</title>
</caption>
<table>
<colgroup>
<col/>
<col/>
<col/>
<col/>
<col/>
<col/>
<col/>
</colgroup>
<thead>
<tr>
<th><italic><bold>&#x03C3;</bold></italic></th>
<th><italic><bold>&#x03BB;</bold></italic></th>
<th>SR (%)<bold>&#x2191;</bold></th>
<th>NT (s)<bold>&#x2193;</bold></th>
<th>PL (m)<bold>&#x2193;</bold></th>
<th>ITR (%)<bold>&#x2193;</bold></th>
<th>SD<bold>&#x2191;</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td rowspan="4">2</td>
<td>0.005</td>
<td>81</td>
<td>15.08</td>
<td>20.36</td>
<td>6.52</td>
<td>0.42</td>
</tr>
<tr>
<td>0.05</td>
<td>91</td>
<td>16.78</td>
<td>22.56</td>
<td>5.82</td>
<td>0.41</td>
</tr>
<tr>
<td>0.1</td>
<td>84</td>
<td>21.00</td>
<td>25.62</td>
<td>5.10</td>
<td>0.43</td>
</tr>
<tr>
<td>0.2</td>
<td>86</td>
<td>22.42</td>
<td>27.46</td>
<td>3.15</td>
<td>0.45</td>
</tr>
<tr>
<td rowspan="4">6</td>
<td>0.005</td>
<td>92</td>
<td>15.45</td>
<td>21.54</td>
<td>6.88</td>
<td>0.43</td>
</tr>
<tr>
<td>0.05</td>
<td>79</td>
<td>16.40</td>
<td>21.10</td>
<td>8.83</td>
<td>0.40</td>
</tr>
<tr>
<td>0.1</td>
<td>94</td>
<td>17.12</td>
<td>23.25</td>
<td>5.25</td>
<td>0.43</td>
</tr>
<tr>
<td>0.2</td>
<td>4</td>
<td>30.30</td>
<td>22.21</td>
<td>10.93</td>
<td>0.39</td>
</tr>
<tr>
<td rowspan="4">8</td>
<td>0.005</td>
<td>87</td>
<td>15.75</td>
<td>21.16</td>
<td>8.25</td>
<td>0.41</td>
</tr>
<tr>
<td>0.05</td>
<td>24</td>
<td>25.86</td>
<td>29.62</td>
<td>9.83</td>
<td>0.40</td>
</tr>
<tr>
<td>0.1</td>
<td>93</td>
<td>18.58</td>
<td>24.83</td>
<td>3.74</td>
<td>0.44</td>
</tr>
<tr>
<td>0.2</td>
<td>68</td>
<td>27.44</td>
<td>31.00</td>
<td>4.48</td>
<td>0.40</td>
</tr>
<tr>
<td rowspan="4">10</td>
<td>0.005</td>
<td>84</td>
<td>14.79</td>
<td>20.72</td>
<td>8.60</td>
<td>0.40</td>
</tr>
<tr>
<td>0.05</td>
<td>86</td>
<td>16.55</td>
<td>21.74</td>
<td>9.61</td>
<td>0.40</td>
</tr>
<tr>
<td>0.1</td>
<td>14</td>
<td>26.54</td>
<td>26.88</td>
<td>8.63</td>
<td>0.39</td>
</tr>
<tr>
<td>0.2</td>
<td>7</td>
<td>20.60</td>
<td>23.60</td>
<td>11.86</td>
<td>0.38</td>
</tr>
</tbody>
</table>
</table-wrap>
<p><xref ref-type="fig" rid="fig-5">Fig. 5</xref> illustrates the impact of <italic>&#x03BB;</italic> (0.005, 0.05, 0.1, 0.2) on loss and reward curves with <italic>&#x03C3;</italic> &#x003D; 8. <italic>&#x03BB;</italic> &#x003D; 0.005 accelerates early optimization through fine-grained perception, leading to rapid loss reduction (<xref ref-type="fig" rid="fig-5">Fig. 5a</xref>) and stable reward convergence (<xref ref-type="fig" rid="fig-5">Fig. 5b</xref>), effectively suppressing policy oscillations. However, <italic>&#x03BB;</italic> &#x003D; 0.05 shows a non-monotonic reward decline, suggesting a suboptimal attractor. <italic>&#x03BB;</italic> &#x003D; 0.2 excessively smooths rewards, slowing early convergence and delaying reward growth until 15,000 iterations. <italic>&#x03BB;</italic> &#x003D; 0.1 achieves the best balance, ensuring smooth loss convergence and a stable reward near 20, moderating the exploration-exploitation trade-off.</p>
<fig id="fig-5">
<label>Figure 5</label>
<caption>
<title>Training loss curve and average reward curve corresponding to different values of <italic>&#x03BB;</italic> when <inline-formula id="ieqn-53"><mml:math id="mml-ieqn-53"><mml:mi>&#x03C3;</mml:mi><mml:mo>=</mml:mo><mml:mn>8</mml:mn></mml:math></inline-formula></title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_65205-fig-5.tif"/>
</fig>
<p><xref ref-type="fig" rid="fig-6">Fig. 6</xref> illustrates the effect of <italic>&#x03C3;</italic> (2, 6, 8, 10) on loss and reward curves with <italic>&#x03BB;</italic> &#x003D; 0.1. A narrow perception range (<italic>&#x03C3;</italic> &#x003D; 2) quickly attenuates nearby risks, leading to rapid loss reduction (<xref ref-type="fig" rid="fig-6">Fig. 6a</xref>), but makes the reward highly sensitive to disturbances, causing a sharp drop from 25 to &#x2212;45 after 8000 iterations (<xref ref-type="fig" rid="fig-6">Fig. 6b</xref>). Increasing <italic>&#x03C3;</italic> to 6 balances local and global risks, ensuring gradual loss convergence and stable rewards around 20. At <italic>&#x03C3;</italic> &#x003D; 8, the model maintains stability while optimizing path length and invasion time. However, <italic>&#x03C3;</italic> &#x003D; 10 causes state space explosion, blurring risk boundaries, and stagnating rewards at &#x2212;8 (<xref ref-type="fig" rid="fig-6">Fig. 6b</xref>). <xref ref-type="table" rid="table-3">Table 3</xref> confirms that <italic>&#x03C3;</italic> &#x003D; 6 and <italic>&#x03BB;</italic> &#x003D; 0.1 achieve optimal success (94%), while <italic>&#x03C3;</italic> &#x003D; 10 leads to decision confusion and a drop in success rate to 14%.</p>
<fig id="fig-6">
<label>Figure 6</label>
<caption>
<title>Training loss curve and average reward curve corresponding to different <inline-formula id="ieqn-54"><mml:math id="mml-ieqn-54"><mml:mi>&#x03C3;</mml:mi></mml:math></inline-formula> values when <italic>&#x03BB;</italic> &#x003D; 0.1</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_65205-fig-6.tif"/>
</fig>
<p>Hyperparameter tuning must align with environmental dynamics. As shown in <xref ref-type="table" rid="table-3">Table 3</xref>, <italic>&#x03C3;</italic> &#x003D; 6 and <italic>&#x03BB;</italic> &#x003D; 0.1 achieve optimal balance (SR &#x003D; 94%, ITR &#x003D; 5.25%) in medium-to-high densities (<italic>&#x03C1;</italic> &#x2265; 0.15). Here, <italic>&#x03C3;</italic> &#x003D; 6 ensures moderate risk coverage, while <italic>&#x03BB;</italic> &#x003D; 0.1 stabilizes reward convergence (<xref ref-type="fig" rid="fig-5">Fig. 5</xref>). In contrast, extreme parameters (e.g., <italic>&#x03C3;</italic> &#x003D; 10, <italic>&#x03BB;</italic> &#x003D; 0.2) cause decision confusion (SR &#x003D; 7%), as excessive risk field ranges blur critical boundaries.</p>

<p>The training curves in <xref ref-type="fig" rid="fig-5">Figs. 5</xref> and <xref ref-type="fig" rid="fig-6">6</xref> further illustrate behavioral implications. For <italic>&#x03BB;</italic> &#x003D; 0.1, smooth loss reduction (<xref ref-type="fig" rid="fig-5">Fig. 5a</xref>) correlates with gradual learning of socially compliant paths (<xref ref-type="fig" rid="fig-4">Fig. 4b</xref>), whereas <italic>&#x03BB;</italic> &#x003D; 0.005&#x2019;s rapid convergence may lead to overly conservative strategies. Similarly, <italic>&#x03C3;</italic> &#x003D; 8&#x2019;s stable reward curve (<xref ref-type="fig" rid="fig-6">Fig. 6b</xref>) reflects the balanced perception of local and global risks, enabling proactive detours in crowded zones.</p>

<p>In light of the results above, the hyperparameter tuning process is advised to adhere to the following principles: In scenarios characterized by low density (<inline-formula id="ieqn-55"><mml:math id="mml-ieqn-55"><mml:mi>&#x03C1;</mml:mi><mml:mo>&#x003C;</mml:mo><mml:mn>0.10</mml:mn></mml:math></inline-formula> people/m<sup>2</sup>), it is recommended to employ <inline-formula id="ieqn-56"><mml:math id="mml-ieqn-56"><mml:mi>&#x03C3;</mml:mi><mml:mo>=</mml:mo><mml:mn>2</mml:mn></mml:math></inline-formula> and <inline-formula id="ieqn-57"><mml:math id="mml-ieqn-57"><mml:mi>&#x03BB;</mml:mi><mml:mo>=</mml:mo><mml:mn>0.05</mml:mn></mml:math></inline-formula> to enhance efficiency through the utilization of local perception, thereby achieving a success rate of 91% and a path length of 22.56 m. This approach enables the management of a higher intrusion time ratio (ITR &#x003D; 5.82%), attributable to the sparse pedestrian population. In scenarios of medium-to-high density (<italic>&#x03C1;</italic> &#x2265; 0.15 people/m<sup>2</sup>), it is recommended to select <inline-formula id="ieqn-58"><mml:math id="mml-ieqn-58"><mml:mi>&#x03C3;</mml:mi><mml:mo>=</mml:mo><mml:mn>6</mml:mn></mml:math></inline-formula> (or <inline-formula id="ieqn-59"><mml:math id="mml-ieqn-59"><mml:mi>&#x03C3;</mml:mi><mml:mo>=</mml:mo><mml:mn>8</mml:mn></mml:math></inline-formula>) and <inline-formula id="ieqn-60"><mml:math id="mml-ieqn-60"><mml:mi>&#x03BB;</mml:mi><mml:mo>=</mml:mo><mml:mn>0.1</mml:mn></mml:math></inline-formula> to achieve a balanced risk coverage and reward decay rate, thereby facilitating a trade-off between safety and efficiency (success rate 94%&#x2013;97%, ITR &#x2264; 5.25%). Avoiding extreme parameter combinations or preventing policy instability or convergence failure is imperative.</p>
</sec>
</sec>
</sec>
<sec id="s5">
<label>5</label>
<title>Discussion</title>
<p>The experimental results demonstrate that the proposed adaptive reward optimization method effectively addresses the safety-efficiency trade-off in complex dynamic environments through spatiotemporal risk field modeling and exponential decay mechanisms. Compared to conventional fixed-weight approaches [<xref ref-type="bibr" rid="ref-6">6</xref>,<xref ref-type="bibr" rid="ref-9">9</xref>,<xref ref-type="bibr" rid="ref-11">11</xref>,<xref ref-type="bibr" rid="ref-12">12</xref>], our method achieves superior generalization across varying crowd densities by dynamically amplifying collision penalties in high-risk scenarios while relaxing constraints in sparse regions. This adaptability stems from the Gaussian kernel-based risk field, which quantifies scene complexity through integrated analysis of pedestrian speed, density, and distance&#x2014;an advancement over static Gaussian formulations that lack spatiotemporal awareness [<xref ref-type="bibr" rid="ref-9">9</xref>,<xref ref-type="bibr" rid="ref-11">11</xref>]. The exponential decay reward further enhances responsiveness by assigning nonlinearly increasing penalties as the robot approaches pedestrians, forcing proactive detours without sacrificing navigation efficiency. These innovations explain the 9.0% improvement in success rates and 10.7% reduction in intrusion time observed in high-density scenarios, outperforming state-of-the-art methods like TGRF [<xref ref-type="bibr" rid="ref-11">11</xref>] and GST &#x002B; HH Attn [<xref ref-type="bibr" rid="ref-9">9</xref>].</p>
<p>The proposed method aligns with recent advancements in adaptive perception and decision-making systems for robotic navigation. For instance, Yi and Guan [<xref ref-type="bibr" rid="ref-28">28</xref>] emphasize the integration of hybrid deliberative-reactive architectures in reinforcement learning to balance strategic planning and real-time responses, a principle echoed in our adaptive reward mechanism. Their work highlights the scalability of DRL across diverse robotic platforms. At the same time, our method extends this by introducing interpretable hyperparameter tuning mechanisms (e.g., <italic>&#x03C3;</italic> and <italic>&#x03BB;</italic>) to address dynamic crowd dynamics. Similarly, Zhou and Garcke [<xref ref-type="bibr" rid="ref-17">17</xref>] leverage spatiotemporal graphs with attention mechanisms to model crowd interactions, demonstrating the critical role of temporal reasoning in proactive navigation. While their approach focuses on graph-based intention prediction, our work complements this by dynamically adjusting reward weights based on real-time risk assessments, bridging the gap between crowd behavior understanding and adaptive decision-making.</p>
<p>A critical distinction lies in the interpretability of our method. While deep reinforcement learning (DRL) methods often operate as &#x201C;black boxes&#x201D; [<xref ref-type="bibr" rid="ref-6">6</xref>,<xref ref-type="bibr" rid="ref-7">7</xref>], our risk field explicitly links environmental dynamics to reward adjustments, enabling systematic hyperparameter tuning. For example, the correlation between &#x03C3; values and risk coverage (<xref ref-type="fig" rid="fig-2">Fig. 2</xref>) provides actionable insights for adapting to specific scenarios&#x2014;a feature absent in end-to-end DRL approaches [<xref ref-type="bibr" rid="ref-28">28</xref>]. This interpretability complements vision-based semantic navigation systems that rely on transparent object detection metrics (e.g., mAP (mean Average Precision) and ODR (Object detection rate) [<xref ref-type="bibr" rid="ref-29">29</xref>]), collectively advancing trustworthy robotic decision-making. Furthermore, our exponential decay mechanism addresses computational inefficiencies in dense crowds, resonating with Zhou and Garcke [<xref ref-type="bibr" rid="ref-17">17</xref>] emphasis on efficient spatiotemporal aggregation but extending it through reward shaping rather than trajectory prediction.</p>
<p>However, limitations persist. The 2D simulation environment simplifies occlusion modeling and sensor noise, potentially overestimating performance in real-world settings. Future integration with multimodal perception systems, such as the YOLO v8-based semantic navigation frameworks [<xref ref-type="bibr" rid="ref-29">29</xref>], could enhance environmental understanding by combining risk field dynamics with real-time object detection. Additionally, while our method reduces hyperparameter sensitivity compared to fixed-weight approaches, optimal &#x03C3; and &#x03BB; selection remain scenario-dependent. Automated parameter adaptation, inspired by the self-tuning mechanisms in graph-based navigation [<xref ref-type="bibr" rid="ref-17">17</xref>] and hybrid DRL architectures [<xref ref-type="bibr" rid="ref-28">28</xref>], could improve robustness across diverse environments. These extensions would bridge the gap between reward optimization and perception, fostering holistic navigation systems operating in structured and unstructured dynamic spaces.</p>
</sec>
<sec id="s6">
<label>6</label>
<title>Conclusions</title>
<p>This paper proposes a navigation method based on spatiotemporal risk field modeling and adaptive reward optimization to address the safety-efficiency trade-off in robotic navigation through complex dynamic environments. By constructing a risk field model that integrates crowd density distribution and pedestrian motion patterns, our approach enables real-time quantification of environmental complexity. Coupled with an exponential decay reward mechanism, this methodology addresses the adaptability limitations of conventional fixed-weight reward functions in varying crowd density scenarios. Experimental results demonstrate that, in comparison with the baseline method, the proposed method enhances the navigation success rate by 9% in high-density scenes and reduces intrusion time by 10.7%. This outcome substantiates the efficacy of balancing safety and efficiency through nonlinear safety constraint enhancement and dynamic adjustment of efficiency weight. Future work will construct a real-world environment testbed containing multimodal sensor data to verify the transferability of our methods from simulation to reality.</p>
</sec>
</body>
<back>
<ack>
<p>We sincerely acknowledge the financial support from the Sichuan Science and Technology Program. We also thank our laboratory members for their invaluable collaboration in experimental execution and data validation.</p>
</ack>
<sec>
<title>Funding Statement</title>
<p>This work was supported by the Sichuan Science and Technology Program (2025ZNSFSC0005).</p>
</sec>
<sec>
<title>Author Contributions</title>
<p>The authors confirm contribution to the paper as follows: Conceptualization, Jie He; methodology, Jie He; data collection and experimental design, Dongmei Zhao and Qingfeng Zou; formal analysis and writing&#x2014;original draft preparation, Jie He and Jian&#x2019;an Xie; supervision and project administration, Tao Liu. All authors reviewed the results and approved the final version of the manuscript.</p>
</sec>
<sec sec-type="data-availability">
<title>Availability of Data and Materials</title>
<p>The data used in this paper can be requested from the corresponding author upon request.</p>
</sec>
<sec>
<title>Ethics Approval</title>
<p>Not applicable.</p>
</sec>
<sec sec-type="COI-statement">
<title>Conflicts of Interest</title>
<p>The authors declare no conflicts of interest to report regarding the present study.</p>
</sec>
<ref-list content-type="authoryear">
<title>References</title>
<ref id="ref-1"><label>[1]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Fragapane</surname> <given-names>G</given-names></string-name>, <string-name><surname>De Koster</surname> <given-names>R</given-names></string-name>, <string-name><surname>Sgarbossa</surname> <given-names>F</given-names></string-name>, <string-name><surname>Strandhagen</surname> <given-names>JO</given-names></string-name></person-group>. <article-title>Planning and control of autonomous mobile robots for intralogistics: literature review and research agenda</article-title>. <source>Eur J Oper Res</source>. <year>2021</year>;<volume>294</volume>(<issue>2</issue>):<fpage>405</fpage>&#x2013;<lpage>26</lpage>. doi:<pub-id pub-id-type="doi">10.1016/j.ejor.2021.01.019</pub-id>.</mixed-citation></ref>
<ref id="ref-2"><label>[2]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Kim</surname> <given-names>SS</given-names></string-name>, <string-name><surname>Kim</surname> <given-names>J</given-names></string-name>, <string-name><surname>Badu-Baiden</surname> <given-names>F</given-names></string-name>, <string-name><surname>Giroux</surname> <given-names>M</given-names></string-name>, <string-name><surname>Choi</surname> <given-names>Y</given-names></string-name></person-group>. <article-title>Preference for robot service or human service in hotels? impacts of the COVID-19 pandemic</article-title>. <source>Int J Hosp Manag</source>. <year>2021</year>;<volume>93</volume>(<issue>2</issue>):<fpage>102795</fpage>. doi:<pub-id pub-id-type="doi">10.1016/j.ijhm.2020.102795</pub-id>; <pub-id pub-id-type="pmid">36919174</pub-id></mixed-citation></ref>
<ref id="ref-3"><label>[3]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Le</surname> <given-names>H</given-names></string-name>, <string-name><surname>Saeedvand</surname> <given-names>S</given-names></string-name>, <string-name><surname>Hsu</surname> <given-names>CC</given-names></string-name></person-group>. <article-title>A comprehensive review of mobile robot navigation using deep reinforcement learning algorithms in crowded environments</article-title>. <source>J Intell Robot Syst</source>. <year>2024</year>;<volume>110</volume>(<issue>4</issue>):<fpage>1</fpage>&#x2013;<lpage>22</lpage>. doi:<pub-id pub-id-type="doi">10.1007/s10846-024-02198-w</pub-id>.</mixed-citation></ref>
<ref id="ref-4"><label>[4]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Bai</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Shao</surname> <given-names>S</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>J</given-names></string-name>, <string-name><surname>Zhao</surname> <given-names>X</given-names></string-name>, <string-name><surname>Fang</surname> <given-names>C</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>T</given-names></string-name>, <etal>et al</etal></person-group>. <article-title>A review of brain-inspired cognition and navigation technology for mobile robots</article-title>. <source>Cyborg Bionic Syst</source>. <year>2024</year>;<volume>5</volume>(<issue>1</issue>):<fpage>0128</fpage>. doi:<pub-id pub-id-type="doi">10.34133/cbsystems.0128</pub-id>; <pub-id pub-id-type="pmid">38938902</pub-id></mixed-citation></ref>
<ref id="ref-5"><label>[5]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Feng</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Xue</surname> <given-names>B</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>C</given-names></string-name>, <string-name><surname>Zhou</surname> <given-names>F</given-names></string-name></person-group>. <article-title>Safe and socially compliant robot navigation in crowds with fast-moving pedestrians via deep reinforcement learning</article-title>. <source>Robotica</source>. <year>2024</year>;<volume>42</volume>(<issue>4</issue>):<fpage>1212</fpage>&#x2013;<lpage>30</lpage>. doi:<pub-id pub-id-type="doi">10.1017/S0263574724000183</pub-id>.</mixed-citation></ref>
<ref id="ref-6"><label>[6]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Liu</surname> <given-names>S</given-names></string-name>, <string-name><surname>Chang</surname> <given-names>P</given-names></string-name>, <string-name><surname>Liang</surname> <given-names>W</given-names></string-name>, <string-name><surname>Chakraborty</surname> <given-names>N</given-names></string-name>, <string-name><surname>Driggs-Campbell</surname> <given-names>K</given-names></string-name></person-group>. <article-title>Decentralized structural-RNN for robot crowd navigation with deep reinforcement learning</article-title>. In: <conf-name>Proceedings of the 2021 IEEE International Conference on Robotics and Automation (ICRA)</conf-name>; <year>2021 May 30&#x2013;Jun 5</year>; <publisher-loc>Xi&#x2019;an, China</publisher-loc>. doi:<pub-id pub-id-type="doi">10.1109/ICRA48506.2021.9561595</pub-id>.</mixed-citation></ref>
<ref id="ref-7"><label>[7]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Zhou</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Zhu</surname> <given-names>P</given-names></string-name>, <string-name><surname>Zeng</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Xiao</surname> <given-names>J</given-names></string-name>, <string-name><surname>Lu</surname> <given-names>H</given-names></string-name>, <string-name><surname>Zhou</surname> <given-names>Z</given-names></string-name></person-group>. <article-title>Robot navigation in a crowd by integrating deep reinforcement learning and online planning</article-title>. <source>Appl Intell</source>. <year>2022</year>;<volume>52</volume>(<issue>13</issue>):<fpage>15600</fpage>&#x2013;<lpage>16</lpage>. doi:<pub-id pub-id-type="doi">10.1007/s10489-022-03191-2</pub-id>.</mixed-citation></ref>
<ref id="ref-8"><label>[8]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Sun</surname> <given-names>X</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>Q</given-names></string-name>, <string-name><surname>Wei</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Liu</surname> <given-names>M</given-names></string-name></person-group>. <article-title>Risk-aware deep reinforcement learning for robot crowd navigation</article-title>. <source>Electronics</source>. <year>2023</year>;<volume>12</volume>(<issue>23</issue>):<fpage>4744</fpage>. doi:<pub-id pub-id-type="doi">10.3390/electronics12234744</pub-id>.</mixed-citation></ref>
<ref id="ref-9"><label>[9]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Liu</surname> <given-names>S</given-names></string-name>, <string-name><surname>Chang</surname> <given-names>P</given-names></string-name>, <string-name><surname>Huang</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Chakraborty</surname> <given-names>N</given-names></string-name>, <string-name><surname>Hong</surname> <given-names>K</given-names></string-name>, <string-name><surname>Liang</surname> <given-names>W</given-names></string-name>, <etal>et al.</etal></person-group> <article-title>Intention aware robot crowd navigation with attention-based interaction graph</article-title>. In: <conf-name>Proceedings of the 2023 IEEE International Conference on Robotics and Automation (ICRA)</conf-name>; <year>2023 May 29&#x2013;Jun 2</year>; <publisher-loc>London, UK</publisher-loc>. doi:<pub-id pub-id-type="doi">10.1109/ICRA48891.2023.10160660</pub-id>.</mixed-citation></ref>
<ref id="ref-10"><label>[10]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Kim</surname> <given-names>J</given-names></string-name>, <string-name><surname>Kwak</surname> <given-names>D</given-names></string-name>, <string-name><surname>Rim</surname> <given-names>H</given-names></string-name>, <string-name><surname>Kim</surname> <given-names>D</given-names></string-name></person-group>. <article-title>Belief aided navigation using Bayesian reinforcement learning for avoiding humans in blind spots</article-title>. In: <conf-name>Proceedings of the 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)</conf-name>; <year>2024 Oct 14&#x2013;18</year>; <publisher-loc>Abu Dhabi, United Arab Emirates</publisher-loc>. doi:<pub-id pub-id-type="doi">10.1109/IROS58592.2024.10802765</pub-id>.</mixed-citation></ref>
<ref id="ref-11"><label>[11]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Kim</surname> <given-names>J</given-names></string-name>, <string-name><surname>Kang</surname> <given-names>S</given-names></string-name>, <string-name><surname>Yang</surname> <given-names>S</given-names></string-name>, <string-name><surname>Kim</surname> <given-names>B</given-names></string-name>, <string-name><surname>Yura</surname> <given-names>J</given-names></string-name>, <string-name><surname>Kim</surname> <given-names>D</given-names></string-name></person-group>. <article-title>Transformable gaussian reward function for socially aware navigation using deep reinforcement learning</article-title>. <source>Sensors</source>. <year>2024</year>;<volume>24</volume>(<issue>14</issue>):<fpage>4540</fpage>. doi:<pub-id pub-id-type="doi">10.3390/s24144540</pub-id>; <pub-id pub-id-type="pmid">39065937</pub-id></mixed-citation></ref>
<ref id="ref-12"><label>[12]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Chen</surname> <given-names>C</given-names></string-name>, <string-name><surname>Liu</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Kreiss</surname> <given-names>S</given-names></string-name>, <string-name><surname>Alahi</surname> <given-names>A</given-names></string-name></person-group>. <article-title>Crowd-robot interaction: crowd-aware robot navigation with attention-based deep reinforcement learning</article-title>. In: <conf-name>Proceedings of the 2019 International Conference on Robotics and Automation (ICRA)</conf-name>; <year>2019 May 20&#x2013;24</year>; <publisher-loc>Montreal, QC, Canada</publisher-loc>. doi:<pub-id pub-id-type="doi">10.1109/ICRA.2019.8794134</pub-id>.</mixed-citation></ref>
<ref id="ref-13"><label>[13]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Hart</surname> <given-names>PE</given-names></string-name>, <string-name><surname>Nilsson</surname> <given-names>NJ</given-names></string-name>, <string-name><surname>Raphael</surname> <given-names>B</given-names></string-name></person-group>. <article-title>A formal basis for the heuristic determination of minimum cost paths</article-title>. <source>IEEE Trans Syst Sci Cybern</source>. <year>1968</year>;<volume>4</volume>(<issue>2</issue>):<fpage>100</fpage>&#x2013;<lpage>7</lpage>. doi:<pub-id pub-id-type="doi">10.1109/TSSC.1968.300136</pub-id>.</mixed-citation></ref>
<ref id="ref-14"><label>[14]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><surname>Sotirchos</surname> <given-names>G</given-names></string-name>, <string-name><surname>Ajanovic</surname> <given-names>Z</given-names></string-name></person-group>. <article-title>Search-based versus sampling-based robot motion planning: a comparative study</article-title>. <comment>arXiv:240609623. 2024</comment>.</mixed-citation></ref>
<ref id="ref-15"><label>[15]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Van Den Berg</surname> <given-names>J</given-names></string-name>, <string-name><surname>Guy</surname> <given-names>SJ</given-names></string-name>, <string-name><surname>Lin</surname> <given-names>M</given-names></string-name>, <string-name><surname>Manocha</surname> <given-names>D</given-names></string-name></person-group>. <article-title>Reciprocal <italic>n</italic>-body collision avoidance</article-title>. In: <conf-name>Proceedings of the 14th International Symposium ISRR</conf-name>; <year>2009 Aug 31&#x2013;Sep 3</year>; <publisher-loc>Lucerne, Switzerland</publisher-loc>.</mixed-citation></ref>
<ref id="ref-16"><label>[16]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Chen</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Liu</surname> <given-names>C</given-names></string-name>, <string-name><surname>Shi</surname> <given-names>BE</given-names></string-name>, <string-name><surname>Liu</surname> <given-names>M</given-names></string-name></person-group>. <article-title>Robot navigation in crowds by graph convolutional networks with attention learned from human gaze</article-title>. <source>IEEE Robot Autom Lett</source>. <year>2020</year>;<volume>5</volume>(<issue>2</issue>):<fpage>2754</fpage>&#x2013;<lpage>61</lpage>. doi:<pub-id pub-id-type="doi">10.1109/LRA.2020.2972868</pub-id>.</mixed-citation></ref>
<ref id="ref-17"><label>[17]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Zhou</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Garcke</surname> <given-names>J</given-names></string-name></person-group>. <article-title>Learning crowd behaviors in navigation with attention-based spatial-temporal graphs</article-title>. In: <conf-name>Proceedings of the 2024 IEEE International Conference on Robotics and Automation (ICRA)</conf-name>; <year>2024 May 13&#x2013;17</year>; <publisher-loc>Yokohama, Japan</publisher-loc>. doi:<pub-id pub-id-type="doi">10.1109/ICRA57147.2024.10610279</pub-id>.</mixed-citation></ref>
<ref id="ref-18"><label>[18]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Zhu</surname> <given-names>K</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>T</given-names></string-name></person-group>. <article-title>Deep reinforcement learning based mobile robot navigation: a review</article-title>. <source>Tsinghua Sci Technol</source>. <year>2021</year>;<volume>26</volume>(<issue>5</issue>):<fpage>674</fpage>&#x2013;<lpage>91</lpage>. doi:<pub-id pub-id-type="doi">10.26599/TST.2021.9010012</pub-id>.</mixed-citation></ref>
<ref id="ref-19"><label>[19]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Ibrahim</surname> <given-names>S</given-names></string-name>, <string-name><surname>Mostafa</surname> <given-names>M</given-names></string-name>, <string-name><surname>Jnadi</surname> <given-names>A</given-names></string-name>, <string-name><surname>Salloum</surname> <given-names>H</given-names></string-name>, <string-name><surname>Osinenko</surname> <given-names>P</given-names></string-name></person-group>. <article-title>Comprehensive overview of reward engineering and shaping in advancing reinforcement learning applications</article-title>. <source>IEEE Access</source>. <year>2024</year>;<volume>12</volume>:<fpage>175473</fpage>&#x2013;<lpage>500</lpage>. doi:<pub-id pub-id-type="doi">10.1109/ACCESS.2024.3504735</pub-id>.</mixed-citation></ref>
<ref id="ref-20"><label>[20]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Montero</surname> <given-names>EE</given-names></string-name>, <string-name><surname>Mutahira</surname> <given-names>H</given-names></string-name>, <string-name><surname>Pico</surname> <given-names>N</given-names></string-name>, <string-name><surname>Muhammad</surname> <given-names>MS</given-names></string-name></person-group>. <article-title>Dynamic warning zone and a short-distance goal for autonomous robot navigation using deep reinforcement learning</article-title>. <source>Complex Intell Syst</source>. <year>2024</year>;<volume>10</volume>(<issue>1</issue>):<fpage>1149</fpage>&#x2013;<lpage>66</lpage>. doi:<pub-id pub-id-type="doi">10.1007/s40747-023-01216-y</pub-id>.</mixed-citation></ref>
<ref id="ref-21"><label>[21]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Patel</surname> <given-names>U</given-names></string-name>, <string-name><surname>Kumar</surname> <given-names>NKS</given-names></string-name>, <string-name><surname>Sathyamoorthy</surname> <given-names>AJ</given-names></string-name>, <string-name><surname>Manocha</surname> <given-names>D</given-names></string-name></person-group>. <article-title>DWA-RL: dynamically feasible deep reinforcement learning policy for robot navigation among mobile obstacles</article-title>. In: <conf-name>Proceedings of the 2021 IEEE International Conference on Robotics and Automation (ICRA)</conf-name>; <year>2021 May 30&#x2013;Jun 5</year>; <publisher-loc>Xi&#x2019;an, China</publisher-loc>. doi:<pub-id pub-id-type="doi">10.1109/ICRA48506.2021.9561462</pub-id>.</mixed-citation></ref>
<ref id="ref-22"><label>[22]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Oh</surname> <given-names>J</given-names></string-name>, <string-name><surname>Heo</surname> <given-names>J</given-names></string-name>, <string-name><surname>Lee</surname> <given-names>J</given-names></string-name>, <string-name><surname>Lee</surname> <given-names>G</given-names></string-name>, <string-name><surname>Kang</surname> <given-names>M</given-names></string-name>, <string-name><surname>Park</surname> <given-names>J</given-names></string-name>, <etal>et al.</etal></person-group> <article-title>Scan: socially-aware navigation using Monte Carlo tree search</article-title>. In: <conf-name>Proceedings of the 2023 IEEE International Conference on Robotics and Automation (ICRA)</conf-name>; <year>2023 May 29&#x2013;Jun 2</year>; <publisher-loc>London, UK</publisher-loc>. doi:<pub-id pub-id-type="doi">10.1109/ICRA48891.2023.10160270</pub-id>.</mixed-citation></ref>
<ref id="ref-23"><label>[23]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Jeong</surname> <given-names>H</given-names></string-name>, <string-name><surname>Hassani</surname> <given-names>H</given-names></string-name>, <string-name><surname>Morari</surname> <given-names>M</given-names></string-name>, <string-name><surname>Lee</surname> <given-names>DD</given-names></string-name>, <string-name><surname>Pappas</surname> <given-names>GJ</given-names></string-name></person-group>. <article-title>Deep reinforcement learning for active target tracking</article-title>. In: <conf-name>Proceedings of the 2021 IEEE International Conference on Robotics and Automation (ICRA)</conf-name>; <year>2021 May 30&#x2013;Jun 5</year>; <publisher-loc>Xi&#x2019;an, China</publisher-loc>. doi:<pub-id pub-id-type="doi">10.1109/ICRA48506.2021.9561258</pub-id>.</mixed-citation></ref>
<ref id="ref-24"><label>[24]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Samsani</surname> <given-names>SS</given-names></string-name>, <string-name><surname>Mutahira</surname> <given-names>H</given-names></string-name>, <string-name><surname>Muhammad</surname> <given-names>MS</given-names></string-name></person-group>. <article-title>Memory-based crowd-aware robot navigation using deep reinforcement learning</article-title>. <source>Complex Intell Syst</source>. <year>2023</year>;<volume>9</volume>(<issue>2</issue>):<fpage>2147</fpage>&#x2013;<lpage>58</lpage>. doi:<pub-id pub-id-type="doi">10.1007/s40747-022-00906-3</pub-id>.</mixed-citation></ref>
<ref id="ref-25"><label>[25]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Rios-Martinez</surname> <given-names>J</given-names></string-name>, <string-name><surname>Spalanzani</surname> <given-names>A</given-names></string-name>, <string-name><surname>Laugier</surname> <given-names>C</given-names></string-name></person-group>. <article-title>From proxemics theory to socially-aware navigation: a survey</article-title>. <source>Int J Soc Robot</source>. <year>2015</year>;<volume>7</volume>(<issue>2</issue>):<fpage>137</fpage>&#x2013;<lpage>53</lpage>. doi:<pub-id pub-id-type="doi">10.1007/s12369-014-0251-1</pub-id>.</mixed-citation></ref>
<ref id="ref-26"><label>[26]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><surname>Goyal</surname> <given-names>P</given-names></string-name>, <string-name><surname>Niekum</surname> <given-names>S</given-names></string-name>, <string-name><surname>Mooney</surname> <given-names>RJ</given-names></string-name></person-group>. <article-title>Using natural language for reward shaping in reinforcement learning</article-title>. <comment>arXiv:190302020. 2019</comment>.</mixed-citation></ref>
<ref id="ref-27"><label>[27]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Xu</surname> <given-names>J</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>W</given-names></string-name>, <string-name><surname>Cai</surname> <given-names>J</given-names></string-name>, <string-name><surname>Liu</surname> <given-names>H</given-names></string-name></person-group>. <article-title>SafeCrowdNav: safety evaluation of robot crowd navigation in complex scenes</article-title>. <source>Front Neurorobot</source>. <year>2023</year>;<volume>17</volume>:<fpage>1276519</fpage>. doi:<pub-id pub-id-type="doi">10.3389/fnbot.2023.1276519</pub-id>; <pub-id pub-id-type="pmid">37904892</pub-id></mixed-citation></ref>
<ref id="ref-28"><label>[28]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Yi</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Guan</surname> <given-names>Y</given-names></string-name></person-group>. <article-title>Research on autonomous navigation and control algorithm of intelligent robot based on reinforcement learning</article-title>. <source>Scalable Comput Pract Exp</source>. <year>2025</year>;<volume>26</volume>(<issue>1</issue>):<fpage>423</fpage>&#x2013;<lpage>31</lpage>. doi:<pub-id pub-id-type="doi">10.12694/scpe.v26i1.3841</pub-id>.</mixed-citation></ref>
<ref id="ref-29"><label>[29]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Alotaibi</surname> <given-names>A</given-names></string-name>, <string-name><surname>Alatawi</surname> <given-names>H</given-names></string-name>, <string-name><surname>Binnouh</surname> <given-names>A</given-names></string-name>, <string-name><surname>Duwayriat</surname> <given-names>L</given-names></string-name>, <string-name><surname>Alhmiedat</surname> <given-names>T</given-names></string-name>, <string-name><surname>Alia</surname> <given-names>OM</given-names></string-name></person-group>. <article-title>Deep learning-based vision systems for robot semantic navigation: an experimental study</article-title>. <source>Technologies</source>. <year>2024</year>;<volume>12</volume>(<issue>9</issue>):<fpage>157</fpage>. doi:<pub-id pub-id-type="doi">10.3390/technologies12090157</pub-id>.</mixed-citation></ref>
</ref-list>
</back></article>