<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.1 20151215//EN" "http://jats.nlm.nih.gov/publishing/1.1/JATS-journalpublishing1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:mml="http://www.w3.org/1998/Math/MathML" xml:lang="en" article-type="research-article" dtd-version="1.1">
<front>
<journal-meta>
<journal-id journal-id-type="pmc">CMC</journal-id>
<journal-id journal-id-type="nlm-ta">CMC</journal-id>
<journal-id journal-id-type="publisher-id">CMC</journal-id>
<journal-title-group>
<journal-title>Computers, Materials &#x0026; Continua</journal-title>
</journal-title-group>
<issn pub-type="epub">1546-2226</issn>
<issn pub-type="ppub">1546-2218</issn>
<publisher>
<publisher-name>Tech Science Press</publisher-name>
<publisher-loc>USA</publisher-loc>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="publisher-id">67117</article-id>
<article-id pub-id-type="doi">10.32604/cmc.2025.067117</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Article</subject>
</subj-group>
</article-categories>
<title-group>
<article-title>Dynamic Decoupling-Driven Cooperative Pursuit for Multi-UAV Systems: A Multi-Agent Reinforcement Learning Policy Optimization Approach</article-title>
<alt-title alt-title-type="left-running-head">Dynamic Decoupling-Driven Cooperative Pursuit for Multi-UAV Systems: A Multi-Agent Reinforcement Learning Policy Optimization Approach</alt-title>
<alt-title alt-title-type="right-running-head">Dynamic Decoupling-Driven Cooperative Pursuit for Multi-UAV Systems: A Multi-Agent Reinforcement Learning Policy Optimization Approach</alt-title>
</title-group>
<contrib-group>
<contrib id="author-1" contrib-type="author">
<name name-style="western"><surname>Lei</surname><given-names>Lei</given-names></name><xref ref-type="aff" rid="aff-1">1</xref></contrib>
<contrib id="author-2" contrib-type="author" corresp="yes">
<name name-style="western"><surname>Wu</surname><given-names>Chengfu</given-names></name><xref ref-type="aff" rid="aff-2">2</xref><email>chiefwu@nwpu.edu.cn</email></contrib>
<contrib id="author-3" contrib-type="author">
<name name-style="western"><surname>Chen</surname><given-names>Huaimin</given-names></name><xref ref-type="aff" rid="aff-2">2</xref></contrib>
<aff id="aff-1"><label>1</label><institution>School of Automation, Northwestern Polytechnical University</institution>, <addr-line>Xi&#x2019;an, 710072</addr-line>, <country>China</country></aff>
<aff id="aff-2"><label>2</label><institution>National Key Laboratory of Unmanned Aerial Vehicle Technology, Northwestern Polytechnical University</institution>, <addr-line>Xi&#x2019;an, 710072</addr-line>, <country>China</country></aff>
</contrib-group>
<author-notes>
<corresp id="cor1"><label>&#x002A;</label>Corresponding Author: Chengfu Wu. Email: <email>chiefwu@nwpu.edu.cn</email></corresp>
</author-notes>
<pub-date date-type="collection" publication-format="electronic">
<year>2025</year>
</pub-date>
<pub-date date-type="pub" publication-format="electronic">
<day>29</day><month>08</month><year>2025</year>
</pub-date>
<volume>85</volume>
<issue>1</issue>
<fpage>1339</fpage>
<lpage>1363</lpage>
<history>
<date date-type="received">
<day>25</day>
<month>4</month>
<year>2025</year>
</date>
<date date-type="accepted">
<day>10</day>
<month>6</month>
<year>2025</year>
</date>
</history>
<permissions>
<copyright-statement>&#x00A9; 2025 The Authors.</copyright-statement>
<copyright-year>2025</copyright-year>
<copyright-holder>Published by Tech Science Press.</copyright-holder>
<license xlink:href="https://creativecommons.org/licenses/by/4.0/">
<license-p>This work is licensed under a <ext-link ext-link-type="uri" xlink:type="simple" xlink:href="https://creativecommons.org/licenses/by/4.0/">Creative Commons Attribution 4.0 International License</ext-link>, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
</license>
</permissions>
<self-uri content-type="pdf" xlink:href="TSP_CMC_67117.pdf"></self-uri>
<abstract>
<p>This paper proposes a Multi-Agent Attention Proximal Policy Optimization (MA2PPO) algorithm aiming at the problems such as credit assignment, low collaboration efficiency and weak strategy generalization ability existing in the cooperative pursuit tasks of multiple unmanned aerial vehicles (UAVs). Traditional algorithms often fail to effectively identify critical cooperative relationships in such tasks, leading to low capture efficiency and a significant decline in performance when the scale expands. To tackle these issues, based on the proximal policy optimization (PPO) algorithm, MA2PPO adopts the centralized training with decentralized execution (CTDE) framework and introduces a dynamic decoupling mechanism, that is, sharing the multi-head attention (MHA) mechanism for critics during centralized training to solve the credit assignment problem. This method enables the pursuers to identify highly correlated interactions with their teammates, effectively eliminate irrelevant and weakly relevant interactions, and decompose large-scale cooperation problems into decoupled sub-problems, thereby enhancing the collaborative efficiency and policy stability among multiple agents. Furthermore, a reward function has been devised to facilitate the pursuers to encircle the escapee by combining a formation reward with a distance reward, which incentivizes UAVs to develop sophisticated cooperative pursuit strategies. Experimental results demonstrate the effectiveness of the proposed algorithm in achieving multi-UAV cooperative pursuit and inducing diverse cooperative pursuit behaviors among UAVs. Moreover, experiments on scalability have demonstrated that the algorithm is suitable for large-scale multi-UAV systems.</p>
</abstract>
<kwd-group kwd-group-type="author">
<kwd>Multi-agent reinforcement learning</kwd>
<kwd>multi-UAV systems</kwd>
<kwd>pursuit-evasion games</kwd>
</kwd-group>
<funding-group>
<award-group id="awg1">
<funding-source>National Research and Development Program of China</funding-source>
<award-id>JCKY2018607C019</award-id>
</award-group>
<award-group id="awg2">
<funding-source>Key Laboratory Fund of UAV of Northwestern Polytechnical University</funding-source>
<award-id>2021JCJQLB07101</award-id>
</award-group>
</funding-group>
</article-meta>
</front>
<body>
<sec id="s1">
<label>1</label>
<title>Introduction</title>
<p>Unmanned aerial vehicles (UAVs) play a vital role in a variety of fields given their affordability, ease of deployment, high maneuverability, and capability to operate in harsh environments [<xref ref-type="bibr" rid="ref-1">1</xref>]. However, the limited detection range and payload capacity of a single UAV often hinder its ability to complete complex missions alone. Multi-UAV systems, with their flexibility, intelligence, autonomy, and robustness, have garnered significant research attention. Applications of multi-UAV systems include search and rescue (SAR) [<xref ref-type="bibr" rid="ref-2">2</xref>], security patrol systems [<xref ref-type="bibr" rid="ref-3">3</xref>], military reconnaissance [<xref ref-type="bibr" rid="ref-4">4</xref>], wireless communication support [<xref ref-type="bibr" rid="ref-5">5</xref>], medical delivery [<xref ref-type="bibr" rid="ref-6">6</xref>], and civilian agriculture [<xref ref-type="bibr" rid="ref-7">7</xref>].</p>
<p>In the emerging battlefield of the intelligent swarm era, the multi-UAV cooperative pursuit-evasion games represent one of the core applications of UAV cluster systems. Consider the following scenario where multiple UAVs are tasked with intercepting a rogue UAV that has entered restricted airspace. The evading UAV exhibits agile, non-cooperative behavior and dynamically adjusts its trajectory based on the positions of the pursuing UAVs. To achieve successful interception, the pursuers must coordinate their actions in real time under limited communication constraints and capture the intruder before it escapes the designated area. These games involve players with conflicting goals: the pursuer intends to catch the evader while the latter is attempting to avoid being caught as soon as possible [<xref ref-type="bibr" rid="ref-8">8</xref>]. We study a pursuit-evasion game involving cooperative pursuit by UAVs against an escapee. Nonetheless, the high degree of dynamism and uncertainty inherent in adversarial games presents a significant challenge for decision-makers in achieving an optimal solution. Furthermore, the kinematics of UAVs must meet speed and acceleration limits in realistic environments [<xref ref-type="bibr" rid="ref-9">9</xref>], which adds complexity to decision-making control for UAVs in pursuit-evasion games and requires further exploration.</p>
<p>In the field of differential games, several studies have been conducted on pursuit-evasion games. The first of these studies was carried out by Isaacs, who addressed the one-to-one robot pursuit-evasion problem by creating and solving partial differential equations [<xref ref-type="bibr" rid="ref-10">10</xref>]. Pursuit-evasion games are typically modeled using the Hamilton-Jacobi-Isaacs (HJI) equation, which is solved to determine the optimal policy of the tracker [<xref ref-type="bibr" rid="ref-11">11</xref>,<xref ref-type="bibr" rid="ref-12">12</xref>]. In [<xref ref-type="bibr" rid="ref-13">13</xref>], the multi-player pursuit-evasion (MPE) differential game is discussed, and an N-player nonzero-sum game framework is established to study the relationships among players. Nevertheless, these approaches face significant challenges in characterizing cooperation among multiple pursuers or evaders and in the computational complexity associated with solving HJI equations.</p>
<p>Intelligent optimization algorithms that offer novel ideas have also been employed to solve pursuit-evasion problems. Some researchers have proposed a decision-making method for the pursuers using a Voronoi diagram [<xref ref-type="bibr" rid="ref-14">14</xref>] and a fuzzy tree [<xref ref-type="bibr" rid="ref-15">15</xref>]. In [<xref ref-type="bibr" rid="ref-16">16</xref>], inspired by the hunting and foraging behaviors of group predators, the authors suggest a cooperative scheme in which the pursuers regulate the encirclement formed around the evader and continually shrink it until the evader is captured. But these methods are based on environmental models or expert knowledge and do not guarantee the optimality of decision-making.</p>
<p>Multi-agent reinforcement learning (MARL) has proven to be an effective approach for solving multi-UAV pursuit games [<xref ref-type="bibr" rid="ref-17">17</xref>]. These games are represented as a decentralized partially observable Markov decision process (Dec-POMDP) in MARL [<xref ref-type="bibr" rid="ref-18">18</xref>], with training leveraging the widely used centralized training and decentralized execution (CTDE) framework. CTDE facilitates centralized training with access to global information, while agents make decisions based on local observations during execution, resulting in reduced training complexity and more efficient execution [<xref ref-type="bibr" rid="ref-19">19</xref>]. However, the possibility of &#x201C;lazy agents&#x201D; arises, where agents&#x2019; poor actions are rewarded due to the favorable actions of other agents when they learn independently from global rewards. This is the most common credit assignment problem in MARL [<xref ref-type="bibr" rid="ref-20">20</xref>]. Once agents fall into this trap, they may take profitable actions that result in locally optimal solutions but poor overall performance. Given the objective of training agents to maximize global rewards, addressing this issue is of utmost importance.</p>
<p>The methods for addressing credit assignment in MARL can be classified into two categories: implicit and explicit. An implicit credit assignment is a value-based approach, and some popular algorithms in this category include VDN [<xref ref-type="bibr" rid="ref-21">21</xref>], QMIX [<xref ref-type="bibr" rid="ref-22">22</xref>], and QTRAN [<xref ref-type="bibr" rid="ref-23">23</xref>]. These algorithms are designed to decompose the centralized value function into decentralized value functions in a reasonable manner. On the other hand, explicit assignment algorithms, such as COMA [<xref ref-type="bibr" rid="ref-24">24</xref>], are based on policy gradients and employ a counterfactual baseline to compare the global reward with the reward obtained when the agents&#x2019; actions are replaced by default actions. In [<xref ref-type="bibr" rid="ref-25">25</xref>], a meta-gradient algorithm is proposed to achieve credit assignment in fully cooperative environments by calculating the marginal contribution of each agent. Reference [<xref ref-type="bibr" rid="ref-26">26</xref>] improves the pursuit-evasion policies of UAVs by combining imitation learning. To tackle the multi-UAV pursuit-evasion problem, reference [<xref ref-type="bibr" rid="ref-27">27</xref>] proposes a spatiotemporally efficient detection network, while reference [<xref ref-type="bibr" rid="ref-28">28</xref>] introduces a crown-shaped bidirectional cooperative target prediction network tailored for multi-agent collaboration. However, these methods fail to consider the heterogeneity of information provided by different agents and treat all agents equally, leading to inefficient collaboration.</p>
<p>Recent studies have introduced the attention mechanism as a new idea for solving credit assignment problems in MARL. The multi-actor-attention-critic (MAAC) approach [<xref ref-type="bibr" rid="ref-29">29</xref>] shares an attention mechanism among critics of centralized computing, leading to more efficient and scalable learning. Another study [<xref ref-type="bibr" rid="ref-30">30</xref>] introduces a complete graph and a two-stage attention network to study multi-agent games. In addition, the authors of [<xref ref-type="bibr" rid="ref-31">31</xref>] propose a hierarchical graph attention network that models the hierarchical relationship among agents and uses two attention networks to represent the interaction at the individual and group levels. Reference [<xref ref-type="bibr" rid="ref-32">32</xref>] introduces a graph attention-based evaluation framework integrated with factorized normalizing flow algorithms. However, these methods only use the attention mechanism to extract various pieces of information and do not apply the attention further. We apply the attention allowing each agent to temporarily separate from teammates with low attention and work more closely with teammates who have high correlation interaction. Large-scale cooperative activities can be dynamically split into decoupled sub-problems in this fashion, enabling agents to collaborate more effectively.</p>
<p>The design of the reward function is the key to the success of real-time chasing games [<xref ref-type="bibr" rid="ref-33">33</xref>]. Pursuers that only learn inefficient capture strategies may fail to catch the faster escapee in the worst case. Previous work has proposed a decentralized multi-target tracking algorithm based on the maximum reciprocal reward to learn collaborative strategies [<xref ref-type="bibr" rid="ref-34">34</xref>], and constructed a potential-based individual reward function to accelerate policy learning [<xref ref-type="bibr" rid="ref-35">35</xref>]. Reference [<xref ref-type="bibr" rid="ref-36">36</xref>] develops a unique reward mechanism designed to enhance the generation of action sequences in UAVs operations. Nevertheless, they do not contemplate developing a reward function to direct the pursuers to form a formation swarm. To address this issue, we propose a novel reward function that includes a formation reward. This enables the pursuers to acquire intricate pursuit tactics such as interception and encirclement, thus effectively completing the pursuit assignment. Our approach enhances the effectiveness and dependability of multi-UAV cooperation in catching the faster escapee. Our main contributions include:
<list list-type="bullet">
<list-item>
<p>We propose a Multi-Agent Attention Proximal Policy Optimization (MA2PPO) algorithm, which builds upon the foundations of proximal policy optimization (PPO) and CTDE. We design a centralized critic network and a decentralized actor network for each UAV to leverage global information during policy training while enabling each UAV to execute its decentralized policy using only its local observation after training. The MA2PPO algorithm enables efficient pursuit strategies for multi-UAV systems in highly dynamic pursuit-escape scenarios.</p></list-item>
<list-item>
<p>We propose a dynamic decoupling to solve the cooperative multi-UAV credit assignment problem. It shares a multi-head attention (MHA) mechanism for critics in centralized training, which makes the pursuers select the relevant information from teammates in real time. Dynamic decoupling identifies highly correlated interactions in the pursuers and eliminates unnecessary and weakly correlated interactions with one another. This approach decomposes the large-scale cooperative problem into smaller, decoupled sub-problems, improving collaboration among agents.</p></list-item>
<list-item>
<p>We design a new reward function that combines a formation reward and a distance reward by selecting the appropriate weights. The distance reward dominates the cooperative reward function when the pursuers are far from the escapee, while the formation reward dominates as the pursuers approach the escapee. The new reward encourages the pursuers to spread out around the escapee and penalizes angular proximity, promoting the learning of complex cooperative pursuit strategies, and improving the capture probability of the multi-UAV system by encouraging the pursuers to quickly obtain the optimal policy.</p></list-item>
</list></p>
<p>The rest of this article is organized as follows. We briefly present the basic theoretical knowledge of MARL and MHA in <xref ref-type="sec" rid="s2">Section 2</xref>. <xref ref-type="sec" rid="s3">Section 3</xref> describes the pursuit-evasion environment, the dynamics equation for UAVs, and Dec-POMDP. In <xref ref-type="sec" rid="s4">Section 4</xref>, the MA2PPO method is developed in detail. Experimental results and discussion are presented in <xref ref-type="sec" rid="s5">Section 5</xref>. Finally, <xref ref-type="sec" rid="s6">Section 6</xref> gives conclusions and looks ahead to future work.</p>
</sec>
<sec id="s2">
<label>2</label>
<title>Preliminary</title>
<sec id="s2_1">
<label>2.1</label>
<title>Multi-Agent Reinforcement Learning</title>
<p>MARL is mathematically modeled by a Dec-POMDP which is a Markov game extending a Markov decision process (MDP) to the multi-agent setting, defined as a tuple <inline-formula id="ieqn-1"><mml:math id="mml-ieqn-1"><mml:mo>&#x003C;</mml:mo><mml:mrow><mml:mi>&#x1D4A9;</mml:mi></mml:mrow><mml:mo>,</mml:mo><mml:mspace width="thinmathspace" /><mml:mrow><mml:mi>&#x1D4AE;</mml:mi></mml:mrow><mml:mo>,</mml:mo><mml:mspace width="thinmathspace" /><mml:mrow><mml:mi>&#x1D49C;</mml:mi></mml:mrow><mml:mo>,</mml:mo><mml:mspace width="thinmathspace" /><mml:mrow><mml:mi>&#x1D4AB;</mml:mi></mml:mrow><mml:mo>,</mml:mo><mml:mspace width="thinmathspace" /><mml:mrow><mml:mi>&#x211B;</mml:mi></mml:mrow><mml:mo>,</mml:mo><mml:mspace width="thinmathspace" /><mml:mrow><mml:mi>&#x1D4AA;</mml:mi></mml:mrow><mml:mo>&#x003E;</mml:mo></mml:math></inline-formula>, where</p>
<p><inline-formula id="ieqn-2"><mml:math id="mml-ieqn-2"><mml:mrow><mml:mi>&#x1D4A9;</mml:mi></mml:mrow></mml:math></inline-formula> is the set of agents, and <inline-formula id="ieqn-3"><mml:math id="mml-ieqn-3"><mml:mrow><mml:mi>&#x1D4A9;</mml:mi></mml:mrow><mml:mo>:</mml:mo><mml:mrow><mml:mo>{</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mspace width="thinmathspace" /><mml:mn>2</mml:mn><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:mspace width="thinmathspace" /><mml:mi>N</mml:mi><mml:mo>}</mml:mo></mml:mrow></mml:math></inline-formula>, where <italic>N</italic> is the number of agents,</p>
<p><inline-formula id="ieqn-4"><mml:math id="mml-ieqn-4"><mml:mrow><mml:mi>&#x1D4AE;</mml:mi></mml:mrow></mml:math></inline-formula> is the global state space and the state <inline-formula id="ieqn-5"><mml:math id="mml-ieqn-5"><mml:mrow><mml:msub><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo>&#x2208;</mml:mo><mml:mrow><mml:mi>&#x1D4AE;</mml:mi></mml:mrow></mml:math></inline-formula>,</p>
<p><inline-formula id="ieqn-6"><mml:math id="mml-ieqn-6"><mml:mrow><mml:mrow><mml:mi>&#x1D49C;</mml:mi></mml:mrow></mml:mrow></mml:math></inline-formula> is the joint action space of all agents, <inline-formula id="ieqn-7"><mml:math id="mml-ieqn-7"><mml:mrow><mml:msup><mml:mrow><mml:mi>a</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>i</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup></mml:mrow><mml:mo>&#x2208;</mml:mo><mml:mrow><mml:mrow><mml:mi>&#x1D49C;</mml:mi></mml:mrow></mml:mrow><mml:mo>,</mml:mo><mml:mspace width="thinmathspace" /><mml:mi>i</mml:mi><mml:mo>&#x2208;</mml:mo><mml:mrow><mml:mo>{</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mspace width="thinmathspace" /><mml:mn>2</mml:mn><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:mspace width="thinmathspace" /><mml:mi>N</mml:mi><mml:mo>}</mml:mo></mml:mrow></mml:math></inline-formula>,</p>
<p><inline-formula id="ieqn-8"><mml:math id="mml-ieqn-8"><mml:mrow><mml:mrow><mml:mi>&#x1D4AB;</mml:mi></mml:mrow></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo>,</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi mathvariant="bold-italic">a</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">&#x2192;</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">]</mml:mo></mml:math></inline-formula> is the transition probability, which denotes the probability of the state <inline-formula id="ieqn-9"><mml:math id="mml-ieqn-9"><mml:mrow><mml:msub><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:math></inline-formula> to the next state <inline-formula id="ieqn-10"><mml:math id="mml-ieqn-10"><mml:mrow><mml:msub><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:mrow></mml:math></inline-formula> after performing the joint action <inline-formula id="ieqn-11"><mml:math id="mml-ieqn-11"><mml:mrow><mml:msub><mml:mrow><mml:mi mathvariant="bold-italic">a</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo>:</mml:mo><mml:mrow><mml:mo>{</mml:mo><mml:msubsup><mml:mi>a</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:msubsup><mml:mi>a</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mn>2</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:msubsup><mml:mi>a</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>N</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mo>}</mml:mo></mml:mrow></mml:math></inline-formula>,</p>
<p><inline-formula id="ieqn-12"><mml:math id="mml-ieqn-12"><mml:mrow><mml:mrow><mml:mi>&#x211B;</mml:mi></mml:mrow></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo>,</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi mathvariant="bold-italic">a</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> is the reward function, which gives the reward value after performing the action <inline-formula id="ieqn-13"><mml:math id="mml-ieqn-13"><mml:mrow><mml:msub><mml:mrow><mml:mi mathvariant="bold-italic">a</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:math></inline-formula> under a given state <inline-formula id="ieqn-14"><mml:math id="mml-ieqn-14"><mml:mrow><mml:msub><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:math></inline-formula>,</p>
<p><inline-formula id="ieqn-15"><mml:math id="mml-ieqn-15"><mml:mrow><mml:mrow><mml:mi>&#x1D4AA;</mml:mi></mml:mrow></mml:mrow></mml:math></inline-formula> is the joint individual observation space of agents under a given state <inline-formula id="ieqn-16"><mml:math id="mml-ieqn-16"><mml:mrow><mml:msub><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:math></inline-formula>, the joint individual observation <inline-formula id="ieqn-17"><mml:math id="mml-ieqn-17"><mml:mrow><mml:msub><mml:mrow><mml:mi mathvariant="bold-italic">o</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo>:</mml:mo><mml:mrow><mml:mo>{</mml:mo><mml:msubsup><mml:mi>o</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:msubsup><mml:mi>o</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mn>2</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:msubsup><mml:mi>o</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>N</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mo>}</mml:mo></mml:mrow></mml:math></inline-formula> and <inline-formula id="ieqn-18"><mml:math id="mml-ieqn-18"><mml:msubsup><mml:mi>o</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>i</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mo>&#x2208;</mml:mo><mml:mrow><mml:mrow><mml:mi>&#x1D4AA;</mml:mi></mml:mrow></mml:mrow></mml:math></inline-formula>.</p>
<p>Each agent performs the corresponding action according to its local observation based on the policy <inline-formula id="ieqn-19"><mml:math id="mml-ieqn-19"><mml:mrow><mml:msup><mml:mrow><mml:mi>&#x03C0;</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>i</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup></mml:mrow><mml:mo>:</mml:mo><mml:msubsup><mml:mi>o</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>i</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mo stretchy="false">&#x2192;</mml:mo><mml:msubsup><mml:mi>a</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>i</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup></mml:math></inline-formula>. The goal of each agent is to maximize a discount return <inline-formula id="ieqn-20"><mml:math id="mml-ieqn-20"><mml:msub><mml:mi>R</mml:mi><mml:mi>t</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:msubsup><mml:mo movablelimits="false">&#x2211;</mml:mo><mml:mrow><mml:mi>t</mml:mi><mml:mo>=</mml:mo><mml:mn>0</mml:mn></mml:mrow><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msubsup><mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>&#x03B3;</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>r</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:mrow></mml:math></inline-formula>, where <inline-formula id="ieqn-21"><mml:math id="mml-ieqn-21"><mml:msub><mml:mrow><mml:mi>&#x03B3;</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> is the discount factor and T is the time horizon.</p>
<p>The joint policy of all agents is <inline-formula id="ieqn-22"><mml:math id="mml-ieqn-22"><mml:mi mathvariant="bold-italic">&#x03C0;</mml:mi><mml:mo>:</mml:mo><mml:mrow><mml:mo>{</mml:mo><mml:mrow><mml:msup><mml:mrow><mml:mi>&#x03C0;</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup></mml:mrow><mml:mo>,</mml:mo><mml:mrow><mml:msup><mml:mrow><mml:mi>&#x03C0;</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mn>2</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup></mml:mrow><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:mrow><mml:msup><mml:mrow><mml:mi>&#x03C0;</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>N</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup></mml:mrow><mml:mo>}</mml:mo></mml:mrow></mml:math></inline-formula>. We use the standard definitions of action-value function <inline-formula id="ieqn-23"><mml:math id="mml-ieqn-23"><mml:msup><mml:mrow><mml:mi>Q</mml:mi></mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">&#x03C0;</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula>, state-value function <inline-formula id="ieqn-24"><mml:math id="mml-ieqn-24"><mml:msup><mml:mrow><mml:mi>V</mml:mi></mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">&#x03C0;</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula>, and advantage function <inline-formula id="ieqn-25"><mml:math id="mml-ieqn-25"><mml:msup><mml:mrow><mml:mi>A</mml:mi></mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">&#x03C0;</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula> are
<disp-formula id="eqn-1"><label>(1)</label><mml:math id="mml-eqn-1" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd /><mml:mtd><mml:msup><mml:mi>Q</mml:mi><mml:mrow><mml:mi mathvariant="bold-italic">&#x03C0;</mml:mi></mml:mrow></mml:msup><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>s</mml:mi><mml:mi>t</mml:mi></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi mathvariant="bold-italic">a</mml:mi></mml:mrow><mml:mi>t</mml:mi></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mi mathvariant="double-struck">E</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:mrow><mml:mo>,</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi mathvariant="bold-italic">a</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:mrow><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo></mml:mrow></mml:msub><mml:mo stretchy="false">[</mml:mo><mml:msub><mml:mi>R</mml:mi><mml:mi>t</mml:mi></mml:msub><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:msub><mml:mi>s</mml:mi><mml:mi>t</mml:mi></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi mathvariant="bold-italic">a</mml:mi></mml:mrow><mml:mi>t</mml:mi></mml:msub><mml:mo stretchy="false">]</mml:mo><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<disp-formula id="eqn-2"><label>(2)</label><mml:math id="mml-eqn-2" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd /><mml:mtd><mml:msup><mml:mi>V</mml:mi><mml:mrow><mml:mi mathvariant="bold-italic">&#x03C0;</mml:mi></mml:mrow></mml:msup><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mrow><mml:mi mathvariant="double-struck">E</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi mathvariant="bold-italic">a</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo>,</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:mrow><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>R</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">]</mml:mo><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<disp-formula id="eqn-3"><label>(3)</label><mml:math id="mml-eqn-3" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd /><mml:mtd><mml:msup><mml:mi>A</mml:mi><mml:mrow><mml:mi mathvariant="bold-italic">&#x03C0;</mml:mi></mml:mrow></mml:msup><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo>,</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi mathvariant="bold-italic">a</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:msup><mml:mrow><mml:mi>Q</mml:mi></mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">&#x03C0;</mml:mi></mml:mrow></mml:msup><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo>,</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi mathvariant="bold-italic">a</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mrow><mml:msup><mml:mrow><mml:mi>V</mml:mi></mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">&#x03C0;</mml:mi></mml:mrow></mml:msup></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>where <inline-formula id="ieqn-26"><mml:math id="mml-ieqn-26"><mml:mrow><mml:mi mathvariant="double-struck">E</mml:mi></mml:mrow><mml:mrow><mml:mo>[</mml:mo><mml:mo>&#x22C5;</mml:mo><mml:mo>]</mml:mo></mml:mrow></mml:math></inline-formula> denotes the mathematical expectation.</p>
</sec>
<sec id="s2_2">
<label>2.2</label>
<title>Proximal Policy Optimization</title>
<p>PPO [<xref ref-type="bibr" rid="ref-37">37</xref>] is an algorithm that combines the policy gradient and the actor-critic framework. The actor represents a stochastic policy, which generates a probability distribution of actions given a particular state. The critic is a value function estimator that helps train the actor by guiding the policy towards actions that lead to high-value states. One unique feature of PPO is its application of importance sampling, which distinguishes between old and new policies, and allows for the efficient reuse of sampled data, improving the data utilization rate. The PPO optimization function uses a policy pruning mechanism to prevent large changes in the action distribution during updates, and employs the advantage function as the gradient weight. The PPO optimization function can be expressed as follows:
<disp-formula id="eqn-4"><label>(4)</label><mml:math id="mml-eqn-4" display="block"><mml:msup><mml:mi>L</mml:mi><mml:mrow><mml:mi>C</mml:mi><mml:mi>L</mml:mi><mml:mi>I</mml:mi><mml:mi>P</mml:mi></mml:mrow></mml:msup><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x03B8;</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mrow><mml:mrow><mml:mi mathvariant="double-struck">E</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mo>[</mml:mo><mml:mi>m</mml:mi><mml:mi>i</mml:mi><mml:mi>n</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>r</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x03B8;</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:msub><mml:mrow><mml:mover><mml:mi>A</mml:mi><mml:mo stretchy="false">&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mi>c</mml:mi><mml:mi>l</mml:mi><mml:mi>i</mml:mi><mml:mi>p</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>r</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x03B8;</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>,</mml:mo><mml:mn>1</mml:mn><mml:mo>&#x2212;</mml:mo><mml:mo>&#x003F5;</mml:mo><mml:mo>,</mml:mo><mml:mn>1</mml:mn><mml:mo>+</mml:mo><mml:mo>&#x003F5;</mml:mo><mml:mo stretchy="false">)</mml:mo><mml:msub><mml:mrow><mml:mover><mml:mi>A</mml:mi><mml:mo stretchy="false">&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mo>]</mml:mo></mml:mrow><mml:mo>,</mml:mo></mml:math></disp-formula></p>
<p>where <inline-formula id="ieqn-27"><mml:math id="mml-ieqn-27"><mml:mrow><mml:msub><mml:mrow><mml:mi>r</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>&#x03B8;</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>&#x03C0;</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x03B8;</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>a</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mrow><mml:mo>/</mml:mo></mml:mrow><mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>&#x03C0;</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>&#x03B8;</mml:mi></mml:mrow><mml:mrow><mml:mi>o</mml:mi><mml:mi>l</mml:mi><mml:mi>d</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>a</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula> is the ratio of the action probability under the <inline-formula id="ieqn-28"><mml:math id="mml-ieqn-28"><mml:mrow><mml:msub><mml:mrow><mml:mi>&#x03C0;</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x03B8;</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>a</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> to the previous policy <inline-formula id="ieqn-29"><mml:math id="mml-ieqn-29"><mml:mrow><mml:msub><mml:mrow><mml:mi>&#x03C0;</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>&#x03B8;</mml:mi></mml:mrow><mml:mrow><mml:mi>o</mml:mi><mml:mi>l</mml:mi><mml:mi>d</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>a</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>; <inline-formula id="ieqn-30"><mml:math id="mml-ieqn-30"><mml:mi>&#x03B8;</mml:mi></mml:math></inline-formula> and <inline-formula id="ieqn-31"><mml:math id="mml-ieqn-31"><mml:mrow><mml:msub><mml:mrow><mml:mi>&#x03B8;</mml:mi></mml:mrow><mml:mrow><mml:mi>o</mml:mi><mml:mi>l</mml:mi><mml:mi>d</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:math></inline-formula> are the trainable hyperparameters of the new policy and the old policy, respectively; <inline-formula id="ieqn-32"><mml:math id="mml-ieqn-32"><mml:msub><mml:mrow><mml:mover><mml:mi>A</mml:mi><mml:mo stretchy="false">&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> is the advantage function estimate, calculated by generalized advantage estimator (GAE) [<xref ref-type="bibr" rid="ref-38">38</xref>]; <inline-formula id="ieqn-33"><mml:math id="mml-ieqn-33"><mml:mi>c</mml:mi><mml:mi>l</mml:mi><mml:mi>i</mml:mi><mml:mi>p</mml:mi></mml:math></inline-formula>(<inline-formula id="ieqn-34"><mml:math id="mml-ieqn-34"><mml:mrow><mml:mo>&#x22C5;</mml:mo></mml:mrow></mml:math></inline-formula>) is the clipping function that limits <inline-formula id="ieqn-35"><mml:math id="mml-ieqn-35"><mml:msub><mml:mrow><mml:mi>r</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>&#x03B8;</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> to a range <inline-formula id="ieqn-36"><mml:math id="mml-ieqn-36"><mml:mo stretchy="false">[</mml:mo><mml:mn>1</mml:mn><mml:mo>&#x2212;</mml:mo><mml:mo>&#x003F5;</mml:mo><mml:mo>,</mml:mo><mml:mn>1</mml:mn><mml:mo>+</mml:mo><mml:mo>&#x003F5;</mml:mo><mml:mo stretchy="false">]</mml:mo></mml:math></inline-formula>; <inline-formula id="ieqn-37"><mml:math id="mml-ieqn-37"><mml:mo>&#x003F5;</mml:mo><mml:mo>&#x2208;</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">]</mml:mo></mml:math></inline-formula> is the cropping parameter.</p>
</sec>
<sec id="s2_3">
<label>2.3</label>
<title>Attention Mechanism</title>
<p>When we encounter something in our daily lives, we quickly identify and attend to different parts of our environment [<xref ref-type="bibr" rid="ref-39">39</xref>]. This process can be expressed as
<disp-formula id="eqn-5"><label>(5)</label><mml:math id="mml-eqn-5" display="block"><mml:mi>A</mml:mi><mml:mi>t</mml:mi><mml:mi>t</mml:mi><mml:mi>e</mml:mi><mml:mi>n</mml:mi><mml:mi>t</mml:mi><mml:mi>i</mml:mi><mml:mi>o</mml:mi><mml:mi>n</mml:mi><mml:mo>=</mml:mo><mml:mi>f</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>g</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>,</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>,</mml:mo></mml:math></disp-formula>where <inline-formula id="ieqn-38"><mml:math id="mml-ieqn-38"><mml:mi>g</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> denotes the generation of attention, which is the process of focusing on differentiating regions; <inline-formula id="ieqn-39"><mml:math id="mml-ieqn-39"><mml:mi>f</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>g</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>,</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> represents processing input <inline-formula id="ieqn-40"><mml:math id="mml-ieqn-40"><mml:mi>x</mml:mi></mml:math></inline-formula> based on attention <inline-formula id="ieqn-41"><mml:math id="mml-ieqn-41"><mml:mi>g</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>, which is consistent with processing key areas and acquiring information.</p>
<p>The attention mechanism operates in a manner akin to the differentiable key-value memory model [<xref ref-type="bibr" rid="ref-40">40</xref>]. By leveraging the correlation between a given key and query, the attention mechanism enables the model to dynamically focus on relevant information. It maps the pairs of a query <inline-formula id="ieqn-42"><mml:math id="mml-ieqn-42"><mml:mi>q</mml:mi></mml:math></inline-formula>, key <inline-formula id="ieqn-43"><mml:math id="mml-ieqn-43"><mml:mi>k</mml:mi></mml:math></inline-formula>, and value <inline-formula id="ieqn-44"><mml:math id="mml-ieqn-44"><mml:mi>v</mml:mi></mml:math></inline-formula> to an output that is a weighted sum of the value vector <inline-formula id="ieqn-45"><mml:math id="mml-ieqn-45"><mml:mi>v</mml:mi></mml:math></inline-formula>, where the weights are determined by the key and query vectors. We describe it as
<disp-formula id="eqn-6"><label>(6)</label><mml:math id="mml-eqn-6" display="block"><mml:mi>A</mml:mi><mml:mi>t</mml:mi><mml:mi>t</mml:mi><mml:mi>e</mml:mi><mml:mi>n</mml:mi><mml:mi>t</mml:mi><mml:mi>i</mml:mi><mml:mi>o</mml:mi><mml:mi>n</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>q</mml:mi><mml:mo>,</mml:mo><mml:mi>k</mml:mi><mml:mo>,</mml:mo><mml:mi>v</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mi>s</mml:mi><mml:mi>o</mml:mi><mml:mi>f</mml:mi><mml:mi>t</mml:mi><mml:mi>m</mml:mi><mml:mi>a</mml:mi><mml:mi>x</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mfrac><mml:mrow><mml:mi>q</mml:mi><mml:mrow><mml:msup><mml:mrow><mml:mi>k</mml:mi></mml:mrow><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msup></mml:mrow></mml:mrow><mml:msqrt><mml:mrow><mml:msub><mml:mrow><mml:mi>d</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:msqrt></mml:mfrac><mml:mo stretchy="false">)</mml:mo><mml:mo>&#x22C5;</mml:mo><mml:mi>v</mml:mi><mml:mo>,</mml:mo></mml:math></disp-formula>where <inline-formula id="ieqn-46"><mml:math id="mml-ieqn-46"><mml:mi>s</mml:mi><mml:mi>o</mml:mi><mml:mi>f</mml:mi><mml:mi>t</mml:mi><mml:mi>m</mml:mi><mml:mi>a</mml:mi><mml:mi>x</mml:mi></mml:math></inline-formula>(<inline-formula id="ieqn-47"><mml:math id="mml-ieqn-47"><mml:mrow><mml:mo>&#x22C5;</mml:mo></mml:mrow></mml:math></inline-formula>) is a normalized exponential function; <inline-formula id="ieqn-48"><mml:math id="mml-ieqn-48"><mml:mrow><mml:msub><mml:mrow><mml:mi>d</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:math></inline-formula> is the dimension of the query, key, and value vector; <inline-formula id="ieqn-49"><mml:math id="mml-ieqn-49"><mml:mi>q</mml:mi></mml:math></inline-formula>, <inline-formula id="ieqn-50"><mml:math id="mml-ieqn-50"><mml:mi>k</mml:mi></mml:math></inline-formula>, and <inline-formula id="ieqn-51"><mml:math id="mml-ieqn-51"><mml:mi>v</mml:mi></mml:math></inline-formula> are embedded vectors. Structured as a linear combination of <inline-formula id="ieqn-52"><mml:math id="mml-ieqn-52"><mml:mi>v</mml:mi></mml:math></inline-formula>, attention can learn this embedding computing by the dot product <inline-formula id="ieqn-53"><mml:math id="mml-ieqn-53"><mml:mi>q</mml:mi><mml:mrow><mml:msup><mml:mrow><mml:mi>k</mml:mi></mml:mrow><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msup></mml:mrow></mml:math></inline-formula> which measures how compatible they are. Computing the dot product <inline-formula id="ieqn-54"><mml:math id="mml-ieqn-54"><mml:mi>q</mml:mi><mml:mrow><mml:msup><mml:mrow><mml:mi>k</mml:mi></mml:mrow><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msup></mml:mrow></mml:math></inline-formula> is used to measure the compatibility of embedded information.</p>
<p>MHA uses multiple query vectors <inline-formula id="ieqn-55"><mml:math id="mml-ieqn-55"><mml:mi>Q</mml:mi><mml:mo>=</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>q</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:mrow><mml:mo>,</mml:mo><mml:mspace width="thinmathspace" /><mml:mrow><mml:msub><mml:mrow><mml:mi>q</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub></mml:mrow><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:mspace width="thinmathspace" /><mml:mrow><mml:msub><mml:mrow><mml:mi>q</mml:mi></mml:mrow><mml:mrow><mml:mi>m</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">]</mml:mo></mml:math></inline-formula> to select sets of information from the input information <inline-formula id="ieqn-56"><mml:math id="mml-ieqn-56"><mml:mo stretchy="false">[</mml:mo><mml:mi>K</mml:mi><mml:mo>,</mml:mo><mml:mspace width="thinmathspace" /><mml:mi>V</mml:mi><mml:mo stretchy="false">]</mml:mo><mml:mo>=</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>k</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:mrow><mml:mo>,</mml:mo><mml:mspace width="thinmathspace" /><mml:mrow><mml:msub><mml:mrow><mml:mi>v</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo><mml:mo>,</mml:mo><mml:mspace width="thinmathspace" /><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>k</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub></mml:mrow><mml:mo>,</mml:mo><mml:mspace width="thinmathspace" /><mml:mrow><mml:msub><mml:mrow><mml:mi>v</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:mspace width="thinmathspace" /><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>k</mml:mi></mml:mrow><mml:mrow><mml:mi>m</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo>,</mml:mo><mml:mspace width="thinmathspace" /><mml:mrow><mml:msub><mml:mrow><mml:mi>v</mml:mi></mml:mrow><mml:mrow><mml:mi>m</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">]</mml:mo></mml:math></inline-formula> in parallel, where <inline-formula id="ieqn-57"><mml:math id="mml-ieqn-57"><mml:mi>m</mml:mi></mml:math></inline-formula> is the number of attention heads. During the query process, each query vector <inline-formula id="ieqn-58"><mml:math id="mml-ieqn-58"><mml:mrow><mml:msub><mml:mrow><mml:mi>q</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:math></inline-formula> will focus on different parts of the input information, that is, analyze the current input information from different perspectives. It allows the model to jointly focus on information from different representation subspaces
<disp-formula id="eqn-7"><label>(7)</label><mml:math id="mml-eqn-7" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd /><mml:mtd><mml:mi>M</mml:mi><mml:mi>u</mml:mi><mml:mi>l</mml:mi><mml:mi>t</mml:mi><mml:mi>i</mml:mi><mml:mi>H</mml:mi><mml:mi>e</mml:mi><mml:mi>a</mml:mi><mml:mi>d</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>Q</mml:mi><mml:mo>,</mml:mo><mml:mi>K</mml:mi><mml:mo>,</mml:mo><mml:mi>V</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mi>c</mml:mi><mml:mi>o</mml:mi><mml:mi>n</mml:mi><mml:mi>c</mml:mi><mml:mi>a</mml:mi><mml:mi>t</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>h</mml:mi><mml:mi>e</mml:mi><mml:mi>a</mml:mi><mml:mrow><mml:msub><mml:mrow><mml:mi>d</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:mrow><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:mi>h</mml:mi><mml:mi>e</mml:mi><mml:mi>a</mml:mi><mml:mrow><mml:msub><mml:mrow><mml:mi>d</mml:mi></mml:mrow><mml:mrow><mml:mi>m</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo><mml:mo>&#x22C5;</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>W</mml:mi></mml:mrow><mml:mrow><mml:mn>0</mml:mn></mml:mrow></mml:msub></mml:mrow><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<disp-formula id="eqn-8"><label>(8)</label><mml:math id="mml-eqn-8" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd /><mml:mtd><mml:mi>h</mml:mi><mml:mi>e</mml:mi><mml:mi>a</mml:mi><mml:mrow><mml:msub><mml:mrow><mml:mi>d</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo>=</mml:mo><mml:mi>A</mml:mi><mml:mi>t</mml:mi><mml:mi>t</mml:mi><mml:mi>e</mml:mi><mml:mi>n</mml:mi><mml:mi>t</mml:mi><mml:mi>i</mml:mi><mml:mi>o</mml:mi><mml:mi>n</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>Q</mml:mi><mml:msub><mml:mi>W</mml:mi><mml:mrow><mml:mi>Q</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mi>K</mml:mi><mml:msub><mml:mi>W</mml:mi><mml:mrow><mml:mi>K</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mi>V</mml:mi><mml:msub><mml:mi>W</mml:mi><mml:mrow><mml:mi>V</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>where <inline-formula id="ieqn-59"><mml:math id="mml-ieqn-59"><mml:mi>c</mml:mi><mml:mi>o</mml:mi><mml:mi>n</mml:mi><mml:mi>c</mml:mi><mml:mi>a</mml:mi><mml:mi>t</mml:mi></mml:math></inline-formula>(<inline-formula id="ieqn-60"><mml:math id="mml-ieqn-60"><mml:mrow><mml:mo>&#x22C5;</mml:mo></mml:mrow></mml:math></inline-formula>) is the concatenation function; <inline-formula id="ieqn-61"><mml:math id="mml-ieqn-61"><mml:msub><mml:mrow><mml:mi>W</mml:mi></mml:mrow><mml:mrow><mml:mn>0</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula>, <inline-formula id="ieqn-62"><mml:math id="mml-ieqn-62"><mml:msub><mml:mi>W</mml:mi><mml:mrow><mml:mi>Q</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>, <inline-formula id="ieqn-63"><mml:math id="mml-ieqn-63"><mml:msub><mml:mi>W</mml:mi><mml:mrow><mml:mi>K</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>, and <inline-formula id="ieqn-64"><mml:math id="mml-ieqn-64"><mml:msub><mml:mi>W</mml:mi><mml:mrow><mml:mi>V</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> are learnable parameter matrices. In MHA, <italic>Q</italic>, <italic>K</italic>, and <italic>V</italic> are linearly projected <inline-formula id="ieqn-65"><mml:math id="mml-ieqn-65"><mml:mi>m</mml:mi></mml:math></inline-formula> times by different and learnable linear projection matrices.</p>
</sec>
</sec>
<sec id="s3">
<label>3</label>
<title>System Model and Problem Formulation</title>
<sec id="s3_1">
<label>3.1</label>
<title>System Model</title>
<p>The pursuit-evasion environment is a continuous, two-dimensional finite area, dimension <inline-formula id="ieqn-66"><mml:math id="mml-ieqn-66"><mml:mi>L</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mi>W</mml:mi></mml:math></inline-formula>. There are <italic>N</italic> homogeneous pursuers and one escapee in our research. The objective of the pursuers is to quickly complete the capture mission by utilizing striking or expelling tactics, while the escapee aims to evade the pursuers. Success in this mission is achieved when the distance between the escapee and at least one pursuer falls below the capture radius, defined as the payload range or weapon attack range of the pursuers. Failure to apprehend the escapee within a specified time frame results in mission failure.</p>
<p>To simplify and increase versatility, we consider all UAVs as particles in the environment and assume that they lie at the same altitude. The two-dimensional kinematics model of UAVs [<xref ref-type="bibr" rid="ref-34">34</xref>,<xref ref-type="bibr" rid="ref-41">41</xref>] we considered is as follows:
<disp-formula id="eqn-9"><label>(9)</label><mml:math id="mml-eqn-9" display="block"><mml:mrow><mml:mo>{</mml:mo><mml:mtable columnalign="left left" rowspacing="4pt" columnspacing="1em"><mml:mtr><mml:mtd><mml:mrow><mml:msub><mml:mrow><mml:mrow><mml:mrow><mml:mover><mml:mi>x</mml:mi><mml:mo>&#x02D9;</mml:mo></mml:mover></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo>=</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>v</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mi>cos</mml:mi><mml:mo>&#x2061;</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>&#x03D5;</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo>,</mml:mo></mml:mtd><mml:mtd /></mml:mtr><mml:mtr><mml:mtd><mml:mrow><mml:msub><mml:mrow><mml:mrow><mml:mrow><mml:mover><mml:mi>y</mml:mi><mml:mo>&#x02D9;</mml:mo></mml:mover></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo>=</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>v</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mi>sin</mml:mi><mml:mo>&#x2061;</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>&#x03D5;</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo>,</mml:mo></mml:mtd><mml:mtd /></mml:mtr><mml:mtr><mml:mtd><mml:mrow><mml:msub><mml:mrow><mml:mrow><mml:mrow><mml:mover><mml:mi>&#x03D5;</mml:mi><mml:mo>&#x02D9;</mml:mo></mml:mover></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mi>&#x03C9;</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo></mml:mtd><mml:mtd /></mml:mtr></mml:mtable><mml:mo fence="true" stretchy="true" symmetric="true"></mml:mo></mml:mrow></mml:math></disp-formula>where <inline-formula id="ieqn-67"><mml:math id="mml-ieqn-67"><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>x</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo>,</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>y</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> is the position of UAV <inline-formula id="ieqn-68"><mml:math id="mml-ieqn-68"><mml:mi>i</mml:mi></mml:math></inline-formula>; <inline-formula id="ieqn-69"><mml:math id="mml-ieqn-69"><mml:mrow><mml:msub><mml:mrow><mml:mi>v</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:math></inline-formula> is the linear velocity; <inline-formula id="ieqn-70"><mml:math id="mml-ieqn-70"><mml:mrow><mml:msub><mml:mrow><mml:mi>&#x03D5;</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:math></inline-formula> is the heading angle; <inline-formula id="ieqn-71"><mml:math id="mml-ieqn-71"><mml:mrow><mml:msub><mml:mrow><mml:mi>&#x03C9;</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:math></inline-formula> is the angular velocity; <inline-formula id="ieqn-72"><mml:math id="mml-ieqn-72"><mml:mi>i</mml:mi></mml:math></inline-formula> refers to the pursuer or the escapee.</p>
<p>We assume that all UAVs fight with a constant speed, and a similar assumption has been used in [<xref ref-type="bibr" rid="ref-34">34</xref>]. The control variable is <inline-formula id="ieqn-73"><mml:math id="mml-ieqn-73"><mml:msub><mml:mrow><mml:mi>&#x03C9;</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> that constraint range is <inline-formula id="ieqn-74"><mml:math id="mml-ieqn-74"><mml:mrow><mml:msub><mml:mrow><mml:mi>&#x03C9;</mml:mi></mml:mrow><mml:mrow><mml:mo movablelimits="true" form="prefix">min</mml:mo></mml:mrow></mml:msub></mml:mrow><mml:mo>&#x2264;</mml:mo><mml:msub><mml:mrow><mml:mi>&#x03C9;</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>&#x2264;</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>&#x03C9;</mml:mi></mml:mrow><mml:mrow><mml:mo movablelimits="true" form="prefix">max</mml:mo></mml:mrow></mml:msub></mml:mrow></mml:math></inline-formula>. The speed of the escapee is set higher than that of the pursuers to encourage cooperation among the pursuers and increase the level of difficulty in the pursuit-evasion game. Each UAV obtains information on the position, attitude, distance, and angle through GPS and sensors.</p>
</sec>
<sec id="s3_2">
<label>3.2</label>
<title>Problem Formulation</title>
<p>In this subsection, the system model is modeled as the Dec-POMDP. Then, we give the problem formulation of the multi-UAV cooperative pursuit-evasion game.</p>
<sec id="s3_2_1">
<label>3.2.1</label>
<title>Agent</title>
<p>Each pursuer can be viewed as an agent. The pursuers engage in an interactive process, wherein they gather information on their teammates within the communication range, information on the escapee within the detection range, and information gleaned from their own detection. Based on this information, the pursuers generate observations of their environment and output corresponding actions according to their policy. The environment provides feedback in the form of rewards for the actions taken by the pursuers. The policy is dynamically updated throughout the pursuit process to promote optimal actions and ensure coordinated pursuit by the UAV swarm.</p>
</sec>
<sec id="s3_2_2">
<label>3.2.2</label>
<title>Observation</title>
<p>Due to the inherent limitations of their sensing capabilities, each UAV is only able to observe a limited portion of the real environment. As shown in <xref ref-type="fig" rid="fig-1">Fig. 1</xref>, the observation of UAV <inline-formula id="ieqn-75"><mml:math id="mml-ieqn-75"><mml:mi>i</mml:mi></mml:math></inline-formula> is <inline-formula id="ieqn-76"><mml:math id="mml-ieqn-76"><mml:mrow><mml:msub><mml:mrow><mml:mi>o</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo>=</mml:mo><mml:mo fence="false" stretchy="false">{</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>o</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo>,</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>o</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo>,</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>o</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mi>e</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo fence="false" stretchy="false">}</mml:mo></mml:math></inline-formula>, where <inline-formula id="ieqn-77"><mml:math id="mml-ieqn-77"><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi><mml:mo>&#x2208;</mml:mo><mml:mo fence="false" stretchy="false">{</mml:mo><mml:mtext>1</mml:mtext><mml:mo>,</mml:mo><mml:mtext>2</mml:mtext><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:mi>N</mml:mi><mml:mo fence="false" stretchy="false">}</mml:mo><mml:mo>,</mml:mo><mml:mi>i</mml:mi><mml:mo>&#x2260;</mml:mo><mml:mi>j</mml:mi></mml:math></inline-formula>, <italic>N</italic> is the number of pursuers,</p>
<fig id="fig-1">
<label>Figure 1</label>
<caption>
<title>The observation of pursuer <inline-formula id="ieqn-104"><mml:math id="mml-ieqn-104"><mml:mi>i</mml:mi></mml:math></inline-formula></title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_67117-fig-1.tif"/>
</fig>
<p><inline-formula id="ieqn-78"><mml:math id="mml-ieqn-78"><mml:mi>e</mml:mi></mml:math></inline-formula> is the escapee,</p>
<p><inline-formula id="ieqn-79"><mml:math id="mml-ieqn-79"><mml:mrow><mml:msub><mml:mrow><mml:mi>&#x03D5;</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:math></inline-formula>, <inline-formula id="ieqn-80"><mml:math id="mml-ieqn-80"><mml:mrow><mml:msub><mml:mrow><mml:mi>&#x03D5;</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:math></inline-formula>, and <inline-formula id="ieqn-81"><mml:math id="mml-ieqn-81"><mml:mrow><mml:msub><mml:mrow><mml:mi>&#x03D5;</mml:mi></mml:mrow><mml:mrow><mml:mi>e</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:math></inline-formula> are the heading angles of pursuer <inline-formula id="ieqn-82"><mml:math id="mml-ieqn-82"><mml:mi>i</mml:mi></mml:math></inline-formula>, pursuer <inline-formula id="ieqn-83"><mml:math id="mml-ieqn-83"><mml:mi>j</mml:mi></mml:math></inline-formula> and escapee <inline-formula id="ieqn-84"><mml:math id="mml-ieqn-84"><mml:mi>e</mml:mi></mml:math></inline-formula>, respectively,</p>
<p><inline-formula id="ieqn-85"><mml:math id="mml-ieqn-85"><mml:mrow><mml:msub><mml:mrow><mml:mi>o</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo>=</mml:mo><mml:mo fence="false" stretchy="false">{</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>x</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo>,</mml:mo><mml:mspace width="thinmathspace" /><mml:mrow><mml:msub><mml:mrow><mml:mi>y</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo>,</mml:mo><mml:mspace width="thinmathspace" /><mml:mrow><mml:msub><mml:mrow><mml:mi>v</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo>,</mml:mo><mml:mspace width="thinmathspace" /><mml:mrow><mml:msub><mml:mrow><mml:mi>&#x03D5;</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo fence="false" stretchy="false">}</mml:mo></mml:math></inline-formula> is the information of UAV <inline-formula id="ieqn-86"><mml:math id="mml-ieqn-86"><mml:mi>i</mml:mi></mml:math></inline-formula> itself,</p>
<p><inline-formula id="ieqn-87"><mml:math id="mml-ieqn-87"><mml:mrow><mml:msub><mml:mrow><mml:mi>o</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo>=</mml:mo><mml:mo fence="false" stretchy="false">{</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>d</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>&#x03B4;</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo fence="false" stretchy="false">}</mml:mo></mml:math></inline-formula> is the information of other pursuers in the communication range, in which <inline-formula id="ieqn-88"><mml:math id="mml-ieqn-88"><mml:mrow><mml:msub><mml:mrow><mml:mi>d</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:math></inline-formula> is the Euclidean distance between pursuer <inline-formula id="ieqn-89"><mml:math id="mml-ieqn-89"><mml:mi>i</mml:mi></mml:math></inline-formula> with other pursuer <inline-formula id="ieqn-90"><mml:math id="mml-ieqn-90"><mml:mi>j</mml:mi></mml:math></inline-formula>, and <inline-formula id="ieqn-91"><mml:math id="mml-ieqn-91"><mml:msub><mml:mrow><mml:mi>&#x03B4;</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> is the heading error, defined as the angle between the heading of pursuer <inline-formula id="ieqn-92"><mml:math id="mml-ieqn-92"><mml:mi>i</mml:mi></mml:math></inline-formula> and the vector from pursuer <inline-formula id="ieqn-93"><mml:math id="mml-ieqn-93"><mml:mi>i</mml:mi></mml:math></inline-formula> to pursuer <inline-formula id="ieqn-94"><mml:math id="mml-ieqn-94"><mml:mi>j</mml:mi></mml:math></inline-formula>,</p>
<p><inline-formula id="ieqn-95"><mml:math id="mml-ieqn-95"><mml:mrow><mml:msub><mml:mrow><mml:mi>o</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mi>e</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo>=</mml:mo><mml:mo fence="false" stretchy="false">{</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>d</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mi>e</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>&#x03B4;</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mi>e</mml:mi></mml:mrow></mml:msub><mml:mo fence="false" stretchy="false">}</mml:mo></mml:math></inline-formula> is the information of the escapee, in which <inline-formula id="ieqn-96"><mml:math id="mml-ieqn-96"><mml:mrow><mml:msub><mml:mrow><mml:mi>d</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mi>e</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:math></inline-formula> is the Euclidean distance between pursuer <inline-formula id="ieqn-97"><mml:math id="mml-ieqn-97"><mml:mi>i</mml:mi></mml:math></inline-formula> with escapee <inline-formula id="ieqn-98"><mml:math id="mml-ieqn-98"><mml:mi>e</mml:mi></mml:math></inline-formula>, and <inline-formula id="ieqn-99"><mml:math id="mml-ieqn-99"><mml:mrow><mml:msub><mml:mrow><mml:mi>&#x03B4;</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mi>e</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:math></inline-formula> is the heading error, defined as the angle between the heading of pursuer <inline-formula id="ieqn-100"><mml:math id="mml-ieqn-100"><mml:mi>i</mml:mi></mml:math></inline-formula> and the vector from pursuer <inline-formula id="ieqn-101"><mml:math id="mml-ieqn-101"><mml:mi>i</mml:mi></mml:math></inline-formula> to escapee <inline-formula id="ieqn-102"><mml:math id="mml-ieqn-102"><mml:mi>e</mml:mi></mml:math></inline-formula>.</p>
<p>Consequently, the state representation of each pursuer comprises <inline-formula id="ieqn-103"><mml:math id="mml-ieqn-103"><mml:mn>2</mml:mn><mml:mi>N</mml:mi><mml:mo>+</mml:mo><mml:mn>4</mml:mn></mml:math></inline-formula> variables, with the number of variables being linearly proportional to the number of pursuers <italic>N</italic>. To standardize the value range, eliminate scale-related errors, and enhance the effectiveness of neural network training, each state variable is normalized.</p>
</sec>
<sec id="s3_2_3">
<label>3.2.3</label>
<title>Action</title>
<p>The action space is the set of all actions that the pursuers can perform. During the multi-UAV confrontation process, each pursuer selects its actions based on its own observations. In a pursuit-evasion scenario where UAVs are flying at a constant speed, the control can be simplified to angular velocity control, ignoring the effect of wind speed on the airspace. As the angular velocity varies, the states of UAVs change accordingly. The range of angular velocity falls within the interval of <inline-formula id="ieqn-105"><mml:math id="mml-ieqn-105"><mml:mo stretchy="false">[</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mn>3</mml:mn><mml:mtext>&#xA0;</mml:mtext><mml:mrow><mml:mrow><mml:mtext>rad/s</mml:mtext></mml:mrow></mml:mrow><mml:mo>,</mml:mo><mml:mn>3</mml:mn><mml:mtext>&#xA0;</mml:mtext><mml:mrow><mml:mrow><mml:mtext>rad/s</mml:mtext></mml:mrow></mml:mrow><mml:mo stretchy="false">]</mml:mo></mml:math></inline-formula>. To facilitate the learning process, we discretize the continuous angular velocity range into eight distinct actions for each pursuer to learn.</p>
</sec>
<sec id="s3_2_4">
<label>3.2.4</label>
<title>Reward Function</title>
<p>It is important to design an effective reward function to guide the learning process of the multi-UAV system. In the pursuit task, the reward function plays a crucial role in controlling the pursuers to chase the escapee while avoiding collisions and staying within the boundaries. We propose two types of rewards: cooperative pursuit reward and punishment reward.</p>
<p>The formation reward received by the pursuers at each step <inline-formula id="ieqn-106"><mml:math id="mml-ieqn-106"><mml:mi>t</mml:mi></mml:math></inline-formula> is given by:
<disp-formula id="eqn-10"><label>(10)</label><mml:math id="mml-eqn-10" display="block"><mml:msubsup><mml:mi>r</mml:mi><mml:mrow><mml:mi>f</mml:mi><mml:mi>o</mml:mi><mml:mi>r</mml:mi><mml:mi>m</mml:mi><mml:mi>a</mml:mi><mml:mi>t</mml:mi><mml:mi>i</mml:mi><mml:mi>o</mml:mi><mml:mi>n</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>i</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:mrow><mml:msup><mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:mi>N</mml:mi></mml:mfrac><mml:msubsup><mml:mo movablelimits="false">&#x2211;</mml:mo><mml:mrow><mml:mi>j</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>N</mml:mi></mml:mrow></mml:msubsup><mml:mrow><mml:mi>cos</mml:mi><mml:mo>&#x2061;</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>&#x03B8;</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msup></mml:mrow><mml:mo>,</mml:mo></mml:math></disp-formula>where <inline-formula id="ieqn-107"><mml:math id="mml-ieqn-107"><mml:mrow><mml:msub><mml:mrow><mml:mi>&#x03B8;</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:math></inline-formula> denotes the angle between pursuer <inline-formula id="ieqn-108"><mml:math id="mml-ieqn-108"><mml:mi>i</mml:mi></mml:math></inline-formula> and pursuer <inline-formula id="ieqn-109"><mml:math id="mml-ieqn-109"><mml:mi>j</mml:mi></mml:math></inline-formula> relative to the direction of the evader, <inline-formula id="ieqn-110"><mml:math id="mml-ieqn-110"><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mspace width="thinmathspace" /><mml:mi>j</mml:mi><mml:mo>&#x2208;</mml:mo><mml:mo fence="false" stretchy="false">{</mml:mo><mml:mtext>1</mml:mtext><mml:mo>,</mml:mo><mml:mspace width="thinmathspace" /><mml:mtext>2</mml:mtext><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:mspace width="thinmathspace" /><mml:mi>N</mml:mi><mml:mo fence="false" stretchy="false">}</mml:mo><mml:mo>,</mml:mo><mml:mspace width="thinmathspace" /><mml:mi>i</mml:mi><mml:mo>&#x2260;</mml:mo><mml:mi>j</mml:mi></mml:math></inline-formula>. A smaller value of this angle indicates that the pursuers are distributed in more diverse directions (i.e., forming larger mutual angles), which implies a better encirclement effect. Conversely, a larger angle suggests that the pursuers are aligned in similar directions, leading to a less effective encirclement.</p>
<p>The distance-based reward function for the pursuers is defined as follows:
<disp-formula id="eqn-11"><label>(11)</label><mml:math id="mml-eqn-11" display="block"><mml:msubsup><mml:mi>r</mml:mi><mml:mrow><mml:mi>d</mml:mi><mml:mi>i</mml:mi><mml:mi>s</mml:mi><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>i</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:mrow><mml:mn>1</mml:mn><mml:mo>+</mml:mo><mml:mrow><mml:msup><mml:mrow><mml:mi>e</mml:mi></mml:mrow><mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mi>&#x03C2;</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>d</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mi>e</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>d</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi><mml:mi>h</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup></mml:mrow></mml:mrow></mml:mfrac><mml:mo>,</mml:mo></mml:math></disp-formula>where <inline-formula id="ieqn-111"><mml:math id="mml-ieqn-111"><mml:mrow><mml:msub><mml:mrow><mml:mi>d</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi><mml:mi>h</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:math></inline-formula> denotes the distance threshold for a successful capture, <inline-formula id="ieqn-112"><mml:math id="mml-ieqn-112"><mml:mi>&#x03C2;</mml:mi></mml:math></inline-formula> is a constant controlling the steepness of the curve, and <inline-formula id="ieqn-113"><mml:math id="mml-ieqn-113"><mml:mrow><mml:msub><mml:mrow><mml:mi>d</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mi>e</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:math></inline-formula> represents the Euclidean distance between pursuer <inline-formula id="ieqn-114"><mml:math id="mml-ieqn-114"><mml:mi>i</mml:mi></mml:math></inline-formula> and evader <inline-formula id="ieqn-115"><mml:math id="mml-ieqn-115"><mml:mi>e</mml:mi></mml:math></inline-formula>. A reward function with dynamic decay encourages the pursuers to be more sensitive to distance. The penalty is large when the distance is far but decreases rapidly as it narrows, which helps avoid overly aggressive close-range pursuit.</p>
<p>Therefore, the cooperative reward function for the pursuers is defined as:
<disp-formula id="eqn-12"><label>(12)</label><mml:math id="mml-eqn-12" display="block"><mml:msubsup><mml:mrow><mml:mi>r</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi><mml:mi>o</mml:mi><mml:mi>o</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>i</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:mrow><mml:mo>{</mml:mo><mml:mtable columnalign="left left" rowspacing=".2em" columnspacing="1em" displaystyle="false"><mml:mtr><mml:mtd /><mml:mtd><mml:mrow><mml:msub><mml:mrow><mml:mi>r</mml:mi></mml:mrow><mml:mrow><mml:mi>w</mml:mi><mml:mi>i</mml:mi><mml:mi>n</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo>,</mml:mo><mml:mspace width="2em" /><mml:mspace width="2em" /><mml:mspace width="2em" /><mml:mspace width="2em" /><mml:mi>i</mml:mi><mml:mi>f</mml:mi><mml:mtext>&#xA0;</mml:mtext><mml:mrow><mml:msub><mml:mrow><mml:mi>d</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mi>e</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo>&#x2264;</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>d</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi><mml:mi>h</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:mtd></mml:mtr><mml:mtr><mml:mtd /><mml:mtd><mml:mrow><mml:msub><mml:mrow><mml:mi>r</mml:mi></mml:mrow><mml:mrow><mml:mi>h</mml:mi><mml:mi>e</mml:mi><mml:mi>l</mml:mi><mml:mi>p</mml:mi><mml:mi>e</mml:mi><mml:mi>r</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo>,</mml:mo><mml:mspace width="2em" /><mml:mspace width="2em" /><mml:mspace width="2em" /><mml:mspace width="1em" /><mml:mi>i</mml:mi><mml:mi>f</mml:mi><mml:mtext>&#xA0;</mml:mtext><mml:mrow><mml:msub><mml:mrow><mml:mi>d</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi><mml:mi>e</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo>&#x2264;</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>d</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi><mml:mi>h</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo>,</mml:mo><mml:mi mathvariant="normal">&#x2203;</mml:mi><mml:mi>j</mml:mi><mml:mo>&#x2260;</mml:mo><mml:mi>i</mml:mi></mml:mtd></mml:mtr><mml:mtr><mml:mtd /><mml:mtd><mml:mo>&#x2212;</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>&#x03C9;</mml:mi></mml:mrow><mml:mrow><mml:mi>f</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:msubsup><mml:mi>r</mml:mi><mml:mrow><mml:mi>f</mml:mi><mml:mi>o</mml:mi><mml:mi>r</mml:mi><mml:mi>m</mml:mi><mml:mi>a</mml:mi><mml:mi>t</mml:mi><mml:mi>i</mml:mi><mml:mi>o</mml:mi><mml:mi>n</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>i</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mo>&#x2212;</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>&#x03C9;</mml:mi></mml:mrow><mml:mrow><mml:mi>d</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:msubsup><mml:mi>r</mml:mi><mml:mrow><mml:mi>d</mml:mi><mml:mi>i</mml:mi><mml:mi>s</mml:mi><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>i</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:mspace width="1em" /><mml:mtext>&#xA0;&#xA0;</mml:mtext><mml:mi>o</mml:mi><mml:mi>t</mml:mi><mml:mi>h</mml:mi><mml:mi>e</mml:mi><mml:mi>r</mml:mi><mml:mi>w</mml:mi><mml:mi>i</mml:mi><mml:mi>s</mml:mi><mml:mi>e</mml:mi></mml:mtd></mml:mtr></mml:mtable><mml:mo fence="true" stretchy="true" symmetric="true"></mml:mo></mml:mrow></mml:math></disp-formula>where <inline-formula id="ieqn-116"><mml:math id="mml-ieqn-116"><mml:mrow><mml:msub><mml:mrow><mml:mi>&#x03C9;</mml:mi></mml:mrow><mml:mrow><mml:mi>f</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:math></inline-formula> and <inline-formula id="ieqn-117"><mml:math id="mml-ieqn-117"><mml:mrow><mml:msub><mml:mrow><mml:mi>&#x03C9;</mml:mi></mml:mrow><mml:mrow><mml:mi>d</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:math></inline-formula> are the weights of formation reward and distance, respectively. Each pursuer receives a negative reward at every stage of the unfinished mission, which is a weighted linear combination of <inline-formula id="ieqn-118"><mml:math id="mml-ieqn-118"><mml:msubsup><mml:mi>r</mml:mi><mml:mrow><mml:mi>f</mml:mi><mml:mi>o</mml:mi><mml:mi>r</mml:mi><mml:mi>m</mml:mi><mml:mi>a</mml:mi><mml:mi>t</mml:mi><mml:mi>i</mml:mi><mml:mi>o</mml:mi><mml:mi>n</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>i</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup></mml:math></inline-formula> and <inline-formula id="ieqn-119"><mml:math id="mml-ieqn-119"><mml:msubsup><mml:mi>r</mml:mi><mml:mrow><mml:mi>d</mml:mi><mml:mi>i</mml:mi><mml:mi>s</mml:mi><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>i</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup></mml:math></inline-formula>. The pursuers that capture the escapee will receive the reward <inline-formula id="ieqn-120"><mml:math id="mml-ieqn-120"><mml:mrow><mml:msub><mml:mrow><mml:mi>r</mml:mi></mml:mrow><mml:mrow><mml:mi>w</mml:mi><mml:mi>i</mml:mi><mml:mi>n</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:math></inline-formula>, and the others will receive <inline-formula id="ieqn-121"><mml:math id="mml-ieqn-121"><mml:mrow><mml:msub><mml:mrow><mml:mi>r</mml:mi></mml:mrow><mml:mrow><mml:mi>h</mml:mi><mml:mi>e</mml:mi><mml:mi>l</mml:mi><mml:mi>p</mml:mi><mml:mi>e</mml:mi><mml:mi>r</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:math></inline-formula>, and <inline-formula id="ieqn-122"><mml:math id="mml-ieqn-122"><mml:mrow><mml:msub><mml:mrow><mml:mi>r</mml:mi></mml:mrow><mml:mrow><mml:mi>w</mml:mi><mml:mi>i</mml:mi><mml:mi>n</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo>&#x003E;</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>r</mml:mi></mml:mrow><mml:mrow><mml:mi>h</mml:mi><mml:mi>e</mml:mi><mml:mi>l</mml:mi><mml:mi>p</mml:mi><mml:mi>e</mml:mi><mml:mi>r</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:math></inline-formula>.</p>
<p>We design a cooperative reward function that encourages pursuers to approach the evader from different directions, forming an encirclement pattern that yields a formation reward. However, if the agents only aim to spread around the evader without moving closer, the pursuit task cannot be completed. To address this, we introduce a dynamically decaying distance reward. When a pursuer is far from the evader, the distance reward dominates the cooperative reward function; as the pursuer approaches the evader, the formation reward becomes more prominent. This new reward formulation encourages pursuers to spread out around the evader while penalizing similar approach angles, thereby promoting the learning of complex cooperative pursuit strategies. It also increases the probability of successful capture in multi-UAV systems by incentivizing agents to quickly acquire optimal policies.</p>
<p><bold>Punishment reward.</bold> UAVs need to pursue the evader while avoiding collisions with others and flying out of the mission area to ensure flight safety. Therefore, it is essential to create a justifiable reward function to direct UAVs toward a safe flight. The penalty reward includes the collision penalty and the boundary penalty.</p>
<p>The collision penalty reward is given when two pursuers are closer together than the safe distance <inline-formula id="ieqn-123"><mml:math id="mml-ieqn-123"><mml:mrow><mml:msub><mml:mrow><mml:mi>d</mml:mi></mml:mrow><mml:mrow><mml:mi>s</mml:mi><mml:mi>a</mml:mi><mml:mi>f</mml:mi><mml:mi>e</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:math></inline-formula>. It is
<disp-formula id="eqn-13"><label>(13)</label><mml:math id="mml-eqn-13" display="block"><mml:msubsup><mml:mrow><mml:mi>r</mml:mi></mml:mrow><mml:mrow><mml:mi>s</mml:mi><mml:mi>a</mml:mi><mml:mi>f</mml:mi><mml:mi>e</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>i</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:mrow><mml:mo>{</mml:mo><mml:mtable columnalign="left left" rowspacing=".2em" columnspacing="1em" displaystyle="false"><mml:mtr><mml:mtd><mml:mo>&#x2212;</mml:mo><mml:mn>10</mml:mn><mml:mo>,</mml:mo></mml:mtd><mml:mtd><mml:mrow><mml:mrow><mml:mtext>if</mml:mtext></mml:mrow><mml:mtext>&#xA0;</mml:mtext><mml:mrow><mml:msub><mml:mrow><mml:mi>d</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo>&#x2264;</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>d</mml:mi></mml:mrow><mml:mrow><mml:mi>s</mml:mi><mml:mi>a</mml:mi><mml:mi>f</mml:mi><mml:mi>e</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo>,</mml:mo><mml:mi mathvariant="normal">&#x2203;</mml:mi><mml:mi>j</mml:mi><mml:mo>&#x2260;</mml:mo><mml:mi>i</mml:mi></mml:mrow></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mrow><mml:mn>0</mml:mn><mml:mo>,</mml:mo></mml:mrow></mml:mtd><mml:mtd><mml:mrow><mml:mrow><mml:mtext>otherwise</mml:mtext></mml:mrow></mml:mrow></mml:mtd></mml:mtr></mml:mtable><mml:mo fence="true" stretchy="true" symmetric="true"></mml:mo></mml:mrow></mml:math></disp-formula>where <inline-formula id="ieqn-124"><mml:math id="mml-ieqn-124"><mml:mrow><mml:msub><mml:mrow><mml:mi>d</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:math></inline-formula> is the Euclidean distance between two chasers.</p>
<p>There are security risks when UAVs fly outside of the mission area. By flying too close to boundaries, UAVs have a portion of their perception range fall into the non-mission area, which is useless. We define <inline-formula id="ieqn-125"><mml:math id="mml-ieqn-125"><mml:mrow><mml:msub><mml:mrow><mml:mi>d</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>x</mml:mi></mml:mrow><mml:mrow><mml:mo movablelimits="true" form="prefix">max</mml:mo></mml:mrow></mml:msub></mml:mrow></mml:mrow></mml:msub></mml:mrow></mml:math></inline-formula>, <inline-formula id="ieqn-126"><mml:math id="mml-ieqn-126"><mml:mrow><mml:msub><mml:mrow><mml:mi>d</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>x</mml:mi></mml:mrow><mml:mrow><mml:mo movablelimits="true" form="prefix">min</mml:mo></mml:mrow></mml:msub></mml:mrow></mml:mrow></mml:msub></mml:mrow></mml:math></inline-formula>, <inline-formula id="ieqn-127"><mml:math id="mml-ieqn-127"><mml:mrow><mml:msub><mml:mrow><mml:mi>d</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>y</mml:mi></mml:mrow><mml:mrow><mml:mo movablelimits="true" form="prefix">max</mml:mo></mml:mrow></mml:msub></mml:mrow></mml:mrow></mml:msub></mml:mrow></mml:math></inline-formula>, <inline-formula id="ieqn-128"><mml:math id="mml-ieqn-128"><mml:mrow><mml:msub><mml:mrow><mml:mi>d</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>y</mml:mi></mml:mrow><mml:mrow><mml:mo movablelimits="true" form="prefix">min</mml:mo></mml:mrow></mml:msub></mml:mrow></mml:mrow></mml:msub></mml:mrow></mml:math></inline-formula> are the <inline-formula id="ieqn-129"><mml:math id="mml-ieqn-129"><mml:mrow><mml:msub><mml:mrow><mml:mi>x</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:math></inline-formula> and <inline-formula id="ieqn-130"><mml:math id="mml-ieqn-130"><mml:mrow><mml:msub><mml:mrow><mml:mi>y</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:math></inline-formula> maximum and minimum distance to the boundary, respectively. The boundary penalty reward is
<disp-formula id="eqn-14"><label>(14)</label><mml:math id="mml-eqn-14" display="block"><mml:msubsup><mml:mrow><mml:mi>r</mml:mi></mml:mrow><mml:mrow><mml:mi>b</mml:mi><mml:mi>o</mml:mi><mml:mi>u</mml:mi><mml:mi>n</mml:mi><mml:mi>d</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>i</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:mrow><mml:mo>{</mml:mo><mml:mtable columnalign="left left" rowspacing=".2em" columnspacing="1em" displaystyle="false"><mml:mtr><mml:mtd><mml:mn>0</mml:mn><mml:mo>,</mml:mo></mml:mtd><mml:mtd><mml:mrow><mml:mtext>if</mml:mtext></mml:mrow><mml:mtext>&#xA0;</mml:mtext><mml:mrow><mml:msub><mml:mrow><mml:mi>d</mml:mi></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>x</mml:mi></mml:mrow><mml:mrow><mml:mo movablelimits="true" form="prefix">min</mml:mo></mml:mrow></mml:msub></mml:mrow></mml:msub></mml:mrow><mml:mo>&#x003C;</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>x</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo>&#x003C;</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>d</mml:mi></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>x</mml:mi></mml:mrow><mml:mrow><mml:mo movablelimits="true" form="prefix">max</mml:mo></mml:mrow></mml:msub></mml:mrow></mml:msub></mml:mrow><mml:mtext>&#xA0;</mml:mtext><mml:mrow><mml:mtext>and</mml:mtext></mml:mrow><mml:mtext>&#xA0;</mml:mtext><mml:msub><mml:mrow><mml:mi>d</mml:mi></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>y</mml:mi></mml:mrow><mml:mrow><mml:mo movablelimits="true" form="prefix">min</mml:mo></mml:mrow></mml:msub></mml:mrow></mml:msub><mml:mo>&#x003C;</mml:mo><mml:msub><mml:mrow><mml:mi>y</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>&#x003C;</mml:mo><mml:msub><mml:mrow><mml:mi>d</mml:mi></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>y</mml:mi></mml:mrow><mml:mrow><mml:mo movablelimits="true" form="prefix">max</mml:mo></mml:mrow></mml:msub></mml:mrow></mml:msub></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mn>10.</mml:mn></mml:mrow></mml:mtd><mml:mtd><mml:mrow><mml:mrow><mml:mtext>otherwise</mml:mtext></mml:mrow></mml:mrow></mml:mtd></mml:mtr></mml:mtable><mml:mo fence="true" stretchy="true" symmetric="true"></mml:mo></mml:mrow></mml:math></disp-formula></p>
<p>For the pursuer <inline-formula id="ieqn-131"><mml:math id="mml-ieqn-131"><mml:mi>i</mml:mi></mml:math></inline-formula>, the penalty reward for each step is
<disp-formula id="eqn-15"><label>(15)</label><mml:math id="mml-eqn-15" display="block"><mml:mrow><mml:msubsup><mml:mrow><mml:mi>r</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi><mml:mi>o</mml:mi><mml:mi>l</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>i</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup></mml:mrow><mml:mo>=</mml:mo><mml:mn>0.5</mml:mn><mml:mrow><mml:msubsup><mml:mrow><mml:mi>r</mml:mi></mml:mrow><mml:mrow><mml:mi>s</mml:mi><mml:mi>a</mml:mi><mml:mi>f</mml:mi><mml:mi>e</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>i</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup></mml:mrow><mml:mo>+</mml:mo><mml:mn>0.5</mml:mn><mml:mrow><mml:msubsup><mml:mrow><mml:mi>r</mml:mi></mml:mrow><mml:mrow><mml:mi>b</mml:mi><mml:mi>o</mml:mi><mml:mi>u</mml:mi><mml:mi>n</mml:mi><mml:mi>d</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>i</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mo>.</mml:mo></mml:mrow></mml:math></disp-formula></p>
<p>The final reward function can be formulated as
<disp-formula id="eqn-16"><label>(16)</label><mml:math id="mml-eqn-16" display="block"><mml:mrow><mml:msup><mml:mrow><mml:mi>r</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>i</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup></mml:mrow><mml:mo>=</mml:mo><mml:mrow><mml:msubsup><mml:mrow><mml:mi>r</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi><mml:mi>o</mml:mi><mml:mi>o</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>i</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup></mml:mrow><mml:mo>+</mml:mo><mml:mrow><mml:msubsup><mml:mrow><mml:mi>r</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi><mml:mi>o</mml:mi><mml:mi>l</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>i</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mo>.</mml:mo></mml:mrow></mml:math></disp-formula></p>
</sec>
<sec id="s3_2_5">
<label>3.2.5</label>
<title>Problem Formulation</title>
<p>The goal is to train the optimal policy for the decentralized execution of each UAV through centralized learning that obtains the states and actions of all UAVs. Each UAV learns to take the optimal flight actions by itself through local observation to complete the pursuit mission. The UAV <inline-formula id="ieqn-132"><mml:math id="mml-ieqn-132"><mml:mi>i</mml:mi></mml:math></inline-formula> uses the policy <inline-formula id="ieqn-133"><mml:math id="mml-ieqn-133"><mml:mrow><mml:msup><mml:mrow><mml:mi>&#x03C0;</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>i</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup></mml:mrow></mml:math></inline-formula> to select the action <inline-formula id="ieqn-134"><mml:math id="mml-ieqn-134"><mml:msubsup><mml:mi>a</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>i</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup></mml:math></inline-formula> based on the current observation <inline-formula id="ieqn-135"><mml:math id="mml-ieqn-135"><mml:msubsup><mml:mi>o</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>i</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup></mml:math></inline-formula>. The issue can be stated as
<disp-formula id="eqn-17"><label>(17)</label><mml:math id="mml-eqn-17" display="block"><mml:munder><mml:mrow><mml:mi>m</mml:mi><mml:mi>a</mml:mi><mml:mi>x</mml:mi></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x03A9;</mml:mi><mml:mo>=</mml:mo><mml:mo fence="false" stretchy="false">{</mml:mo><mml:mrow><mml:msup><mml:mrow><mml:mi>&#x03C0;</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup></mml:mrow><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:mrow><mml:msup><mml:mrow><mml:mi>&#x03C0;</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>N</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup></mml:mrow><mml:mo fence="false" stretchy="false">}</mml:mo></mml:mrow></mml:munder><mml:mspace width="thinmathspace" /><mml:mi>&#x03B7;</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">&#x03C0;</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>,</mml:mo></mml:math></disp-formula>where <inline-formula id="ieqn-136"><mml:math id="mml-ieqn-136"><mml:mi mathvariant="bold-italic">&#x03C0;</mml:mi></mml:math></inline-formula> is the joint policy; <inline-formula id="ieqn-137"><mml:math id="mml-ieqn-137"><mml:mi mathvariant="normal">&#x03A9;</mml:mi></mml:math></inline-formula> is the collection of policies of pursuers; <inline-formula id="ieqn-138"><mml:math id="mml-ieqn-138"><mml:mi>&#x03B7;</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">&#x03C0;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> is the expected discount return, and <inline-formula id="ieqn-139"><mml:math id="mml-ieqn-139"><mml:mi>&#x03B7;</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">&#x03C0;</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>E</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mrow><mml:mn>0</mml:mn></mml:mrow></mml:msub></mml:mrow><mml:mo>&#x223C;</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>&#x03B6;</mml:mi></mml:mrow><mml:mrow><mml:mn>0</mml:mn></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mrow><mml:mn>0</mml:mn></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:msup><mml:mi>V</mml:mi><mml:mrow><mml:mi mathvariant="bold-italic">&#x03C0;</mml:mi></mml:mrow></mml:msup><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mrow><mml:mn>0</mml:mn></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">]</mml:mo></mml:math></inline-formula>, and <inline-formula id="ieqn-140"><mml:math id="mml-ieqn-140"><mml:mrow><mml:mrow><mml:msub><mml:mi>&#x03B6;</mml:mi><mml:mn>0</mml:mn></mml:msub></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mi>s</mml:mi><mml:mn>0</mml:mn></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula> is the distribution of the initial state <inline-formula id="ieqn-141"><mml:math id="mml-ieqn-141"><mml:mrow><mml:msub><mml:mi>s</mml:mi><mml:mn>0</mml:mn></mml:msub></mml:mrow></mml:math></inline-formula>. The optimal joint policy <inline-formula id="ieqn-142"><mml:math id="mml-ieqn-142"><mml:mrow><mml:msup><mml:mrow><mml:mi mathvariant="bold-italic">&#x03C0;</mml:mi></mml:mrow><mml:mrow><mml:mo>&#x2217;</mml:mo></mml:mrow></mml:msup></mml:mrow></mml:math></inline-formula> of the pursuers in a fully cooperative multi-agent task is expressed as
<disp-formula id="eqn-18"><label>(18)</label><mml:math id="mml-eqn-18" display="block"><mml:mrow><mml:msup><mml:mrow><mml:mi>&#x03C0;</mml:mi></mml:mrow><mml:mrow><mml:mo>&#x2217;</mml:mo></mml:mrow></mml:msup></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi mathvariant="bold-italic">a</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi mathvariant="bold-italic">o</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:munderover><mml:mo movablelimits="false">&#x220F;</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>N</mml:mi></mml:mrow></mml:munderover><mml:mrow><mml:mrow><mml:msup><mml:mrow><mml:mi>&#x03C0;</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>i</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup></mml:mrow></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:msubsup><mml:mi>a</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>i</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:msubsup><mml:mi>o</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>i</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mo stretchy="false">)</mml:mo><mml:mo>.</mml:mo></mml:math></disp-formula></p>
</sec>
</sec>
</sec>
<sec id="s4">
<label>4</label>
<title>Multi-Agent Attention Proximal Policy Optimization</title>
<p>The framework of algorithm MA2PPO is depicted in <xref ref-type="fig" rid="fig-2">Fig. 2</xref>. Each UAV has a centralized critic network and a decentralized actor network. In the centralized training, <inline-formula id="ieqn-143"><mml:math id="mml-ieqn-143"><mml:msubsup><mml:mi>o</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>i</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup></mml:math></inline-formula> is embedded as <inline-formula id="ieqn-144"><mml:math id="mml-ieqn-144"><mml:msubsup><mml:mi>e</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>o</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula>, and <inline-formula id="ieqn-145"><mml:math id="mml-ieqn-145"><mml:mo stretchy="false">(</mml:mo><mml:msubsup><mml:mi>o</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>i</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:msubsup><mml:mi>a</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>i</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> is embedded as <inline-formula id="ieqn-146"><mml:math id="mml-ieqn-146"><mml:mrow><mml:msub><mml:mrow><mml:mi>e</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:math></inline-formula> which is fed into the MHA to get others&#x2019; levels of attention <inline-formula id="ieqn-147"><mml:math id="mml-ieqn-147"><mml:mrow><mml:msub><mml:mrow><mml:mi>x</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:math></inline-formula> and attention weights <inline-formula id="ieqn-148"><mml:math id="mml-ieqn-148"><mml:msub><mml:mrow><mml:mi>&#x03B1;</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>. The critics take <inline-formula id="ieqn-149"><mml:math id="mml-ieqn-149"><mml:mrow><mml:msub><mml:mrow><mml:mi>x</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:math></inline-formula> and <inline-formula id="ieqn-150"><mml:math id="mml-ieqn-150"><mml:msubsup><mml:mi>e</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>o</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula> as an input to output <inline-formula id="ieqn-151"><mml:math id="mml-ieqn-151"><mml:mi>Q</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msubsup><mml:mi>o</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>i</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:msubsup><mml:mi>a</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>i</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>. The <inline-formula id="ieqn-152"><mml:math id="mml-ieqn-152"><mml:msub><mml:mrow><mml:mi>&#x03B1;</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> and <inline-formula id="ieqn-153"><mml:math id="mml-ieqn-153"><mml:mi>Q</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msubsup><mml:mi>o</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>i</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:msubsup><mml:mi>a</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>i</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> are the input of dynamic decoupling used to get the advantage function updating the actor. In the decentralized execution, each UAV of the group uses its participant model to select actions to complete the pursuit task based only on its local observations. Although all UAVs perform pursuit in a decentralized manner, the participant models are trained by collaborating in a centralized manner. The actions that UAVs select during execution are still cooperative.</p>
<fig id="fig-2">
<label>Figure 2</label>
<caption>
<title>The framework of algorithm MA2PPO</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_67117-fig-2.tif"/>
</fig>
<sec id="s4_1">
<label>4.1</label>
<title>Multi-Head Attention</title>
<p>The centralized critics receive joint observations <inline-formula id="ieqn-154"><mml:math id="mml-ieqn-154"><mml:mi mathvariant="bold-italic">o</mml:mi><mml:mtext>=</mml:mtext><mml:mrow><mml:mo>{</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>o</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:mrow><mml:mo>,</mml:mo><mml:mspace width="thinmathspace" /><mml:mrow><mml:msub><mml:mrow><mml:mi>o</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub></mml:mrow><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:mspace width="thinmathspace" /><mml:mrow><mml:msub><mml:mrow><mml:mi>o</mml:mi></mml:mrow><mml:mrow><mml:mi>N</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo>}</mml:mo></mml:mrow></mml:math></inline-formula> and joint actions <inline-formula id="ieqn-155"><mml:math id="mml-ieqn-155"><mml:mi mathvariant="bold-italic">a</mml:mi><mml:mtext>=</mml:mtext><mml:mrow><mml:mo>{</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>a</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:mrow><mml:mo>,</mml:mo><mml:mspace width="thinmathspace" /><mml:mrow><mml:msub><mml:mrow><mml:mi>a</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub></mml:mrow><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:mspace width="thinmathspace" /><mml:mrow><mml:msub><mml:mrow><mml:mi>a</mml:mi></mml:mrow><mml:mrow><mml:mi>N</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo>}</mml:mo></mml:mrow></mml:math></inline-formula> to calculate Q-value function <inline-formula id="ieqn-156"><mml:math id="mml-ieqn-156"><mml:msub><mml:mi>Q</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">o</mml:mi><mml:mo>,</mml:mo><mml:mi mathvariant="bold-italic">a</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> for UAV <inline-formula id="ieqn-157"><mml:math id="mml-ieqn-157"><mml:mi>i</mml:mi></mml:math></inline-formula>. The UAVs&#x2019; set other than UAV <inline-formula id="ieqn-158"><mml:math id="mml-ieqn-158"><mml:mi>i</mml:mi></mml:math></inline-formula> is represented as <inline-formula id="ieqn-159"><mml:math id="mml-ieqn-159"><mml:mo>&#x2216;</mml:mo><mml:mi>i</mml:mi></mml:math></inline-formula>, which is indexed by <inline-formula id="ieqn-160"><mml:math id="mml-ieqn-160"><mml:mi>j</mml:mi></mml:math></inline-formula>. The observation-action function of UAV <inline-formula id="ieqn-161"><mml:math id="mml-ieqn-161"><mml:mi>i</mml:mi></mml:math></inline-formula><inline-formula id="ieqn-162"><mml:math id="mml-ieqn-162"><mml:mtext>&#x00A0;</mml:mtext><mml:msubsup><mml:mi>Q</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x03C8;</mml:mi></mml:mrow></mml:msubsup><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">o</mml:mi><mml:mo>,</mml:mo><mml:mi mathvariant="bold-italic">a</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> includes consideration for other UAVs, which can be expressed as
<disp-formula id="eqn-19"><label>(19)</label><mml:math id="mml-eqn-19" display="block"><mml:msub><mml:mi>Q</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">o</mml:mi><mml:mo>,</mml:mo><mml:mi mathvariant="bold-italic">a</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>f</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>g</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>o</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo>,</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>a</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo><mml:mo>,</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>x</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo><mml:mo>,</mml:mo></mml:math></disp-formula>where <inline-formula id="ieqn-163"><mml:math id="mml-ieqn-163"><mml:mrow><mml:msub><mml:mrow><mml:mi>f</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:math></inline-formula> is a two-layer multi-layer perceptron (MLP); <inline-formula id="ieqn-164"><mml:math id="mml-ieqn-164"><mml:mrow><mml:msub><mml:mrow><mml:mi>g</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:math></inline-formula> is a one-layer MLP; <inline-formula id="ieqn-165"><mml:math id="mml-ieqn-165"><mml:mrow><mml:msub><mml:mrow><mml:mi>x</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:math></inline-formula> represents the weighted sum of others&#x2019; attention degrees.</p>
<p>Calculating other UAVs&#x2019; levels of attention <inline-formula id="ieqn-166"><mml:math id="mml-ieqn-166"><mml:mrow><mml:msub><mml:mrow><mml:mi>x</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:math></inline-formula> and the attention weight <inline-formula id="ieqn-167"><mml:math id="mml-ieqn-167"><mml:mrow><mml:msub><mml:mrow><mml:mi>&#x03B1;</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:math></inline-formula> of pursuer <inline-formula id="ieqn-168"><mml:math id="mml-ieqn-168"><mml:mi>i</mml:mi></mml:math></inline-formula> is shown in <xref ref-type="fig" rid="fig-3">Fig. 3</xref>. <inline-formula id="ieqn-169"><mml:math id="mml-ieqn-169"><mml:msub><mml:mi>W</mml:mi><mml:mrow><mml:mi>Q</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>, <inline-formula id="ieqn-170"><mml:math id="mml-ieqn-170"><mml:msub><mml:mi>W</mml:mi><mml:mrow><mml:mi>K</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>, and <inline-formula id="ieqn-171"><mml:math id="mml-ieqn-171"><mml:msub><mml:mi>W</mml:mi><mml:mrow><mml:mi>V</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> are a set of learnable parameters for the attention mechanism. <inline-formula id="ieqn-172"><mml:math id="mml-ieqn-172"><mml:msub><mml:mi>W</mml:mi><mml:mrow><mml:mi>Q</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> converts the observation embedding <inline-formula id="ieqn-173"><mml:math id="mml-ieqn-173"><mml:mrow><mml:msub><mml:mrow><mml:mi>e</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:math></inline-formula> into queries, and the observation-action embedding <inline-formula id="ieqn-174"><mml:math id="mml-ieqn-174"><mml:mrow><mml:msub><mml:mrow><mml:mi>e</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:math></inline-formula> of UAV <inline-formula id="ieqn-175"><mml:math id="mml-ieqn-175"><mml:mi>j</mml:mi></mml:math></inline-formula> is transformed into keys by <inline-formula id="ieqn-176"><mml:math id="mml-ieqn-176"><mml:msub><mml:mi>W</mml:mi><mml:mrow><mml:mi>K</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>. The correlation between queries and keys is then calculated and normalized. To prevent the gradient from vanishing, the matching is scaled by the sizes of these two matrices, and the outcome is <inline-formula id="ieqn-177"><mml:math id="mml-ieqn-177"><mml:mrow><mml:msub><mml:mrow><mml:mi>&#x03B1;</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:math></inline-formula>. It represents the attention weight of UAV <inline-formula id="ieqn-178"><mml:math id="mml-ieqn-178"><mml:mi>i</mml:mi></mml:math></inline-formula> to UAV <inline-formula id="ieqn-179"><mml:math id="mml-ieqn-179"><mml:mi>j</mml:mi></mml:math></inline-formula> as <disp-formula id="eqn-20"><label>(20)</label><mml:math id="mml-eqn-20" display="block"><mml:mrow><mml:msub><mml:mrow><mml:mi>&#x03B1;</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo>&#x221D;</mml:mo><mml:mi>e</mml:mi><mml:mi>x</mml:mi><mml:mi>p</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msubsup><mml:mi>e</mml:mi><mml:mrow><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msubsup><mml:msubsup><mml:mi>W</mml:mi><mml:mrow><mml:mi>K</mml:mi></mml:mrow><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msubsup><mml:msub><mml:mrow><mml:mi>W</mml:mi></mml:mrow><mml:mrow><mml:mi>Q</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:msub><mml:mrow><mml:mi>e</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo><mml:mo>,</mml:mo></mml:math></disp-formula>where <inline-formula id="ieqn-180"><mml:math id="mml-ieqn-180"><mml:mi>e</mml:mi><mml:mi>x</mml:mi><mml:mi>p</mml:mi></mml:math></inline-formula>(<inline-formula id="ieqn-181"><mml:math id="mml-ieqn-181"><mml:mrow><mml:mo>&#x22C5;</mml:mo></mml:mrow></mml:math></inline-formula>) is an exponential function with the natural constant <inline-formula id="ieqn-182"><mml:math id="mml-ieqn-182"><mml:mi>e</mml:mi></mml:math></inline-formula> as the base.</p>
<fig id="fig-3">
<label>Figure 3</label>
<caption>
<title>Calculating other UAVs&#x2019; levels of attention <inline-formula id="ieqn-197"><mml:math id="mml-ieqn-197"><mml:mrow><mml:msub><mml:mrow><mml:mi>x</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:math></inline-formula> and the attention weight <inline-formula id="ieqn-198"><mml:math id="mml-ieqn-198"><mml:mrow><mml:msub><mml:mrow><mml:mi>&#x03B1;</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:math></inline-formula> of pursuer <inline-formula id="ieqn-199"><mml:math id="mml-ieqn-199"><mml:mi>i</mml:mi></mml:math></inline-formula></title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_67117-fig-3.tif"/>
</fig>
<p>The observation-action embedding <inline-formula id="ieqn-183"><mml:math id="mml-ieqn-183"><mml:mrow><mml:msub><mml:mrow><mml:mi>e</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:math></inline-formula> of UAV <inline-formula id="ieqn-184"><mml:math id="mml-ieqn-184"><mml:mi>j</mml:mi></mml:math></inline-formula> is then transformed into a value with <inline-formula id="ieqn-185"><mml:math id="mml-ieqn-185"><mml:msub><mml:mi>W</mml:mi><mml:mrow><mml:mi>V</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>. To represent the value of UAV <inline-formula id="ieqn-186"><mml:math id="mml-ieqn-186"><mml:mi>j</mml:mi></mml:math></inline-formula>, we normalize the value of each UAV to get <inline-formula id="ieqn-187"><mml:math id="mml-ieqn-187"><mml:mrow><mml:msub><mml:mrow><mml:mi>v</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:math></inline-formula> which is a weighted compressed information vector. The attention of UAV <inline-formula id="ieqn-188"><mml:math id="mml-ieqn-188"><mml:mi>i</mml:mi></mml:math></inline-formula> to UAV <inline-formula id="ieqn-189"><mml:math id="mml-ieqn-189"><mml:mi>j</mml:mi></mml:math></inline-formula> is
<disp-formula id="eqn-21"><label>(21)</label><mml:math id="mml-eqn-21" display="block"><mml:mrow><mml:msub><mml:mrow><mml:mi>x</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo>=</mml:mo><mml:munder><mml:mo movablelimits="false">&#x2211;</mml:mo><mml:mrow><mml:mi>j</mml:mi><mml:mo>&#x2260;</mml:mo><mml:mi>i</mml:mi></mml:mrow></mml:munder><mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>&#x03B1;</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>v</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo>=</mml:mo></mml:mrow><mml:munder><mml:mo movablelimits="false">&#x2211;</mml:mo><mml:mrow><mml:mi>j</mml:mi><mml:mo>&#x2260;</mml:mo><mml:mi>i</mml:mi></mml:mrow></mml:munder><mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>&#x03B1;</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mi>&#x03C3;</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>W</mml:mi><mml:mrow><mml:mi>V</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:msub><mml:mrow><mml:mi>e</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mrow><mml:mtext>=</mml:mtext></mml:mrow><mml:munder><mml:mo movablelimits="false">&#x2211;</mml:mo><mml:mrow><mml:mi>j</mml:mi><mml:mo>&#x2260;</mml:mo><mml:mi>i</mml:mi></mml:mrow></mml:munder><mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>&#x03B1;</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mi>&#x03C3;</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>W</mml:mi><mml:mrow><mml:mi>V</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:msub><mml:mrow><mml:mi>g</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>o</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo>,</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>a</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>,</mml:mo></mml:math></disp-formula>where <inline-formula id="ieqn-190"><mml:math id="mml-ieqn-190"><mml:mi>&#x03C3;</mml:mi></mml:math></inline-formula> is leaky rectified linear units (leaky ReLU); <inline-formula id="ieqn-191"><mml:math id="mml-ieqn-191"><mml:mrow><mml:msub><mml:mrow><mml:mi>e</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo>=</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>g</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>o</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo>,</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>a</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>. We employ MHA to calculate the attention of UAV <inline-formula id="ieqn-192"><mml:math id="mml-ieqn-192"><mml:mi>i</mml:mi></mml:math></inline-formula> to other UAVs by weighting the data of others into <italic>N</italic> separate heads of attention concurrently. Combining the information-weighted calculation results of all attention heads, we obtain a fixed-length vector <inline-formula id="ieqn-193"><mml:math id="mml-ieqn-193"><mml:mrow><mml:msub><mml:mrow><mml:mi>x</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:math></inline-formula>
<disp-formula id="eqn-22"><label>(22)</label><mml:math id="mml-eqn-22" display="block"><mml:mrow><mml:msub><mml:mrow><mml:mi>x</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo>=</mml:mo><mml:mi>c</mml:mi><mml:mi>o</mml:mi><mml:mi>n</mml:mi><mml:mi>c</mml:mi><mml:mi>a</mml:mi><mml:mi>t</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>x</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:mrow><mml:mo>,</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>x</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mn>2</mml:mn></mml:mrow></mml:msub></mml:mrow><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>x</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mi>N</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo><mml:mo>&#x22C5;</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>W</mml:mi></mml:mrow><mml:mrow><mml:mn>0</mml:mn></mml:mrow></mml:msub></mml:mrow><mml:mo>,</mml:mo></mml:math></disp-formula>where <inline-formula id="ieqn-194"><mml:math id="mml-ieqn-194"><mml:mrow><mml:msub><mml:mrow><mml:mi>W</mml:mi></mml:mrow><mml:mrow><mml:mn>0</mml:mn></mml:mrow></mml:msub></mml:mrow></mml:math></inline-formula> is a learnable parameter matric for data projections; <inline-formula id="ieqn-195"><mml:math id="mml-ieqn-195"><mml:mrow><mml:msub><mml:mrow><mml:mi>x</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mi>k</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>k</mml:mi><mml:mo>&#x2208;</mml:mo><mml:mi mathvariant="normal">&#x2216;</mml:mi><mml:mi>i</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> is the result of attention head <inline-formula id="ieqn-196"><mml:math id="mml-ieqn-196"><mml:mi>k</mml:mi></mml:math></inline-formula>.</p>
<p>Taking note of the fact that all UAVs share the weights of the extracted queries, keys, and values for each attention header, which promotes the use of common embedding space. The centralized set of attentional critics is used by UAVs with the same goal. Due to parameter sharing, our method performs well in scenarios where UAVs have different rewards but similar features.</p>
<p>The number of UAVs will change in cooperative pursuit scenarios due to collisions or crashes of UAVs, and some algorithms are unable to handle the emergency. MA2PPO integrates the observations of <inline-formula id="ieqn-200"><mml:math id="mml-ieqn-200"><mml:mrow><mml:msub><mml:mrow><mml:mi>x</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:math></inline-formula> and UAV <inline-formula id="ieqn-201"><mml:math id="mml-ieqn-201"><mml:mi>i</mml:mi></mml:math></inline-formula> as the input of the critics network and immediately concatenates the weighted results of multiple heads with an emphasis on other UAVs. A nonlinearly triggered feed-forward layer is used to combine the contributions of many heads. As a result, non-fixed UAV number scenarios may be handled, and the applicability of the algorithm is increased.</p>
</sec>
<sec id="s4_2">
<label>4.2</label>
<title>Dynamic Decoupling</title>
<p>UAVs use MHA to pay diverse amounts of attention to other UAVs. In the case that pursuer <inline-formula id="ieqn-202"><mml:math id="mml-ieqn-202"><mml:mi>i</mml:mi></mml:math></inline-formula> pays absolutely no attention to pursuer <inline-formula id="ieqn-203"><mml:math id="mml-ieqn-203"><mml:mi>j</mml:mi></mml:math></inline-formula>, we can temporarily separate them and assume the expected reward for pursuer <inline-formula id="ieqn-204"><mml:math id="mml-ieqn-204"><mml:mi>i</mml:mi></mml:math></inline-formula> is not affected by pursuer <inline-formula id="ieqn-205"><mml:math id="mml-ieqn-205"><mml:mi>j</mml:mi></mml:math></inline-formula>. We call this dynamical decoupling, which breaks the large-scale cooperative multi-UAV problem into multiple sets of decoupled sub-problems.</p>
<p>The relevant set <inline-formula id="ieqn-206"><mml:math id="mml-ieqn-206"><mml:msubsup><mml:mi>G</mml:mi><mml:mrow><mml:mi>&#x03C0;</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>i</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mrow><mml:msubsup><mml:mrow><mml:mi>o</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>i</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup></mml:mrow><mml:mo>,</mml:mo><mml:msubsup><mml:mrow><mml:mi>a</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>i</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> of pursuer <inline-formula id="ieqn-207"><mml:math id="mml-ieqn-207"><mml:mi>i</mml:mi></mml:math></inline-formula> is defined as the subset of UAVs that have an impact on pursuer <inline-formula id="ieqn-208"><mml:math id="mml-ieqn-208"><mml:mi>i</mml:mi></mml:math></inline-formula> at time <inline-formula id="ieqn-209"><mml:math id="mml-ieqn-209"><mml:mi>t</mml:mi></mml:math></inline-formula>, and its implicit definition is
<disp-formula id="eqn-23"><label>(23)</label><mml:math id="mml-eqn-23" display="block"><mml:mrow><mml:mtext>E</mml:mtext></mml:mrow><mml:mrow><mml:mo>[</mml:mo><mml:munderover><mml:mo movablelimits="false">&#x2211;</mml:mo><mml:mrow><mml:mi>&#x03C4;</mml:mi><mml:mo>=</mml:mo><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:munderover><mml:msubsup><mml:mi>r</mml:mi><mml:mrow><mml:mi>&#x03C4;</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>i</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:msubsup><mml:mrow><mml:mi>o</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>i</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:msubsup><mml:mrow><mml:mi>a</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>i</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mo>]</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mrow><mml:mtext>E</mml:mtext></mml:mrow><mml:mrow><mml:mo>[</mml:mo><mml:munderover><mml:mo movablelimits="false">&#x2211;</mml:mo><mml:mrow><mml:mi>&#x03C4;</mml:mi><mml:mo>=</mml:mo><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:munderover><mml:mo stretchy="false">(</mml:mo><mml:msubsup><mml:mi>r</mml:mi><mml:mrow><mml:mi>&#x03C4;</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>i</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:msubsup><mml:mrow><mml:mi>o</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>j</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:msubsup><mml:mi>a</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>j</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mo stretchy="false">)</mml:mo><mml:mo>,</mml:mo><mml:mi>i</mml:mi><mml:mo>&#x2208;</mml:mo><mml:mo fence="false" stretchy="false">{</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:mi>N</mml:mi><mml:mo fence="false" stretchy="false">}</mml:mo><mml:mo>,</mml:mo><mml:mi>j</mml:mi><mml:mo>&#x2208;</mml:mo><mml:msubsup><mml:mi>G</mml:mi><mml:mrow><mml:mi>&#x03C0;</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>i</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mrow><mml:msubsup><mml:mrow><mml:mi>o</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>i</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup></mml:mrow><mml:mo>,</mml:mo><mml:msubsup><mml:mrow><mml:mi>a</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>i</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup></mml:mrow><mml:mo stretchy="false">)</mml:mo><mml:mo>]</mml:mo></mml:mrow><mml:mo>.</mml:mo></mml:math></disp-formula></p>
<p>The expected future reward of pursuer <inline-formula id="ieqn-210"><mml:math id="mml-ieqn-210"><mml:mi>i</mml:mi></mml:math></inline-formula> then depends only on the observation-action of UAVs in its relevant set. The coalition strategy is included as a superscript in the relevant set, which is determined by the coalition strategy of all UAVs. It should be noted that <inline-formula id="ieqn-211"><mml:math id="mml-ieqn-211"><mml:msubsup><mml:mi>G</mml:mi><mml:mrow><mml:mi>&#x03C0;</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>i</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mrow><mml:msubsup><mml:mrow><mml:mi>o</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>i</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup></mml:mrow><mml:mo>,</mml:mo><mml:msubsup><mml:mrow><mml:mi>a</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>i</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> varies over time. When irrelevant teammates are ignored in the learning update of pursuer <inline-formula id="ieqn-212"><mml:math id="mml-ieqn-212"><mml:mi>i</mml:mi></mml:math></inline-formula>, it can be viewed as decomposing a large-scale cooperative multi-UAV problem into smaller ones, and only <inline-formula id="ieqn-213"><mml:math id="mml-ieqn-213"><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:msubsup><mml:mi>G</mml:mi><mml:mrow><mml:mi>&#x03C0;</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>i</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mrow><mml:msubsup><mml:mrow><mml:mi>o</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>i</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup></mml:mrow><mml:mo>,</mml:mo><mml:msubsup><mml:mrow><mml:mi>a</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>i</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup></mml:mrow><mml:mo stretchy="false">)</mml:mo><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow></mml:math></inline-formula> UAVs are involved at each time step <inline-formula id="ieqn-214"><mml:math id="mml-ieqn-214"><mml:mi>t</mml:mi></mml:math></inline-formula>. Dynamic decoupling allocates the pursuers to overlapping subgroups (i.e., a UAV might belong to more than one subgroup at once) and dynamic subgroups (i.e., related set assignments change with time steps), in contrast to static decomposition, which might make an effort to assign UAVs to static subgroups.</p>
<p>Dynamic decoupling addresses the credit assignment problem since it only distributes credit to UAVs that affect the rewards of UAV <inline-formula id="ieqn-215"><mml:math id="mml-ieqn-215"><mml:mi>i</mml:mi></mml:math></inline-formula>. We also consider that dynamic decoupling leads to lower variance in the process of policy gradient estimation, as only smaller UAV subsets will be considered and less noise will be added. This decoupling in the policy gradient-based algorithm allows the contribution of pursuer <inline-formula id="ieqn-216"><mml:math id="mml-ieqn-216"><mml:mi>j</mml:mi></mml:math></inline-formula> returns to be subtracted from the gradient estimation of pursuer <inline-formula id="ieqn-217"><mml:math id="mml-ieqn-217"><mml:mi>i</mml:mi></mml:math></inline-formula> without causing bias.</p>
<p>UAV <inline-formula id="ieqn-218"><mml:math id="mml-ieqn-218"><mml:mi>j</mml:mi></mml:math></inline-formula> is not included in the relevant set of UAV <inline-formula id="ieqn-219"><mml:math id="mml-ieqn-219"><mml:mi>i</mml:mi></mml:math></inline-formula> if its predicted future return does not depend on the actions of UAV <inline-formula id="ieqn-220"><mml:math id="mml-ieqn-220"><mml:mi>i</mml:mi></mml:math></inline-formula>. This property demonstrates that it is reasonable to estimate the relevant set using a value function. Our estimate of the value of UAV <inline-formula id="ieqn-221"><mml:math id="mml-ieqn-221"><mml:mi>j</mml:mi></mml:math></inline-formula> does not depend on <inline-formula id="ieqn-222"><mml:math id="mml-ieqn-222"><mml:msubsup><mml:mi>a</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>i</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup></mml:math></inline-formula>, so we infer <inline-formula id="ieqn-223"><mml:math id="mml-ieqn-223"><mml:mi>j</mml:mi><mml:mo>&#x2209;</mml:mo><mml:msubsup><mml:mi>G</mml:mi><mml:mrow><mml:mi>&#x03C0;</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>i</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mrow><mml:msubsup><mml:mrow><mml:mi>o</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>i</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup></mml:mrow><mml:mo>,</mml:mo><mml:msubsup><mml:mrow><mml:mi>a</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>i</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>. To accurately infer the pertinent set of each UAV, we combine a value function with an attention mechanism. The dependence on the specific UAV action can be &#x201C;turned off&#x201D; by the present state of the environment by setting the attention weight of the related action to zero.</p>
</sec>
<sec id="s4_3">
<label>4.3</label>
<title>Actor Update</title>
<p>The actor network, represented by <inline-formula id="ieqn-224"><mml:math id="mml-ieqn-224"><mml:mrow><mml:msub><mml:mrow><mml:mi>&#x03C0;</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x03B8;</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:math></inline-formula>, maps the individual observe <inline-formula id="ieqn-225"><mml:math id="mml-ieqn-225"><mml:msubsup><mml:mi>o</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>i</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup></mml:math></inline-formula> of UAV <inline-formula id="ieqn-226"><mml:math id="mml-ieqn-226"><mml:mi>i</mml:mi></mml:math></inline-formula> to the means and standard deviation vectors of a multivariate Gaussian distribution in continuous action spaces, or to an action categorization distribution in discrete action spaces to sample the actions. By removing the UAVs&#x2019; contributions that are not in the relevant set <inline-formula id="ieqn-227"><mml:math id="mml-ieqn-227"><mml:msubsup><mml:mi>G</mml:mi><mml:mrow><mml:mi>&#x03C0;</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>i</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mrow><mml:msubsup><mml:mrow><mml:mi>o</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>i</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup></mml:mrow><mml:mo>,</mml:mo><mml:msubsup><mml:mrow><mml:mi>a</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>i</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> of UAV <inline-formula id="ieqn-228"><mml:math id="mml-ieqn-228"><mml:mi>i</mml:mi></mml:math></inline-formula>, the gradient update can be made simpler. The loss function of the actor of each UAV with parameter sharing is phrased as
<disp-formula id="eqn-24"><label>(24)</label><mml:math id="mml-eqn-24" display="block"><mml:mi>L</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x03B8;</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:mrow><mml:mi>B</mml:mi><mml:mi>N</mml:mi></mml:mrow></mml:mfrac><mml:munderover><mml:mo movablelimits="false">&#x2211;</mml:mo><mml:mrow><mml:mi>k</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>B</mml:mi></mml:mrow></mml:munderover><mml:munderover><mml:mo movablelimits="false">&#x2211;</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>N</mml:mi></mml:mrow></mml:munderover><mml:mi>m</mml:mi><mml:mi>i</mml:mi><mml:mi>n</mml:mi><mml:mrow><mml:mo>[</mml:mo><mml:msubsup><mml:mi>r</mml:mi><mml:mrow><mml:mi>&#x03B8;</mml:mi><mml:mo>,</mml:mo><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>k</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mspace width="negativethinmathspace" /><mml:mspace width="negativethinmathspace" /><mml:munder><mml:mo movablelimits="false">&#x2211;</mml:mo><mml:mrow><mml:mi>j</mml:mi><mml:mo>:</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>&#x03C9;</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">o</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mo>&#x003E;</mml:mo><mml:mi>&#x03BE;</mml:mi></mml:mrow></mml:munder><mml:mspace width="negativethinmathspace" /><mml:mspace width="negativethinmathspace" /><mml:mrow><mml:msubsup><mml:mrow><mml:mover><mml:mi>A</mml:mi><mml:mo stretchy="false">&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>k</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup></mml:mrow><mml:mo>,</mml:mo><mml:mi>c</mml:mi><mml:mi>i</mml:mi><mml:mi>l</mml:mi><mml:mi>p</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msubsup><mml:mi>r</mml:mi><mml:mrow><mml:mi>&#x03B8;</mml:mi><mml:mo>,</mml:mo><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>k</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:mn>1</mml:mn><mml:mo>&#x2212;</mml:mo><mml:mi>&#x03B5;</mml:mi><mml:mo>,</mml:mo><mml:mn>1</mml:mn><mml:mo>+</mml:mo><mml:mi>&#x03B5;</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mspace width="negativethinmathspace" /><mml:mspace width="negativethinmathspace" /><mml:munder><mml:mo movablelimits="false">&#x2211;</mml:mo><mml:mrow><mml:mi>j</mml:mi><mml:mo>:</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>&#x03C9;</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">o</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mo>&#x003E;</mml:mo><mml:mi>&#x03BE;</mml:mi></mml:mrow></mml:munder><mml:mspace width="negativethinmathspace" /><mml:mspace width="negativethinmathspace" /><mml:msubsup><mml:mrow><mml:mover><mml:mi>A</mml:mi><mml:mo stretchy="false">&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>k</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mo>]</mml:mo></mml:mrow><mml:mo>,</mml:mo></mml:math></disp-formula>where <inline-formula id="ieqn-229"><mml:math id="mml-ieqn-229"><mml:msubsup><mml:mrow><mml:mover><mml:mi>A</mml:mi><mml:mo stretchy="false">&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>k</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup></mml:math></inline-formula> represents the advantage of actual reward calculation using the related set. Note that the decoupling changes dynamically over time because the estimated relevant set <inline-formula id="ieqn-230"><mml:math id="mml-ieqn-230"><mml:msubsup><mml:mi>G</mml:mi><mml:mrow><mml:mi>&#x03C0;</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>i</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mrow><mml:msubsup><mml:mrow><mml:mi>o</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>i</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup></mml:mrow><mml:mo>,</mml:mo><mml:msubsup><mml:mrow><mml:mi>a</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>i</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> changes over time. <inline-formula id="ieqn-231"><mml:math id="mml-ieqn-231"><mml:mrow><mml:msub><mml:mrow><mml:mi>&#x03C9;</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mi mathvariant="bold-italic">o</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> represents how much attention UAV <inline-formula id="ieqn-232"><mml:math id="mml-ieqn-232"><mml:mi>i</mml:mi></mml:math></inline-formula> pays when estimating its expected future return based on the observed behavior of UAV <inline-formula id="ieqn-233"><mml:math id="mml-ieqn-233"><mml:mi>j</mml:mi></mml:math></inline-formula>. The threshold we consider relevant for <inline-formula id="ieqn-234"><mml:math id="mml-ieqn-234"><mml:mrow><mml:msub><mml:mrow><mml:mi>&#x03C9;</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:math></inline-formula> is <inline-formula id="ieqn-235"><mml:math id="mml-ieqn-235"><mml:mi>&#x03BE;</mml:mi></mml:math></inline-formula>. On the first 150,000 steps, <inline-formula id="ieqn-236"><mml:math id="mml-ieqn-236"><mml:mi>&#x03BE;</mml:mi></mml:math></inline-formula> is initialized at 0 and increases linearly with a maximum of 0.01. This allows the training process to be stabilized by providing sufficient time to learn to correctly allocate attention weights. The batch size <italic>B</italic> and the ratio <inline-formula id="ieqn-237"><mml:math id="mml-ieqn-237"><mml:msubsup><mml:mi>r</mml:mi><mml:mrow><mml:mi>&#x03B8;</mml:mi><mml:mo>,</mml:mo><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>k</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>&#x03C0;</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x03B8;</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:msubsup><mml:mi>a</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>k</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:msubsup><mml:mi>o</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>k</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mrow><mml:mo>/</mml:mo></mml:mrow><mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>&#x03C0;</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>&#x03B8;</mml:mi></mml:mrow><mml:mrow><mml:mi>o</mml:mi><mml:mi>l</mml:mi><mml:mi>d</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:msubsup><mml:mi>a</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>k</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:msubsup><mml:mi>o</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>k</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mtext>&#x00A0;</mml:mtext></mml:math></inline-formula> are the other variables in this formula.</p>
</sec>
<sec id="s4_4">
<label>4.4</label>
<title>Critic Update</title>
<p>We create the MHA-based central critic networks that gather information about the observation and action of all UAVs. MHA gets a fixed-length vector <inline-formula id="ieqn-238"><mml:math id="mml-ieqn-238"><mml:mrow><mml:msub><mml:mrow><mml:mi>x</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:math></inline-formula> which is the weighted sum of other drones&#x2019; attention degree obtained by UAV <inline-formula id="ieqn-239"><mml:math id="mml-ieqn-239"><mml:mi>i</mml:mi></mml:math></inline-formula>. Then, <inline-formula id="ieqn-240"><mml:math id="mml-ieqn-240"><mml:mrow><mml:msub><mml:mrow><mml:mi>x</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:math></inline-formula> and the observation embedding of UAV <inline-formula id="ieqn-241"><mml:math id="mml-ieqn-241"><mml:mi>i</mml:mi></mml:math></inline-formula><inline-formula id="ieqn-242"><mml:math id="mml-ieqn-242"><mml:mtext>&#x00A0;</mml:mtext><mml:mrow><mml:msub><mml:mrow><mml:mi>e</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:math></inline-formula> are fed into the critic network of UAV <inline-formula id="ieqn-243"><mml:math id="mml-ieqn-243"><mml:mi>i</mml:mi></mml:math></inline-formula> as input information. Based on the acquired information, the centralized critic network gives the <italic>Q</italic> value and attention weights, which are used to calculate the advantage function of the actor.</p>
<p>A centralized group of multi-headed attentional critics is used by UAVs with the same objective. As a result of parameter sharing, the centralized critic networks of all UAVs are updated simultaneously to reduce the joint loss function, which is
<disp-formula id="eqn-25"><label>(25)</label><mml:math id="mml-eqn-25" display="block"><mml:mi>L</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>&#x03C8;</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:mrow><mml:mi>B</mml:mi><mml:mi>N</mml:mi></mml:mrow></mml:mfrac><mml:munderover><mml:mo movablelimits="false">&#x2211;</mml:mo><mml:mrow><mml:mi>k</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>B</mml:mi></mml:mrow></mml:munderover><mml:mrow><mml:munderover><mml:mo movablelimits="false">&#x2211;</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>N</mml:mi></mml:mrow></mml:munderover><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mrow><mml:msup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:msubsup><mml:mi>Q</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>k</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mi mathvariant="bold-italic">o</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo>,</mml:mo><mml:mrow><mml:msub><mml:mi mathvariant="bold-italic">a</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mover><mml:mi>R</mml:mi><mml:mo stretchy="false">&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msup></mml:mrow><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:mrow><mml:mo>,</mml:mo></mml:math></disp-formula>where <inline-formula id="ieqn-244"><mml:math id="mml-ieqn-244"><mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mrow><mml:mover><mml:mi>R</mml:mi><mml:mo stretchy="false">&#x005E;</mml:mo></mml:mover></mml:mrow></mml:mrow><mml:mi>i</mml:mi></mml:msub></mml:mrow></mml:mrow></mml:math></inline-formula> is the discounted reward-to-go. The overall MA2PPO algorithm is presented in Algorithm 1.</p>
<fig id="fig-10">
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_67117-fig-10.tif"/>
</fig>
</sec>
</sec>
<sec id="s5">
<label>5</label>
<title>Experiment</title>
<p>We conduct experiments using the multi-UAV pursuit-evasion model established in <xref ref-type="sec" rid="s3_1">Section 3.1</xref>. Simulation demonstrates the convergence and effectiveness of the proposed algorithm. In the beginning, the simulation settings, training settings, and evaluation metrics of the proposed algorithm are introduced. Then, we compare the performance of MA2PPO with various baseline methods, including MAPPO, MAAC, and COMA, which are popular MARL algorithms. We also validate the ablation algorithms to confirm the importance and contribution of each component. MA2PPO without dynamical decoupling and MA2PPO without formation reward are the algorithms for the ablation version. Finally, we demonstrate the effectiveness, training efficiency, and scalability of the MA2PPO algorithm by analyzing evaluation criteria and combining results such as multi-UAV pursuit trajectories and attention visualization.</p>
<sec id="s5_1">
<label>5.1</label>
<title>Simulation Setups</title>
<sec id="s5_1_1">
<label>5.1.1</label>
<title>Environment Settings</title>
<p>We evaluate the proposed approach and develop a simulation platform that simulates the multi-UAV pursuit-evasion game.</p>
<p>We assume that all UAVs fly at the same altitude and their airspace is limited to an area of 400 m &#x002A; 400 m. UAVs can begin at any location within the permitted space. The initial locations of all UAVs are provided in relatively stable positions to prevent collisions and facilitate the training process. The aircraft we modeled is a four-rotor UAV. The velocity of a chasing UAV is <inline-formula id="ieqn-264"><mml:math id="mml-ieqn-264"><mml:mrow><mml:msub><mml:mrow><mml:mi>v</mml:mi></mml:mrow><mml:mrow><mml:mi>p</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo>=</mml:mo><mml:mn>9</mml:mn></mml:math></inline-formula> m/s for the situation of a fixed linear velocity, while the velocity of an escaping UAV is <inline-formula id="ieqn-265"><mml:math id="mml-ieqn-265"><mml:mrow><mml:msub><mml:mrow><mml:mi>v</mml:mi></mml:mrow><mml:mrow><mml:mi>e</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo>=</mml:mo><mml:mn>11</mml:mn></mml:math></inline-formula> m/s. Both types of UAVs can fly at a range of angular velocities <inline-formula id="ieqn-266"><mml:math id="mml-ieqn-266"><mml:mi>&#x03C9;</mml:mi><mml:mo>&#x2208;</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mn>3</mml:mn><mml:mtext>&#xA0;</mml:mtext><mml:mrow><mml:mrow><mml:mtext>rad/s</mml:mtext></mml:mrow></mml:mrow><mml:mo>,</mml:mo><mml:mn>3</mml:mn><mml:mtext>&#xA0;</mml:mtext><mml:mrow><mml:mrow><mml:mtext>rad/s</mml:mtext></mml:mrow></mml:mrow><mml:mo stretchy="false">]</mml:mo></mml:math></inline-formula>. The first-order differential of the UAVs&#x2019; velocity and position coordinate values is determined using the fourth-order Runge-Kutta [<xref ref-type="bibr" rid="ref-43">43</xref>] differential equation numerical solution method. The new state can be calculated for each UAV during the simulation time unit <inline-formula id="ieqn-267"><mml:math id="mml-ieqn-267"><mml:mi>t</mml:mi></mml:math></inline-formula>, as it determines its angular velocity through decision-making. This approach is more precise than the Euler method of direct numerical computation.</p>
<p>The simulation time unit is set to <inline-formula id="ieqn-268"><mml:math id="mml-ieqn-268"><mml:mi>t</mml:mi><mml:mo>=</mml:mo><mml:mn>0.15</mml:mn><mml:mtext>&#xA0;</mml:mtext><mml:mrow><mml:mi mathvariant="normal">s</mml:mi></mml:mrow></mml:math></inline-formula>, during which all UAVs concurrently execute their selected maneuvers. The resulting motion states collectively define the system&#x2019;s next state.</p>
<p>It is assumed that the maximum communication distance of each UAV is 50 m, the maximum detection distance is 20 m, and the capture range of the pursuer is 15 m. When the distance between the evader and at least one pursuer is less than 15 m, the pursuit task is deemed accomplished due to the air combat simulation platform does not model and simulate the close-range combat missile. The safe distance between UAVs to prevent collisions is 30 m. UAVs modify flight directions to avoid collisions by sharing information with nearby companions.</p>
<p>We train the pursuers using the proposed algorithm, while an escapee can use the DRL model to learn or just follow the rules. The proposed approach is used to control the angular velocity of the pursuers. We adopt the clear and effective rules for a rule-based escapee: within the permissible range of angular velocity, the escapee can choose with greed the maximum angular velocity of a speedier escape. The weights of rewards <inline-formula id="ieqn-269"><mml:math id="mml-ieqn-269"><mml:mrow><mml:msub><mml:mrow><mml:mi>&#x03C9;</mml:mi></mml:mrow><mml:mrow><mml:mi>f</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:math></inline-formula> and <inline-formula id="ieqn-270"><mml:math id="mml-ieqn-270"><mml:mrow><mml:msub><mml:mrow><mml:mi>&#x03C9;</mml:mi></mml:mrow><mml:mrow><mml:mi>d</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:math></inline-formula> are set to 0.1 and 0.002, respectively. The environmental parameters are shown in <xref ref-type="table" rid="table-1">Table 1</xref>.</p>
<table-wrap id="table-1">
<label>Table 1</label>
<caption>
<title>Environment parameter settings</title>
</caption>
<table>
<colgroup>
<col/>
<col/>
<col/>
</colgroup>
<thead>
<tr>
<th>Entity</th>
<th>Variable</th>
<th>Value</th>
</tr>
</thead>
<tbody>
<tr>
<td>Environment</td>
<td>Shape</td>
<td>Square</td>
</tr>
<tr>
<td></td>
<td>Size (m)</td>
<td>400 &#x002A; 400</td>
</tr>
<tr>
<td>Pursuer</td>
<td>Total number</td>
<td>4</td>
</tr>
<tr>
<td></td>
<td>Speed (m/s)</td>
<td>9</td>
</tr>
<tr>
<td></td>
<td>Angular velocity (rad/s)</td>
<td>[&#x2212;3, 3]</td>
</tr>
<tr>
<td></td>
<td>Maximum detection distance (m)</td>
<td>20</td>
</tr>
<tr>
<td></td>
<td>Communication range (m)</td>
<td>50</td>
</tr>
<tr>
<td></td>
<td>Safe distance between pursuers (m)</td>
<td>30</td>
</tr>
<tr>
<td>Escapee</td>
<td>Total number</td>
<td>1</td>
</tr>
<tr>
<td></td>
<td>Speed (m/s)</td>
<td>11</td>
</tr>
<tr>
<td></td>
<td>Angular velocity (rad/s)</td>
<td>[&#x2212;3, 3]</td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
<sec id="s5_1_2">
<label>5.1.2</label>
<title>Training Settings</title>
<p>All experiments in the training process are performed on a workstation with an Intel i9-10850K CPU and a Nvidia GeForce GTX 3060 GPU. We use PyTorch for network implementation.</p>
<p>The algorithm is trained at 5,000,000 steps with a constant set of 200 steps per episode. The environment and UAVs are reset after the confrontation procedure is finished or when the maximum steps allowed per episode are achieved.</p>
<p>All actor networks in MA2PPO share the same structure, as do all critic networks. A fully connected neural network with two hidden layers makes up the actor networks. We parameterize the actors using the MLP with 128 units per layer. Rectified linear unit (ReLu) functions serve as the activation function in both hidden layers. The network layers of the critics have 256 neurons. As the activation function in the buried layer, we employ Leaky ReLU. There is no activation function in the output layer, which is merely a linear layer. The Adam optimizer is adopted by both the actor network and the critic network, with a learning rate of 0.0005. Other hyperparameters are set to the discount factor <inline-formula id="ieqn-271"><mml:math id="mml-ieqn-271"><mml:mrow><mml:mi mathvariant="normal">&#x03B3;</mml:mi></mml:mrow></mml:math></inline-formula> is 0.99 and the GAE lambda is 0.95. <xref ref-type="table" rid="table-2">Table 2</xref> depicts the hyperparameter configurations of our algorithm.</p>
<table-wrap id="table-2">
<label>Table 2</label>
<caption>
<title>Hyperparameter configurations of the algorithm</title>
</caption>
<table>
<colgroup>
<col/>
<col/>
</colgroup>
<thead>
<tr>
<th>Hyperparameters</th>
<th>Values</th>
</tr>
</thead>
<tbody>
<tr>
<td>Max episode</td>
<td>25,000</td>
</tr>
<tr>
<td>Max step</td>
<td>200</td>
</tr>
<tr>
<td>Actor learning rate</td>
<td>0.0005</td>
</tr>
<tr>
<td>Critic learning rate</td>
<td>0.0005</td>
</tr>
<tr>
<td>Discount factor</td>
<td>0.99</td>
</tr>
<tr>
<td>GAE lambda</td>
<td>0.95</td>
</tr>
<tr>
<td>Optimizer</td>
<td>Adam</td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
<sec id="s5_1_3">
<label>5.1.3</label>
<title>Baselines</title>
<p>We choose MAAC, COMA, and MAPPO as the benchmark algorithms to compare the performance of the proposed algorithm in dealing with the issue of multi-UAV pursuit game. The implementation details of the benchmark algorithm are shown below.</p>
<p><bold>MAAC:</bold> The MAAC algorithm is a traditional MARL algorithm with an attention mechanism built on the soft actor-critic (SAC) and CTDE. The actor networks have a hidden layer unit number of 128, and ReLu is their activation function. Leaky ReLu is an activation function for the output layer of the network. The hidden layer unit number of the critic networks is 256. Its activation function is Leaky ReLu, and the output layer of the network is activated using the ReLu function. The actor and critic networks are optimized using the Adam optimizer, and the learning rate is 0.001. Networks for the target actors and the target critics both employ soft updates. Other hyperparameters include: the buffer size is 1000 K, the batch size is 1024, and the discount factor is 0.9.</p>
<p><bold>COMA:</bold> The multi-agent credit assignment problem is the focus of COMA. It utilizes the single, central network to anticipate the Q value of each agent with distinct forwarding. An input layer, two hidden recurrent neural network (RNN) layers and an output layer make up the policy network. There are 64 hidden layer units in the RNN hidden layers, which uses the GRU unit. The critic network has 128 hidden layer units and is a four-layer fully connected linear unit. Both the activation mechanisms of the policy and the critic are ReLu activation functions. The RMS optimizer is used by COMA. The actor network and the critic network have learning rates of 0.0001 and 0.001, respectively. The discount factor is 0.99.</p>
<p><bold>MAPPO:</bold> It is a widely used MARL algorithm that has demonstrated strong performance in various applications. Its hyperparameters are similar to those of MA2PPO, and the parameters for various implementation details are set to the same as those from [<xref ref-type="bibr" rid="ref-44">44</xref>].</p>
</sec>
<sec id="s5_1_4">
<label>5.1.4</label>
<title>Evaluation Metrics</title>
<p>We employ three evaluation metrics to demonstrate the effectiveness of the proposed approach.</p>
<p><bold>Average episode reward (AER):</bold> The average episode reward is a significant evaluation criterion in MARL. It measures the effectiveness of the proposed algorithm&#x2019;s training by tracking its growth and convergence. To demonstrate the optimization process of the model, it is crucial to consider the cumulative reward value of the confrontation and maximize it.</p>
<p><bold>Average pursuit success rate (APSR):</bold> The average pursuit success rate is the ratio of the sum of successful steps to all steps in an episode. It is used to evaluate the quality of the pursuit. The algorithm performs better with a higher APSR. We define it as
<disp-formula id="eqn-26"><label>(26)</label><mml:math id="mml-eqn-26" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd /><mml:mtd><mml:mi>A</mml:mi><mml:mi>P</mml:mi><mml:mi>S</mml:mi><mml:mi>R</mml:mi><mml:mo>=</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mi>H</mml:mi><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow></mml:mrow></mml:mfrac><mml:msubsup><mml:mo movablelimits="false">&#x2211;</mml:mo><mml:mrow><mml:mi>t</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>H</mml:mi></mml:mrow></mml:msubsup><mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>I</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:mrow><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p><disp-formula id="eqn-27"><label>(27)</label><mml:math id="mml-eqn-27" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd /><mml:mtd><mml:mrow><mml:msub><mml:mrow><mml:mi>I</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo>=</mml:mo><mml:mrow><mml:mo>{</mml:mo><mml:mtable columnalign="left left" rowspacing=".2em" columnspacing="1em" displaystyle="false"><mml:mtr><mml:mtd /><mml:mtd><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mspace width="1em" /><mml:mrow><mml:mtext>if</mml:mtext></mml:mrow><mml:mtext>&#xA0;</mml:mtext><mml:mrow><mml:msub><mml:mrow><mml:mi>&#x03BA;</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo>&#x003E;</mml:mo><mml:mn>0</mml:mn></mml:mtd></mml:mtr><mml:mtr><mml:mtd /><mml:mtd><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mspace width="1em" /><mml:mrow><mml:mtext>if</mml:mtext></mml:mrow><mml:mtext>&#xA0;</mml:mtext><mml:mrow><mml:msub><mml:mrow><mml:mi>&#x03BA;</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo>=</mml:mo><mml:mn>0</mml:mn></mml:mtd></mml:mtr></mml:mtable><mml:mo fence="true" stretchy="true" symmetric="true"></mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>where <inline-formula id="ieqn-272"><mml:math id="mml-ieqn-272"><mml:mrow><mml:mo>|</mml:mo><mml:mi>H</mml:mi><mml:mo>|</mml:mo></mml:mrow></mml:math></inline-formula> represents the number of steps in the episodes; <inline-formula id="ieqn-273"><mml:math id="mml-ieqn-273"><mml:mrow><mml:msub><mml:mrow><mml:mi>&#x03BA;</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:math></inline-formula> is the number of UAVs that are successfully pursued in the time step <inline-formula id="ieqn-274"><mml:math id="mml-ieqn-274"><mml:mi>t</mml:mi></mml:math></inline-formula>.</p>
</sec>
</sec>
<sec id="s5_2">
<label>5.2</label>
<title>Performance Analysis</title>
<p>In this section, we present the simulations and analysis of the comparison with baseline algorithms, ablation experiments, scalability tests, and attention visualization of MA2PPO based on the environment settings, hyperparameter settings, and evaluation criteria given in <xref ref-type="sec" rid="s5_1">Section 5.1</xref>.</p>
<sec id="s5_2_1">
<label>5.2.1</label>
<title>Comparison with Baseline Algorithms</title>
<p>We train MA2PPO, MAPPO, MAAC, and COMA in the same environment to evaluate and analyze their performance of learning strategies in the multi-UAV pursuit environment.</p>
<p><xref ref-type="fig" rid="fig-4">Fig. 4</xref> shows that MA2PPO outperforms other algorithms in terms of the AER, the APSR demonstrating its superior and steady performance in the multi-UAV pursuit game. The AER of the four methods is displayed in <xref ref-type="fig" rid="fig-4">Fig. 4a</xref>. As evidenced by the fact that the AER values of MA2PPO and MAPPO dramatically increase at the beginning of the training period, and both obtain respectably high scores from negative starting values. The AER curve illustrates that both algorithms have learned the pursuit strategy, but MA2PPO has learned it better. For MA2PPO, the AER value can be roughly converged to 50,000 steps. MAPPO takes twice as many convergence steps as MA2PPO, and its AER has a certain gap with MA2PPO. MAAC and COMA both performed poorly overall, and MA2PPO has significantly surpassed them in the AER.</p>
<fig id="fig-4">
<label>Figure 4</label>
<caption>
<title>Comparisons of MA2PPO with MAPPO, MAAC, and COMA in terms of the AER and APSR. (<bold>a</bold>) The AER; (<bold>b</bold>) The APSR</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_67117-fig-4.tif"/>
</fig>
<p>We can see that the trend of the APSR curve shown in <xref ref-type="fig" rid="fig-4">Fig. 4b</xref> resembles the AER curve. The pursuers, in the beginning, could not complete the task well. The APSR curve increased notably when the pursuers use the MA2PPO algorithm and begin chasing the escapee effectively as training progresses. At roughly 50,000 steps, the success rate of the pursuers is nearly 98%. MAPPO can also learn successful pursuit strategies, but its average success rate is lower than MA2PPO. There was no significant increase in the APSR values of COMA and MAAC.</p>

</sec>
<sec id="s5_2_2">
<label>5.2.2</label>
<title>Ablative Analysis</title>
<p>Two ablative versions of MA2PPO, namely MA2PPO without dynamic decoupling and MA2PPO without formation reward, are also contrasted and examined as baselines to show the efficacy of various components. MA2PPO without dynamic decoupling highlights the importance of formation reward. It replaces the simple distance reward with the novel reward function we presented, based on PPO and our suggested framework. The purpose of MA2PPO without formation reward is to confirm how the dynamic decoupling affects the resolution of credit assignment issues. Only the dynamic decoupling portion of the algorithm is kept, and the reward function is just the straightforward distance reward. <xref ref-type="fig" rid="fig-5">Fig. 5</xref> displays the MA2PPO and its ablative versions&#x2019; training results. The comparisons of two evaluation criteria demonstrate that MA2PPO performs better than the other two algorithms.</p>
<fig id="fig-5">
<label>Figure 5</label>
<caption>
<title>Comparisons of the AER and APSR for MA2PPO, MA2PPO without dynamic decoupling, and MA2PPO without formation reward. (<bold>a</bold>) The AER; (<bold>b</bold>) The APSR</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_67117-fig-5.tif"/>
</fig>
<p><bold>Ablation Study of Dynamic Decoupling:</bold> We employ MHA to decouple the multi-UAV cooperative pursuit problem, as discussed in <xref ref-type="sec" rid="s4_2">Section 4.2</xref>. Dynamic decoupling is effective at resolving the credit assignment problem and promoting active collaboration between nearby UAVs. By using the attention mechanism, UAVs are incentivized to approach one another when others are helpful for their task. Conversely, collisions between UAVs are penalized in the reward function when they are too close to one another. In situations where they need to avoid each other, UAVs dynamically decouple from one another. These incentives are what propel each UAV forward. Ensuring that UAVs coordinate their motivations to keep an eye on each other while avoiding collisions is critical. Effective tracking of an escapee by UAVs is dependent on ensuring that the incentives of the individual UAVs are aligned. The MA2PPO algorithm, which we propose, facilitates this by enabling UAVs to exhibit a wide range of cooperative behaviors that can be easily adapted to different situations.</p>
<p>In <xref ref-type="fig" rid="fig-6">Figs. 6</xref> and <xref ref-type="fig" rid="fig-7">7</xref>, we illustrate the process of dynamic decoupling by visualizing the trajectories of all UAVs and the corresponding attention weights. The trajectories of all UAVs at the 13th step after initialization is shown in <xref ref-type="fig" rid="fig-6">Fig. 6a</xref>. The escapee flies to the upper left as the pursuers begin their hunt in accordance with the MA2PPO method they have learned. The flight directions of UAVs 1, 2, and 4 make it easier to complete the chase. As seen in <xref ref-type="fig" rid="fig-7">Fig. 7a</xref>, UAV 1 has a higher attention weight value for UAVs 2 and 4 than UAV 3. UAVs 2 and 3 are more practical than UAV 1 to work with to accomplish the interception of escapees when the distance between them and their respective flight directions is considered. Therefore, UAV 2 pays more attention to UAV 3 than to UAV 1 while UAV 2 is also focused on UAV 1. Similarly, the attention weights of UAVs 3 and 4 in <xref ref-type="fig" rid="fig-7">Fig. 7a</xref> support our analysis of UAVs 1 and 2.</p>
<fig id="fig-6">
<label>Figure 6</label>
<caption>
<title>The trajectories of all UAVs under different steps. (<bold>a</bold>) The trajectories of all UAVs at the 13th step. (<bold>b</bold>) The trajectories of all UAVs at the 51st step. (<bold>c</bold>) The trajectories of all UAVs at the 98th step (The trajectory color of each UAV transitions from light to dark, representing its movement from the start position to the end position, with the lightest color indicating the starting point and the darkest color indicating the end point. The UAV icons mark each UAV&#x2019;s final position at the corresponding time step)</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_67117-fig-6.tif"/>
</fig><fig id="fig-7">
<label>Figure 7</label>
<caption>
<title>The attention weights of the pursuers under different steps. (<bold>a</bold>) The attention weights of the pursuers at the 13th step. (<bold>b</bold>) The attention weights of the pursuers at the 51st step. (<bold>c</bold>) The attention weights of the pursuers at the 98th step (The color scale indicates the magnitude of the attention weights: lighter colors represent lower weights, while darker colors represent higher attention weights)</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_67117-fig-7.tif"/>
</fig>
<p><xref ref-type="fig" rid="fig-6">Fig. 6b</xref> shows the trajectories of the multi-UAV at the 51st step. UAVs 1, 2, and 4 may be seen flying in the direction of the escapee as the hunt goes on. UAV 3 is currently near the direction of the escapee. <xref ref-type="fig" rid="fig-7">Fig. 7b</xref> demonstrates that UAVs 1 and 2 both recognize that working with UAV 3 is more favorable to completing the hunt. Hence, the attention weight of UAV 3 is increased by UAV 1 while the attention weight of UAV 4 is decreased. UAV 2 also learned this, which has increased attention to UAVs 1 and 3, reducing attention to UAV 4. UAVs 1, 2, and 3 consequently dynamically disconnect UAV 4 and concentrate more on each other.</p>
<p>The trajectories of pursuers in <xref ref-type="fig" rid="fig-6">Fig. 6c</xref> serves as a confirmation of the analysis in <xref ref-type="fig" rid="fig-7">Fig. 7b</xref>. UAVs 1, 2, and 3 fly straight to the escapee, as shown in <xref ref-type="fig" rid="fig-6">Fig. 6c</xref>, as they prepare to form a round-up state to complete the pursuit of the evader. In <xref ref-type="fig" rid="fig-7">Fig. 7c</xref>, UAVs 1, 2, and 3 add attention weights to each other, whereas UAV 4 receives nearly no attention. The results demonstrate that the dynamic decoupling enables pursuers to determine in real-time which companions are more beneficial to the pursuit process to pay them more attention and work more closely together. This approach decomposes the large-scale cooperative problem into smaller, decoupled subproblems and eliminates lazy agents, motivating the pursuers to collaborate more effectively with one another.</p>
<p><bold>Ablation Study of Formation Reward:</bold> We analyze the outcome of giving each pursuer the formation reward, which is offered to promote a great pursuit manner at each time step. The AER of MA2PPO is higher than MA2PPO without formation reward, as seen in <xref ref-type="fig" rid="fig-5">Fig. 5</xref>. Additionally, the average pursuit success rate will rise when utilizing the formation reward. The MA2PPO algorithm enables UAVs to engage in effective multi-UAV pursuit cooperation by reshaping their original rewards through the formation reward received from their neighboring UAVs. It also helps to address the drawbacks of the distance reward, hence strengthening system performance and boosting individual UAV rewards. This further emphasizes the value of UAV cluster cooperation.</p>

<p>UAVs use the MA2PPO method to work together to track down the escapee and intuitively determine whether they have learned to create a formation by displaying the trajectory visualization of various tracking processes. As seen in <xref ref-type="fig" rid="fig-8">Fig. 8a</xref>, the pursuers use the cooperative chasing policies in response to the escapee&#x2019;s unexpected turn and successfully catch up with it. UAV 1 keeps close track of the evader, even if it makes a sudden turn. UAVs 2, 3, and 4 are also flying in the direction of the evader, establishing a hunting state, and effectively intercepting the evader. <xref ref-type="fig" rid="fig-8">Fig. 8b</xref> and <xref ref-type="fig" rid="fig-8">c</xref> exhibits the trajectories of pursuers in a variety of additional scenarios. When the escapee flees quickly in one direction in <xref ref-type="fig" rid="fig-8">Fig. 8b</xref> and <xref ref-type="fig" rid="fig-8">8c</xref>, the pursuers are coordinating their activities and pursuing the escapee from various angles. Then, when the besieged state of the escapee is not met, two pursuers closely follow it while others disperse around them. The pursuers then construct an encirclement around the evader to prevent the evader from escaping. While they approach the evader for arrest after the evader is in the encirclement and the encirclement is fulfilled. They coordinate the actions to move from encirclement to capture. Finally, they go straight after the escapee untill it is apprehended. The results demonstrate that the pursuers successfully handle the chase process in various states. The formation reward incentivizes the pursuers to encircle the escapee to meet the siege, then shorten the siege until the escapee is apprehended, enabling closer multi-UAV cooperation.</p>
<fig id="fig-8">
<label>Figure 8</label>
<caption>
<title>Visualizing the trajectories of various cooperative pursuit processes (The legend content, the gradient trajectory colors of the UAVs, and the meaning of the UAV icons are all the same as those in <xref ref-type="fig" rid="fig-6">Fig. 6</xref>)</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_67117-fig-8.tif"/>
</fig>
<p><bold>Ablation analysis of reward function weights.</bold> To systematically evaluate the impact of the formation reward weight <inline-formula id="ieqn-275"><mml:math id="mml-ieqn-275"><mml:mrow><mml:msub><mml:mrow><mml:mi>&#x03C9;</mml:mi></mml:mrow><mml:mrow><mml:mi>f</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:math></inline-formula> and the distance reward weight <inline-formula id="ieqn-276"><mml:math id="mml-ieqn-276"><mml:mrow><mml:msub><mml:mrow><mml:mi>&#x03C9;</mml:mi></mml:mrow><mml:mrow><mml:mi>d</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:math></inline-formula> on learning performance, we conduct an ablation study using the APSR as the performance metric while keeping all other settings unchanged. Five different combinations of weights are tested, including the parameter settings: <inline-formula id="ieqn-277"><mml:math id="mml-ieqn-277"><mml:mrow><mml:msub><mml:mrow><mml:mi>&#x03C9;</mml:mi></mml:mrow><mml:mrow><mml:mi>f</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo>=</mml:mo><mml:mn>0.025</mml:mn></mml:math></inline-formula>, <inline-formula id="ieqn-278"><mml:math id="mml-ieqn-278"><mml:mrow><mml:msub><mml:mrow><mml:mi>&#x03C9;</mml:mi></mml:mrow><mml:mrow><mml:mi>d</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo>=</mml:mo><mml:mn>0.0005</mml:mn></mml:math></inline-formula>; <inline-formula id="ieqn-279"><mml:math id="mml-ieqn-279"><mml:mrow><mml:msub><mml:mrow><mml:mi>&#x03C9;</mml:mi></mml:mrow><mml:mrow><mml:mi>f</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo>=</mml:mo><mml:mn>0.05</mml:mn></mml:math></inline-formula>, <inline-formula id="ieqn-280"><mml:math id="mml-ieqn-280"><mml:mrow><mml:msub><mml:mrow><mml:mi>&#x03C9;</mml:mi></mml:mrow><mml:mrow><mml:mi>d</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo>=</mml:mo><mml:mn>0.001</mml:mn></mml:math></inline-formula>; <inline-formula id="ieqn-281"><mml:math id="mml-ieqn-281"><mml:mrow><mml:msub><mml:mrow><mml:mi>&#x03C9;</mml:mi></mml:mrow><mml:mrow><mml:mi>f</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo>=</mml:mo><mml:mn>0.1</mml:mn></mml:math></inline-formula>, <inline-formula id="ieqn-282"><mml:math id="mml-ieqn-282"><mml:mrow><mml:msub><mml:mrow><mml:mi>&#x03C9;</mml:mi></mml:mrow><mml:mrow><mml:mi>d</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo>=</mml:mo><mml:mn>0.002</mml:mn></mml:math></inline-formula>; <inline-formula id="ieqn-283"><mml:math id="mml-ieqn-283"><mml:mrow><mml:msub><mml:mrow><mml:mi>&#x03C9;</mml:mi></mml:mrow><mml:mrow><mml:mi>f</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo>=</mml:mo><mml:mn>0.15</mml:mn></mml:math></inline-formula>, <inline-formula id="ieqn-284"><mml:math id="mml-ieqn-284"><mml:mrow><mml:msub><mml:mrow><mml:mi>&#x03C9;</mml:mi></mml:mrow><mml:mrow><mml:mi>d</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo>=</mml:mo><mml:mn>0.003</mml:mn></mml:math></inline-formula>; <inline-formula id="ieqn-285"><mml:math id="mml-ieqn-285"><mml:mrow><mml:msub><mml:mrow><mml:mi>&#x03C9;</mml:mi></mml:mrow><mml:mrow><mml:mi>f</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo>=</mml:mo><mml:mn>0.2</mml:mn></mml:math></inline-formula>, <inline-formula id="ieqn-286"><mml:math id="mml-ieqn-286"><mml:mrow><mml:msub><mml:mrow><mml:mi>&#x03C9;</mml:mi></mml:mrow><mml:mrow><mml:mi>d</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo>=</mml:mo><mml:mn>0.004</mml:mn></mml:math></inline-formula>.</p>
<p>As shown in <xref ref-type="fig" rid="fig-9">Fig. 9</xref>, the algorithm achieves the best performance in terms of convergence speed and final success rate when <inline-formula id="ieqn-287"><mml:math id="mml-ieqn-287"><mml:mrow><mml:msub><mml:mrow><mml:mi>&#x03C9;</mml:mi></mml:mrow><mml:mrow><mml:mi>f</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo>=</mml:mo><mml:mn>0.1</mml:mn></mml:math></inline-formula> and <inline-formula id="ieqn-288"><mml:math id="mml-ieqn-288"><mml:mrow><mml:msub><mml:mrow><mml:mi>&#x03C9;</mml:mi></mml:mrow><mml:mrow><mml:mi>d</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo>=</mml:mo><mml:mn>0.002</mml:mn></mml:math></inline-formula>. We observe that a moderate <inline-formula id="ieqn-289"><mml:math id="mml-ieqn-289"><mml:mrow><mml:msub><mml:mrow><mml:mi>&#x03C9;</mml:mi></mml:mrow><mml:mrow><mml:mi>f</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:math></inline-formula> encourages agents to collaboratively encircle the evader without causing premature clustering, while the <inline-formula id="ieqn-290"><mml:math id="mml-ieqn-290"><mml:mrow><mml:msub><mml:mrow><mml:mi>&#x03C9;</mml:mi></mml:mrow><mml:mrow><mml:mi>d</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:math></inline-formula> ensures that pursuing the evader remains the primary incentive. This balance promotes rapid learning in the early stages and sustains a high success rate later in training. These results indicate that the chosen weights effectively balance cooperative encirclement and target approach. Therefore, we set <inline-formula id="ieqn-291"><mml:math id="mml-ieqn-291"><mml:mrow><mml:msub><mml:mrow><mml:mi>&#x03C9;</mml:mi></mml:mrow><mml:mrow><mml:mi>f</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:math></inline-formula> and <inline-formula id="ieqn-292"><mml:math id="mml-ieqn-292"><mml:mrow><mml:msub><mml:mrow><mml:mi>&#x03C9;</mml:mi></mml:mrow><mml:mrow><mml:mi>d</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:math></inline-formula> to 0.1 and 0.002, respectively.</p>
<fig id="fig-9">
<label>Figure 9</label>
<caption>
<title>Calculating other UAVs&#x2019; levels of attention <inline-formula id="ieqn-293"><mml:math id="mml-ieqn-293"><mml:mrow><mml:msub><mml:mrow><mml:mi>x</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:math></inline-formula> and the attention weight <inline-formula id="ieqn-294"><mml:math id="mml-ieqn-294"><mml:mrow><mml:msub><mml:mrow><mml:mi>&#x03B1;</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:math></inline-formula> of pursuer <inline-formula id="ieqn-295"><mml:math id="mml-ieqn-295"><mml:mi>i</mml:mi></mml:math></inline-formula></title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_67117-fig-9.tif"/>
</fig>
</sec>
<sec id="s5_2_3">
<label>5.2.3</label>
<title>Scalability Testing</title>
<p>We test the capacity of strategies that are learned by the MA2PPO algorithm to scale in various situations. We trained all six of the algorithms we previously compared in various settings with various numbers of UAVs used for chase and escape. We still evaluated them using the two criteria listed in <xref ref-type="sec" rid="s5_1">Section 5.1</xref>. <xref ref-type="table" rid="table-3">Table 3</xref> displays the statistics of the test findings.</p>
<table-wrap id="table-3">
<label>Table 3</label>
<caption>
<title>Test results of evaluation criteria obtained by performing multiple algorithms in different environments</title>
</caption>
<table>
<colgroup>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
</colgroup>
<thead>
<tr>
<th align="center">Environ-ment Map</th>
<th align="center">Pursuers vs. Evaders</th>
<th align="center">Evaluation Metrics</th>
<th align="center">MA2PPO</th>
<th align="center">MA2PPO without dynamical decoupling</th>
<th align="center">MA2PPO without formation reward</th>
<th align="center">MAPPO</th>
<th align="center">MAAC</th>
<th align="center">COMA</th>
</tr>
</thead>
<tbody>
<tr>
<td>400 &#x002A; 400</td>
<td>2 vs. 1</td>
<td>APSR</td>
<td><bold>0.98</bold></td>
<td>0.94</td>
<td>0.88</td>
<td>0.82</td>
<td>0.52</td>
<td>0.5</td>
</tr>
<tr>
<td></td>
<td></td>
<td>AER</td>
<td><bold>118.54</bold></td>
<td>112.76</td>
<td>55.32</td>
<td>43.17</td>
<td>&#x2212;12.38</td>
<td>&#x2212;194.04</td>
</tr>
<tr>
<td>400 &#x002A; 400</td>
<td>4 vs. 1</td>
<td>APSR</td>
<td><bold>0.98</bold></td>
<td>0.95</td>
<td>0.9</td>
<td>0.86</td>
<td>0.55</td>
<td>0.54</td>
</tr>
<tr>
<td></td>
<td></td>
<td>AER</td>
<td><bold>194.63</bold></td>
<td>170.71</td>
<td>91.99</td>
<td>73.92</td>
<td>10.82</td>
<td>&#x2212;126.27</td>
</tr>
<tr>
<td>400 &#x002A; 400</td>
<td>4 vs. 2</td>
<td>APSR</td>
<td><bold>0.87</bold></td>
<td>0.85</td>
<td>0.79</td>
<td>0.73</td>
<td>0.47</td>
<td>0.45</td>
</tr>
<tr>
<td></td>
<td></td>
<td>AER</td>
<td><bold>255.28</bold></td>
<td>238.06</td>
<td>162.17</td>
<td>150.42</td>
<td>37.19</td>
<td>&#x2212;87.29</td>
</tr>
<tr>
<td>400 &#x002A; 400</td>
<td>6 vs. 2</td>
<td>APSR</td>
<td><bold>0.91</bold></td>
<td>0.89</td>
<td>0.86</td>
<td>0.79</td>
<td>0.49</td>
<td>0.47</td>
</tr>
<tr>
<td></td>
<td></td>
<td>AER</td>
<td><bold>311.22</bold></td>
<td>287.91</td>
<td>240.18</td>
<td>228.82</td>
<td>66.4</td>
<td>&#x2212;48.05</td>
</tr>
<tr>
<td>800 &#x002A; 800</td>
<td>8 vs. 2</td>
<td>APSR</td>
<td><bold>0.96</bold></td>
<td>0.92</td>
<td>0.89</td>
<td>0.83</td>
<td>0.51</td>
<td>0.49</td>
</tr>
<tr>
<td></td>
<td></td>
<td>AER</td>
<td><bold>403.99</bold></td>
<td>375.91</td>
<td>308.97</td>
<td>283.49</td>
<td>109.06</td>
<td>0.08</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>From an overall perspective, MA2PPO consistently outperforms the other five baseline methods across all tested scenarios, demonstrating strong stability and generalization capabilities. Notably, in the highly complex 8 vs. 2 task, MA2PPO maintains an APSR of 0.96 and an AER of 403.99, significantly exceeding the performance of the comparative algorithms. These results highlight the superior ability of MA2PPO in multi-target task allocation and strategic coordination under challenging conditions.</p>
<p>The ablation studies on MA2PPO variants without key modules reveal that removing the dynamic decoupling mechanism leads to reduced coordination accuracy, while excluding the formation reward significantly decreases group encirclement efficiency. For instance, in the 6 vs. 2 scenario, the complete MA2PPO achieves an AER of 311.22, which drops to 240.18 when the formation reward is removed, validating the critical role of our proposed modules in enhancing system-level cooperation. In contrast, MAPPO, MAAC, and COMA exhibit a noticeable decline in performance when the number of evaders increases or the task scale expands. Although MAAC also incorporates an attention mechanism, its overall success rate across different scenarios is substantially lower than that of MA2PPO. As the task scale increases, most baseline methods show significant performance fluctuations, whereas MA2PPO maintains relatively stable performance, indicating that MA2PPO-based learning demonstrates superior generalization and scalability across diverse scenarios.</p>
<p>An in-depth analysis is conducted to uncover the reasons behind the advantages of the MA2PPO algorithm. The use of dynamic decoupling may be able to identify those highly correlated connections that are essential and eliminate those weakly correlated interactions. Additionally, the formation reward encourages the pursuers to consider nearby companions that can create a particular form when determining the reward for each UAV. The pursuers can learn better cooperative manners to execute the mission from these nearby companions, which dramatically increases pursuit efficiency.</p>
<p>In summary, the proposed MA2PPO algorithm enhances coordination efficiency and adaptability across different task configurations and can be extended to a variety of cooperative scenarios involving more pursuers and escapees.</p>
</sec>
<sec id="s5_2_4">
<label>5.2.4</label>
<title>Computational Complexity Analysis</title>
<p>We analyze the algorithm from the perspective of computational complexity. There are <italic>N</italic> homogeneous UAVs, each with a state dimensionality of <italic>M</italic>, and a continuous action dimensionality of 1. During centralized training, the critic network for each UAV must incorporate the states and actions of the other <inline-formula id="ieqn-296"><mml:math id="mml-ieqn-296"><mml:mi>N</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:math></inline-formula> UAVs, resulting in an input dimensionality of <inline-formula id="ieqn-297"><mml:math id="mml-ieqn-297"><mml:mrow><mml:mi>&#x1D4AA;</mml:mi></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msup><mml:mrow><mml:mi>N</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msup></mml:mrow><mml:mo>&#x00D7;</mml:mo><mml:mi>M</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>. However, with the use of a MHA mechanism, the input dimensionality is reduced to <inline-formula id="ieqn-298"><mml:math id="mml-ieqn-298"><mml:mrow><mml:mi>&#x1D4AA;</mml:mi></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>N</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mi>M</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>. For small-scale UAV teams, the computational overhead introduced by the attention mechanism is negligible compared to approaches that do not employ it. Nevertheless, in larger-scale UAV swarms (e.g., <inline-formula id="ieqn-299"><mml:math id="mml-ieqn-299"><mml:mi>N</mml:mi><mml:mo>&#x003E;</mml:mo><mml:mn>10</mml:mn></mml:math></inline-formula>), the computational burden increases dramatically without the attention mechanism, while with attention, the growth remains linear with respect to <italic>N</italic>. Therefore, our algorithm is more suitable for large-scale swarm tasks.</p>
<p>Moreover, in the extreme case where each UAV communicates with all other <inline-formula id="ieqn-300"><mml:math id="mml-ieqn-300"><mml:mi>N</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:math></inline-formula> UAVs, interactions are treated as uniformly contributing, which causes the variance of the advantage function estimate to grow quadratically with <italic>N</italic>. However, not all neighboring UAV messages are relevant to the task. High-variance advantage estimates lead to unstable gradient directions during policy updates, significantly slowing down convergence. By leveraging dynamic decoupling, our method enables each pursuer to identify and retain only strongly correlated interactions with teammates, effectively removing irrelevant advantage terms that do not contribute to policy gradients. This reduces the overall variance and facilitates more efficient policy learning.</p>
</sec>
</sec>
</sec>
<sec id="s6">
<label>6</label>
<title>Conclusion and Future Work</title>
<p>Our proposed approach, MA2PPO, builds upon the foundations of PPO and CTDE, and improves the efficiency and the capture probability of multi-UAV systems during the pursuit of the escapee. To address the credit assignment problem, we introduce a dynamic decoupling that takes advantage of MHA to enable each pursuer to selectively incorporate information from teammates, and thus dynamically determine which interactions should be decoupled in real-time. It dynamically separates the multi-UAV collaboration problem into decoupled sub-problems, and minimizes the noise introduced, lowering the variance in the strategy gradient estimation. Additionally, we create a novel reward function that combines the formation reward with the distance reward to encourage the pursuers to learn complex cooperative pursuit strategies around the escapee. It empowers UAVs to obtain the optimal policy as soon as possible and improves the capture probability of multi-UAV systems. Experiments demonstrate that MA2PPO effectively enhances cooperation among UAVs and encourages the formation of clusters that exhibit a wide range of cooperative behaviors. Our algorithm can also be applied in collaborative scenarios with more numbers of UAVs.</p>
<p>The primary limitation of the current study lies in the use of a 2D UAV model for experimentation, where all UAVs are assumed to operate at a constant speed. In more realistic settings involving variable-speed scenarios and complex UAV dynamics and perception models in three-dimensional (3D) environments, the proposed MA2PPO algorithm may not be directly applicable to pursuit-evasion games. Compared to the 2D case, UAVs operating in 3D space have more degrees of freedom, such as attitude control and climb rate constraints, and are subject to physical control laws that impose response delays, making the learning task significantly more challenging. To address these issues, we consider adopting a hierarchical architecture that separates high-level decision-making from low-level trajectory tracking. Furthermore, curriculum learning can be employed, starting from pretraining in simplified 2D constant-speed environments and gradually incorporating 3D dynamics to improve policy learning and training stability.</p>
<p>In addition, the current formation reward is based on the relative angular dispersion of any two pursuers with respect to the evader in 2D space, which encourages the UAVs to spread around the evader and form an effective encirclement. However, this 2D angle-based formation reward cannot be directly extended to 3D environments. Designing a robust and differentiable 3D encirclement metric remains a key challenge. A potential solution is to construct a 3D formation reward by modeling the relative spatial distribution between pursuers and the evader using tensor encoding. This would allow for a measurement of spatial coverage in 3D space. Additionally, a vertical separation penalty term could be introduced to discourage UAV stacking, thereby promoting more effective and realistic 3D formations.</p>
</sec>
</body>
<back>
<ack>
<p>The authors sincerely thank the entire editorial team and all reviewers for voluntarily dedicating their time and providing valuable feedback.</p>
</ack>
<sec>
<title>Funding Statement</title>
<p>This work was supported by the National Research and Development Program of China under Grant JCKY2018607C019, in part by the Key Laboratory Fund of UAV of Northwestern Polytechnical University under Grant 2021JCJQLB07101.</p>
</sec>
<sec>
<title>Author Contributions</title>
<p>Lei Lei: Conceptualization, Methodology, Writing&#x2014;original draft; Chengfu Wu: Writing&#x2014;review &#x0026; editing, Supervision, Resources, Funding acquisition; Huaimin Chen: Investigation, Funding acquisition, Project administration. All authors reviewed the results and approved the final version of the manuscript.</p>
</sec>
<sec sec-type="data-availability">
<title>Availability of Data and Materials</title>
<p>Not applicable.</p>
</sec>
<sec>
<title>Ethics Approval</title>
<p>Not applicable.</p>
</sec>
<sec sec-type="COI-statement">
<title>Conflicts of Interest</title>
<p>The authors declare no conflicts of interest to report regarding the present study.</p>
</sec>
<ref-list content-type="authoryear">
<title>References</title>
<ref id="ref-1"><label>[1]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Sun</surname> <given-names>N</given-names></string-name>, <string-name><surname>Zhao</surname> <given-names>J</given-names></string-name>, <string-name><surname>Shi</surname> <given-names>Q</given-names></string-name>, <string-name><surname>Liu</surname> <given-names>C</given-names></string-name>, <string-name><surname>Liu</surname> <given-names>P</given-names></string-name></person-group>. <article-title>Moving target tracking by unmanned aerial vehicle: a survey and taxonomy</article-title>. <source>IEEE Trans Ind Inform</source>. <year>2024</year>;<volume>20</volume>(<issue>5</issue>):<fpage>7056</fpage>&#x2013;<lpage>68</lpage>. doi:<pub-id pub-id-type="doi">10.1109/TII.2024.3363084</pub-id>.</mixed-citation></ref>
<ref id="ref-2"><label>[2]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Kashino</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Nejat</surname> <given-names>G</given-names></string-name>, <string-name><surname>Benhabib</surname> <given-names>B</given-names></string-name></person-group>. <article-title>Multi-UAV based autonomous wilderness search and rescue using target Iso-probability curves</article-title>. In: <conf-name>2019 International Conference on Unmanned Aircraft Systems (ICUAS)</conf-name>; <year>2019 Jun 11&#x2013;14</year>; <publisher-loc>Atlanta, GA, USA</publisher-loc>. p. <fpage>636</fpage>&#x2013;<lpage>43</lpage>. doi:<pub-id pub-id-type="doi">10.1109/icuas.2019.8798354</pub-id>.</mixed-citation></ref>
<ref id="ref-3"><label>[3]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Wu</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Wu</surname> <given-names>S</given-names></string-name>, <string-name><surname>Hu</surname> <given-names>X</given-names></string-name></person-group>. <article-title>Cooperative path planning of UAVs &#x0026; UGVs for a persistent surveillance task in urban environments</article-title>. <source>IEEE Internet Things J</source>. <year>2020</year>;<volume>8</volume>(<issue>6</issue>):<fpage>4906</fpage>&#x2013;<lpage>19</lpage>. doi:<pub-id pub-id-type="doi">10.1109/JIOT.2020.3030240</pub-id>.</mixed-citation></ref>
<ref id="ref-4"><label>[4]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Huang</surname> <given-names>H</given-names></string-name>, <string-name><surname>Savkin</surname> <given-names>AV</given-names></string-name></person-group>. <article-title>An algorithm of reactive collision free 3-D deployment of networked unmanned aerial vehicles for surveillance and monitoring</article-title>. <source>IEEE Trans Ind Inform</source>. <year>2020</year>;<volume>16</volume>(<issue>1</issue>):<fpage>132</fpage>&#x2013;<lpage>40</lpage>. doi:<pub-id pub-id-type="doi">10.1109/TII.2019.2913683</pub-id>.</mixed-citation></ref>
<ref id="ref-5"><label>[5]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Huang</surname> <given-names>H</given-names></string-name>, <string-name><surname>Savkin</surname> <given-names>AV</given-names></string-name></person-group>. <article-title>A method for optimized deployment of unmanned aerial vehicles for maximum coverage and minimum interference in cellular networks</article-title>. <source>IEEE Trans Ind Inform</source>. <year>2018</year>;<volume>15</volume>(<issue>5</issue>):<fpage>2638</fpage>&#x2013;<lpage>47</lpage>. doi:<pub-id pub-id-type="doi">10.1109/TII.2018.2875041</pub-id>.</mixed-citation></ref>
<ref id="ref-6"><label>[6]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Fawaz</surname> <given-names>W</given-names></string-name>, <string-name><surname>Abou-Rjeily</surname> <given-names>C</given-names></string-name>, <string-name><surname>Assi</surname> <given-names>C</given-names></string-name></person-group>. <article-title>UAV-aided cooperation for FSO communication systems</article-title>. <source>IEEE Commun Mag</source>. <year>2018</year>;<volume>56</volume>(<issue>1</issue>):<fpage>70</fpage>&#x2013;<lpage>5</lpage>. doi:<pub-id pub-id-type="doi">10.1109/mcom.2017.1700320</pub-id>.</mixed-citation></ref>
<ref id="ref-7"><label>[7]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Schmidt</surname> <given-names>LM</given-names></string-name>, <string-name><surname>Brosig</surname> <given-names>J</given-names></string-name>, <string-name><surname>Plinge</surname> <given-names>A</given-names></string-name>, <string-name><surname>Eskofier</surname> <given-names>BM</given-names></string-name>, <string-name><surname>Mutschler</surname> <given-names>C</given-names></string-name></person-group>. <article-title>An introduction to multi-agent reinforcement learning and review of its application to autonomous mobility</article-title>. In: <conf-name>2022 IEEE 25th International Conference on Intelligent Transportation Systems (ITSC)</conf-name>; <year>2022 Oct 8&#x2013;12</year>; <publisher-loc>Macau, China</publisher-loc>. p. <fpage>1342</fpage>&#x2013;<lpage>9</lpage>. doi:<pub-id pub-id-type="doi">10.1109/ITSC55140.2022.9922205</pub-id>.</mixed-citation></ref>
<ref id="ref-8"><label>[8]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Selvakumar</surname> <given-names>J</given-names></string-name>, <string-name><surname>Bakolas</surname> <given-names>E</given-names></string-name></person-group>. <article-title>Min-max Q-learning for multi-player pursuit-evasion games</article-title>. <source>Neurocomputing</source>. <year>2022</year>;<volume>475</volume>:<fpage>1</fpage>&#x2013;<lpage>14</lpage>. doi:<pub-id pub-id-type="doi">10.1016/j.neucom.2021.12.025</pub-id>.</mixed-citation></ref>
<ref id="ref-9"><label>[9]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Shen</surname> <given-names>P</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>X</given-names></string-name>, <string-name><surname>Fang</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Yuan</surname> <given-names>M</given-names></string-name></person-group>. <article-title>Real-time acceleration-continuous path-constrained trajectory planning with built-In tradeoff between cruise and time-optimal motions</article-title>. <source>IEEE Trans Autom Sci Eng</source>. <year>2020</year>;<volume>17</volume>(<issue>4</issue>):<fpage>1911</fpage>&#x2013;<lpage>24</lpage>. doi:<pub-id pub-id-type="doi">10.1109/TASE.2020.2980423</pub-id>.</mixed-citation></ref>
<ref id="ref-10"><label>[10]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Fang</surname> <given-names>X</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>C</given-names></string-name>, <string-name><surname>Xie</surname> <given-names>L</given-names></string-name>, <string-name><surname>Chen</surname> <given-names>J</given-names></string-name></person-group>. <article-title>Cooperative pursuit with multi-pursuer and one faster free-moving evader</article-title>. <source>IEEE Trans Cybern</source>. <year>2022</year>;<volume>52</volume>(<issue>3</issue>):<fpage>1405</fpage>&#x2013;<lpage>14</lpage>. doi:<pub-id pub-id-type="doi">10.1109/TCYB.2019.2958548</pub-id>; <pub-id pub-id-type="pmid">32413935</pub-id></mixed-citation></ref>
<ref id="ref-11"><label>[11]</label><mixed-citation publication-type="book"><person-group person-group-type="author"><string-name><surname>Tran</surname> <given-names>HV</given-names></string-name></person-group>. <source>Hamilton-Jacobi equations: theory and applications</source>. <publisher-loc>Providence, RI, USA</publisher-loc>: <publisher-name>American Mathematical Society</publisher-name>; <year>2021</year>.</mixed-citation></ref>
<ref id="ref-12"><label>[12]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Yuan</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>P</given-names></string-name>, <string-name><surname>Li</surname> <given-names>X</given-names></string-name></person-group>. <article-title>Synchronous fault-tolerant near-optimal control for discrete-time nonlinear PE game</article-title>. <source>IEEE Trans Neural Netw Learn Syst</source>. <year>2021</year>;<volume>32</volume>(<issue>10</issue>):<fpage>4432</fpage>&#x2013;<lpage>44</lpage>. doi:<pub-id pub-id-type="doi">10.1109/TNNLS.2020.3017762</pub-id>; <pub-id pub-id-type="pmid">32903189</pub-id></mixed-citation></ref>
<ref id="ref-13"><label>[13]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Xu</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Yang</surname> <given-names>H</given-names></string-name>, <string-name><surname>Jiang</surname> <given-names>B</given-names></string-name>, <string-name><surname>Polycarpou</surname> <given-names>MM</given-names></string-name></person-group>. <article-title>Multiplayer pursuit-evasion differential games with malicious pursuers</article-title>. <source>IEEE Trans Autom Control</source>. <year>2022</year>;<volume>67</volume>(<issue>9</issue>):<fpage>4939</fpage>&#x2013;<lpage>46</lpage>. doi:<pub-id pub-id-type="doi">10.1109/tac.2022.3168430</pub-id>.</mixed-citation></ref>
<ref id="ref-14"><label>[14]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Pan</surname> <given-names>T</given-names></string-name>, <string-name><surname>Yuan</surname> <given-names>Y</given-names></string-name></person-group>. <article-title>A region-based relay pursuit scheme for a pursuit-evasion game with a single evader and multiple pursuers</article-title>. <source>IEEE Trans Syst Man Cybern Syst</source>. <year>2022</year>;<volume>53</volume>(<issue>3</issue>):<fpage>1958</fpage>&#x2013;<lpage>69</lpage>. doi:<pub-id pub-id-type="doi">10.1109/TSMC.2022.3210022</pub-id>.</mixed-citation></ref>
<ref id="ref-15"><label>[15]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Wu</surname> <given-names>A</given-names></string-name>, <string-name><surname>Yang</surname> <given-names>R</given-names></string-name>, <string-name><surname>Liang</surname> <given-names>X</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>J</given-names></string-name>, <string-name><surname>Qi</surname> <given-names>D</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>N</given-names></string-name></person-group>. <article-title>Visual range maneuver decision of unmanned combat aerial vehicle based on fuzzy reasoning</article-title>. <source>Int J Fuzzy Syst</source>. <year>2022</year>;<volume>24</volume>(<issue>1</issue>):<fpage>519</fpage>&#x2013;<lpage>36</lpage>. doi:<pub-id pub-id-type="doi">10.1007/s40815-021-01158-y</pub-id>.</mixed-citation></ref>
<ref id="ref-16"><label>[16]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Chen</surname> <given-names>J</given-names></string-name>, <string-name><surname>Zha</surname> <given-names>WZ</given-names></string-name>, <string-name><surname>Peng</surname> <given-names>ZH</given-names></string-name>, <string-name><surname>Gu</surname> <given-names>D</given-names></string-name></person-group>. <article-title>Multi-player pursuit-evasion games with one superior evader</article-title>. <source>Automatica</source>. <year>2016</year>;<volume>71</volume>:<fpage>24</fpage>&#x2013;<lpage>32</lpage>. doi:<pub-id pub-id-type="doi">10.1016/j.automatica.2016.04.012</pub-id>.</mixed-citation></ref>
<ref id="ref-17"><label>[17]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Jaderberg</surname> <given-names>M</given-names></string-name>, <string-name><surname>Czarnecki</surname> <given-names>WM</given-names></string-name>, <string-name><surname>Dunning</surname> <given-names>I</given-names></string-name>, <string-name><surname>Marris</surname> <given-names>L</given-names></string-name>, <string-name><surname>Lever</surname> <given-names>G</given-names></string-name>, <string-name><surname>Casta&#x00F1;eda</surname> <given-names>AG</given-names></string-name>, <etal>et al</etal></person-group>. <article-title>Human-level performance in 3D multiplayer games with population-based reinforcement learning</article-title>. <source>Science</source>. <year>2019</year>;<volume>364</volume>(<issue>6443</issue>):<fpage>859</fpage>&#x2013;<lpage>65</lpage>. doi:<pub-id pub-id-type="doi">10.1126/science.aau6249</pub-id>; <pub-id pub-id-type="pmid">31147514</pub-id></mixed-citation></ref>
<ref id="ref-18"><label>[18]</label><mixed-citation publication-type="book"><person-group person-group-type="author"><string-name><surname>Zhang</surname> <given-names>K</given-names></string-name>, <string-name><surname>Yang</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Ba&#x015F;ar</surname> <given-names>T</given-names></string-name></person-group>. <chapter-title>Multi-agent reinforcement learning: a selective overview of theories and algorithms</chapter-title>. In: <source>Handbook of reinforcement learning and control</source>. <publisher-loc>Cham, Switzerland</publisher-loc>: <publisher-name>Springer International Publishing</publisher-name>; <year>2021</year>. p. <fpage>321</fpage>&#x2013;<lpage>84</lpage>. doi:<pub-id pub-id-type="doi">10.1007/978-3-030-60990-0_12</pub-id>.</mixed-citation></ref>
<ref id="ref-19"><label>[19]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Lowe</surname> <given-names>R</given-names></string-name>, <string-name><surname>Wu</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Tamar</surname> <given-names>A</given-names></string-name>, <string-name><surname>Harb</surname> <given-names>J</given-names></string-name>, <string-name><surname>Abbeel</surname> <given-names>P</given-names></string-name>, <string-name><surname>Mordatch</surname> <given-names>I</given-names></string-name></person-group>. <article-title>Multi-agent actor-critic for mixed cooperative-competitive environments</article-title>. In: <conf-name>NIPS&#x2019;17: Proceedings of the 31st International Conference on Neural Information Processing Systems</conf-name>; <year>2017 Dec 4&#x2013;9</year>; <publisher-loc>Long Beach, CA, USA</publisher-loc>. p. <fpage>6382</fpage>&#x2013;<lpage>93</lpage>.</mixed-citation></ref>
<ref id="ref-20"><label>[20]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Hernandez-Leal</surname> <given-names>P</given-names></string-name>, <string-name><surname>Kartal</surname> <given-names>B</given-names></string-name>, <string-name><surname>Taylor</surname> <given-names>ME</given-names></string-name></person-group>. <article-title>A survey and critique of multiagent deep reinforcement learning</article-title>. <source>Auton Agents Multi Agent Syst</source>. <year>2019</year>;<volume>33</volume>(<issue>6</issue>):<fpage>750</fpage>&#x2013;<lpage>97</lpage>. doi:<pub-id pub-id-type="doi">10.1007/s10458-019-09421-1</pub-id>.</mixed-citation></ref>
<ref id="ref-21"><label>[21]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Sunehag</surname> <given-names>P</given-names></string-name>, <string-name><surname>Lever</surname> <given-names>G</given-names></string-name>, <string-name><surname>Gruslys</surname> <given-names>A</given-names></string-name>, <string-name><surname>Czarnecki</surname> <given-names>WM</given-names></string-name>, <string-name><surname>Zambaldi</surname> <given-names>V</given-names></string-name>, <string-name><surname>Jaderberg</surname> <given-names>M</given-names></string-name>, <etal>et al</etal></person-group>. <article-title>Value-decomposition networks for cooperative multi-agent learning based on team reward</article-title>. In: <conf-name>Proceedings of the 17th International Conference on Autonomous Agents and Multiagent Systems (AAMAS)</conf-name>; <year>2018 Jul 10&#x2013;15</year>; <publisher-loc>Stockholm, Sweden</publisher-loc>. p. <fpage>2085</fpage>&#x2013;<lpage>7</lpage>.</mixed-citation></ref>
<ref id="ref-22"><label>[22]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Rashid</surname> <given-names>T</given-names></string-name>, <string-name><surname>Samvelyan</surname> <given-names>M</given-names></string-name>, <string-name><surname>Schroeder</surname> <given-names>C</given-names></string-name>, <string-name><surname>Farquhar</surname> <given-names>G</given-names></string-name>, <string-name><surname>Foerster</surname> <given-names>J</given-names></string-name>, <string-name><surname>Whiteson</surname> <given-names>S</given-names></string-name></person-group>. <article-title>Qmix: monotonic value function factorisation for deep multi-agent reinforcement learning</article-title>. In: <conf-name>Proceedings of the 35th International Conference on Machine Learning (ICML)</conf-name>; <year>2018 Jul 10&#x2013;15</year>; <publisher-loc>Stockholm, Sweden</publisher-loc>. p. <fpage>4295</fpage>&#x2013;<lpage>304</lpage>.</mixed-citation></ref>
<ref id="ref-23"><label>[23]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Son</surname> <given-names>K</given-names></string-name>, <string-name><surname>Kim</surname> <given-names>D</given-names></string-name>, <string-name><surname>Kang</surname> <given-names>WJ</given-names></string-name>, <string-name><surname>Hostallero</surname> <given-names>D</given-names></string-name>, <string-name><surname>Yi</surname> <given-names>Y</given-names></string-name></person-group>. <article-title>Qtran: learning to factorize with transformation for cooperative multi-agent reinforcement learning</article-title>. In: <conf-name>Proceedings of the 36th International Conference on Machine Learning (ICML)</conf-name>; <year>2019 Jun 9&#x2013;15</year>; <publisher-loc>Long Beach, CA, USA</publisher-loc>. p. <fpage>5887</fpage>&#x2013;<lpage>96</lpage>.</mixed-citation></ref>
<ref id="ref-24"><label>[24]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Foerster</surname> <given-names>J</given-names></string-name>, <string-name><surname>Farquhar</surname> <given-names>G</given-names></string-name>, <string-name><surname>Afouras</surname> <given-names>T</given-names></string-name>, <string-name><surname>Nardelli</surname> <given-names>N</given-names></string-name>, <string-name><surname>Whiteson</surname> <given-names>S</given-names></string-name></person-group>. <article-title>Counterfactual multi-agent policy gradients</article-title>. In: <conf-name>Proceedings of the 32nd Association for the Advancement of Artificial Intelligence Conference on Artificial General Intelligence (AAAI)</conf-name>; <year>2018 Feb 2&#x2013;7</year>; <publisher-loc>New Orleans, LA, USA</publisher-loc>. p. <fpage>2974</fpage>&#x2013;<lpage>82</lpage>. doi:<pub-id pub-id-type="doi">10.1609/aaai.v32i1.11794</pub-id>.</mixed-citation></ref>
<ref id="ref-25"><label>[25]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Du</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Han</surname> <given-names>L</given-names></string-name>, <string-name><surname>Fang</surname> <given-names>M</given-names></string-name>, <string-name><surname>Dai</surname> <given-names>T</given-names></string-name>, <string-name><surname>Liu</surname> <given-names>J</given-names></string-name>, <string-name><surname>Tao</surname> <given-names>D</given-names></string-name></person-group>. <article-title>Liir: learning individual intrinsic reward in multi-agent reinforcement learning</article-title>. In: <conf-name>Proceedings of the 33rd International Conference Neural Information Process System (NeurIPS)</conf-name>; 2019 Dec 8&#x2013;14; <publisher-loc>Vancouver, BC, Canada</publisher-loc>. p. <fpage>4403</fpage>&#x2013;<lpage>14</lpage>.</mixed-citation></ref>
<ref id="ref-26"><label>[26]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Fu</surname> <given-names>X</given-names></string-name>, <string-name><surname>Zhu</surname> <given-names>J</given-names></string-name>, <string-name><surname>Wei</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>H</given-names></string-name>, <string-name><surname>Li</surname> <given-names>S</given-names></string-name></person-group>. <article-title>A UAV pursuit-evasion strategy based on DDPG and imitation learning</article-title>. <source>Int J Aerosp Eng</source>. <year>2022</year>;<volume>2022</volume>:<fpage>1</fpage>&#x2013;<lpage>14</lpage>. doi:<pub-id pub-id-type="doi">10.1155/2022/3139610</pub-id>.</mixed-citation></ref>
<ref id="ref-27"><label>[27]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Liao</surname> <given-names>G</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>J</given-names></string-name>, <string-name><surname>Yang</surname> <given-names>D</given-names></string-name>, <string-name><surname>Yang</surname> <given-names>J</given-names></string-name></person-group>. <article-title>Multi-UAV escape target search: a multi-agent reinforcement learning method</article-title>. <source>Sensors</source>. <year>2024</year>;<volume>24</volume>(<issue>21</issue>):<fpage>6859</fpage>. doi:<pub-id pub-id-type="doi">10.3390/s24216859</pub-id>; <pub-id pub-id-type="pmid">39517756</pub-id></mixed-citation></ref>
<ref id="ref-28"><label>[28]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Zhang</surname> <given-names>R</given-names></string-name>, <string-name><surname>Zong</surname> <given-names>Q</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>X</given-names></string-name>, <string-name><surname>Dou</surname> <given-names>L</given-names></string-name>, <string-name><surname>Tian</surname> <given-names>B</given-names></string-name></person-group>. <article-title>Game of drones: multi-UAV pursuit-evasion game with online motion planning by deep reinforcement learning</article-title>. <source>IEEE Trans Neural Netw Learn Syst</source>. <year>2023</year>;<volume>34</volume>(<issue>10</issue>):<fpage>7900</fpage>&#x2013;<lpage>9</lpage>. doi:<pub-id pub-id-type="doi">10.1109/TNNLS.2022.3146976</pub-id>; <pub-id pub-id-type="pmid">35157597</pub-id></mixed-citation></ref>
<ref id="ref-29"><label>[29]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Iqbal</surname> <given-names>S</given-names></string-name>, <string-name><surname>Sha</surname> <given-names>F</given-names></string-name></person-group>. <article-title>Actor-attention-critic for multi-agent reinforcement learning</article-title>. In: <conf-name>Proceedings of the 36th International Conference on Machine Learning (ICML)</conf-name>; <year>2019 Jun 9&#x2013;15</year>; <publisher-loc>Long Beach, CA, USA</publisher-loc>. p. <fpage>2961</fpage>&#x2013;<lpage>70</lpage>.</mixed-citation></ref>
<ref id="ref-30"><label>[30]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Liu</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>W</given-names></string-name>, <string-name><surname>Hu</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Hao</surname> <given-names>J</given-names></string-name>, <string-name><surname>Chen</surname> <given-names>X</given-names></string-name>, <string-name><surname>Gao</surname> <given-names>Y</given-names></string-name></person-group>. <article-title>Multi-agent game abstraction via graph attention neural network</article-title>. In: <conf-name>Proceedings of the 34th Association for the Advancement of Artificial Intelligence Conference Artificial Intelligence (AAAI)</conf-name>; <year>2020 Feb 7&#x2013;12</year>; <publisher-loc>New York, NY, USA</publisher-loc>. p. <fpage>7211</fpage>&#x2013;<lpage>8</lpage>.</mixed-citation></ref>
<ref id="ref-31"><label>[31]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Ryu</surname> <given-names>H</given-names></string-name>, <string-name><surname>Shin</surname> <given-names>H</given-names></string-name>, <string-name><surname>Park</surname> <given-names>J</given-names></string-name></person-group>. <article-title>Multi-agent actor-critic with hierarchical graph attention network</article-title>. In: <conf-name>Proceedings of the 34th Association for the Advancement of Artificial Intelligence Conference Artificial Intelligence (AAAI)</conf-name>; <year>2020 Feb 7&#x2013;12</year>; <publisher-loc>New York, NY, USA</publisher-loc>. p. <fpage>7236</fpage>&#x2013;<lpage>43</lpage>.</mixed-citation></ref>
<ref id="ref-32"><label>[32]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Peng</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Wu</surname> <given-names>G</given-names></string-name>, <string-name><surname>Luo</surname> <given-names>B</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>L</given-names></string-name></person-group>. <article-title>Multi-UAV cooperative pursuit strategy with limited visual field in urban airspace: a multi-agent reinforcement learning approach</article-title>. <source>IEEE/CAA J Autom Sin</source>. <year>2025</year>;<volume>12</volume>(<issue>7</issue>):<fpage>1350</fpage>&#x2013;<lpage>67</lpage>. doi:<pub-id pub-id-type="doi">10.1109/jas.2024.124965</pub-id>.</mixed-citation></ref>
<ref id="ref-33"><label>[33]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Hu</surname> <given-names>J</given-names></string-name>, <string-name><surname>Sun</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Chen</surname> <given-names>H</given-names></string-name>, <string-name><surname>Huang</surname> <given-names>S</given-names></string-name>, <string-name><surname>Piao</surname> <given-names>H</given-names></string-name>, <string-name><surname>Chang</surname> <given-names>Y</given-names></string-name>, <etal>et al</etal></person-group>. <article-title>Distributional reward estimation for effective multi-agent deep reinforcement learning</article-title>. In: <conf-name>Proceedings of the 36th International Conference on Neural Information Processing Systems (NeurIPS)</conf-name>; <year>2022 Nov 28&#x2013;Dec 9</year>; <publisher-loc>Orleans, LA, USA</publisher-loc>. p. <fpage>12619</fpage>&#x2013;<lpage>32</lpage>.</mixed-citation></ref>
<ref id="ref-34"><label>[34]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Zhou</surname> <given-names>W</given-names></string-name>, <string-name><surname>Li</surname> <given-names>J</given-names></string-name>, <string-name><surname>Liu</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Shen</surname> <given-names>L</given-names></string-name></person-group>. <article-title>Improving multi-target cooperative tracking guidance for UAV swarms using multi-agent reinforcement learning</article-title>. <source>Chin J Aeronaut</source>. <year>2022</year>;<volume>35</volume>(<issue>7</issue>):<fpage>100</fpage>&#x2013;<lpage>12</lpage>. doi:<pub-id pub-id-type="doi">10.1016/j.cja.2021.09.008</pub-id>.</mixed-citation></ref>
<ref id="ref-35"><label>[35]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Zhang</surname> <given-names>T</given-names></string-name>, <string-name><surname>Qiu</surname> <given-names>T</given-names></string-name>, <string-name><surname>Liu</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Pu</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Yi</surname> <given-names>J</given-names></string-name>, <string-name><surname>Zhu</surname> <given-names>J</given-names></string-name>, <etal>et al</etal></person-group>. <article-title>Multi-UAV cooperative short-range combat via attention-based reinforcement learning using individual reward shaping</article-title>. In: <conf-name>2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)</conf-name>; <year>2022 Oct 23&#x2013;27</year>; <publisher-loc>Kyoto, Japan</publisher-loc>. p. <fpage>13737</fpage>&#x2013;<lpage>44</lpage>. doi:<pub-id pub-id-type="doi">10.1109/IROS47612.2022.9982096</pub-id>.</mixed-citation></ref>
<ref id="ref-36"><label>[36]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Kutpanova</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Kadhim</surname> <given-names>M</given-names></string-name>, <string-name><surname>Zheng</surname> <given-names>X</given-names></string-name>, <string-name><surname>Zhakiyev</surname> <given-names>N</given-names></string-name></person-group>. <article-title>Multi-UAV path planning for multiple emergency payloads delivery in natural disaster scenarios</article-title>. <source>J Electron Sci Technol</source>. <year>2025</year>;<volume>23</volume>(<issue>2</issue>):<fpage>100303</fpage>. doi:<pub-id pub-id-type="doi">10.1016/j.jnlest.2025.100303</pub-id>.</mixed-citation></ref>
<ref id="ref-37"><label>[37]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Schulman</surname> <given-names>J</given-names></string-name>, <string-name><surname>Wolski</surname> <given-names>F</given-names></string-name>, <string-name><surname>Dhariwal</surname> <given-names>P</given-names></string-name>, <string-name><surname>Radford</surname> <given-names>A</given-names></string-name>, <string-name><surname>Klimov</surname> <given-names>O</given-names></string-name></person-group>. <article-title>Proximal policy optimization algorithms</article-title>. In: <conf-name>Proceedings of the 5th International Conference on Learning Representations (ICLR)</conf-name>; <year>2017 Apr 24&#x2013;26</year>; <publisher-loc>Toulon, France</publisher-loc>. p. <fpage>1</fpage>&#x2013;<lpage>12</lpage>.</mixed-citation></ref>
<ref id="ref-38"><label>[38]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Schulman</surname> <given-names>J</given-names></string-name>, <string-name><surname>Moritz</surname> <given-names>P</given-names></string-name>, <string-name><surname>Levine</surname> <given-names>S</given-names></string-name>, <string-name><surname>Jordan</surname> <given-names>MI</given-names></string-name>, <string-name><surname>Abbeel</surname> <given-names>P</given-names></string-name></person-group>. <article-title>High-dimensional continuous control using generalized advantage estimation</article-title>. In: <conf-name>Proceedings of the 4th International Conference on Learning Representations (ICLR)</conf-name>; <year>2016 May 2&#x2013;4</year>; <publisher-loc>San Juan, Puerto Rico</publisher-loc>. p. <fpage>1</fpage>&#x2013;<lpage>14</lpage>.</mixed-citation></ref>
<ref id="ref-39"><label>[39]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Vaswani</surname> <given-names>A</given-names></string-name>, <string-name><surname>Shazeer</surname> <given-names>N</given-names></string-name>, <string-name><surname>Parmar</surname> <given-names>N</given-names></string-name>, <string-name><surname>Uszkoreit</surname> <given-names>J</given-names></string-name>, <string-name><surname>Jones</surname> <given-names>L</given-names></string-name>, <string-name><surname>Gomez</surname> <given-names>AN</given-names></string-name>, <etal>et al</etal></person-group>. <article-title>Attention is all you need</article-title>. In: <conf-name>Proceedings of the 31st International Conference Neural Information Processing System (NeurIPS)</conf-name>; <year>2017 Dec 4&#x2013;9</year>; <publisher-loc>Long Beach, CA, USA</publisher-loc>. p. <fpage>6000</fpage>&#x2013;<lpage>10</lpage>.</mixed-citation></ref>
<ref id="ref-40"><label>[40]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Shi</surname> <given-names>D</given-names></string-name>, <string-name><surname>Zhao</surname> <given-names>C</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Yang</surname> <given-names>H</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>G</given-names></string-name>, <string-name><surname>Jiang</surname> <given-names>H</given-names></string-name>, <etal>et al</etal></person-group>. <article-title>Multi actor hierarchical attention critic with RNN-based feature extraction</article-title>. <source>Neurocomputing</source>. <year>2022</year>;<volume>471</volume>:<fpage>79</fpage>&#x2013;<lpage>93</lpage>. doi:<pub-id pub-id-type="doi">10.1016/j.neucom.2021.10.093</pub-id>.</mixed-citation></ref>
<ref id="ref-41"><label>[41]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Xia</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Du</surname> <given-names>J</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>J</given-names></string-name>, <string-name><surname>Jiang</surname> <given-names>C</given-names></string-name>, <string-name><surname>Ren</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Li</surname> <given-names>G</given-names></string-name>, <etal>et al</etal></person-group>. <article-title>Multi-agent reinforcement learning aided intelligent UAV swarm for target tracking</article-title>. <source>IEEE Trans Veh Technol</source>. <year>2021</year>;<volume>71</volume>(<issue>1</issue>):<fpage>931</fpage>&#x2013;<lpage>45</lpage>. doi:<pub-id pub-id-type="doi">10.1109/TVT.2021.3129504</pub-id>.</mixed-citation></ref>
<ref id="ref-42"><label>[42]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Kingma</surname> <given-names>DP</given-names></string-name>, <string-name><surname>Ba</surname> <given-names>J</given-names></string-name></person-group>. <article-title>Adam: a method for stochastic optimization</article-title>. In: <conf-name>Proceedings of the 3rd International Conference on Learning Representations (ICLR)</conf-name>; <year>2015 May 7&#x2013;9</year>; <publisher-loc>San Diego, CA, USA</publisher-loc>. p. <fpage>1</fpage>&#x2013;<lpage>13</lpage>.</mixed-citation></ref>
<ref id="ref-43"><label>[43]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Zhang</surname> <given-names>M</given-names></string-name>, <string-name><surname>Qin</surname> <given-names>H</given-names></string-name>, <string-name><surname>Lan</surname> <given-names>M</given-names></string-name>, <string-name><surname>Lin</surname> <given-names>J</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>S</given-names></string-name>, <string-name><surname>Liu</surname> <given-names>K</given-names></string-name>, <etal>et al</etal></person-group>. <article-title>A high fidelity simulator for a auadrotor UAV using ROS and Gazebo</article-title>. In: <conf-name>Proceedings of the Annual Conference of the IEEE Industrial Electronics Society</conf-name>; <year>2015 Nov 9&#x2013;12</year>; <publisher-loc>Yokohama, Japan</publisher-loc>. p. <fpage>846</fpage>&#x2013;<lpage>51</lpage>.</mixed-citation></ref>
<ref id="ref-44"><label>[44]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Yu</surname> <given-names>C</given-names></string-name>, <string-name><surname>Velu</surname> <given-names>A</given-names></string-name>, <string-name><surname>Vinitsky</surname> <given-names>E</given-names></string-name>, <string-name><surname>Gao</surname> <given-names>J</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Bayen</surname> <given-names>A</given-names></string-name>, <etal>et al</etal></person-group>. <article-title>The surprising effectiveness of PPO in cooperative multi-agent games</article-title>. In: <conf-name>Proceedings of the 36th Conference Conference on Neural Information Processing Systems (NeurIPS)</conf-name>; <year>2022 Nov 28&#x2013;Dec 9</year>; <publisher-loc>New Orleans, LA, USA</publisher-loc>. p. <fpage>24611</fpage>&#x2013;<lpage>24</lpage>.</mixed-citation></ref>
</ref-list>
</back></article>