An Intelligent Algorithm for Solving Weapon-Target Assignment Problem: DDPG-DNPE Algorithm

Aiming at the problems of traditional dynamic weapon-target assignment algorithms in command decision-making, such as large computational amount, slow solution speed, and low calculation accuracy, combined with deep reinforcement learning theory, an improved Deep Deterministic Policy Gradient algorithm with dual noise and prioritized experience replay is proposed, which uses a double noise mechanism to expand the search range of the action, and introduces a priority experience playback mechanism to effectively achieve data utilization. Finally, the algorithm is simulated and validated on the ground-to-air countermeasures digital battlefield. The results of the experiment show that, under the framework of the deep neural network for intelligent weapon-target assignment proposed in this paper, compared to the traditional RELU algorithm, the agent trained with reinforcement learning algorithms, such as Deep Deterministic Policy Gradient algorithm, Asynchronous Advantage Actor-Critic algorithm, Deep Q Network algorithm performs better. It shows that the use of deep reinforcement learning algorithms to solve the weapon-target assignment problem in the field of air defense operations is scientific. In contrast to other reinforcement learning algorithms, the agent trained by the improved Deep Deterministic Policy Gradient algorithm has a higher win rate and reward in confrontation, and the use of weapon resources is more efficient. It shows that the model and algorithm have certain superiority and rationality. The results of this paper provide new ideas for solving the problem of weapon-target assignment in air defense combat command decisions.


Introduction
Weapon-target assignment (WTA) is the core link in the command and control of air defense operations, which has a significant impact on improving combat effectiveness.Its connotation refers to the efficient use of its own multi-type and multi-platform weapon resources based on battlefield situational awareness, rational allocation and interception of multiple incoming targets, avoiding the omission of key targets, repeated shooting, and other phenomena, to achieve the best combat effect [1][2][3].This problem has been proven to be a non-deterministic polynomial complete (NP-complete) problem [4,5].The efficient WTA is more than 3 times more effective than free fire [6], acting as a force multiplier.
The experts and scholars at home and abroad regard WTA as a class of mathematical problems and have carried out related model construction and solution work from multiple levels [7][8][9][10][11][12][13]. Davis et al. [14] focused on ballistic missile defense with multiple rounds of launches under dynamic conditions from the perspective of maximizing the protection of strategic points; Xu et al. [15] fully considered the communication and cooperation between sensor platforms and weapon platforms, and carried out research on the optimization goals of maximizing the target threat and minimizing the total cost of interception for the multi-stage air defense WTA problem.Guo et al. [16] mainly studied the problem of many-to-many missile interception and proposed two strategies: fixed and adaptive grouping strategies, which were solved by artificial swarm algorithm; Aiming at the WTA problem of regional defense, Kim et al. [17] proposed two new algorithms: rotation fixed strategy and rotation strategy to deal with multi-target attacks, and the effectiveness of the algorithm was verified by experiments.Li et al. [18] summarized and analyzed the static WTA problem, and introduced an improved Multi-objective Evolutionary Algorithm Based on Decomposition (MOEA/D framework) to solve it.Based on analyzing the difficulties of regional air defense decision-making, Severson et al. [19] adopted the idea of multi-layer defense to establish a distribution model to maximize the interference effectiveness index.Ouyang et al. [20] proposed a distributed allocation method to effectively manage radar resources and use probabilistic optimization algorithms to allocate radar targets for limited radar early warning resources.To maximize the probability of killing, Feng et al. [21] divided the fire units into groups, considered the situation of compound strikes in the same group of fire units, and constructed a dynamic WTA model with multiple interception timing.Bayrak et al. [22] studied how to efficiently solve the firepower cooperative allocation problem using the genetic algorithm (GA) with good convergence and search speed; Li et al. [23] designed a particle swarm optimization (PSO) algorithm with perturbation by attractors and inverse elements for the anti-missile interceptor-target allocation problem; to realize the rapid solution of WTA, Fan et al. [24] introduced the variable neighborhood search factor into the solution equation, mimics the natural behavior of bees, and proposes a memory mechanism that improves the global search efficiency of the artificial bee colony (ABC) algorithm.Fu et al. [25] proposed a multi-target PSO algorithm based on the coevolution of multiple populations to construct a model of co-evolution.However, the above studies are based on traditional analytical models and algorithms.Due to the rapid changes and difficult quantification of the battlefield environment and the opponent's strategies, there are bottlenecks in the uncertainty and nonlinearity of traditional analytical models and algorithms in dealing with decision-making, and it is difficult to adapt to the changing battlefield environment.Facing the needs of air defense operation, decision-making advantage is the core, so it is urgent to study new methods for WTA to improve the level of intelligent decision-making.WTA is a typical sequential decision-making process oriented to incomplete information games, which can be boiled down to solving the Markov decision process (MDP) problem; deep reinforcement learning (DRL) provides an efficient solution to this problem: DRL can realize the end-to-end learning process from perception to action, and its learning mechanism and method are in line with the experience learning and decision-making thinking mode of combat commanders, which has obvious advantages for solving sequential decision-making problems under game confrontation conditions.Some good results have been achieved in the application of Go [26][27][28], real-time strategy games [29,30], automatic driving [31], intelligent recommendation [32], and other fields.
In summary, this paper aims to apply the theory and algorithm of DRL to the WTA problem of air defense operation command decision-making.By introducing an event-based reward mechanism (EBR), multi-head attention mechanism (MHA), and gated cyclic unit (GRU), a new deep neural network framework for intelligent WTA is constructed, which is solved by the improved Deep Deterministic Policy Gradient algorithm with dual noise and prioritized experience replay (DDPG-DNPE) algorithm with dual noise and prioritized experience replay to improve the auxiliary decisionmaking ability of intelligent WTA for air defense operation in highly dynamic, uncertain and complex battlefield environments, transform information advantages into decision-making advantages, and provide more accurate WTA decision support for commanders.Finally, the red and blue sides are designed on the simulation deduction platform to verify the network architecture and algorithm proposed in this paper.The experimental results prove the practicability and effectiveness of the method used in this paper.

Related Theories 2.1 DRL
The goal of reinforcement learning (RL) is to enable the agent to obtain the maximum cumulative reward value during interaction with the environment, and to learn the optimal control method of its actions.RL introduces the concept of agent and environment, expanding the optimal control problem into a more general and broader sense of sequential decision-making problems, and the agent can autonomously interact with the environment and obtain training samples, rather than relying on a limited number of expert samples.The RL model consists of five key parts: agent, environment, state, action, and reward.Each interaction between the agent and environment produces corresponding information, which is used to update the agent's knowledge, and this perception-action-learning cycle is shown in Fig. 1.

Figure 1: Schematic diagram of RL
DRL is the reinforcement learning method of using a deep neural network to express the agent strategy.In the field of air defense intelligent WTA, the perception ability of deep learning (DL) can be used for battlefield situation recognition, and the RL algorithm can be used to assist decision-making to improve the efficiency of WTA and gain competitive advantages.

MDP is represented by five tuples S, A, P, R,
is the action set; P is the state transition matrix; R is the reward.γ is the discount factor.
In the process of MDP, the agent is in the initial state s in the environment, at which time it will execute an action a, and then the environment will output the next state s and the reward r obtained by the current action a.The agent is constantly interacting with the environment.R t is the cumulative reward, which is the sum of rewards after the time step t: The policy π is the probability of selecting an action a in the state s: The state-value function V π (s) is the expected total reward of taking strategy π in the initial state s: The action-value function Q π (s, a) is the expected total reward obtained when action a is performed in state s and subsequent actions follow strategy π: The goal of RL is to solve the optimal policy function π * of MDP to maximize the return, while the optimal state value function V * (s) and the optimal action value function Q * (s, a) are expressions of the optimal policy function π * :

Introduction of the DDPG Algorithm
Aiming at the dimensionality disaster problem existing in traditional RL algorithms, a better DRL algorithm can be found for solving large-scale decision-making tasks by combining the representation advantages of DL and the decision-making advantages of RL.DDPG algorithm has strong deep neural network fitting ability and generalized learning ability, and its sequential decision-making ability is strong, which is very consistent with the decision-making thinking of air defense operations, so this paper considers the use of DDPG algorithm in air defense intelligent decision-making.
The expectation that defines the cumulative reward is the objective function of the DDPG algorithm: To find the optimal deterministic behavior policy μ * , is equivalent to maximizing the policy in the objective function J β (μ): The Actor network is updated as follows: is the action state value that can be generated when the action is selected according to the deterministic policy μ under the state s; IE s∼ρ β represents the expectation of Q if the state s conforms to the ρ β distribution.The gradient ascent algorithm is used to optimize the above equation to continuously improve the expectation of discount cumulative reward.Finally, the algorithm updates the parameters θ μ of the policy network in the direction Q(s, a; θ Q ) of increasing the action value.
To update the critic network by Deep Q Network (DQN) updating value network, the gradient of the value network is as follows: The neural network parameters θ μ and θ Q in the Target Q value are the parameters of the target policy network and target value network, respectively, and the gradient descent algorithm is used to update the parameters in the network model.The process of training the value network is to find the optimal solution of the parameters θ Q in the value network.
Therefore, the training objective of the DDPG algorithm is to maximize the objective function J β (μ) and minimize the loss of the value network Q.

DDPG-DNPE Algorithm
The traditional DDPG algorithm uses an experience replay mechanism, and samples uniformly from the experience replay pool during sampling, that is, all experience importance is considered to be consistent.In the actual simulation process, it is found that the importance of the sample is different, and the data that makes the network performs poorly in the interaction process is more valuable for learning.Therefore, this paper introduces a priority experience replay mechanism, and gives different data a certain weight, so that the training network can be invested in learning high-value data as much as possible.
If |δ i | is larger, give the experience a higher weight.The sampling probability of experience can be defined as: where, , rank (i) is the sequence number of experience i in the experience pool.The larger |δ i | is, the higher the sequence number is, that is, the greater the probability of experience i being drawn.α mainly determines the order in which priorities are used.
A High-frequency sampling of experiences with high weights changes the distribution of samples, making it difficult for the model to converge.Importance sampling is often considered, and the importance sampling weight is: S is the size of the experience replay pool, and β is a hyperparameter that controls the level of experience replay based on priority.
As shown in Fig. 2, in the training process, to better take into account exploration and update, OU (Ornstein-Uhlenbeck) random noise and Gaussian noise are introduced to change the decisionmaking process of the action from deterministic to a random process, and the differential form of OU random noise N t is: where, μ is the mean value; θ is the speed at which noise tends to the average value; δ is the fluctuation degree of noise; B t is the standard Brownian motion.
Update critic network by minimizing the loss: Update the actor policy using the sampled policy gradient:

.3.3 DRL Training Framework
Before using DRL to solve the WTA problem, it is necessary to collect training samples through the interaction between the agent and the environment, and then optimize the neural network parameters through the RL algorithm, so that the agent can learn the optimal strategy.The agent training framework is shown in Fig. 4.

Figure 4: Schematic diagram of the agent training framework
The focus of DRL is on both sampling and training.When sampling, the input of the neural network used by the agent is state information and reward, and the output is action information, while the simulation environment requires input operating instructions and the output is battlefield situation information.Therefore, when the agent interacts with the environment for sampling, the data output from the simulation environment is transformed into state information and reward, and the action output from the neural network is transformed into action commands according to the parameters in the MDP model.During training, the collected samples are input to the RL algorithm, and the parameters are continuously trained and optimized to finally obtain the optimal strategy.

Deep Neural Network Framework for WTA
Based on the key indicators such as air defense combat boundaries, engagement criteria, and physical constraints, the deep neural network structure used in this paper is shown in Fig. 5.In the red-blue confrontation, the red agent comprehensively evaluates the threat degree of the incoming blue target according to the real-time state of both sides, considers the deployment of the red side's air defense fire units, and decides which blue targets to intercept at which points in time.The input of the network model is mainly the real-time state of the red and blue sides, and the output is which interceptor weapons are used to intercept which blue targets are in the current state.The network structure can be divided into three parts: battlefield situation input, decision-making action calculation, and decision-making action output.

Battlefield Situation Input
The battlefield situation input is mainly to input the network state space.The network state space is integrated and reduced by combining the air defense WTA combat elements.The battlefield situation is planned to be divided into four categories and input to the neural network in the form of semantic information.The specific classification and state information are shown in Table 1.The number and position of the firepower units and radar vehicles, the state of the radar switch, the number of interceptors, the firepower unit that can be intercepted State of blue incoming target that can be observed The type, number, motion status, and threat level of the incoming target State of blue target that can be intercepted The type, number, and status of the target that can be intercepted In a complex battlefield environment, there are many air defense combat entities and operational constraints, and the battlefield situation will change in space over time, so the number of each type is dynamically changing.

Decision-Making Action Calculation
After the state of the red key target, the state of the red fire unit, the state of the blue incoming target that can be observed, and the state of the blue target that can be intercepted are input into the neural network, each type of state data is extracted from the situation characteristics through two layers of fully connected-rectified linear unit (FC-ReLU), and then all the data are combined and connected to the site.After a layer of FC-ReLU and GRU, the global situation characteristics are output, and then decision reasoning and action calculation are carried out.
Due to the complex battlefield environment and random disturbance, the battlefield situation presents dynamic uncertainty, and the temporal attributes of the situation and the spatial attributes of the operational nodes should be fully considered.Moreover, the red and blue adversarial data often contain the historical value, that is, the decision-making in the current state is related to historical information, and the GRU network can selectively forget unimportant historical information, which better solves the problem of gradient disappearance and gradient explosion in long-sequence training.The structure of the GRU network is shown in Fig. 6.
Figure 6: GRU z t and r t represent the updated door and the reset door, respectively.The update gate is used to control the extent to which the state information at the previous moment is brought into the current state, and the larger the value of the update gate, the more state information is brought in at the previous moment.Resetting the gate controls how much information is written to the current candidate set h t from the previous state, and the smaller the reset gate, the less information is written from the previous state.The update mechanism for each door is: where, x t is the input information at the current moment, h t−1 denotes the hidden state of the previous moment.The hidden state acts as a neural network memory, which contains information about the data seen by the previous node, h t represents the hidden state passed to the next moment, h t is the candidate hidden state, δ stands for the sigmoid function, by which the data can be changed to a value in the range 0-1, tanh is the tanh function, by which the data can be changed to a value in the range [−1, 1], W r is the weight matrix.
After the introduction of GRU, it can effectively retain high-value historical information, so that the neural network can skillfully store and retrieve information, rationally use the effective information in the strategy to achieve cross-time correlation events, fully carry out comprehensive analysis and judgment, and improve the prediction accuracy of the neural network strategy in the time-varying environment.

Decision-Making Action Output
By integrating the network action space, the action output in the WTA process can be divided into three categories: 1. Action subject: selectable red fire unit; 2. Action predicate: the timing of interception of the red fire unit and the type of weapon launched; 3. Action object: blue target that can be intercepted.
In the red-blue confrontation, the combat units of two sides change dynamically with the development of the situation, and the intention of the blue target is closely related to its state, and different characteristics of the target state have different degrees of influence on the analysis of target intention.To improve the efficiency of training, the multi-head attention mechanism is considered.It simulates the human brain's different attention to different objects in the same field of view by assigning certain correlation degree weights to the input sequence features and further analyzes the importance of different target features so that the agent can focus on the blue target with higher threat degree at some moments, give priority to important information, and make accurate decisions quickly.
According to the attention distribution relationship of input characteristics, the attention mechanism can be divided into hard attention and soft attention.The soft attention mechanism assigns attention to each input feature and continuously learns and trains to obtain the weight of each feature.At the same time, the whole mode based on the soft attention mechanism is differentiable, that is, backpropagation learning can be realized.Therefore, the soft attention mechanism is chosen in this article.
The attention variable z is used to represent the location of the selected information, and the probability of the i input information is defined as a i , then where, X = [x 1 , . . ., x N ] is the input information, which is the intercepted blue target feature vector; q is the selectable red fire unit feature vector, namely, the hidden state obtained by GRU; f (x i , q) is the attention scoring function, representing the attention score of the red fire unit to the blue target; W and U are the neural network parameters; v is the global situation feature vector.The current situation is processed by soft max function, the relative importance of each parameter information is obtained, and the focus of local situation information is realized.
After the global situation features are generated, the feature vectors of the situation of the red fire unit and the interceptable blue target are respectively scored for attention, and the score of each red fire unit about each interceptable blue target is generated.Finally, the sigmoid sampling of the score vector is carried out to generate the attack target of the red fire unit.When making decisions, the algorithm will output the command and control for each unit, collect the status and overall situation of each unit, and then call the next decision command.

Red Agent Training Method
In the simulation, the data itself is unstable, each round of iteration may produce fluctuations, and will immediately react to the next round of iterations, it is difficult to obtain a stable model.This paper intends to decouple the intelligent WTA neural network in the training process, as shown in Fig. 7, dividing it into an inference module and a training module.The two modules use the same network structure and parameters, and the inference module is responsible for interacting with the simulation environment to obtain the interaction data.Based on the interactive data, the training module continuously updates the network parameters through the improved DDPG algorithm and synchronizes the network parameters to the inference module when the training module completes N i iteration.

Figure 7: Schematic representation of the decoupling training
Since the parameters of the inference module are fixed in N i iteration time, the data difference is reduced and the network fluctuation can be effectively avoided.The value of N i is affected by the fluctuation range of the training module, and the threshold T is considered.When the fluctuation range is less than T and N i meets the lower limit, the parameters of the inference module are updated synchronously.

Simulation Deduction and Verification
This paper uses the intelligent simulation deduction platform to compile the confrontation operation plan of the red and blue sides, and realize data collection and air defense command and control intelligent body verification.The deduction platform has a variety of models such as UAVs, cruise missiles, bombers, early warning aircraft, long-range and short-range fire units, radars, etc., which can realize a variety of operations such as aircraft takeoff and landing, flight along designated routes, bombing, missile launching, fire unit firing, radar switching and on, etc., and can carry out countermeasure deduction in real-time and evaluate the decision-making level of the agent.

WTA Platform Architecture
As shown in Fig. 8, the training environment and the extrapolation environment are physically divided, the corresponding training environment is constructed in the digital battlefield according to the combat idea, and the training environment and the agent are deployed on the training cloud of the large-scale data center.By training in the learning environment of the training cloud for some time, the agent will have some real-time decision-making ability.Then a corresponding extrapolation system is constructed in the digital battlefield which runs on the extrapolation cloud composed of small-scale server clusters.The agents trained on the training cloud will also be deployed on the same extrapolation cloud, and the countermeasures learned during training will be applied to the extrapolation system.Through intuitive adversarial deduction, the decision-making level of the agent is evaluated, and the defects and deficiencies of the agent are analyzed.The hyperparameters of the neural network in the training environment are adjusted in a targeted manner, and then iteratively trained.

Simulation Environment
The whole simulation deduction process is based on the virtual digital battlefield close to the real battlefield, using the real elevation digital map, which can be configured with physical constraints and performance indicators of equipment, including radar detectable area, missile killing area, and killing probability, etc.At the same time, the combat damage of both sides and other confrontation results can be recorded in real-time.
The simulation environment is shown in Fig. 9.The confrontation process is divided into two camps, "red and blue".In the combat area, a certain number of blue forces attack the command post and airport of the red side, and the task of the red side is to protect strategic places such as the command post and airfield.The task of the blue side is to destroy strategic points of the red side and attack the exposed fire units of the red side.The red agent receives the battlefield situation in real-time in the battlefield environment and makes decision-making instructions according to the battlefield situation to strike at incoming blue targets, protecting important places.It uses the reward and punishment mechanism to continuously modify the behavior of the decision-making brain and finally enables it to generate correct decision-making instructions for the situation in the environment.

Troop Setting
The main goal of the red side is to rationally plan the use of ammunition and defend the command post and airfield with a minimum interceptor missile resource.The main goal of the blue side is to destroy the command post and airfield of the red side while attacking the exposed fire units.The force settings and performance indicators of the red and blue sides are shown in Table 2.In the initial deployment stage, taking into account the fire connection and overlap of each air defense position while ensuring a certain depth of destruction, some troops are selected to deploy in advance.A total of nine interception units are deployed to protect the command post and airfield: among them, three long-range interception units and two short-range interception units are deployed to defend the command post of the red side; Three long-range interception units and one short-range interception unit were deployed to defend the airfield, as shown in Fig. 9.

Reward Function Setting
The reward signal is the only supervised information in RL, and whether an objective and appropriate reward function can be given is crucial for training an excellent model.The design of the reward function is closely related to the combat mission and directly affects the update of strategy parameters.Due to the large number of units on both sides of red and blue, the state space and action space are correspondingly large.If the neural network only gets feedback according to the reward function after each round of confrontation, it will reduce the exploration efficiency, resulting in each action facing the problem of sparse feedback.That is, the neural network does a lot of "correct" actions, and a small number of "incorrect" actions lead to combat failure, while these "correct" actions are not rewarded, resulting in difficulty in strategy exploration and optimization.Due to the complexity of the air defense combat command decision-making task, the probability of the agent exploring the winning state by itself is very low, so it is necessary to reasonably design the reward function, clarify the key events that trigger the reward, formulate the final indicators of each component of the reward function, and closely associate the trigger mechanism of the reward with the air defense WTA combat process.
Considering that bombers and fighters pose a greater threat to the red side, the interception of high-value blue targets can be taken as a key reward and punishment trigger, and a one-time periodic reward will be given after the first wave of blue side attacks is successfully intercepted; when the blue side's different value targets are successfully intercepted, a certain reward value will be given; when the red side wins, a winning bonus value is given.The reward function is set as follows: where, i, j, k, m respectively represent the number of intercepted bombers, fighters, cruise bombs, and UAVs.The red side will be awarded 8 points, 5 points, 1 point, and 0.5 points for intercepting a bomber, fighter, cruise bomb, and UAV respectively.Since the trigger events that reward the red agent are all the goals that the red agent must achieve to win, the reward function can gradually guide the agent to find the direction of learning.

Antagonism Criterion and Winning Condition Setting
The radar needs to be switched on throughout the guidance.The red side fire control radar will radiate electromagnetic waves, which will be captured by the blue side and then expose the position.If the red fire control radar is destroyed, the red fire unit cannot fight.The interception rate of the anti-aircraft missiles launched by the red fire unit is about 45%-75% in the kill zone, which fluctuates with different types of blue combat units.If the red radar is interfered with, the kill probability will be reduced accordingly.
When the red command post and airfield are all destroyed, or the radar loss exceeds 60%, the red side fails; When the blue team loses more than 50% of its fighters, the blue team fails.

Ablation Experiment
To compare and analyze the effects of event-based reward mechanism, GRU, and multi-head attention mechanism on training effectiveness, this section designs an ablation experiment as shown in Table 3.   Fig. 11 shows the comparison of win rate and average reward using different mechanics.Consistent with the above analysis, when the three mechanisms are used, it ensures an effective understanding of the characteristics of the air situation and masters the implementation situation.At the same time, it can reward the agent in time, the winning rate of the red agent is the highest at this time, which can reach 78.14%;If only one machine is used, the win rate of the agent also increases, indicating that the introduction of the machine is necessary and critical for the improvement of the win rate.

Comparison of Different Algorithms
Under the neural network architecture proposed in this paper, the comparison of the improved DDPG algorithm, Asynchronous Advantage Actor-Critic (A3C) algorithm, DQN algorithm, and the RELU algorithm is shown in Fig. 12.Among them, the RELU algorithm refers to the method of expert rule base to solve the model, as a contrast between the traditional method and the agent model.
In horizontal comparison, the reward function curve and win rate curve when using the rule algorithm are relatively stable, and only fluctuate within a very limited range, indicating that the play of the expert agent is stable.However, agents trained with RL algorithms (such as DDPG, A3C, and DQN) have higher win rates and reward values than rule algorithms, which shows that it is scientific and reasonable to use deep reinforcement learning algorithms to solve WTA problems in the field of air defense operations.Due to the rapid changes in the battlefield environment and the opponent's strategy, it is difficult to deal with complex situations by relying on traditional rules alone, and it is not possible to solve such problems well.The network model trained by neural network and DRL algorithm can provide good solutions to such problems and have a strong ability to adapt to complex battlefields.
In longitudinal comparison, under the same network architecture, compared with the A3C algorithm and DQN algorithm, the use of an improved DDPG algorithm can obtain a higher win rate and reward, indicating that the improved DDPG algorithm can effectively deal with such problems, and the algorithm proposed in this paper is effective.It is worth noting that the win curve and reward curve jitter is more intense because the scene is full of a large number of uncertainties, resulting in the overall fluctuation of the curve.As can be seen from Fig. 13, consistent with the above analysis, the improved DDPG algorithm has a higher win rate and average reward compared with RELU, A3C, and DQN.It shows that the improved DDPG algorithm is more suitable to solve the red-blue confrontation problem in air defense operations to a certain extent.

Analysis of Confrontation Details
In the process of simulation and deduction of the virtual digital battlefield, the trained red agent uses ammunition more reasonably, and tactics emerge, which can better complete the task of defending key places.This section mainly reviews and analyzes the data obtained in the process of simulation and summarizes the strategies emerging from the red agent in the process of confrontation.
(1) reasonable planning of ammunition, the first-line interception As shown in Fig. 14, in the initial stage, the red agent has almost no strategy, and each firepower unit fires freely when the firing conditions are met.The firepower units use too much ammunition to intercept incoming enemy aircraft at the same time, resulting in excessive ammunition consumption in the early stage, and the efficiency-cost ratio is extremely low, which greatly causes a waste of resources; When the blue's important and threatening incoming targets approach, the ammunition available is extremely limited, and it has to adopt a very conservative strategy to shoot and intercept, and finally the red is failed due to insufficient ammunition.
After a period of training, as shown in Fig. 15, the red agent can better adapt to the blue side's offensive rhythm, master certain rules, and correctly plan the use of ammunition.After the blue target enters the kill zone, the firepower units cooperate to complete the interception with the least ammunition, reflecting the effectiveness of the strategy; When the important and threatening incoming target of the blue side approaches, the firepower unit has a large ammunition stock at this time, and can flexibly adapt the shooting strategy to complete the defense task with low ammunition consumption.Without training, only when the blue target is about to enter the strategic hinterland, the red fire unit can complete the interception; After training, the red side firepower unit can be detected and intercepted as soon as possible, which further verifies the rationality and effectiveness of the network structure trained by the improved DDPG algorithm.As shown in Fig. 16, when it has not been trained, the red side firepower units fight independently, and only perform their defensive tasks, and when friendly neighboring units encounter danger, they fail to respond in time to counterattack without any tactics and battle methods.
After a period of training, as shown in Fig. 17, the red agent can command the firepower units to carry out coordinated defense, and while completing its defense task, it provides timely and appropriate fire support to other neighboring units, which greatly relieves the defensive pressure of other firepower units and improves the overall defense efficiency.When the long-range firepower units are seriously damaged, they will turn off the radar in time and defend themselves in a silent state.At this time the short-range firepower units can actively react, when the blue target enters the ambush circle, they can cooperate with long-range firepower units to quickly and efficiently destroy the incoming blue targets.Since the blue side strategy has not been fixed, that is, the blue side strategy is random, the trained red agent strategy has a certain generalization and can be adapted to other battle scenarios.Aiming at the difficulty of traditional models and algorithms to solve the uncertainty and nonlinearity problems in WTA, this paper constructs a new deep neural network framework for intelligent WTA, analyzes the network structure composition in detail, and introduces event-based reward mechanism, multi-head attention mechanism, and GRU.Then, based on the virtual digital battlefield close the real battlefield, real-time confrontation simulation experiments are carried out, and the improved DDPG algorithm with dual noise and priority experience playback techniques is used to solve the problem.The results show that under the new deep neural network framework, compared with the A3C algorithm, DQN algorithm, and RELU algorithm, the agent trained by the improved DDPG algorithm has a higher win rate and reward return, and the planning and use of ammunition is more reasonable, which can show a high decision-making level.The framework proposed in this paper has some reasonableness.

Figure 2 :Algorithm 1 :
Figure 2: Exploration strategy based on OU noise and Gaussian noise Gaussian noise is directly superimposed in the motion exploration in the form of ε ∼ N 0, δ 2 .In summary, the structural block diagram of the improved DDPG algorithm is shown in Fig. 3.The calculation flow of the DDPG-DNPE algorithm is as follows: Algorithm 1: DDPG-DNPE algorithm Randomly initialize critic network Q s, a|θ Q and actor μ (s|θ μ ) with weights θ Q and θ μ Initialize target network Q and μ with weights θ Q ← θ Q , θ μ ← θ μ Initialize replay buffer S (Continued)

Figure 5 :
Figure 5: Deep neural network framework for WTA

Table 3 :
Design indicates that the mechanism is included; indicates that this mechanism is not included.The results of the ablation experiment are shown in Figs. 10 and 11.

Figure 10 :
Figure 10: Average reward comparisonThe horizontal axis represents the training round, and the vertical axis represents the average reward obtained.It can be seen from Fig.10that with the increase of training rounds, the average reward obtained by the algorithm using the three mechanisms has increased, indicating that the three mechanisms proposed have a certain role, but the degree of impact is different: among them, the DDPG algorithm has a low reward and an extremely slow rise, probably because it has not used any mechanism, and the bottleneck of training is more obvious.The average reward of DDPG + G is also low, which may be due to the lack of real-time analysis of the battlefield situation, the difficulty of grasping the battlefield dynamics in real-time, and the delay in rewards, making it difficult to obtain better training results; The higher rewards obtained by DDPG + E and DDPG + M algorithms indicate that the influence of event-based reward mechanism and multi-head attention mechanism is greater, but the effect based on event-based reward mechanism is more obvious.When the three mechanisms are used at the same time, the average reward obtained by the agent increases from 15 to about 80, an increase of 81.25%, indicating that the three mechanisms introduced can significantly

Figure 11 :
Figure 11: Comparison of ablation results

Figure 12 :
Figure 12: Performance comparison of different algorithms.(a) The mean reward of DDPG + EGM and DQN + EGM; (b) The mean win ratio of DDPG + EGM and DQN + EGM; (c) The mean reward of DDPG + EGM and RELU + EGM; (d) The mean win ratio of DDPG + EGM and RELU + EGM; (e) The mean reward of DDPG + EGM and A3C + EGM; (f) The mean win ratio of DDPG + EGM and A3C + EGM

Figure 13 :
Figure 13: Comparison of the final reward and win rate for different algorithms

Figure 14 :
Figure 14: Agent performance before training

Figure 15 :
Figure 15: Agent performance after training

Figure 16 :
Figure 16: Agent performance before training

Figure 17 :
Figure 17: Agent performance after training according to the current policy and exploration noise Execute action a t and observe reward r and observe the new state s t+1 Calculate δ i and D i , store transition (s t , a t , r t , s t+1 ) in S Calculate W i , conduct importance sampling a minibatch of N transitions (s t , a t , r t , s t+1 ) from S Set y Algorithm 1 (continued) for episode = 1,M do Initialize a random process N t and ε for action exploration Receive initial observation s 1 for t = 1,T do Select action a t = μ (s t ; θ μ ) + N t + ε

Table 1 :
Classification and information of the state