Double Deep Q-Network Method for Energy Efficiency and Throughput in a UAV-Assisted Terrestrial Network

Increasing the coverage and capacity of cellular networks by deploying additional base stations is one of the fundamental objectives of fifth-generation (5G) networks. However, it leads to performance degradation and huge spectral consumption due to the massive densification of connected devices and simultaneous access demand. To meet these access conditions and improve Quality of Service, resource allocation (RA) should be carefully optimized. Traditionally, RA problems are nonconvex optimizations, which are performed using heuristic methods, such as genetic algorithm, particle swarm optimization, and simulated annealing. However, the application of these approaches remains computationally expensive and unattractive for dense cellular networks. Therefore, artificial intelligence algorithms are used to improve traditional RA mechanisms. Deep learning is a promising tool for addressing resource management problems in wireless communication. In this study, we investigate a double deep Q-network-based RA framework that maximizes energy efficiency (EE) and total network throughput in unmanned aerial vehicle (UAV)-assisted terrestrial networks. Specifically, the system is studied under the constraints of interference. However, the optimization problem is formulated as a mixed integer nonlinear program. Within this framework, we evaluated the effect of height and the number of UAVs on EE and throughput. Then, in accordance with the experimental results, we compare the proposed algorithm with several artificial intelligence methods. Simulation results indicate that the proposed approach can increase EE with a considerable throughput.


Introduction
In recent years, unmanned aerial vehicle (UAV)-assisted fifth-generation (5G) communication has provided an attractive way to connect users with different devices and improve network capacity. However, data traffic on cellular networks increases exponentially, Thus, resource allocation (RA) is becoming increasingly critical [1]. Industrial spectrum bands experience increased demand for channels, leading to a spectrum scarcity situation. In the context of 5G, mmWave is considered a potential solution to meet this demand [2,3]. Moreover, other techniques, such as beamforming, multi-input multi-output (MIMO), and advanced power control, are introduced as promising solutions in the design of future networks [4]. Despite all these attempts to satisfy this demand, RA remains a priority to accommodate users in terms of Quality of Service (QoS). RA problems are often formulated as nonconvex problems requiring proper management [5,6]. Optimal solutions are obtained by implementing heuristic methods, such as genetic algorithm, particle swarm optimization, and simulated annealing [7,8]. However, such solutions end up with quasioptimal solutions and converge relatively slowly. Therefore, alternative solutions and flexible algorithms that exploit late development in artificial intelligence are desirable to explore. Recently, deep learning (DL) [9] has emerged as an effective tool to increase flexibility and optimize RA in complex wireless communication networks. First, DL-based RA is flexible because the same deep neural network (DNN) can be implemented to achieve different design objectives by modifying the loss function [10]. Second, the computation time required by DL to obtain RA results is lower than that of conventional algorithms [11]. Finally, DL can receive complex high-dimensional information as input and allocate the optimal action for each input statistic in a particular condition [10]. On the basis of the above analysis, DL can be chosen as an accurate method for RA.

Related Works
As an emerging technology, DL has been used in several research studies to improve RA for terrestrial networks. For instance, the authors in [12] investigated the deep reinforcement learning (DRL)-based time division duplex configuration to allocate radio resources dynamically in an online manner and with high mobility. In [1], Lee et al. proposed deep power control based on a convolutional neural network to maximize spectral efficiency (SE) and energy efficiency (EE). In this study, a comparison between the DL model and a conventional weighted minimum mean square error was realized. In the same context, [13] performed a max-min and max-prod power allocation in downlink massive MIMO. To maximize EE, a deep artificial neural network scheme was applied in [14], where interference and system propagation channels were considered. Deep Q-learning (DQL)-based RA has also attracted much attention in recent literature. In [15], the authors studied the RA problem to enhance EE. The proposed method formulated a combined optimization problem, considering EE and QoS. More recently, a supervised DL approach in 5G multitier networks was adopted in [16] to solve the joint RA and remoteradio-head association. For this model, efficient subchannel and power allocation were used to generate training data. According to the decentralized RA mechanism, the authors in [17] developed a novel decentralized DL for vehicle-to-vehicle communications. The main objective was to determine the optimal sub-band and power level for transmission without requiring or waiting for global information. The authors used a DRL-based power control to investigate the problem of spectrum sharing in a cognitive radio system. The aim of this framework is that the secondary user shares the common spectrum with the primary user. Instead of unsupervised learning, the authors in [18] introduced supervised learning to maximize the throughput of device-to-device with maximum power constraint. The authors in [19] presented a comprehensive approach and considered DRL to maximize the total network throughput. However, this work did not include EE for optimization. Majority of the learning algorithms introduced above do not incorporate constraints directly into the training cost functions. Nowadays, literatures focus on RA in UAV-assisted cellular networks based on artificial intelligence. In reference

Contribution
Existing research on RA in UAV-assisted 5G networks focuses on single objective optimization and considers the DQN algorithm to generate data. Following the previous analysis, we investigate the RA problem in UAV-assisted cellular networks that maximize EE and total network throughput. Especially, DDQN is proposed to address intelligent RA. The main contributions of this study are listed below.
(1) We formulate EE and total throughput in mmWave scenario while ensuring the minimum QoS requirements for all users according to the environment. However, the optimization problem is formulated as a mixed integer nonlinear program. Multiple constraints, such as the path loss model, number of users, channel gains, beamforming, and signal-to-interference-plus-noise ratio (SINR) issues, are used to describe the environment. (2) We investigate a multiagent DDQN algorithm to optimize EE and total throughput. We assume that each user equipment (UE) behaves as an agent and performs optimization decisions on environmental information. (3) We compare the performance of the proposed algorithm, QL, and the DQN approaches already proposed in terms of RA.
The remainder of this paper is organized as follows: An overview for DRL is presented in Section 2, and the system model is introduced in Section 3. Then, the DDQN algorithm is discussed in Section 4, followed by simulation and results in Section 5. Lastly, conclusions and perspectives are drawn in Section 6.

Overview of DRL
DRL is a prominent case of machine learning and thus a class of artificial intelligence. It allows agents to identify the ideal performance based on its own experience, rather than depending on a supervisor [27]. In this approach, a neural network is used as an agent that learns by interacting with the environment and solves the process by determining an optimal action. Compared with the standard ML, namely supervised and unsupervised learning [28], DRL does not depend on data acquisition. Thus, sequential decision making occurs, and the next input is based on the decision of the learner or system. Moreover, in DRL, the Markov decision process (MDP) is formalized as a mathematical approach to modeling and decision-making situations. The reinforcement learning process operates as follows [29]: the agent begins in a specific state within its environment s 0 2 S by obtaining an initial observation w 0 2 and takes an action a t 2 A at each time step t. As illustrated in Fig. 1, the DRL can be categorized into three algorithms, such as valuebased, policy gradient, and model-based methods. In value-based DRL, the agent uses the learned value function to evaluate s; a ð Þ pairs and generate a policy [30]. DQL is a much more popular and efficient algorithm in this category. By contrast, a policy-based algorithm is intuitive, where algorithms learn a policy p. Learning a policy to act in an environment is sensible; thus, a policy function p considers a state s as input to generate an action a $ p s ð Þ.

Q-Learning
As a popular branch of machine learning, Q-learning is based on the main concept of the action value function q p s; a ð Þfor policy p. It uses the Bellman equation to learn and calculate the optimum values of the Q-function in an iterative way [30], which is expressed as where a t is the step-size parameter that defines the extent to which the new data contribute to the existing Q value, c is the MDP discounter factor, r tþ1 is the numerical reward for the agent after the execution of the action, and s tþ1 indicates that the environment changes to a new state, with transition probability pðs 0 ; r s; aÞ j , as illustrated in Fig. 2. However, the Q-learning algorithm could be applied only to RA problems with low dimensionality in state and action, resulting in an evolutionary limitation [31]. Moreover, this application is only used when the state and action spaces are discrete (e.g., channel access) [32].

Deep Q-Network
As stated above, the Q-learning algorithm faces difficulties in obtaining the optimal policy when the action and state spaces become exceptionally large [33]. This constraint is often observed in the RA approaches of cellular networks. To solve this problem, the DQN algorithm, which connects the traditional Q-learning algorithm to a convolutional neural network, was proposed [34]. The main difference with the Q-learning algorithm is the replacement of the table with the function approximator called DNN; this process attempts to approximate the Q values. Approximators have two types: a linear function and a nonlinear function [35]. However, in a nonlinear DNN, the new Q function is defined as Q s t ; a t jx ð Þ%Q Ã s; a ð Þ, where x represents the weights of the neural network. At each time t, action a t is taken in accordance with the e-greedy policy, and the transition tuple (s t ; a t ; r t ; s tþ1 Þ is stored mainly in a replay memory denoted by D. During the training process, a minibatch is sampled randomly from Figure 1: DRL algorithms experience D to optimize the mean squared error. Thus, the target Q-network is used to improve the stability of DQN, whose x is regularly adjusted to follow those of the principal Q-network. On the basis of the Bellman equation, the optimal state-action function is given by [35] To train the DQN, iterative updating of the weight x is used, thus minimizing the mean squared error of the Bellman equation. Mathematically, the loss function at each iteration is given by

System Model and Problem Formulation
In the proposed model, we consider the downlink communication of a UAV-assisted cellular network comprising a set of small base stations (SBSs) denoted as S ¼ SBS 1 ; SBS 2 ; . . . ; SBS M f g and a set of UAVs which is defined as U ¼ UAV 1 ; UAV 2 ; . . . ; UAV u f g . UAVs are placed at a particular altitude H, and H min H H max is assumed to be constant for all UAVs. Each cell contains an mm-wave band and some user N distributed randomly in a dense area. We assume that a particular user is assigned to a single base station that provides the strongest signal. In this work, MBS and UEs are assumed to be equipped with omnidirectional antennas, i.e., antennas with unit gain, and every UAV are equipped with directional antennas [36]. Moreover, each UE associated with UAVs are assigned with orthogonal resource blocks RBs (an RB consists of 12 subcarriers, with a total bandwidth of 180 kHz in the frequency domain and one time slot 0.5 ms in the time domain), whereas UEs associated with SBSs share the remaining RBs [37]. The transmission power allocated by UAVs and SBSs are denoted by P UAV , P SBS respectively. Furthermore, the link between BS = UAVs [ SBSs and users can have two conditions, i.e., line-of-sight (LoS) or non-line-of-sight (NLoS) link. As illustrated in Fig. 3, interference powers from adjacent base stations are considered. Table 1 summarizes the notations that were used in this article.

Fading and Achievable Data Rate
The channel between the base station and UE can be fixed or time varying. Fading is defined as the fluctuation in received signal strength with respect to time, and it occurs due to several factors, including transmitter and receiver movement, propagation environment, and atmospheric condition. Similar to [38], we model the channel in a way that it can capture small-scale and large-scale fading. At each time slot t, the small-scale fading between UAVs, SBSs, and UEs is considered frequency-selective fading, whose objective is to obtain a delay spread greater than the symbol period. By contrast, the channel in every subcarrier is supposed to be flat fading. This combination means that the channel gains can remain unchanged. All UEs periodically transmit their channel quality information to the related BS. In addition, let h BS;k;n designate the channel gain from BS to user k on different subcarriers n. A binary variable u is introduced to define the association mode. If UE is associated with the UAV/SBS according to LoS link, then u ¼ 0; otherwise u ¼ 1. We apply the following assumption in formulation. The mm-wave signal is affected by various factors, such as buildings in urban areas, making the link susceptible to effect blockage. Thus, the downlink achievable throughput (data rate) of user k on the n th subcarrier can be given by the following equation as R BS;k;n ¼ P link;UAV R UAV ;k;n þ P link;SBS R SBS;k;n (4) where P LoS;UAV , P LoS; SBS are the blockage probabilities when the link between the UAV/SBS and UE is LoS; they are expressed as [39,40] where b and c are constants that depend on the network environment, and z ¼ ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffi is the Euclidean distance between the typical UE and UAV (see Fig. 4).
where b is the blockage parameter that defines the average size of obstacles. Here, d corresponds to the distance between SBS and UE.

SINR and Path Loss Model
Adding additional gain to the system remains necessary due to the propagation losses that occur at mmwave frequencies. One of the main solutions proposed by several research for future wireless networks is beamforming [41]. The fundamental principle of beamforming is to control the direction of a wavefront toward the UE. According to [42], UAV and SBS serve UE through beamforming technology. In this manner, the SINR of UE from UAV at time slot t can be written as where G t UAV represents the directional beamforming gain for the desired link, and r 2 refers to additive white Gaussian noise. I SBS and I UAV 0 represent the interference from the adjacent SBS and UAV, respectively, and are expressed as Without loss of generality, different properties are displayed in terms of propagation. For air-to-ground communication, the path loss of LoS and NLoS links at time slot t can be experienced depending on additional path losses in LoS and NLoS links u t LoS ; u t NLoS and path loss exponents a t LoS , a t NLoS as Similarly, we define the SINR when UE is associated with SBS. In this case, we adopt the standard power-law path loss model with the mean d t LoS , d t NLoS for LoS and NLoS, respectively. Hence, the path loss model can be given as The SINR additional loss at the typical UE when it is connected to SBS is given by (13), where I UAV ¼ P t UAV h t UAV ;k;n G t SBS and I SBS 0 ¼ P t SBS 0 h t SBS 0 ;k;n G t SBS 0 are the interferences from UAV and SBS, respectively.

Spectral Efficiency and Energy Efficiency
SE and EE are the key metrics to evaluate any wireless communication system. SE is defined as the efficiency capability of a given channel bandwidth. In other words, it represents the transmission rate per unit of bandwidth and is measured in bits per second per hertz. The EE metric is used to evaluate the total energy consumption for a network. It is defined as a ratio of the total transferred bits to the total power consumption. Nevertheless, EE and SE have a fundamental relationship. Let P C be the power consumed in the circuit of the transmitter; then, EE can be given by where P i is the transmit power i 2 UAV ; SBS f g , which ranges 0 , P i P max ; the achievable SE i2 UAV ;SBS f g of transmitter can be computed as For UAV For SBS

Objective Formulation
The proper performance of EE approaches is of paramount importance in UAV-assisted terrestrial networks because it is directly related to the choice of objectives and constraints for relevant optimization problems. In this work, we aim to optimize two specific objectives for RA, namely, the maximization of EE and throughput. From the SE perspective, the EE maximization problem can be formulated as Constraint C1 means that the transmit power P UAV and P SBS must be in the interval 0; P max ½ . It specifies the upper limit of the power transmission. Constraint C2 indicates that the UAV should be positioned between a minimum and maximum height. At higher heights, the distance between the UAV and UE increases, resulting in considerable path loss. By contrast, when the UAV is located at a certain minimum height, the NLoS conditions are recorded and may affect EE; hence, this constraint must be studied. The constraint in C3 guarantees that the EE of UAV must be greater than that of SBS. In C4, R BS;k;n defines the maximum downlink achievable data rate, whereas R Qos BS;k;n accounts for the data rate requirement. Constraints C5 and C6 specify that the SINR of the UE should be higher than a certain threshold; the SINR threshold differs from each tier (UAV, SBS). Lastly, the last constraint ensures that the UE is connected with a single BS. In a subsequent section, we will present our second objective, which is to maximize the total network throughput. The overall throughput R BS;k;n is defined as the sum of the data rates that are provided perfectly to all UE. Mathematically, the maximization problem can be computed as  The constraint in C4 indicates the minimum required data rate for QoS. Here, constraint C5 means that the LoS probability of the SBS must be less than that of the UAV.

Double Deep Q-Network Algorithm
In this section, we present a DRL algorithm-based EE and throughput RA framework to address the network problems of (17) and (18). The task of the DRL agent is to learn an optimal policy from state to action, thus maximizing the utility function. We formulate the optimization problem as a fully observable Markov decision process. Similar to literature, we consider a tuple (s t ; a t ; r t ; s tþ1 Þ. Based on the transition probability pðs tþ1 js t ; a t Þ, the current network state s t learns a new state according to the action a t selected by the agent at time slot t. A DDQN is applied to achieve an optimal solution. However, we assume that UAV and SBS act as an agent that continuously interacts with the environment to optimize the policy. First, the agent j observes the state s j t and decides to take an action a j t in accordance with the optimal policy. Then, at each time policy, the agent receives reward r j t conditioned by the action and moves to the next state s j tþ1 . This procedure concerns the DQN algorithm with a single agent. The major inconvenience of this algorithm lies in the confusion of the selection or evaluation of actions, leading to overestimation of action values and unstable training. To solve this overestimation, Hasselt et al. proposed a DDQN architecture, where the max function estimators is decomposed into action selection and evaluation, as illustrated in Fig. 5. The fundamental concept of the algorithm is to change the target network Y DQN t ¼ r tþ1 þ c max a Q s tþ1 ; a t x 0 t À Á as At each time t, the weighted parameters x t of the online network is used to evaluate the greedy policy, whereas the weighted parameter x 0 t estimates the policy value. For improved performance evaluation, the target network for DDQN can use any parameters from the previous iteration t À 1 ð Þ. Therefore, a periodic update of the target network settings is applied with copies of the online network.

State and Observation
The state describes a specific configuration of the environment. At time slot t, UAVs and SBSs act as agents and define the observation space O t j . The observation of each BS ¼ UAV [ SBS includes the SINR measurement from the UAV and SBS to UE, the height of UAVs H, and spectral efficiency. We define the global state as where O BS t represents the set of observation and can be expressed as

Action
In our problem, each agent must choose an appropriate base station (i.e., UAV or SBS), power transmission, UAV height, and LoS/NLoS link probability. At time step t, the action of UAV/SBS can be expressed as a j Agent (UAV and SBS) select an action a j t according to e-greedy. Obtain the immediate reward r j t , and observe next state s tþ1 . Obtain optimal EE and throughput.
Store transition (s t j ; a t j ; r t ; s tþ1 j Þ in D. for agent j ¼ 1 to M þ u do UAV and SBS randomly sample minibatch (s t j ; a t j ; r t ; s tþ1 j Þ into D. Calculate the target Q-value Y DDQN t in Eq. (19) Train main network Apply gradient descent as end for end for end for

Simulation Results
This section discusses the simulation and results for EE and throughput in the downlink UAV-assisted terrestrial network comprising eight SBSs with a radius of 500 m and five UAVs deployed randomly in the area. The cell contains 20 randomly distributed users and uses mm-wave bands. We assume that the maximum power transmission for SBS is P max; SBS ¼ 23 dBm, and different values of maximum P max; UAV is shown in simulation. The path loss exponent in the LoS and NLoS links for the UAV and SBS have the values a UAV LoS ¼ 3, a UAV NLoS ¼ 3; 5, a SBS LoS ¼ 2, and a SBS NLoS ¼ 4. In addition, the power consumed in the circuit of the transmitter P c ¼ 40 dBm. The added white Gaussian noise r 2 ¼ À114 dBm. In the DDQN algorithm, the DNN of each agent is a four-layer fully connected neural network with two hidden layers of 64 and 32 neurons. Other simulation and DDQN parameters are listed in Table 2. The simulation is realized using MATLAB (R2017a) running on a Dell PC (2.8 Ghz @ Intel Core i7-7600U, 16 GB). In our simulation, we consider w 1 ¼ 0:6 and w 2 ¼ 0:4.

Energy Efficiency Analysis
In this subsection, we show some results of EE, which are obtained using DDNQ. For improved performance validation, we compare our proposed algorithm with the DQN and QL architectures. Moreover, the effect of UE demand, number of UAVs, and beamforming on maximum power P max; UAV are discussed. In the simulation evaluation, the parameter values in Table 2 are used, unless otherwise specified. First, we evaluate the effect of UE demand on the EE for different algorithms in Fig. 6. A common observation in Fig. 6 is that increasing UE demand can lead to increased EE; however, from 60 Mbps, EE converges less quickly.
This result is obtained because when UE demand increases considerably (60 Mbps . Þ, all algorithms (DDQN, DQN, and QL) aim to maximize network throughput, which requires high transmission power, causing reduced EE. Another comment from Fig. 6 is that the DDQN algorithm can outperform DQN and QL. This outcome is achieved because the agent selects a more appropriate Q value to estimate the action. This perfection is mainly due to the two separate estimators applied in DDQN. In other terms, the use of the opposite estimator is cost effective for obtaining unbiased Q-values. A proposed solution to the EE problem when UE demand increases is to add base stations. The increase in the number of UAVs has a remarkable effect on EE, as illustrated in Fig. 7. As the number of UAVs increases, EE improves because the number of users covered in UAV LoS increases.  Fig. 7 demonstrates that the DDQN algorithm outperforms DQN and QL by 13.3% on EE because traditional RL algorithms use a one-actor network to train multiple agents; thus, conflicts between agents are recorded. Next, EE is plotted as a function of the number of UE for different UAV height (H max constraint), as shown in Fig. 8. Moreover, an increase in the number of UE results in EE degradation because of the increase in energy consumption. Fig. 8 also shows that UAV height can affect EE. Therefore, EE increases as H max increases because the increase in UAV height results in additional UEs in the LoS link condition, leading to an increase in the total number of bits transmitted.
As the number of UE increases, the power assigned to UE declines. Therefore, the increase in height compensates this shortcoming. Fig. 9 shows EE vs. the maximum power of UAV P max; UAV with and without beamforming. A common observation in Fig. 9 is that EE decreases by extending the maximum transmission power of the UAV due to the increased energy consumption by users. In addition, when the power of UAVs increases, the links between UAVs and UEs are in NLoS condition, thus reducing EE. This analysis is conducted with and without beamforming. As illustrated in Fig. 9, applied beamforming improves EE in each algorithm (DDQN and DQN) because beamforming provides additional gains and can overcome mm-wave blockage constraints.

Throughput Analysis
To validate the accuracy of our approach, we analyze the total throughput (second objective) according to the number of UAVs deployed, UAV height H Max , and beamforming. Considering the first scenario, Fig. 10 depicts the total throughput as a function of the number of UAVs. As the number of UAVs increases, the total throughput is enhanced. Thus, DDQN outperforms DQN and QL. However, this effect is mainly due to the increase in LoS links. The same figure also shows that the total throughput reaches a congestion level at a particular number of UAVs due to the rise in interference between UAVs. Fig. 11 illustrates the variation of throughput vs. UAV height in different AI algorithms. According to Fig. 11, throughput increases with maximization of altitude H Max because at low altitude, the propagation condition is in NLoS, and interference between tiers is observed. By contrast, when the UAV height increases, the LoS condition occurs, resulting in reduced loss. Moreover, saturation is experienced from an altitude of 130 m because as UAV height increases, the distance between the UAV and UE increases, leading to signal attenuation.   Fig. 12 shows the variation of the throughput vs. maximum UAV power. As expected, the total throughput increases as P Max; UAV increases. Fig. 12 also reveals that DDQN achieves a maximum throughput of 582.7 Mbps with a maximum power of P Max; UAV ¼ 35 dBm. By contrast, DQN achieves a maximum throughput of 269,234 Mbps at the same P Max; UAV . Again, the proposed DDQN algorithm outperforms DQN. Finally, we plot the throughput as a function of blockage parameter b for SBS when UAVs are assumed to be located at H Max ¼ 120 m, as shown in Fig. 13. When b increases, the total throughput of the network decreases. Therefore, with the increase in obstacle density, more UEs are served by NLoS conditions. In addition, Fig. 13 shows that the proposed DDQN scheme converges to highly satisfactory solutions compared with the other approaches because it handles interference perfectly.

Conclusion
In this study, we proposed a DDQN scheme for RA optimization in UAV-assisted terrestrial networks. The problem is formulated as EE and throughput maximization. Initially, we provided a general overview of deep reinforcement architectures. Then, we presented the network architecture where the base stations use the beamforming technique during transmission. The proposed EE and throughput were assessed under the number of UAVs, beamforming, maximum UAV power transmission, and blockage parameter. The algorithm accuracy of the obtained EE and throughput was demonstrated by a comparison with deep Q-network and Q-learning. Our results indicate that EE can be affected by the number of UAVs to be deployed in the coverage area, as well as the maximum altitude variation (constraint). Moreover, the use of beamforming can be cost effective in improving EE. Our investigation also revealed other useful conclusions. For throughput analysis, the blockage parameter has a dominant influence on the throughput, and an optimal value can be selected. In terms of convergences, our DDQN consistently outperforms DQN and QL. In future work, other issues can be explored and investigated. For instance, UAV mobility can be considered, and an optimal mobility model can be selected to maximize throughput. Interference coordination may also be introduced between tiers.