<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.1 20151215//EN" "http://jats.nlm.nih.gov/publishing/1.1/JATS-journalpublishing1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:mml="http://www.w3.org/1998/Math/MathML" xml:lang="en" article-type="research-article" dtd-version="1.1">
<front>
<journal-meta>
<journal-id journal-id-type="pmc">CMC</journal-id>
<journal-id journal-id-type="nlm-ta">CMC</journal-id>
<journal-id journal-id-type="publisher-id">CMC</journal-id>
<journal-title-group>
<journal-title>Computers, Materials &#x0026; Continua</journal-title>
</journal-title-group>
<issn pub-type="epub">1546-2226</issn>
<issn pub-type="ppub">1546-2218</issn>
<publisher>
<publisher-name>Tech Science Press</publisher-name>
<publisher-loc>USA</publisher-loc>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="publisher-id">51217</article-id>
<article-id pub-id-type="doi">10.32604/cmc.2024.051217</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Article</subject>
</subj-group>
</article-categories>
<title-group>
<article-title>QoS Routing Optimization Based on Deep Reinforcement Learning in SDN</article-title>
<alt-title alt-title-type="left-running-head">QoS Routing Optimization based on Deep Reinforcement Learning in SDN</alt-title>
<alt-title alt-title-type="right-running-head">QoS Routing Optimization based on Deep Reinforcement Learning in SDN</alt-title>
</title-group>
<contrib-group>
<contrib id="author-1" contrib-type="author">
<name name-style="western"><surname>Song</surname><given-names>Yu</given-names></name><xref ref-type="aff" rid="aff-1">1</xref></contrib>
<contrib id="author-2" contrib-type="author">
<name name-style="western"><surname>Qian</surname><given-names>Xusheng</given-names></name><xref ref-type="aff" rid="aff-2">2</xref></contrib>
<contrib id="author-3" contrib-type="author">
<name name-style="western"><surname>Zhang</surname><given-names>Nan</given-names></name><xref ref-type="aff" rid="aff-3">3</xref></contrib>
<contrib id="author-4" contrib-type="author">
<name name-style="western"><surname>Wang</surname><given-names>Wei</given-names></name><xref ref-type="aff" rid="aff-2">2</xref></contrib>
<contrib id="author-5" contrib-type="author" corresp="yes">
<name name-style="western"><surname>Xiong</surname><given-names>Ao</given-names></name><xref ref-type="aff" rid="aff-1">1</xref><email>xiongao@bupt.edu.cn</email></contrib>
<aff id="aff-1"><label>1</label><institution>State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications</institution>, <addr-line>Beijing, 100876</addr-line>, <country>China</country></aff>
<aff id="aff-2"><label>2</label><institution>Marketing Service Center, State Grid Jiangsu Electric Power Co., Ltd.</institution>, <addr-line>Nanjing, 220000</addr-line>, <country>China</country></aff>
<aff id="aff-3"><label>3</label><institution>Customer Service Center, State Grid Co., Ltd.</institution>, <addr-line>Nanjing, 211161</addr-line>, <country>China</country></aff>
</contrib-group>
<author-notes>
<corresp id="cor1"><label>&#x002A;</label>Corresponding Author: Ao Xiong. Email: <email>xiongao@bupt.edu.cn</email></corresp>
</author-notes>
<pub-date date-type="collection" publication-format="electronic">
<year>2024</year></pub-date>
<pub-date date-type="pub" publication-format="electronic"><day>15</day>
<month>5</month>
<year>2024</year></pub-date>
<volume>79</volume>
<issue>2</issue>
<fpage>3007</fpage>
<lpage>3021</lpage>
<history>
<date date-type="received">
<day>29</day>
<month>2</month>
<year>2024</year>
</date>
<date date-type="accepted">
<day>09</day>
<month>4</month>
<year>2024</year>
</date>
</history>
<permissions>
<copyright-statement>&#x00A9; 2024 Song et al.</copyright-statement>
<copyright-year>2024</copyright-year>
<copyright-holder>Song et al.</copyright-holder>
<license xlink:href="https://creativecommons.org/licenses/by/4.0/">
<license-p>This work is licensed under a <ext-link ext-link-type="uri" xlink:type="simple" xlink:href="https://creativecommons.org/licenses/by/4.0/">Creative Commons Attribution 4.0 International License</ext-link>, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
</license>
</permissions>
<self-uri content-type="pdf" xlink:href="TSP_CMC_51217.pdf"></self-uri>
<abstract>
<p>To enhance the efficiency and expediency of issuing e-licenses within the power sector, we must confront the challenge of managing the surging demand for data traffic. Within this realm, the network imposes stringent Quality of Service (QoS) requirements, revealing the inadequacies of traditional routing allocation mechanisms in accommodating such extensive data flows. In response to the imperative of handling a substantial influx of data requests promptly and alleviating the constraints of existing technologies and network congestion, we present an architecture for QoS routing optimization with in Software Defined Network (SDN), leveraging deep reinforcement learning. This innovative approach entails the separation of SDN control and transmission functionalities, centralizing control over data forwarding while integrating deep reinforcement learning for informed routing decisions. By factoring in considerations such as delay, bandwidth, jitter rate, and packet loss rate, we design a reward function to guide the Deep Deterministic Policy Gradient (DDPG) algorithm in learning the optimal routing strategy to furnish superior QoS provision. In our empirical investigations, we juxtapose the performance of Deep Reinforcement Learning (DRL) against that of Shortest Path (SP) algorithms in terms of data packet transmission delay. The experimental simulation results show that our proposed algorithm has significant efficacy in reducing network delay and improving the overall transmission efficiency, which is superior to the traditional methods.</p>
</abstract>
<kwd-group kwd-group-type="author">
<kwd>Deep reinforcement learning</kwd>
<kwd>SDN</kwd>
<kwd>route optimization</kwd>
<kwd>QoS</kwd>
</kwd-group>
<funding-group>
<award-group id="awg1">
<funding-source>State Grid Corporation of China</funding-source>
<award-id>5700-202353318A-1-1-ZN</award-id>
</award-group>
</funding-group>
</article-meta>
</front>
<body>
<sec id="s1">
<label>1</label>
<title>Introduction</title>
<p>For power grid enterprises to effectively conduct power engineering infrastructure, operations and maintenance, marketing, and other business activities, seamless cross-departmental interaction is essential to ensure reliable data sharing. This is achieved by issuing electronic licenses for data storage and sharing, thereby guaranteeing trusted data sharing between departments. However, in order to achieve efficient and rapid electronic license issuance in power business scenarios, higher network Quality of Service (QoS) [<xref ref-type="bibr" rid="ref-1">1</xref>] is imperative. This entails satisfying users&#x2019; requirements concerning the delay, throughput, jitter rate, and packet loss rate of the network. When dealing with large-scale data transmission and traffic, ensuring the stability of network services is crucial to prevent network paralysis caused by congestion. Traditional network routing schemes typically rely on the shortest path algorithm for calculation, which has proven insufficient to meet the demands of current network traffic with extensive resource requirements. These traditional approaches often suffer from slow convergence speeds and are prone to network congestion [<xref ref-type="bibr" rid="ref-2">2</xref>,<xref ref-type="bibr" rid="ref-3">3</xref>].</p>
<p>Software Defined Networking (SDN) employs the separation of transfer and control by utilizing the control layer&#x2019;s development interface to offer services to the upper network layer, while ensuring unified control of the data forwarding layer for efficient traffic transmission [<xref ref-type="bibr" rid="ref-4">4</xref>]. Implementing SDN architecture can significantly enhance network performance and utilization, facilitating comprehensive network environment management [<xref ref-type="bibr" rid="ref-5">5</xref>]. The versatile features of SDN find applications in diverse scenarios such as edge computing [<xref ref-type="bibr" rid="ref-6">6</xref>], routing optimization, and others. In today&#x2019;s SDN routing algorithm, Dijkstra&#x2019;s fixed exploration strategy algorithm [<xref ref-type="bibr" rid="ref-7">7</xref>] is mainly used, which considers the shortest path problem and does not consider other network metrics, which will limit the exploration of the network environment, in the face of complex networks, there will be many limitations. The various algorithms of Reinforcement Learning (RL) involve a process of searching for the optimal solution in the process of continuous exploration and utilization, mainly solving the resource scheduling decision problem [<xref ref-type="bibr" rid="ref-8">8</xref>]. Through the design of reward functions and state action problems, the process of dynamic transition and state is determined. It can be combined with the routing optimization algorithm to continuously explore and optimize the QoS as the index to find the optimal path forwarding. Deep Reinforcement Learning (DRL) adds deep learning to the agent in reinforcement learning and uses the perception ability of deep learning to deal with complex network environment and routing decision problems more effectively, to achieve better routing strategy, DRL is mainly used in route design and resource management, and there are applications of DRL in games [<xref ref-type="bibr" rid="ref-9">9</xref>], video [<xref ref-type="bibr" rid="ref-10">10</xref>]and dynamic resource allocation [<xref ref-type="bibr" rid="ref-11">11</xref>]. Deep Deterministic Policy Gradient (DDPG) based routing algorithms are commonly employed to solve decision problems in continuous action space. When facing unknown and complex network conditions, DDPG can adjust its policy to adapt to the new environment through learning. When the network topology or load changes, DDPG can adjust the routing strategy through learning to optimize the network performance. Compared with traditional routing algorithms, DDPG has higher flexibility and adaptability to dynamic network conditions.</p>
<p>In the power industry, electronic licenses need to be issued quickly to ensure timely data interaction and sharing. Traditional routing algorithms fail to meet the QoS requirements of this scenario. We combine the SDN architecture and deep reinforcement learning algorithm and propose a QoS routing optimization algorithm based on deep reinforcement learning in SDN. Considering the QoS index requirements of the scene, we design the reward function to provide the optimal QoS service. The main contributions of this paper are as follows:
<list list-type="order">
<list-item>
<p>We propose an overall routing architecture based on deep reinforcement learning in SDN. SDN is combined with deep reinforcement learning and applied in routing scenarios. The agent architecture in reinforcement learning is put into the SDN controller layer, and the routing was put into the data forwarding layer. Using the characteristics of SDN transfer and control separation can effectively improve the transmission efficiency of the network.</p></list-item>
<list-item>
<p>We propose the DDPG-based routing optimization algorithm in SDN. Taking QoS as the optimization objective, the delay, bandwidth, jitter rate and packet loss rate as metrics are comprehensively considered, the reward function is designed, and the DDPG algorithm is used to learn the optimal routing strategy to provide the optimal QoS service set in this paper.</p></list-item>
<list-item>
<p>We have carried out a large number of experimental simulations. The experimental results show that the DDPG algorithm in the routing optimization scenario, compared with the traditional routing algorithm, can efficiently process data requests and reduce data transmission delays.</p></list-item>
</list></p>
<p>The rest of the paper is structured as follows: <xref ref-type="sec" rid="s2">Section 2</xref> summarizes the related work. <xref ref-type="sec" rid="s3">Section 3</xref> presents the system model architecture. <xref ref-type="sec" rid="s4">Section 4</xref> introduces the routing algorithm based on deep reinforcement learning in SDN. In <xref ref-type="sec" rid="s5">Section 5</xref>, the experimental simulation is carried out. Conclusions are given in <xref ref-type="sec" rid="s6">Section 6</xref>.</p>
</sec>
<sec id="s2">
<label>2</label>
<title>Related Work</title>
<p>To solve the problem of network congestion and meet the demands of current mass traffic. There is also many related research to solve this practical problem. At present, most of the research on QoS in SDN network is based on a single data index, which has low algorithm complexity and is simple to realize. It can also optimize routing traffic to a certain extent, but it is easy to cause local optimum. It only optimizes a single parameter without considering the constraints of multiple QoS parameters, so it can only solve a specific part of the problem [<xref ref-type="bibr" rid="ref-12">12</xref>].</p>
<p>Most of the traditional routing traffic forwarding methods use the Shortest Path First method to select route forwarding, such as the Open Shortest Path First (OSPF) [<xref ref-type="bibr" rid="ref-13">13</xref>] protocol, the forwarding traffic only considers the shortest path, namely delay, without considering other factors in QoS, which is easy to cause channel congestion and cannot meet the demands of today&#x2019;s high data traffic.</p>
<p>There have been extensive studies on the application research of SDN [<xref ref-type="bibr" rid="ref-14">14</xref>,<xref ref-type="bibr" rid="ref-15">15</xref>]. There have been many studies on SDN and artificial intelligence technology in routing scenarios in recent years. Gopi et al. [<xref ref-type="bibr" rid="ref-16">16</xref>] and Shirmarz et al. [<xref ref-type="bibr" rid="ref-17">17</xref>] used SDN technology to enhance the traditional routing protocol, which has limitations and does not combine network operation knowledge to realize intelligent routing. Xu et al. [<xref ref-type="bibr" rid="ref-18">18</xref>] proposed a route randomization defense method based on deep reinforcement learning to resist eavesdropping attacks. Compared with other route randomization methods, it has obvious advantages in security and resistance to eavesdropping attacks. Yu et al. [<xref ref-type="bibr" rid="ref-19">19</xref>] proposed DROM routing optimization framework, which uses DDPG algorithm to optimize SDN routing, and can deal with multi-dimensional action space. By maximizing the reward, the network state updates the link weight continuously, to select QoS routes that meet the conditions. Stampa et al. [<xref ref-type="bibr" rid="ref-20">20</xref>] proposed a deep reinforcement learning agent for routing optimization under the SDN framework, which selects routes according to the current routing state, to minimize network delay. Some related studies use Q-learning algorithm to optimize routing in QoS [<xref ref-type="bibr" rid="ref-21">21</xref>], and the whole training process uses Markov Decision Process (MDP) for training, which can achieve low delay and high throughput to a certain extent. However, Q-learning uses the Q-table value to store the reward value of the training process. When the number of routes increases, the selection of paths also increases greatly, and the storage of Q-table will bring a lot of memory overhead. Some related studies use Deep Q-Learning (DQN) algorithm for routing optimization [<xref ref-type="bibr" rid="ref-22">22</xref>], and use neural network to replace the traditional Q table. However, DQN algorithm can only be applied to discontinuous action and network state space, and today&#x2019;s network state transition is very fast, so it is not suitable for use in today&#x2019;s network transition and traffic transmission is large.</p>
<p>The above references can improve the performance of routing by using SDN and DRL technology, but only consider the delay parameter in the QoS index and optimize the network delay, and do not consider other indicators of network status. In this paper, SDN and DDPG algorithm are combined to optimize the routing, and the QoS of the routing is optimized by considering factors such as packet loss rate, jitter rate, delay, and bandwidth, to provide users with a better network experience. Therefore, using the global network topology to realize intelligent QoS routing optimization in SDN architecture to improve QoS while ensuring network quality of service has become an urgent problem to be solved in current research.</p>
<p>The above papers can improve the performance of routing by using SDN and ML technology, but in QoS indicators, only delay is optimized without considering other indicators of network status. In this paper, SDN and DRL technology are combined to optimize the routing, and the factors of packet loss rate, jitter rate, delay and bandwidth are comprehensively considered to optimize the QoS of the route to achieve the optimal QoS. Therefore, using the global network topology to realize intelligent QoS routing optimization in SDN architecture, and ensuring the quality of network service while improving QoS, has become a problem to be solved in current research.</p>
</sec>
<sec id="s3">
<label>3</label>
<title>Routing Architecture Based on Deep Reinforcement Learning in SDN</title>
<p>The system model architecture utilizes in this paper adopts SDN architecture and DRL. SDN&#x2019;s separation of forwarding and control can effectively improve the problem of network congestion and low efficiency and combines the decision-making ability of reinforcement learning and the perception ability of deep learning and is well-suited for the scenario of routing optimization. In the SDN control layer, the DRL layout sends the action through the SDN controller to the data forwarding layer for path selection. In this section, the routing architecture based on deep reinforcement learning under SDN will be introduced in detail.</p>
<p>SDN can effectively improve the congestion and inefficiency of current networks [<xref ref-type="bibr" rid="ref-23">23</xref>]. SDN is built by separating the control and data layers of the network devices used today. The SDN framework is divided into application layer, control layer and data forwarding layer from top to bottom [<xref ref-type="bibr" rid="ref-24">24</xref>]. Information transfer between layers the control layer and the application layer are transmitted through the north interface, and the information transfer between the control layer and the data layer is the south interface. The advantages of SDN architecture are 1) the network structure is clearly layered, and the functions are clearly distributed; 2) the network transmission and configuration are unified and operated by the controller, programmable; 3) the control layer and data forwarding are structurally coupled, which can improve the efficiency of data transmission. SDN&#x2019;s structure of separating the transfer and control layers and centralized control can offer enhanced flexibility for data handling and can accelerate the overall network transmission efficiency more effectively, which has been widely used in recent years.</p>
<p>The RL model is a kind of MDP architecture. The process of RL is a continuous interaction between the intelligence and the environment for action selection and state change, and a reward value is returned, and the interaction process converges until the reward value reaches its maximum. Its representative algorithms are Q-learning [<xref ref-type="bibr" rid="ref-25">25</xref>], but as the number of environmental states increases, Q-table will occupy larger storage resources and time-consuming to find, and when the number of environmental states is immeasurable, Q-table cannot support storing all the states. DRL can solve the problems of Q-learning. Deep reinforcement learning combines the decision-making ability of RL with the perceptual ability of deep learning. The representative algorithms are DQN which combines neural network and Q-learning, but it is not applicable to the environmental situation of continuous action. Deterministic Policy Gradient (DPG) algorithm [<xref ref-type="bibr" rid="ref-26">26</xref>] can be used in the case of continuous action changes but is prone to overfitting problems. The DDPG algorithm [<xref ref-type="bibr" rid="ref-27">27</xref>] can solve these problems, which is based on the combination of DPG and DQN algorithms on the Actor-Critic method. The algorithm used in this paper for the deep reinforcement learning module is DDPG, and the DDPG algorithm is described in detail next.</p>
<p>The overall process of the DDPG algorithm is shown in <xref ref-type="fig" rid="fig-1">Fig. 1</xref> and employs the experience replay technique. The experience replay is a set of data <inline-formula id="ieqn-1"><mml:math id="mml-ieqn-1"><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>s</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>a</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>r</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>s</mml:mi><mml:mrow><mml:mi>t</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula> for each state transition into a relay buffer, where <inline-formula id="ieqn-2"><mml:math id="mml-ieqn-2"><mml:msub><mml:mi>s</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> is the current state of the environment, <inline-formula id="ieqn-3"><mml:math id="mml-ieqn-3"><mml:msub><mml:mi>a</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> is the action made in the current environment, <inline-formula id="ieqn-4"><mml:math id="mml-ieqn-4"><mml:msub><mml:mi>r</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> is the real reward value obtained by making the action, and <inline-formula id="ieqn-5"><mml:math id="mml-ieqn-5"><mml:msub><mml:mi>s</mml:mi><mml:mrow><mml:mi>t</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula> is the state of the environment after making the action. After reaching the set number of cache pools N a random uniform collection of M (M &#x003C; N) datasets is performed to train the neural network, using randomly sampled datasets in order to eliminate the temporal correlation between datasets. The Actor-Critic algorithm is used in the DDPG algorithm, where each module has two neural networks, an online network for training and learning, and a target network, both of which are identical in structure. Firstly, the online network is initialized with <inline-formula id="ieqn-6"><mml:math id="mml-ieqn-6"><mml:mi>Q</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mi>s</mml:mi><mml:mo>,</mml:mo><mml:mi>a</mml:mi><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:msup><mml:mi>&#x03B8;</mml:mi><mml:mrow><mml:mi>Q</mml:mi></mml:mrow></mml:msup><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula> and <inline-formula id="ieqn-7"><mml:math id="mml-ieqn-7"><mml:mi>&#x03BC;</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mi>s</mml:mi><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:msup><mml:mi>&#x03B8;</mml:mi><mml:mrow><mml:mi>&#x03BC;</mml:mi></mml:mrow></mml:msup><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula>, and the target network is initialized with <inline-formula id="ieqn-8"><mml:math id="mml-ieqn-8"><mml:msup><mml:mi>Q</mml:mi><mml:mrow><mml:msup><mml:mtext>&#x00A0;</mml:mtext><mml:mo>&#x2032;</mml:mo></mml:msup></mml:mrow></mml:msup></mml:math></inline-formula> and <inline-formula id="ieqn-9"><mml:math id="mml-ieqn-9"><mml:msup><mml:mi>&#x03BC;</mml:mi><mml:mrow><mml:msup><mml:mtext>&#x00A0;</mml:mtext><mml:mo>&#x2032;</mml:mo></mml:msup></mml:mrow></mml:msup></mml:math></inline-formula>, and the initialization assignment of the parameters of this network is <inline-formula id="ieqn-10"><mml:math id="mml-ieqn-10"><mml:msup><mml:mi>&#x03B8;</mml:mi><mml:mrow><mml:msup><mml:mi>Q</mml:mi><mml:mrow><mml:msup><mml:mtext>&#x00A0;</mml:mtext><mml:mo>&#x2032;</mml:mo></mml:msup></mml:mrow></mml:msup></mml:mrow></mml:msup><mml:mo stretchy="false">&#x2190;</mml:mo><mml:msup><mml:mi>&#x03B8;</mml:mi><mml:mrow><mml:mi>Q</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula>, <inline-formula id="ieqn-11"><mml:math id="mml-ieqn-11"><mml:msup><mml:mi>&#x03B8;</mml:mi><mml:mrow><mml:msup><mml:mi>&#x03BC;</mml:mi><mml:mrow><mml:msup><mml:mtext>&#x00A0;</mml:mtext><mml:mo>&#x2032;</mml:mo></mml:msup></mml:mrow></mml:msup></mml:mrow></mml:msup><mml:mo stretchy="false">&#x2190;</mml:mo><mml:msup><mml:mi>&#x03B8;</mml:mi><mml:mrow><mml:mi>&#x03BC;</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula> The training dataset is adopted by the empirical playback algorithm. The detailed process of training is described in the following.</p>
<fig id="fig-1">
<label>Figure 1</label>
<caption>
<title>DDPG training process</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_51217-fig-1.tif"/>
</fig>
<p>First the agent acquires the state <inline-formula id="ieqn-12"><mml:math id="mml-ieqn-12"><mml:msub><mml:mi>s</mml:mi><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula> and performs t&#x2192;1, T for cycling for the same learning and training. According to the current environment <inline-formula id="ieqn-13"><mml:math id="mml-ieqn-13"><mml:msub><mml:mi>s</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>, the corresponding action <inline-formula id="ieqn-14"><mml:math id="mml-ieqn-14"><mml:msub><mml:mi>a</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mi>&#x03BC;</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>s</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:msup><mml:mi>&#x03B8;</mml:mi><mml:mrow><mml:mi>u</mml:mi></mml:mrow></mml:msup><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula> is made according to the actor function of the online network. The action at is made, the reward <inline-formula id="ieqn-15"><mml:math id="mml-ieqn-15"><mml:msub><mml:mi>r</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> and the new environment state <inline-formula id="ieqn-16"><mml:math id="mml-ieqn-16"><mml:msub><mml:mi>s</mml:mi><mml:mrow><mml:mi>t</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula> are obtained, and <inline-formula id="ieqn-17"><mml:math id="mml-ieqn-17"><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>s</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>a</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>r</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>s</mml:mi><mml:mrow><mml:mi>t</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula> is put into the replay buffer R. When the amount of data in the replay buffer reaches the set number N, the parameters in the neural network are updated by randomly selecting M data sets in the replay buffer R. The reward value <inline-formula id="ieqn-18"><mml:math id="mml-ieqn-18"><mml:msub><mml:mi>y</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> obtained at the execution of that action, the actual reward value currently obtained plus the predicted reward value for future actions obtained with the objective function critic, uses the Time Differential (TD) method as in <xref ref-type="disp-formula" rid="eqn-1">Eq. (1)</xref>, where &#x03B3; is the discount factor, representing the constant decline of the reward.
<disp-formula id="eqn-1"><label>(1)</label><mml:math id="mml-eqn-1" display="block"><mml:msub><mml:mi>y</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mi>r</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:mi>&#x03B3;</mml:mi><mml:mi>Q</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>s</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msup><mml:mi>&#x03BC;</mml:mi><mml:mrow><mml:msup><mml:mtext>&#x00A0;</mml:mtext><mml:mo>&#x2032;</mml:mo></mml:msup></mml:mrow></mml:msup><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>s</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:msup><mml:mi>&#x03B8;</mml:mi><mml:mrow><mml:msup><mml:mi>&#x03BC;</mml:mi><mml:mrow><mml:msup><mml:mtext>&#x00A0;</mml:mtext><mml:mo>&#x2032;</mml:mo></mml:msup></mml:mrow></mml:msup></mml:mrow></mml:msup><mml:mo>)</mml:mo></mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:msup><mml:mi>&#x03B8;</mml:mi><mml:mrow><mml:msup><mml:mi>Q</mml:mi><mml:mrow><mml:msup><mml:mtext>&#x00A0;</mml:mtext><mml:mo>&#x2032;</mml:mo></mml:msup></mml:mrow></mml:msup></mml:mrow></mml:msup><mml:mo>)</mml:mo></mml:mrow></mml:math></disp-formula></p>
<p>According to the current state <inline-formula id="ieqn-19"><mml:math id="mml-ieqn-19"><mml:msub><mml:mi>s</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> makes the corresponding action <inline-formula id="ieqn-20"><mml:math id="mml-ieqn-20"><mml:msub><mml:mi>a</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>, as the input of <inline-formula id="ieqn-21"><mml:math id="mml-ieqn-21"><mml:mi>Q</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mi>s</mml:mi><mml:mo>,</mml:mo><mml:mi>a</mml:mi><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:msup><mml:mi>&#x03B8;</mml:mi><mml:mrow><mml:mi>Q</mml:mi></mml:mrow></mml:msup><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula>, the reward value is derived, and the TD-error is performed between the <inline-formula id="ieqn-22"><mml:math id="mml-ieqn-22"><mml:msub><mml:mi>y</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> calculated with the TD algorithm to calculate the loss function as in <xref ref-type="disp-formula" rid="eqn-2">Eq. (2)</xref>, which is used to update the online critic network parameters.
<disp-formula id="eqn-2"><label>(2)</label><mml:math id="mml-eqn-2" display="block"><mml:mi>L</mml:mi><mml:mo>=</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:mi>N</mml:mi></mml:mfrac><mml:msub><mml:mo movablelimits="false">&#x2211;</mml:mo><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:msup><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>y</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>&#x2212;</mml:mo><mml:mi>Q</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>s</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>a</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:msup><mml:mi>&#x03B8;</mml:mi><mml:mrow><mml:mi>Q</mml:mi></mml:mrow></mml:msup><mml:mo>)</mml:mo></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msup></mml:math></disp-formula></p>
<p>The parameters in the online actor network are updated using the product of the gradient of the online critic network and the gradient of the online actor network, as shown in <xref ref-type="disp-formula" rid="eqn-3">Eq. (3)</xref>. The equations for updating the critic and actor parameters of the target network are shown <xref ref-type="disp-formula" rid="eqn-4">Eqs. (4)</xref> and <xref ref-type="disp-formula" rid="eqn-5">(5)</xref>.
<disp-formula id="eqn-3"><label>(3)</label><mml:math id="mml-eqn-3" display="block"><mml:msub><mml:mi mathvariant="normal">&#x2207;</mml:mi><mml:mrow><mml:mrow><mml:msup><mml:mi>&#x03B8;</mml:mi><mml:mrow><mml:mi>&#x03BC;</mml:mi></mml:mrow></mml:msup></mml:mrow></mml:mrow></mml:msub><mml:mi>J</mml:mi><mml:mo>=</mml:mo><mml:mi>g</mml:mi><mml:mi>r</mml:mi><mml:mi>a</mml:mi><mml:mi>d</mml:mi><mml:mrow><mml:mo>[</mml:mo><mml:mi>Q</mml:mi><mml:mo>]</mml:mo></mml:mrow><mml:mo>&#x2217;</mml:mo><mml:mi>g</mml:mi><mml:mi>r</mml:mi><mml:mi>a</mml:mi><mml:mi>d</mml:mi><mml:mrow><mml:mo>[</mml:mo><mml:mi>&#x03BC;</mml:mi><mml:mo>]</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:mi>N</mml:mi></mml:mfrac><mml:msub><mml:mo movablelimits="false">&#x2211;</mml:mo><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:msub><mml:mi mathvariant="normal">&#x2207;</mml:mi><mml:mrow><mml:mi>a</mml:mi></mml:mrow></mml:msub><mml:mi>Q</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mi>s</mml:mi><mml:mo>,</mml:mo><mml:mi>a</mml:mi><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:msup><mml:mi>&#x03B8;</mml:mi><mml:mrow><mml:mi>Q</mml:mi></mml:mrow></mml:msup><mml:mo>)</mml:mo></mml:mrow><mml:msub><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mrow><mml:mi>s</mml:mi><mml:mo>=</mml:mo><mml:mrow><mml:msub><mml:mi>s</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo>,</mml:mo><mml:mi>a</mml:mi><mml:mo>=</mml:mo><mml:mi>&#x03BC;</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:msub><mml:mi>s</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:msub><mml:msub><mml:mi mathvariant="normal">&#x2207;</mml:mi><mml:mrow><mml:mrow><mml:msup><mml:mi>&#x03B8;</mml:mi><mml:mrow><mml:mi>&#x03BC;</mml:mi></mml:mrow></mml:msup></mml:mrow></mml:mrow></mml:msub><mml:mi>&#x03BC;</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mi>s</mml:mi><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:msup><mml:mi>&#x03B8;</mml:mi><mml:mrow><mml:mi>&#x03BC;</mml:mi></mml:mrow></mml:msup><mml:mo>)</mml:mo></mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:msub><mml:mi>s</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:math></disp-formula>
<disp-formula id="eqn-4"><label>(4)</label><mml:math id="mml-eqn-4" display="block"><mml:msup><mml:mi>&#x03B8;</mml:mi><mml:mrow><mml:msup><mml:mi>Q</mml:mi><mml:mrow><mml:msup><mml:mtext>&#x00A0;</mml:mtext><mml:mo>&#x2032;</mml:mo></mml:msup></mml:mrow></mml:msup></mml:mrow></mml:msup><mml:mo stretchy="false">&#x2190;</mml:mo><mml:mi>&#x03C4;</mml:mi><mml:msup><mml:mi>&#x03B8;</mml:mi><mml:mrow><mml:mi>Q</mml:mi></mml:mrow></mml:msup><mml:mo>+</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:mn>1</mml:mn><mml:mo>&#x2212;</mml:mo><mml:mi>&#x03C4;</mml:mi><mml:mo>)</mml:mo></mml:mrow><mml:msup><mml:mi>&#x03B8;</mml:mi><mml:mrow><mml:msup><mml:mi>Q</mml:mi><mml:mrow><mml:msup><mml:mtext>&#x00A0;</mml:mtext><mml:mo>&#x2032;</mml:mo></mml:msup></mml:mrow></mml:msup></mml:mrow></mml:msup></mml:math></disp-formula>
<disp-formula id="eqn-5"><label>(5)</label><mml:math id="mml-eqn-5" display="block"><mml:msup><mml:mi>&#x03B8;</mml:mi><mml:mrow><mml:msup><mml:mi>&#x03BC;</mml:mi><mml:mrow><mml:msup><mml:mtext>&#x00A0;</mml:mtext><mml:mo>&#x2032;</mml:mo></mml:msup></mml:mrow></mml:msup></mml:mrow></mml:msup><mml:mo stretchy="false">&#x2190;</mml:mo><mml:mi>&#x03C4;</mml:mi><mml:msup><mml:mi>&#x03B8;</mml:mi><mml:mrow><mml:mi>&#x03BC;</mml:mi></mml:mrow></mml:msup><mml:mo>+</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:mn>1</mml:mn><mml:mo>&#x2212;</mml:mo><mml:mi>&#x03C4;</mml:mi><mml:mo>)</mml:mo></mml:mrow><mml:msup><mml:mi>&#x03B8;</mml:mi><mml:mrow><mml:msup><mml:mi>&#x03BC;</mml:mi><mml:mrow><mml:msup><mml:mtext>&#x00A0;</mml:mtext><mml:mo>&#x2032;</mml:mo></mml:msup></mml:mrow></mml:msup></mml:mrow></mml:msup></mml:math></disp-formula></p>
<p>The SDN architecture itself, with its inherent separation of the transfer and control layers, offers a platform for network optimization. By integrating DRL, specifically the DDPG algorithm, into the routing optimization process, we aim to achieve QoS routing optimization. The overall architecture is depicted in <xref ref-type="fig" rid="fig-2">Fig. 2</xref>. In this architecture, the network topology resides within the data forwarding layer of the SDN architecture, serving as the environment element in the MDP model. The DRL agent, utilizing DDPG, operates within the control layer of SDN, functioning as an agent element within the MDP. The agent receives real-time network state information from the environment, serving as input to the DDPG neural network. It then outputs actions with maximum Q-value to the SDN controller, which centrally routes the next hopping action. Subsequently, the network state transitions to the next state, with actual network data provided to the agent as a parameter for the reward value calculation based on QoS metrics. This reward value is crucial for updating the neural network parameters within the DDPG algorithm employed by the agent. The neural network parameters in the DDPG algorithm are continuously trained and updated to achieve convergence, thereby obtaining the optimal routing and forwarding policy. Consequently, this approach enables the identification of the optimal path for route forwarding to meet the specified QoS targets.</p>
<fig id="fig-2">
<label>Figure 2</label>
<caption>
<title>Overall system architecture for deep reinforcement learning-based routing in SDN</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_51217-fig-2.tif"/>
</fig>
</sec>
<sec id="s4">
<label>4</label>
<title>Deep Reinforcement Learning Based Routing Algorithm in SDN</title>
<p>The deep reinforcement learning used in this paper uses the DDPG algorithm, which has been described in detail about the algorithm training process in the previous section. The DDPG algorithm can be used for continuous action space scenarios, and the neural network used to record the reward values is used for training records. The design and implementation details of each element of DDPG are described in detail below. A directed graph G (V, E) is used to represent the routing network, where V denotes the set of all routes and E denotes the set of links between routes. The number of nodes N &#x003D; |V|, and the number of links L &#x003D; |E|.</p>
<p>State: In the RL process, the state reflects the characteristics of the current environment in which the agent is located. In DRL based routing scenarios, states represent the transmission status of packets in the network. It starts from the source node to the destination node. The total number of nodes in the network is N, and the packet goes through each node in the network. For each QoS metric, a |N| &#x002A; |N| two-dimensional matrix R is defined. <inline-formula id="ieqn-23"><mml:math id="mml-ieqn-23"><mml:msub><mml:mi>d</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> stands for bandwidth, delay, packet loss rate and jitter rate in the current network packet in the QoS metric per unit time from the source node <inline-formula id="ieqn-24"><mml:math id="mml-ieqn-24"><mml:msub><mml:mi>d</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> to <inline-formula id="ieqn-25"><mml:math id="mml-ieqn-25"><mml:msub><mml:mi>d</mml:mi><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>. The state matrix is shown in <xref ref-type="disp-formula" rid="eqn-6">Eq. (6)</xref>.
<disp-formula id="eqn-6"><label>(6)</label><mml:math id="mml-eqn-6" display="block"><mml:mi>R</mml:mi><mml:mo>=</mml:mo><mml:mrow><mml:mo>[</mml:mo><mml:mtable columnalign="center center center" rowspacing="4pt" columnspacing="1em"><mml:mtr><mml:mtd><mml:msub><mml:mi>d</mml:mi><mml:mrow><mml:mn>11</mml:mn></mml:mrow></mml:msub></mml:mtd><mml:mtd><mml:mo>&#x22EF;</mml:mo></mml:mtd><mml:mtd><mml:msub><mml:mi>d</mml:mi><mml:mrow><mml:mn>1</mml:mn><mml:mi>N</mml:mi></mml:mrow></mml:msub></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mo>&#x22EE;</mml:mo></mml:mtd><mml:mtd><mml:mo>&#x22F1;</mml:mo></mml:mtd><mml:mtd><mml:mo>&#x22EE;</mml:mo></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:msub><mml:mi>d</mml:mi><mml:mrow><mml:mi>N</mml:mi><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:mtd><mml:mtd><mml:mo>&#x22EF;</mml:mo></mml:mtd><mml:mtd><mml:msub><mml:mi>d</mml:mi><mml:mrow><mml:mi>N</mml:mi><mml:mi>N</mml:mi></mml:mrow></mml:msub></mml:mtd></mml:mtr></mml:mtable><mml:mo>]</mml:mo></mml:mrow></mml:math></disp-formula></p>
<p>Action: Action is the route that the agent chooses for the next hop based on the current state and reward. In routing as is the specific routing rules issued by the agent to the network. Suppose the network is equipped with E edges, the set of actions is defined as <inline-formula id="ieqn-26"><mml:math id="mml-ieqn-26"><mml:mi>A</mml:mi><mml:mo>=</mml:mo><mml:mrow><mml:mo>[</mml:mo><mml:msub><mml:mi>a</mml:mi><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>a</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mi>a</mml:mi><mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mi>E</mml:mi><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow></mml:mrow></mml:msub><mml:mo>]</mml:mo></mml:mrow></mml:math></inline-formula>. Each communication link <inline-formula id="ieqn-27"><mml:math id="mml-ieqn-27"><mml:mrow><mml:mo>(</mml:mo><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi><mml:mo>)</mml:mo></mml:mrow><mml:mo>&#x2208;</mml:mo><mml:mi>E</mml:mi></mml:math></inline-formula> in this network.</p>
<p>Reward Function: According to the current network state and the behavior made by the agent the shift to the next network state feedback to the reward, the reward can be set according to different networks with different indicators of the reward function. In this paper the reward design parameters are delay D, bandwidth B, packet loss rate L, jitter rate J of QoS, delay, jitter rate, packet loss rate in QoS the smaller represents the better network quality. Then the reward function is shown in <xref ref-type="disp-formula" rid="eqn-7">Eq. (7)</xref>, where <italic>w<sub>1</sub></italic>, <italic>w<sub>2</sub></italic>, <italic>w<sub>3</sub></italic>, and <italic>w<sub>4</sub></italic> take values in the range of (0,1).
<disp-formula id="eqn-7"><label>(7)</label><mml:math id="mml-eqn-7" display="block"><mml:mi>R</mml:mi><mml:mi>e</mml:mi><mml:mi>w</mml:mi><mml:mi>a</mml:mi><mml:mi>r</mml:mi><mml:mi>d</mml:mi><mml:mo>=</mml:mo><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mi>d</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo>&#x2217;</mml:mo><mml:msub><mml:mi>w</mml:mi><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:msub><mml:mi>b</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo>&#x2217;</mml:mo><mml:msub><mml:mi>w</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mi>l</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo>&#x2217;</mml:mo><mml:msub><mml:mi>w</mml:mi><mml:mrow><mml:mn>3</mml:mn></mml:mrow></mml:msub><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mi>j</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo>&#x2217;</mml:mo><mml:msub><mml:mi>w</mml:mi><mml:mrow><mml:mn>4</mml:mn></mml:mrow></mml:msub></mml:math></disp-formula></p>
<p>Routing Optimization: The routing optimization algorithm based on DDPG under SDN is shown in Algorithm 1. The algorithm initializes the parameters of critic network and actor network as well as the target network parameters and replay buffer. In the training process of DDPG algorithm, the current routing network state <inline-formula id="ieqn-28"><mml:math id="mml-ieqn-28"><mml:msub><mml:mi>s</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> is obtained from the SDN controller, and the action <inline-formula id="ieqn-29"><mml:math id="mml-ieqn-29"><mml:msub><mml:mrow><mml:mtext>a</mml:mtext></mml:mrow><mml:mrow><mml:mrow><mml:mtext>t</mml:mtext></mml:mrow></mml:mrow></mml:msub></mml:math></inline-formula> is made according to the current policy and noise, and the current state is converted into <inline-formula id="ieqn-30"><mml:math id="mml-ieqn-30"><mml:msub><mml:mi>s</mml:mi><mml:mrow><mml:mi>t</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula>, and the reward value <inline-formula id="ieqn-31"><mml:math id="mml-ieqn-31"><mml:msub><mml:mi>r</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> obtained currently is calculated. The size of the reward value r is influenced by the routing QoS metrics delay, bandwidth, packet loss rate, and jitter rate. After that, a set of data <inline-formula id="ieqn-32"><mml:math id="mml-ieqn-32"><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>s</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>a</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>r</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>s</mml:mi><mml:mrow><mml:mi>t</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula> is put into the replay buffer, and when the set target number is reached, the specified number is randomly taken from the replay buffer to update the parameters of the critic network, actor network and each parameter in the target network until each parameter reaches convergence. If the source node i to the destination node wants to find the QoS optimal path, the agent obtains the routing network information from the SDN controller and outputs the path that satisfies the largest value of <inline-formula id="ieqn-33"><mml:math id="mml-ieqn-33"><mml:msub><mml:mi>y</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> in node i to j. Thus, the routing is optimized to improve the QoS so that the network can provide services more stably.</p>
<fig id="fig-11">
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_51217-fig-11.tif"/>
</fig>
</sec>
<sec id="s5">
<label>5</label>
<title>Experiment</title>
<sec id="s5_1">
<label>5.1</label>
<title>Environment and Parameter Setting</title>
<p>The network environment simulated in this paper is barabasi-albert network type with 500 nodes and the average node degree is 3. The DRL module is implemented based on PyTorch framework. The relevant parameters for using deep reinforcement learning are set as follows: The random sampling training capacity is M &#x003D; 16, the capacity of the cache pool is N &#x003D; 1000, the discount factor <inline-formula id="ieqn-61"><mml:math id="mml-ieqn-61"><mml:mi>&#x03B3;</mml:mi></mml:math></inline-formula> &#x003D; 0.99, and the neural network parameters are 0.01,<inline-formula id="ieqn-62"><mml:math id="mml-ieqn-62"><mml:mi>&#x03C4;</mml:mi></mml:math></inline-formula> &#x003D; 0.5. The number of training rounds is 30, and the iteration step of each round is 2000. The reward function parameters <inline-formula id="ieqn-63"><mml:math id="mml-ieqn-63"><mml:msub><mml:mi>w</mml:mi><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula>, <inline-formula id="ieqn-64"><mml:math id="mml-ieqn-64"><mml:msub><mml:mi>w</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula>, <inline-formula id="ieqn-65"><mml:math id="mml-ieqn-65"><mml:msub><mml:mi>w</mml:mi><mml:mrow><mml:mn>3</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula>, and <inline-formula id="ieqn-66"><mml:math id="mml-ieqn-66"><mml:msub><mml:mi>w</mml:mi><mml:mrow><mml:mn>4</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula> are set to be 0.9. Specific parameter settings are shown in <xref ref-type="table" rid="table-1">Table 1</xref>.</p>
<table-wrap id="table-1">
<label>Table 1</label>
<caption>
<title>Parameter setting</title>
</caption>
<table frame="hsides">
<colgroup>
<col align="left"/>
<col align="left"/>
</colgroup>
<thead>
<tr>
<th>Parameter</th>
<th>Set</th>
</tr>
</thead>
<tbody>
<tr>
<td>Number of network nodes</td>
<td>500</td>
</tr>
<tr>
<td>Node degree</td>
<td>3</td>
</tr>
<tr>
<td>N</td>
<td>1000</td>
</tr>
<tr>
<td>M</td>
<td>16</td>
</tr>
<tr>
<td><inline-formula id="ieqn-67"><mml:math id="mml-ieqn-67"><mml:mi>&#x03B3;</mml:mi></mml:math></inline-formula></td>
<td>0.99</td>
</tr>
<tr>
<td><inline-formula id="ieqn-68"><mml:math id="mml-ieqn-68"><mml:mi>&#x03C4;</mml:mi></mml:math></inline-formula></td>
<td>0.5</td>
</tr>
<tr>
<td>Learning rate</td>
<td>0.005</td>
</tr>
<tr>
<td>Steps</td>
<td>50</td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
<sec id="s5_2">
<label>5.2</label>
<title>Experimental Result</title>
<p>Regarding the DDPG algorithm, the number of network packets is 5000 for 50 rounds of training and learning, and the Shortest Path uses Dijkstra algorithm to find the shortest route. The meaning of the experimental performance index is as follows:</p>
<p>Average delay time per episode: The average of the delay times (s) in each training round.</p>
<p>Percent of empty nodes per episode: Percentage of idle nodes in each training round.</p>
<p>Average delay time: The average of the delay time (s) experienced by the packet transmission</p>
<p>Average packet idle time: The average time (s) that a packet is idle between sending and receiving.</p>
<p>Average non-empty queue length: The average number of non-empty elements in a queue.</p>
<p>Maximum number of packet nodes held: The maximum number of packet nodes held.</p>
<p>Percent of working nodes at capacity: The percentage of worker nodes that have reached their maximum capacity.</p>
<p><xref ref-type="fig" rid="fig-3">Fig. 3</xref> shows that with the increase of training rounds, the average delay time of DDPG algorithm shows a downward trend. Before the number of rounds reaches 10, the downward trend is large; after the number of rounds reaches 10, the downward trend gradually decreases; when the number of rounds reaches 20, the average delay time gradually reaches convergence. The results indicate that DDPG algorithm is more effective in routing environment.</p>
<fig id="fig-3">
<label>Figure 3</label>
<caption>
<title>Average delay time per episode</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_51217-fig-3.tif"/>
</fig>
<p><xref ref-type="fig" rid="fig-4">Fig. 4</xref> shows that the percentage of idle nodes increases with the number of training times. When the training round reaches 10, the percentage of idle nodes increases greatly, and when the training reaches 20, the proportion of idle nodes gradually converges. After the training of DDPG algorithm reaches a certain convergence, the optimal path of routing is selected, that is, the path with relatively fewer routing hops and faster transmission will be selected, so the proportion of occupied nodes will be reduced.</p>
<fig id="fig-4">
<label>Figure 4</label>
<caption>
<title>Percent of empty nodes per episode</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_51217-fig-4.tif"/>
</fig>
<p><xref ref-type="fig" rid="fig-5">Fig. 5</xref> shows the average delay time comparison as the number of packets increases. It can be seen from the figure that with the increase of data packets, the delay of Shortest Path algorithm increases greatly, and the delay of DDPG algorithm increases slowly. When the data packet reaches 5000, the average delay of DDPG algorithm is 182.55 units less than that of Shortest Path algorithm. DDPG algorithm has better performance in terms of delay.</p>
<fig id="fig-5">
<label>Figure 5</label>
<caption>
<title>Average delay time</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_51217-fig-5.tif"/>
</fig>
<p><xref ref-type="fig" rid="fig-6">Fig. 6</xref> shows the average waiting time of data packets with the increase of data packets. It can be seen from the figure that the waiting time of data packets in Shortest Path algorithm increases greatly with the increase of the number of data packets, while the waiting time of data packets in DDPG algorithm increases slowly with the average waiting time of data packets less. The average packet waiting time of DDPG algorithm is 82.2 units less than Shortest Path algorithm.</p>
<fig id="fig-6">
<label>Figure 6</label>
<caption>
<title>Average packet idle time</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_51217-fig-6.tif"/>
</fig>
<p><xref ref-type="fig" rid="fig-7">Fig. 7</xref> illustrates the change in the length of the average nonempty queue as the number of packets increases. It can be seen from the figure that with the increase of the number of data packets, the change of the average non-empty queue length in the Shortest Path algorithm increases greatly, and the average non-empty queue length in the DDPG algorithm is stable without large fluctuations. When the data packet reaches 5000, the average non-empty queue length of DDPG algorithm is 61.481 less than that of Shortest Path algorithm.</p>
<fig id="fig-7">
<label>Figure 7</label>
<caption>
<title>Average non-empty queue length</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_51217-fig-7.tif"/>
</fig>
<p><xref ref-type="fig" rid="fig-8">Fig. 8</xref> illustrates the variation of the maximum number of packet nodes as the number of packets increases. It can be seen from the figure that with the increase of data packets, the maximum number of packet nodes in the Shortest Path algorithm changes little, but is above 140, while the maximum number of packet nodes in the DDPG algorithm increases gradually, but the number of packet nodes in the DDPG algorithm is less than the maximum number of packet nodes in the Shortest Path algorithm.</p>
<fig id="fig-8">
<label>Figure 8</label>
<caption>
<title>Maximum number of packet nodes held</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_51217-fig-8.tif"/>
</fig>
<p><xref ref-type="fig" rid="fig-9">Fig. 9</xref> illustrates the capacity percentage of the working nodes as the packets increase. It can be seen from the figure that the capacity percentage of working nodes of the Shortest Path algorithm increases more than that of the DDPG algorithm. When the data packet is 5000, the workload percentage of DDPG is nearly 25% less than that of the Shortest Path algorithm.</p>
<fig id="fig-9">
<label>Figure 9</label>
<caption>
<title>Percent of working nodes at capacity</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_51217-fig-9.tif"/>
</fig>
<p>To show the impact of different w values in the reward function on the performance, we conduct comparative experiments. <xref ref-type="fig" rid="fig-10">Fig. 10</xref> illustrates the effect of different values of w on the delay. Among them, when <inline-formula id="ieqn-69"><mml:math id="mml-ieqn-69"><mml:msub><mml:mi>w</mml:mi><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula> is set to 1, only the delay factor is considered, while the other w values are set to 0. When <inline-formula id="ieqn-70"><mml:math id="mml-ieqn-70"><mml:msub><mml:mi>w</mml:mi><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula> is set to 1/4, the other values of w are also set to 1/4, indicating that all factors have the same weight. When <inline-formula id="ieqn-71"><mml:math id="mml-ieqn-71"><mml:msub><mml:mi>w</mml:mi><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula> is set to 0, the delay factor is not considered. According to the adjustment to different performance requirements, the w value can be adjusted accordingly.</p>
<fig id="fig-10">
<label>Figure 10</label>
<caption>
<title>Effect of different w values on latency</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_51217-fig-10.tif"/>
</fig>
<p>In summary, DDPG algorithm is employed in the routing optimization scenario, in the face of a large number of traffic data requests, compared with the traditional routing algorithm, the optimal routing path is customized, the QoS optimal path is selected comprehensively, the packet queue length is reduced, the data request can be processed in time, the packet waiting time is reduced, and the delay is reduced.</p>
</sec>
</sec>
<sec id="s6">
<label>6</label>
<title>Conclusion</title>
<p>To achieve efficient and fast electronic certificate issuance in power business scenarios, traditional network routing schemes pose significant challenges. To solve the existing problems, we propose a QoS routing optimization method based on deep reinforcement learning in SDN. Through the combination of SDN architecture and DDPG algorithm, a reward function based on QoS measurement was designed to obtain the optimal network transmission path. Through the characteristics of SDN transfer and control separation, the optimal network transmission path was uniformly communicated to the data transmission layer, so as to improve the network transmission rate and provide the optimal QoS. The experimental results show that the algorithm can effectively improve the efficiency of network transmission, reduce the delay of processing data packets, and effectively reduce network congestion. In the future, we will consider multiple objectives such as security, performance and resource utilization into the routing optimization algorithm, find the optimal solution to the multi-objective problem, and compare multiple routing optimization algorithms at the same time.</p>
</sec>
</body>
<back>
<ack><p>All authors sincerely thank all institutions for providing resources and research conditions. We would like to thank all the members of the research group for their suggestions and support, which have provided important help and far-reaching influence on our research work.</p>
</ack>
<sec><title>Funding Statement</title>
<p>This work has been supported by State Grid Corporation of China Science and Technology Project &#x201C;Research and Application of Key Technologies for Trusted Issuance and Security Control of Electronic Licenses for Power Business&#x201D; (5700-202353318A-1-1-ZN).</p>
</sec>
<sec><title>Author Contributions</title>
<p>The authors confirm contribution to the paper as follows: Study conception and design: Yu Song, Ao Xiong; analysis and interpretation of results: Xuesheng Qian, Nan Zhang, Wei Wang; draft manuscript preparation: Yu Song. All authors reviewed the results and approved the final version of the manuscript.</p>
</sec>
<sec sec-type="data-availability"><title>Availability of Data and Materials</title>
<p>The authors confirm that the data supporting the findings of this study are available within the article.</p>
</sec>
<sec sec-type="COI-statement"><title>Conflicts of Interest</title>
<p>The authors declare that they have no conflicts of interest to report regarding the present study.</p>
</sec>
<ref-list content-type="authoryear">
<title>References</title>
<ref id="ref-1"><label>[1]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>J.</given-names> <surname>Ali</surname></string-name> and <string-name><given-names>B. H.</given-names> <surname>Roh</surname></string-name></person-group>, &#x201C;<article-title>Quality of service improvement with optimal software-defined networking controller and control plane clustering</article-title>,&#x201D; <source>Comput. Mater. Contin.</source>, vol. <volume>67</volume>, no. <issue>1</issue>, pp. <fpage>849</fpage>&#x2013;<lpage>875</lpage>, <year>2021</year>. doi: <pub-id pub-id-type="doi">10.32604/cmc.2021.014576</pub-id>.</mixed-citation></ref>
<ref id="ref-2"><label>[2]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><collab>Tomovic</collab>, <collab>Slavica</collab>, <string-name><given-names>I.</given-names> <surname>Radusinovic</surname></string-name>, and <string-name><given-names>N.</given-names> <surname>Prasad</surname></string-name></person-group>, &#x201C;<article-title>Performance comparison of QoS routing algorithms applicable to large-scale SDN networks</article-title>,&#x201D; in <conf-name>IEEE EUROCON 2015-Int. Conf. Comput. Tool (EUROCON)</conf-name>, <publisher-loc>Salamanca, Spain</publisher-loc>, <publisher-name>IEEE</publisher-name>, <year>2015</year>.</mixed-citation></ref>
<ref id="ref-3"><label>[3]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>E.</given-names> <surname>Akin</surname></string-name> and <string-name><given-names>T.</given-names> <surname>Korkmaz</surname></string-name></person-group>, &#x201C;<article-title>Comparison of routing algorithms with static and dynamic link cost in software defined networking (SDN)</article-title>,&#x201D; <source>IEEE Access</source>, vol. <volume>7</volume>, pp. <fpage>148629</fpage>&#x2013;<lpage>148644</lpage>, <year>2019</year>. doi: <pub-id pub-id-type="doi">10.1109/ACCESS.2019.2946707</pub-id>.</mixed-citation></ref>
<ref id="ref-4"><label>[4]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>R.</given-names> <surname>Masoudi</surname></string-name> and <string-name><given-names>A.</given-names> <surname>Ghaffari</surname></string-name></person-group>, &#x201C;<article-title>Software defined networks: A survey</article-title>,&#x201D; <source>J. Netw. Comput. Appl.</source>, vol. <volume>67</volume>, no. <issue>4</issue>, pp. <fpage>1</fpage>&#x2013;<lpage>25</lpage>, <year>2016</year>. doi: <pub-id pub-id-type="doi">10.1016/j.jnca.2016.03.016</pub-id>.</mixed-citation></ref>
<ref id="ref-5"><label>[5]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>W.</given-names> <surname>Li</surname></string-name>, <string-name><given-names>Y.</given-names> <surname>Wang</surname></string-name>, <string-name><given-names>Z.</given-names> <surname>Jin</surname></string-name>, <string-name><given-names>K.</given-names> <surname>Yu</surname></string-name>, <string-name><given-names>J.</given-names> <surname>Li</surname></string-name> and <string-name><given-names>Y.</given-names> <surname>Xiang</surname></string-name></person-group>, &#x201C;<article-title>Challenge based collaborative intrusion detection in software-defined networking: An evaluation</article-title>,&#x201D; <source>Digital Commun. Netw.</source>, vol. <volume>7</volume>, no. <issue>2</issue>, pp. <fpage>257</fpage>&#x2013;<lpage>263</lpage>, <year>2021</year>. doi: <pub-id pub-id-type="doi">10.1016/j.dcan.2020.09.003</pub-id>.</mixed-citation></ref>
<ref id="ref-6"><label>[6]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>X.</given-names> <surname>Xu</surname></string-name> <etal>et al.</etal></person-group>, &#x201C;<article-title>Secure service offloading for internet of vehicles in SDN-enabled mobile edge computing</article-title>,&#x201D; <source>IEEE Trans. Intell. Transp. Syst.</source>, vol. <volume>22</volume>, no. <issue>6</issue>, pp. <fpage>3720</fpage>&#x2013;<lpage>3729</lpage>, <year>2021</year>. doi: <pub-id pub-id-type="doi">10.1109/TITS.2020.3034197</pub-id>.</mixed-citation></ref>
<ref id="ref-7"><label>[7]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>M.</given-names> <surname>Cicio&#x011F;lu</surname></string-name> and <string-name><given-names>A.</given-names> <surname>&#x00C7;alhan</surname></string-name></person-group>, &#x201C;<article-title>Energy-efficient and SDN-enabled routing algorithm for wireless body area networks</article-title>,&#x201D; <source>Comput. Commun.</source>, vol. <volume>160</volume>, pp. <fpage>228</fpage>&#x2013;<lpage>239</lpage>, <year>2020</year>.</mixed-citation></ref>
<ref id="ref-8"><label>[8]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>D.</given-names> <surname>Yang</surname></string-name> <etal>et al.</etal></person-group>, &#x201C;<article-title>DetFed: Dynamic resource scheduling for deterministic federated learning over time-sensitive networks</article-title>,&#x201D; <source>IEEE Trans. Mob. Comput.</source>, vol. <volume>23</volume>, no. <issue>5</issue>, pp. <fpage>5162</fpage>&#x2013;<lpage>5178</lpage>, <year>2023</year>. doi: <pub-id pub-id-type="doi">10.1109/TMC.2023.3303017</pub-id>.</mixed-citation></ref>
<ref id="ref-9"><label>[9]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>R.</given-names> <surname>Chen</surname></string-name> <etal>et al.</etal></person-group>, &#x201C;<article-title>A three-party hierarchical game for physical layer security aware wireless communications with dynamic trilateral coalitions</article-title>,&#x201D; <source>IEEE Trans. Wirel. Commun.</source>, Early Access, <year>2023</year>.</mixed-citation></ref>
<ref id="ref-10"><label>[10]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>W.</given-names> <surname>Zhu</surname></string-name> <etal>et al.</etal></person-group>, &#x201C;<article-title>Edge-Assisted video transmission with adaptive key frame selection: A hierarchical DRL approach</article-title>,&#x201D; in <conf-name>2023 Biennial Symp. Commun. (BSC)</conf-name>, <publisher-loc>Montreal, QC, Canada</publisher-loc>, <publisher-name>IEEE</publisher-name>, <year>2023</year>, pp. <fpage>89</fpage>&#x2013;<lpage>94</lpage>.</mixed-citation></ref>
<ref id="ref-11"><label>[11]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>R.</given-names> <surname>Chen</surname></string-name> <etal>et al.</etal></person-group>, &#x201C;<article-title>A DRL-based hierarchical game for physical layer security with dynamic trilateral coalitions</article-title>,&#x201D; in <conf-name>ICC 2023&#x2013;IEEE Int. Conf. Commun.</conf-name>, <publisher-loc>Rome, Italy</publisher-loc>, <publisher-name>IEEE</publisher-name>, <year>2023</year>.</mixed-citation></ref>
<ref id="ref-12"><label>[12]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>S.</given-names> <surname>Tomovic</surname></string-name> and <string-name><given-names>I.</given-names> <surname>Radusinovic</surname></string-name></person-group>, &#x201C;<article-title>Fast and efficient bandwidth-delay constrained routing algorithm for SDN networks</article-title>,&#x201D; in <conf-name>2016 IEEE NetSoft Conf. Workshops (NetSoft)</conf-name>, <publisher-loc>Seoul, Korea (South)</publisher-loc>, <publisher-name>IEEE</publisher-name>, <year>2016</year>, pp. <fpage>303</fpage>&#x2013;<lpage>311</lpage>.</mixed-citation></ref>
<ref id="ref-13"><label>[13]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>A.</given-names> <surname>Verma</surname></string-name> and <string-name><given-names>N.</given-names> <surname>Bhardwaj</surname></string-name></person-group>, &#x201C;<article-title>A review on routing information protocol (RIP) and open shortest path first (OSPF) routing protocol</article-title>,&#x201D; <source>Int. J. Future Gener. Commun. Netw.</source>, vol. <volume>9</volume>, no. <issue>4</issue>, pp. <fpage>161</fpage>&#x2013;<lpage>170</lpage>, <year>2016</year>. doi: <pub-id pub-id-type="doi">10.14257/ijfgcn.2016.9.4.13</pub-id>.</mixed-citation></ref>
<ref id="ref-14"><label>[14]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>S. S.</given-names> <surname>Vinod Chandra</surname></string-name>, and <string-name><given-names>S.</given-names> <surname>Anand Hareendran</surname></string-name></person-group>, &#x201C;<article-title>Modified smell detection algorithm for optimal paths engineering in hybrid SDN</article-title>,&#x201D; <source>J. Parallel Distr. Comput.</source>, vol. <volume>187</volume>, no. <issue>1</issue>, pp. <fpage>104834</fpage>, <year>2024</year>. doi: <pub-id pub-id-type="doi">10.1016/j.jpdc.2023.104834</pub-id>.</mixed-citation></ref>
<ref id="ref-15"><label>[15]</label><mixed-citation publication-type="book"><person-group person-group-type="author"><string-name><given-names>A. R.</given-names> <surname>Ananthalakshmi</surname></string-name>, <string-name><given-names>P. C.</given-names> <surname>Sajimon</surname></string-name>, and <string-name><given-names>S. S.</given-names> <surname>Vinod Chandra</surname></string-name></person-group>, &#x201C;<chapter-title>Application of smell detection agent based algorithm for optimal path identification by SDN controllers</chapter-title>,&#x201D; in <source>Advances in Swarm Intelligence,</source> <publisher-loc>Fukuoka, Japan</publisher-loc>: <publisher-name>Springer International Publishing</publisher-name>, <year>Jul. 27, Aug. 1, 2017</year>, vol. <volume>10386</volume>, pp. <fpage>502</fpage>&#x2013;<lpage>510</lpage>.</mixed-citation></ref>
<ref id="ref-16"><label>[16]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>D.</given-names> <surname>Gopi</surname></string-name>, <string-name><given-names>S.</given-names> <surname>Cheng</surname></string-name>, and <string-name><given-names>R.</given-names> <surname>Huck</surname></string-name></person-group>, &#x201C;<article-title>Comparative analysis of SDN and conventional networks using routing protocols</article-title>,&#x201D; in <conf-name>2017 Int. Conf. Comput., Inform. Telecommun. Syst. (CITS)</conf-name>, <publisher-loc>Dalian, China</publisher-loc>, <publisher-name>IEEE</publisher-name>, <year>2017</year>, pp. <fpage>108</fpage>&#x2013;<lpage>112</lpage>.</mixed-citation></ref>
<ref id="ref-17"><label>[17]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>A.</given-names> <surname>Shirmarz</surname></string-name> and <string-name><given-names>A.</given-names> <surname>Ghaffari</surname></string-name></person-group>, &#x201C;<article-title>An adaptive greedy flow routing algorithm for performance improvement in software-defined network</article-title>,&#x201D; <source>Int. J. Numerical Model.: Elect. Netw., Devices Fields</source>, vol. <volume>33</volume>, no. <issue>1</issue>, pp. <fpage>2676</fpage>, <year>2020</year>. doi: <pub-id pub-id-type="doi">10.1002/jnm.2676</pub-id>.</mixed-citation></ref>
<ref id="ref-18"><label>[18]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>X.</given-names> <surname>Xu</surname></string-name>, <string-name><given-names>H.</given-names> <surname>Hu</surname></string-name>, <string-name><given-names>Y.</given-names> <surname>Liu</surname></string-name>, <string-name><given-names>J.</given-names> <surname>Tan</surname></string-name>, and <string-name><given-names>H.</given-names> <surname>Zhang</surname></string-name></person-group>, &#x201C;<article-title>Moving target defense of routing randomization with deep reinforcement learning against eavesdropping attack</article-title>,&#x201D; <source>Digital Commun. Netw.</source>, vol. <volume>8</volume>, no. <issue>3</issue>, pp. <fpage>373</fpage>&#x2013;<lpage>387</lpage>, <year>2022</year>.</mixed-citation></ref>
<ref id="ref-19"><label>[19]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>C.</given-names> <surname>Yu</surname></string-name>, <string-name><given-names>J.</given-names> <surname>Lan</surname></string-name>, <string-name><given-names>Z.</given-names> <surname>Guo</surname></string-name>, and <string-name><given-names>Y.</given-names> <surname>Hu</surname></string-name></person-group>, &#x201C;<article-title>DROM: Optimizing the routing in software-defined networks with deep reinforcement learning</article-title>,&#x201D; <source>IEEE Access</source>, vol. <volume>6</volume>, pp. <fpage>64533</fpage>&#x2013;<lpage>64539</lpage>, <year>2018</year>. doi: <pub-id pub-id-type="doi">10.1109/ACCESS.2018.2877686</pub-id>.</mixed-citation></ref>
<ref id="ref-20"><label>[20]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><given-names>G.</given-names> <surname>Stampa</surname></string-name> <etal>et al.</etal></person-group>, &#x201C;<article-title>A deep-reinforcement learning approach for software-defined networking routing optimization</article-title>,&#x201D; <comment>arXiv preprint arXiv:1709.07080</comment>, <year>2017</year>.</mixed-citation></ref>
<ref id="ref-21"><label>[21]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>O. J.</given-names> <surname>Pandey</surname></string-name>, <string-name><given-names>T.</given-names> <surname>Yuvaraj</surname></string-name>, <string-name><given-names>J. K.</given-names> <surname>Paul</surname></string-name>, <string-name><given-names>H. H.</given-names> <surname>Nguyen</surname></string-name>, <string-name><given-names>K.</given-names> <surname>Gundepudi</surname></string-name> and <string-name><given-names>M. K.</given-names> <surname>Shukla</surname></string-name></person-group>, &#x201C;<article-title>Improving energy efficiency and QoS of lpwans for IoT using q-learning based data routing</article-title>,&#x201D; <source>IEEE Trans. Cogn. Commun. Netw.</source>, vol. <volume>8</volume>, no. <issue>1</issue>, pp. <fpage>365</fpage>&#x2013;<lpage>379</lpage>, <year>2021</year>. doi: <pub-id pub-id-type="doi">10.1109/TCCN.2021.3114147</pub-id>.</mixed-citation></ref>
<ref id="ref-22"><label>[22]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><given-names>C.</given-names> <surname>Zhao</surname></string-name> <etal>et al.</etal></person-group>, &#x201C;<article-title>DRL-M4MR: An intelligent multicast routing approach based on DQN deep reinforcement learning in SDN</article-title>,&#x201D; <comment>arXiv preprint arXiv:2208.00383</comment>, <year>2022</year>.</mixed-citation></ref>
<ref id="ref-23"><label>[23]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>H.</given-names> <surname>Zhang</surname></string-name> and <string-name><given-names>J.</given-names> <surname>Yan</surname></string-name></person-group>, &#x201C;<article-title>Performance of SDN routing in comparison with legacy routing protocols</article-title>,&#x201D; in <conf-name>2015 Int. Conf. Cyber-Enabled Distrib. Comput. Knowl. Disc.</conf-name>, <publisher-loc>Xi&#x2019;an, China</publisher-loc>, <publisher-name>IEEE</publisher-name>, <year>2015</year>.</mixed-citation></ref>
<ref id="ref-24"><label>[24]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>S.</given-names> <surname>Singh</surname></string-name> and <string-name><given-names>R. K.</given-names> <surname>Jha</surname></string-name></person-group>, &#x201C;<article-title>A survey on software defined networking: Architecture for next generation network</article-title>,&#x201D; <source>J. Netw. Syst. Manage.</source>, vol. <volume>25</volume>, no. <issue>2</issue>, pp. <fpage>321</fpage>&#x2013;<lpage>374</lpage>, <year>2017</year>. doi: <pub-id pub-id-type="doi">10.1007/s10922-016-9393-9</pub-id>.</mixed-citation></ref>
<ref id="ref-25"><label>[25]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>J.</given-names> <surname>Clifton</surname></string-name> and <string-name><given-names>E.</given-names> <surname>Laber</surname></string-name></person-group>, &#x201C;<article-title>Q-learning: Theory and applications</article-title>,&#x201D; <source>Annu. Rev. Stat. Appl.</source>, vol. <volume>7</volume>, no. <issue>1</issue>, pp. <fpage>279</fpage>&#x2013;<lpage>301</lpage>, <year>2020</year>. doi: <pub-id pub-id-type="doi">10.1146/annurev-statistics-031219-041220</pub-id>.</mixed-citation></ref>
<ref id="ref-26"><label>[26]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>S.</given-names> <surname>Han</surname></string-name>, <string-name><given-names>W.</given-names> <surname>Zhou</surname></string-name>, <string-name><given-names>S.</given-names> <surname>L&#x00FC;</surname></string-name>, and <string-name><given-names>J.</given-names> <surname>Yu</surname></string-name></person-group>, &#x201C;<article-title>Regularly updated deterministic policy gradient algorithm</article-title>,&#x201D; <source>Knowl. Based Syst.</source>, vol. <volume>214</volume>, no. <issue>7540</issue>, pp. <fpage>106736</fpage>, <year>2021</year>. doi: <pub-id pub-id-type="doi">10.1016/j.knosys.2020.106736</pub-id>.</mixed-citation></ref>
<ref id="ref-27"><label>[27]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>H.</given-names> <surname>Tan</surname></string-name></person-group>, &#x201C;<article-title>Reinforcement learning with deep deterministic policy gradient</article-title>,&#x201D; in <conf-name>2021 Int. Conf. Artif. Intell., Big Data Algor. (CAIBDA)</conf-name>, <publisher-loc>Xi&#x2019;an, China</publisher-loc>, <publisher-name>IEEE</publisher-name>, <year>2021</year>, pp. <fpage>82</fpage>&#x2013;<lpage>85</lpage>.</mixed-citation></ref>
</ref-list>
</back></article>