<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.1 20151215//EN" "http://jats.nlm.nih.gov/publishing/1.1/JATS-journalpublishing1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" article-type="research-article" dtd-version="1.1">
<front>
<journal-meta>
<journal-id journal-id-type="pmc">CMES</journal-id>
<journal-id journal-id-type="nlm-ta">CMES</journal-id>
<journal-id journal-id-type="publisher-id">CMES</journal-id>
<journal-title-group>
<journal-title>Computer Modeling in Engineering &#x0026; Sciences</journal-title>
</journal-title-group>
<issn pub-type="epub">1526-1506</issn>
<issn pub-type="ppub">1526-1492</issn>
<publisher>
<publisher-name>Tech Science Press</publisher-name>
<publisher-loc>USA</publisher-loc>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="publisher-id">21052</article-id>
<article-id pub-id-type="doi">10.32604/cmes.2022.021052</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Article</subject>
</subj-group>
</article-categories>
<title-group>
<article-title>Research on Volt/Var Control of Distribution Networks Based on PPO Algorithm</article-title>
<alt-title alt-title-type="left-running-head">Research on Volt/Var Control of Distribution Networks Based on PPO Algorithm</alt-title>
<alt-title alt-title-type="right-running-head">Research on Volt/Var Control of Distribution Networks Based on PPO Algorithm</alt-title>
</title-group>
<contrib-group content-type="authors">
<contrib id="author-1" contrib-type="author">
<name name-style="western"><surname>Zhu</surname><given-names>Chao</given-names></name><xref ref-type="aff" rid="aff-1">1</xref></contrib>
<contrib id="author-2" contrib-type="author">
<name name-style="western"><surname>Wang</surname><given-names>Lei</given-names></name><xref ref-type="aff" rid="aff-1">1</xref></contrib>
<contrib id="author-3" contrib-type="author">
<name name-style="western"><surname>Pan</surname><given-names>Dai</given-names></name><xref ref-type="aff" rid="aff-1">1</xref></contrib>
<contrib id="author-4" contrib-type="author">
<name name-style="western"><surname>Wang</surname><given-names>Zifei</given-names></name><xref ref-type="aff" rid="aff-2">2</xref></contrib>
<contrib id="author-5" contrib-type="author">
<name name-style="western"><surname>Wang</surname><given-names>Tao</given-names></name><xref ref-type="aff" rid="aff-2">2</xref></contrib>
<contrib id="author-6" contrib-type="author" corresp="yes">
<name name-style="western"><surname>Wang</surname><given-names>Licheng</given-names></name><xref ref-type="aff" rid="aff-2">2</xref><email>wanglicheng@zjut.edu.cn</email>
</contrib>
<contrib id="author-7" contrib-type="author">
<name name-style="western"><surname>Ye</surname><given-names>Chengjin</given-names></name><xref ref-type="aff" rid="aff-3">3</xref></contrib>
<aff id="aff-1"><label>1</label><institution>State Grid Zhejiang Economic and Technological Research Institute</institution>, <addr-line>Hangzhou, 310008</addr-line>, <country>China</country></aff>
<aff id="aff-2"><label>2</label><institution>College of Information Engineering, Zhejiang University of Technology</institution>, <addr-line>Hangzhou, 310023</addr-line>, <country>China</country></aff>
<aff id="aff-3"><label>3</label><institution>College of Electrical Engineering, Zhejiang University</institution>, <addr-line>Hangzhou, 310058</addr-line>, <country>China</country></aff>
</contrib-group>
<author-notes>
<corresp id="cor1"><label>&#x002A;</label>Corresponding Author: Licheng Wang. Email: <email>wanglicheng@zjut.edu.cn</email></corresp>
</author-notes>
<pub-date pub-type="epub" date-type="pub" iso-8601-date="2022-08-11"><day>11</day>
<month>08</month>
<year>2022</year></pub-date>
<volume>134</volume>
<issue>1</issue>
<fpage>599</fpage>
<lpage>609</lpage>
<history>
<date date-type="received"><day>24</day><month>12</month><year>2021</year></date>
<date date-type="accepted"><day>16</day><month>2</month><year>2022</year></date>
</history>
<permissions>
<copyright-statement>&#x00A9; 2023 Zhu et al.</copyright-statement>
<copyright-year>2023</copyright-year>
<copyright-holder>Zhu et al.</copyright-holder>
<license xlink:href="https://creativecommons.org/licenses/by/4.0/">
<license-p>This work is licensed under a <ext-link ext-link-type="uri" xlink:type="simple" xlink:href="https://creativecommons.org/licenses/by/4.0/">Creative Commons Attribution 4.0 International License</ext-link>, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
</license>
</permissions>
<self-uri content-type="pdf" xlink:href="TSP_CMES_21052.pdf"></self-uri>
<abstract>
<p>In this paper, a model free volt/var control (VVC) algorithm is developed by using deep reinforcement learning (DRL). We transform the VVC problem of distribution networks into the network framework of PPO algorithm, in order to avoid directly solving a large-scale nonlinear optimization problem. We select photovoltaic inverters as agents to adjust system voltage in a distribution network, taking the reactive power output of inverters as action variables. An appropriate reward function is designed to guide the interaction between photovoltaic inverters and the distribution network environment. OPENDSS is used to output system node voltage and network loss. This method realizes the goal of optimal VVC in distribution network. The IEEE 13-bus three phase unbalanced distribution system is used to verify the effectiveness of the proposed algorithm. Simulation results demonstrate that the proposed method has excellent performance in voltage and reactive power regulation of a distribution network.</p>
</abstract>
<kwd-group kwd-group-type="author">
<kwd>Deep reinforcement learning</kwd>
<kwd>voltage regulation</kwd>
<kwd>unbalance distribution systems</kwd>
<kwd>high photovoltaic permeability</kwd>
<kwd>photovoltaic inverter</kwd>
<kwd>volt/var control</kwd>
</kwd-group>
</article-meta>
</front>
<body>
<sec id="s1"><label>1</label><title>Introduction</title>
<p>In recent years, with the large consumption of traditional energy, energy crisis and environmental pollution have become increasingly serious. At the same time, in order to fit China&#x0027;s energy strategy of &#x201C;carbon peaking&#x201D; and &#x201C;carbon neutralization&#x201D;, the energy structure dominated by fossil energy is gradually transforming to that dominated by renewable energy, and the new energy industry has developed rapidly [<xref ref-type="bibr" rid="ref-1">1</xref>&#x2013;<xref ref-type="bibr" rid="ref-5">5</xref>]. The new energy has the advantages of clean, infinite regeneration, small amount of operation and maintenance, but new requirements are put forward for traditional volt/var control (VVC) [<xref ref-type="bibr" rid="ref-6">6</xref>]. For the problem of grid local voltage out of limit caused by the intermittence and fluctuation of photovoltaic output [<xref ref-type="bibr" rid="ref-7">7</xref>], in the traditional VVC, the discrete tap/switch mechanism of on-load tap changers (OLTC) and capacitor banks (CBS) is used to control the voltage [<xref ref-type="bibr" rid="ref-8">8</xref>]. However, with the continuous increase of photovoltaic permeability in the distribution network, the burden of such voltage regulating equipment increases sharply (such as frequent tap switching [<xref ref-type="bibr" rid="ref-9">9</xref>], repeated charging and discharging of energy storage, etc.), which leads to accelerated aging and even damage of the equipment and is unable to deal with the voltage violation caused by high photovoltaic permeability [<xref ref-type="bibr" rid="ref-10">10</xref>]. Because photovoltaic inverter has the advantage of instantaneous response to system voltage changes and can participate in the voltage regulation of distribution network according to the revised IEEE1547 standard [<xref ref-type="bibr" rid="ref-11">11</xref>], photovoltaic inverter is widely used in voltage management under high photovoltaic permeability [<xref ref-type="bibr" rid="ref-12">12</xref>&#x2013;<xref ref-type="bibr" rid="ref-18">18</xref>].</p>
<p>At the algorithm design level, the early designed photovoltaic inverter participating in the voltage control strategy of distribution network is mainly centralized solution based on optimal power flow (OPF) algorithm [<xref ref-type="bibr" rid="ref-19">19</xref>,<xref ref-type="bibr" rid="ref-20">20</xref>]. However, these methods generally have some problems, such as large amount of calculation, easy to fall into local optimization, heavy dependence on prediction data and difficult to realize on-line control. Considering that photovoltaic inverter has the advantages of flexible regulation of reactive power and deep reinforcement learning model has the ability to process massive and complex data information in real time [<xref ref-type="bibr" rid="ref-21">21</xref>], a real-time voltage regulation method of distribution network based on reinforcement learning is proposed in this paper. The VVC problem is transformed into a Proximal Policy Optimization (PPO) network framework. We take multiple inverters as agents; the action of the agent is determined by the interactive training between the inverter and the environment. This method realizes the voltage management under high photovoltaic permeability. The main contributions of this paper are as follows:
<list list-type="simple">
<list-item><label>1)</label><p>We propose a data-driven real-time voltage control framework, which can quickly deal with the voltage violations caused by high photovoltaic permeability by controlling multiple photovoltaic inverter devices.</p></list-item>
<list-item><label>2)</label><p>We propose a multi-agent deep reinforcement learning (MADRL) algorithm based on photovoltaic inverter. In the off-line training process, the voltage out of limit and the reactive power output of photovoltaic inverter are modeled as penalty terms to ensure the security of power grid.</p></list-item>
<list-item><label>3)</label><p>The load and voltage values of all nodes are integrated into OPENDSS, and the MADRL problem is realized by PPO algorithm. Compared with the traditional method, the voltage regulation efficiency of three-phase distribution system is significantly improved.</p></list-item>
</list></p>
</sec>
<sec id="s2"><label>2</label><title>PPO Algorithm</title>
<p>PPO algorithm is a deep reinforcement learning algorithm based on actor-critic structure. It obtains the optimal policy based on policy gradient. The critic network in PPO algorithm is used to approximate the state value function, and its network parameters are updated by minimizing the estimation deviation of the estimation function. The calculation formula is shown in <xref ref-type="disp-formula" rid="eqn-1">Eq. (1)</xref>.
<disp-formula id="eqn-1"><label>(1)</label><mml:math id="mml-eqn-1" display="block"><mml:mi>J</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mi>&#x03D5;</mml:mi><mml:mo>)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mrow><mml:msub><mml:mi>r</mml:mi><mml:mi>t</mml:mi></mml:msub></mml:mrow><mml:mo>+</mml:mo><mml:munderover><mml:mrow><mml:mo movablelimits="false">&#x2211;</mml:mo></mml:mrow><mml:mrow><mml:mi>k</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>T</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mi>k</mml:mi></mml:mrow></mml:munderover><mml:mo>&#x2061;</mml:mo><mml:mrow><mml:msup><mml:mi>&#x03B3;</mml:mi><mml:mi>k</mml:mi></mml:msup></mml:mrow><mml:mo>&#x2217;</mml:mo><mml:mrow><mml:msub><mml:mi>r</mml:mi><mml:mrow><mml:mi>t</mml:mi><mml:mo>+</mml:mo><mml:mi>k</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mi>V</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mrow><mml:msub><mml:mi>s</mml:mi><mml:mi>t</mml:mi></mml:msub></mml:mrow></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:math></disp-formula>where <inline-formula id="ieqn-1"><mml:math id="mml-ieqn-1"><mml:mi>&#x03D5;</mml:mi></mml:math></inline-formula> is the parameter of critic network and <inline-formula id="ieqn-2"><mml:math id="mml-ieqn-2"><mml:mi>V</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mrow><mml:msub><mml:mi>s</mml:mi><mml:mi>t</mml:mi></mml:msub></mml:mrow></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula> is the output value of critic network.</p>
<p>In PPO algorithm, actor network is used for approximation strategy, and the network parameters are updated by introducing the concept of importance sampling and continuously optimizing and improving the objective function. The introduction of importance sampling not only improves the utilization of data samples, but also speeds up the convergence speed of the model. The specific method is realized by <xref ref-type="disp-formula" rid="eqn-2">Eqs. (2)</xref>&#x2013;<xref ref-type="disp-formula" rid="eqn-8">(8)</xref>. Assuming that there is a random variable <italic>x</italic> and the probability density function is <inline-formula id="ieqn-3"><mml:math id="mml-ieqn-3"><mml:mi>p</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mi>x</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula>, the expected calculation of <inline-formula id="ieqn-4"><mml:math id="mml-ieqn-4"><mml:mi>f</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mi>x</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula> is shown in <xref ref-type="disp-formula" rid="eqn-2">Eq. (2)</xref>.
<disp-formula id="eqn-2"><label>(2)</label><mml:math id="mml-eqn-2" display="block"><mml:mrow><mml:msub><mml:mi>E</mml:mi><mml:mrow><mml:mi>x</mml:mi><mml:mo>&#x223C;</mml:mo><mml:mi>p</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:mi>f</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mi>x</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:mrow><mml:mo>]</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mrow><mml:mo largeop="false">&#x222B;</mml:mo></mml:mrow><mml:mo>&#x2061;</mml:mo><mml:mi>f</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mi>x</mml:mi><mml:mo>)</mml:mo></mml:mrow><mml:mi>p</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mi>x</mml:mi><mml:mo>)</mml:mo></mml:mrow><mml:mi>d</mml:mi><mml:mi>x</mml:mi><mml:mo>=</mml:mo><mml:mrow><mml:mo largeop="false">&#x222B;</mml:mo></mml:mrow><mml:mo>&#x2061;</mml:mo><mml:mi>f</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mi>x</mml:mi><mml:mo>)</mml:mo></mml:mrow><mml:mfrac><mml:mrow><mml:mi>p</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mi>x</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:mi>q</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mi>x</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:mfrac><mml:mi>q</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mi>x</mml:mi><mml:mo>)</mml:mo></mml:mrow><mml:mi>d</mml:mi><mml:mi>x</mml:mi><mml:mo>=</mml:mo><mml:mrow><mml:msub><mml:mi>E</mml:mi><mml:mrow><mml:mi>x</mml:mi><mml:mo>&#x223C;</mml:mo><mml:mi>p</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:mi>f</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mi>x</mml:mi><mml:mo>)</mml:mo></mml:mrow><mml:mfrac><mml:mrow><mml:mi>p</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mi>x</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:mi>q</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mi>x</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:mfrac></mml:mrow><mml:mo>]</mml:mo></mml:mrow></mml:math></disp-formula></p>
<p>The importance sampling method, i.e., <xref ref-type="disp-formula" rid="eqn-2">Eq. (2)</xref> is applied to PPO algorithm, the objective function of PPO algorithm can be written as <xref ref-type="disp-formula" rid="eqn-3">Eq. (3)</xref> [<xref ref-type="bibr" rid="ref-22">22</xref>].
<disp-formula id="eqn-3"><label>(3)</label><mml:math id="mml-eqn-3" display="block"><mml:mi>J</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mi>&#x03B8;</mml:mi><mml:mo>)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mi>m</mml:mi><mml:mi>a</mml:mi><mml:mi>x</mml:mi><mml:mrow><mml:msub><mml:mi>E</mml:mi><mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mrow><mml:msub><mml:mi>s</mml:mi><mml:mi>t</mml:mi></mml:msub></mml:mrow><mml:mo>,</mml:mo><mml:mrow><mml:msub><mml:mi>a</mml:mi><mml:mi>t</mml:mi></mml:msub></mml:mrow></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mo>&#x223C;</mml:mo><mml:mrow><mml:msub><mml:mi>&#x03C0;</mml:mi><mml:mrow><mml:msup><mml:mi>&#x03B8;</mml:mi><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msup></mml:mrow></mml:msub></mml:mrow></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:mrow><mml:msub><mml:mi>r</mml:mi><mml:mi>&#x03B8;</mml:mi></mml:msub></mml:mrow><mml:mrow><mml:msup><mml:mi>A</mml:mi><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msup></mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mrow><mml:msub><mml:mi>s</mml:mi><mml:mi>t</mml:mi></mml:msub></mml:mrow><mml:mo>,</mml:mo><mml:mrow><mml:msub><mml:mi>a</mml:mi><mml:mi>t</mml:mi></mml:msub></mml:mrow></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mrow><mml:mo>]</mml:mo></mml:mrow></mml:math></disp-formula>
<disp-formula id="eqn-4"><label>(4)</label><mml:math id="mml-eqn-4" display="block"><mml:msup><mml:mi>A</mml:mi><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msup><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mrow><mml:msub><mml:mi>s</mml:mi><mml:mi>t</mml:mi></mml:msub></mml:mrow><mml:mo>,</mml:mo><mml:mrow><mml:msub><mml:mi>a</mml:mi><mml:mi>t</mml:mi></mml:msub></mml:mrow></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mi>V</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mrow><mml:msub><mml:mi>s</mml:mi><mml:mi>t</mml:mi></mml:msub></mml:mrow></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mo>+</mml:mo><mml:mrow><mml:msub><mml:mi>r</mml:mi><mml:mi>t</mml:mi></mml:msub></mml:mrow><mml:mo>+</mml:mo><mml:mi>&#x03B3;</mml:mi><mml:mrow><mml:msub><mml:mi>r</mml:mi><mml:mrow><mml:mi>t</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:mrow><mml:mo>+</mml:mo><mml:mo>&#x22EF;</mml:mo><mml:mo>+</mml:mo><mml:mrow><mml:msup><mml:mi>&#x03B3;</mml:mi><mml:mrow><mml:mi>T</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mi>t</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup></mml:mrow><mml:mrow><mml:msub><mml:mi>r</mml:mi><mml:mrow><mml:mi>T</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:mrow><mml:mo>+</mml:mo><mml:mrow><mml:msup><mml:mi>&#x03B3;</mml:mi><mml:mrow><mml:mi>T</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mi>t</mml:mi></mml:mrow></mml:msup></mml:mrow><mml:mi>V</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mrow><mml:msub><mml:mi>s</mml:mi><mml:mi>T</mml:mi></mml:msub></mml:mrow></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:math></disp-formula>
<disp-formula id="eqn-5"><label>(5)</label><mml:math id="mml-eqn-5" display="block"><mml:mrow><mml:msub><mml:mi>r</mml:mi><mml:mi>&#x03B8;</mml:mi></mml:msub></mml:mrow><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mrow><mml:msub><mml:mi>&#x03C0;</mml:mi><mml:mi>&#x03B8;</mml:mi></mml:msub></mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mrow><mml:msub><mml:mi>a</mml:mi><mml:mi>t</mml:mi></mml:msub></mml:mrow><mml:mrow><mml:mo fence="false" stretchy="false">|</mml:mo></mml:mrow><mml:mrow><mml:msub><mml:mi>s</mml:mi><mml:mi>t</mml:mi></mml:msub></mml:mrow></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:msub><mml:mi>&#x03C0;</mml:mi><mml:mrow><mml:msup><mml:mi>&#x03B8;</mml:mi><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msup></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mrow><mml:msub><mml:mi>a</mml:mi><mml:mi>t</mml:mi></mml:msub></mml:mrow><mml:mrow><mml:mo fence="false" stretchy="false">|</mml:mo></mml:mrow><mml:mrow><mml:msub><mml:mi>s</mml:mi><mml:mi>t</mml:mi></mml:msub></mml:mrow></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:mfrac></mml:math></disp-formula>where <inline-formula id="ieqn-5"><mml:math id="mml-ieqn-5"><mml:msup><mml:mi>A</mml:mi><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msup><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mrow><mml:msub><mml:mi>s</mml:mi><mml:mi>t</mml:mi></mml:msub></mml:mrow><mml:mo>,</mml:mo><mml:mrow><mml:msub><mml:mi>a</mml:mi><mml:mi>t</mml:mi></mml:msub></mml:mrow></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula> is the advantage function sampled according to the strategy and the <italic>T</italic>-step return value estimation method, which is equivalent to the advantage function in <xref ref-type="disp-formula" rid="eqn-2">Eq. (2)</xref>. <inline-formula id="ieqn-6"><mml:math id="mml-ieqn-6"><mml:mrow><mml:msub><mml:mi>r</mml:mi><mml:mi>&#x03B8;</mml:mi></mml:msub></mml:mrow></mml:math></inline-formula> is the probability ratio of action taken by the new strategy and the old strategy in the current state, which is equivalent to <inline-formula id="ieqn-7"><mml:math id="mml-ieqn-7"><mml:mfrac><mml:mrow><mml:mi>p</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mi>x</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:mi>q</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mi>x</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:mfrac></mml:math></inline-formula> that in <xref ref-type="disp-formula" rid="eqn-2">Eq. (2)</xref>. The premise of applying <xref ref-type="disp-formula" rid="eqn-2">Eq. (2)</xref> to PPO algorithm is that the gap between strategy probability distribution <inline-formula id="ieqn-8"><mml:math id="mml-ieqn-8"><mml:mrow><mml:msub><mml:mi>&#x03C0;</mml:mi><mml:mi>&#x03B8;</mml:mi></mml:msub></mml:mrow></mml:math></inline-formula> and <inline-formula id="ieqn-9"><mml:math id="mml-ieqn-9"><mml:mrow><mml:msub><mml:mi>&#x03C0;</mml:mi><mml:mrow><mml:msup><mml:mi>&#x03B8;</mml:mi><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msup></mml:mrow></mml:msub></mml:mrow></mml:math></inline-formula> is within a certain range [<xref ref-type="bibr" rid="ref-23">23</xref>]. Therefore, <italic>KL</italic> divergence is introduced into PPO algorithm, and the objective function becomes <xref ref-type="disp-formula" rid="eqn-6">Eq. (6)</xref>.
<disp-formula id="eqn-6"><label>(6)</label><mml:math id="mml-eqn-6" display="block"><mml:mi>J</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mi>&#x03B8;</mml:mi><mml:mo>)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mrow><mml:msub><mml:mi>E</mml:mi><mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mrow><mml:msub><mml:mi>s</mml:mi><mml:mi>t</mml:mi></mml:msub></mml:mrow><mml:mo>,</mml:mo><mml:mrow><mml:msub><mml:mi>a</mml:mi><mml:mi>t</mml:mi></mml:msub></mml:mrow></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mo>&#x223C;</mml:mo><mml:mrow><mml:msub><mml:mi>&#x03C0;</mml:mi><mml:mrow><mml:msup><mml:mi>&#x03B8;</mml:mi><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msup></mml:mrow></mml:msub></mml:mrow></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:mrow><mml:msub><mml:mi>r</mml:mi><mml:mi>&#x03B8;</mml:mi></mml:msub></mml:mrow><mml:msup><mml:mi>A</mml:mi><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msup><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mrow><mml:msub><mml:mi>s</mml:mi><mml:mi>t</mml:mi></mml:msub></mml:mrow><mml:mo>,</mml:mo><mml:mrow><mml:msub><mml:mi>a</mml:mi><mml:mi>t</mml:mi></mml:msub></mml:mrow></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mi>&#x03B2;</mml:mi><mml:mi>K</mml:mi><mml:mi>L</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mrow><mml:msub><mml:mi>&#x03C0;</mml:mi><mml:mi>&#x03B8;</mml:mi></mml:msub></mml:mrow><mml:mo>,</mml:mo><mml:mrow><mml:msub><mml:mi>&#x03C0;</mml:mi><mml:mrow><mml:msup><mml:mi>&#x03B8;</mml:mi><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msup></mml:mrow></mml:msub></mml:mrow></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mrow><mml:mo>]</mml:mo></mml:mrow></mml:math></disp-formula>where <inline-formula id="ieqn-10"><mml:math id="mml-ieqn-10"><mml:mi>&#x03B2;</mml:mi></mml:math></inline-formula> represents the penalty for the difference between <inline-formula id="ieqn-11"><mml:math id="mml-ieqn-11"><mml:mrow><mml:msub><mml:mi>&#x03C0;</mml:mi><mml:mi>&#x03B8;</mml:mi></mml:msub></mml:mrow></mml:math></inline-formula> and <inline-formula id="ieqn-12"><mml:math id="mml-ieqn-12"><mml:mrow><mml:msub><mml:mi>&#x03C0;</mml:mi><mml:mrow><mml:msup><mml:mi>&#x03B8;</mml:mi><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msup></mml:mrow></mml:msub></mml:mrow></mml:math></inline-formula> distribution. Because <italic>KL</italic> divergence is not easy to calculate, the method of clipping is used to replace <italic>KL</italic> divergence, which can effectively limit the range of update. The objective function of PPO algorithm including clip function is expressed as <xref ref-type="disp-formula" rid="eqn-7">Eqs. (7)</xref> and <xref ref-type="disp-formula" rid="eqn-8">(8)</xref>.
<disp-formula id="eqn-7"><label>(7)</label><mml:math id="mml-eqn-7" display="block"><mml:mi>J</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mi>&#x03B8;</mml:mi><mml:mo>)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mrow><mml:msub><mml:mi>E</mml:mi><mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mrow><mml:msub><mml:mi>s</mml:mi><mml:mi>t</mml:mi></mml:msub></mml:mrow><mml:mo>,</mml:mo><mml:mrow><mml:msub><mml:mi>a</mml:mi><mml:mi>t</mml:mi></mml:msub></mml:mrow></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mo>&#x223C;</mml:mo><mml:mrow><mml:msub><mml:mi>&#x03C0;</mml:mi><mml:mrow><mml:msup><mml:mi>&#x03B8;</mml:mi><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msup></mml:mrow></mml:msub></mml:mrow></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>m</mml:mi><mml:mi>i</mml:mi><mml:mi>n</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mrow><mml:msub><mml:mi>r</mml:mi><mml:mi>&#x03B8;</mml:mi></mml:msub></mml:mrow><mml:msup><mml:mi>A</mml:mi><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msup><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mrow><mml:msub><mml:mi>s</mml:mi><mml:mi>t</mml:mi></mml:msub></mml:mrow><mml:mo>,</mml:mo><mml:mrow><mml:msub><mml:mi>a</mml:mi><mml:mi>t</mml:mi></mml:msub></mml:mrow></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mo>,</mml:mo><mml:mi>c</mml:mi><mml:mi>l</mml:mi><mml:mi>i</mml:mi><mml:mi>p</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mrow><mml:msub><mml:mi>r</mml:mi><mml:mi>&#x03B8;</mml:mi></mml:msub></mml:mrow><mml:mo>,</mml:mo><mml:mn>1</mml:mn><mml:mo>&#x2212;</mml:mo><mml:mi>&#x03B5;</mml:mi><mml:mo>,</mml:mo><mml:mn>1</mml:mn><mml:mo>+</mml:mo><mml:mi>&#x03B5;</mml:mi></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:msup><mml:mi>A</mml:mi><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msup><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mrow><mml:msub><mml:mi>s</mml:mi><mml:mi>t</mml:mi></mml:msub></mml:mrow><mml:mo>,</mml:mo><mml:mrow><mml:msub><mml:mi>a</mml:mi><mml:mi>t</mml:mi></mml:msub></mml:mrow></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mo stretchy="false">]</mml:mo></mml:math></disp-formula>
<disp-formula id="eqn-8"><label>(8)</label><mml:math id="mml-eqn-8" display="block"><mml:mi>&#x03B8;</mml:mi><mml:mo stretchy="false">&#x2190;</mml:mo><mml:mi>a</mml:mi><mml:mi>r</mml:mi><mml:mi>g</mml:mi><mml:mi>m</mml:mi><mml:mi>a</mml:mi><mml:mrow><mml:msub><mml:mi>x</mml:mi><mml:mi>&#x03B8;</mml:mi></mml:msub></mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mi>J</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mi>&#x03B8;</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:math></disp-formula></p>
</sec>
<sec id="s3"><label>3</label><title>Proposed VVC Algorithm</title>
<p>According to Markov decision theory and PPO algorithm framework, the distribution network environment is modeled. Taking the reactive power output of each inverter in the distribution network as the regulating variable, after off-line centralized training, the goal of not exceeding the voltage limit of the distribution network under high photovoltaic permeability is finally completed.</p>
<sec id="s3_1"><label>3.1</label><title>Environmental Modeling</title>
<p>Markov decision process is composed of a five tuple, expressed as (<italic>s</italic>, <italic>a</italic>, <italic>P</italic>, <italic>R</italic>, <italic>R</italic>). Power system environment modeling is mainly set from three aspects: state <italic>s</italic>, action <italic>a</italic> and reward <italic>R</italic>. Under the framework of this paper, the main task of the agent is to select the appropriate reactive power output and transmit it to OPENDSS to ensure the convergence of power flow calculation and the node voltage does not exceed the limit.
<list list-type="simple">
<list-item><label>1)</label><p>State:</p></list-item>
</list></p>
<p>The state quantity needs to guide the agent to make appropriate actions [<xref ref-type="bibr" rid="ref-24">24</xref>]. The setting of state quantity in this paper is shown in <xref ref-type="disp-formula" rid="eqn-9">Eqs. (9)</xref> and <xref ref-type="disp-formula" rid="eqn-10">(10)</xref>, which includes the three-phase voltage of each node in the three-phase distribution network:
<disp-formula id="eqn-9"><label>(9)</label><mml:math id="mml-eqn-9" display="block"><mml:mrow><mml:mi mathvariant="bold-italic">S</mml:mi></mml:mrow><mml:mo>=</mml:mo><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:msubsup><mml:mi>U</mml:mi><mml:mn>1</mml:mn><mml:mi>&#x03C6;</mml:mi></mml:msubsup><mml:mo>,</mml:mo><mml:msubsup><mml:mi>U</mml:mi><mml:mn>2</mml:mn><mml:mi>&#x03C6;</mml:mi></mml:msubsup><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:msubsup><mml:mi>U</mml:mi><mml:mi>k</mml:mi><mml:mi>&#x03C6;</mml:mi></mml:msubsup></mml:mrow><mml:mo>]</mml:mo></mml:mrow></mml:math></disp-formula>
<disp-formula id="eqn-10"><label>(10)</label><mml:math id="mml-eqn-10" display="block"><mml:mi>&#x03C6;</mml:mi><mml:mo>=</mml:mo><mml:mrow><mml:mo>{</mml:mo><mml:mrow><mml:mi>a</mml:mi><mml:mo>,</mml:mo><mml:mi>b</mml:mi><mml:mo>,</mml:mo><mml:mi>c</mml:mi></mml:mrow><mml:mo>}</mml:mo></mml:mrow></mml:math></disp-formula>where <inline-formula id="ieqn-13"><mml:math id="mml-ieqn-13"><mml:msubsup><mml:mi>U</mml:mi><mml:mi>k</mml:mi><mml:mi>&#x03C6;</mml:mi></mml:msubsup></mml:math></inline-formula> represents the voltage magnitude on phase <inline-formula id="ieqn-14"><mml:math id="mml-ieqn-14"><mml:mi>&#x03C6;</mml:mi></mml:math></inline-formula> at node <italic>k</italic>.
<list list-type="simple">
<list-item><label>2)</label><p>Action:</p></list-item>
</list></p>
<p>The action quantity needs to guide the agent from the current state to the next state. In this paper, the reactive power output of the inverter is selected as the action, and because the output of the PPO algorithm used in this paper is the probability distribution of the action value, the action value is fixed in a certain range. Therefore, the action value in this paper is expressed as <xref ref-type="disp-formula" rid="eqn-11">Eqs. (11)</xref> and <xref ref-type="disp-formula" rid="eqn-12">(12)</xref>.
<disp-formula id="eqn-11"><label>(11)</label><mml:math id="mml-eqn-11" display="block"><mml:mrow><mml:mi mathvariant="bold-italic">a</mml:mi></mml:mrow><mml:mo>=</mml:mo><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:mrow><mml:msub><mml:mi>a</mml:mi><mml:mn>1</mml:mn></mml:msub></mml:mrow><mml:mo>,</mml:mo><mml:mrow><mml:msub><mml:mi>a</mml:mi><mml:mn>2</mml:mn></mml:msub></mml:mrow><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:mrow><mml:msub><mml:mi>a</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mrow></mml:mrow><mml:mo>]</mml:mo></mml:mrow></mml:math></disp-formula>
<disp-formula id="eqn-12"><label>(12)</label><mml:math id="mml-eqn-12" display="block"><mml:mrow><mml:msub><mml:mi>A</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mrow><mml:mo>=</mml:mo><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:mrow><mml:msub><mml:mi>a</mml:mi><mml:mrow><mml:mi>m</mml:mi><mml:mi>i</mml:mi><mml:mi>n</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo>,</mml:mo><mml:mrow><mml:msub><mml:mi>a</mml:mi><mml:mrow><mml:mi>m</mml:mi><mml:mi>a</mml:mi><mml:mi>x</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:mrow><mml:mo>]</mml:mo></mml:mrow></mml:math></disp-formula>where <inline-formula id="ieqn-15"><mml:math id="mml-ieqn-15"><mml:mrow><mml:msub><mml:mi>a</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mrow></mml:math></inline-formula> represents the reactive output of the three-phase inverter, i.e., <inline-formula id="ieqn-16"><mml:math id="mml-ieqn-16"><mml:msubsup><mml:mi>a</mml:mi><mml:mi>i</mml:mi><mml:mi>&#x03C6;</mml:mi></mml:msubsup></mml:math></inline-formula>. <inline-formula id="ieqn-17"><mml:math id="mml-ieqn-17"><mml:mrow><mml:msub><mml:mi>A</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mrow></mml:math></inline-formula> Represents the action space of the ith agent, where <inline-formula id="ieqn-18"><mml:math id="mml-ieqn-18"><mml:mrow><mml:msub><mml:mi>a</mml:mi><mml:mrow><mml:mi>m</mml:mi><mml:mi>i</mml:mi><mml:mi>n</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:math></inline-formula> and <inline-formula id="ieqn-19"><mml:math id="mml-ieqn-19"><mml:mrow><mml:msub><mml:mi>a</mml:mi><mml:mrow><mml:mi>m</mml:mi><mml:mi>a</mml:mi><mml:mi>x</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:math></inline-formula> represent the upper and lower limits of the action value space. During the training, the value is mapped to the reactive output space of the inverter.
<list list-type="simple">
<list-item><label>3)</label><p>Reward:</p></list-item>
</list></p>
<p>The setting of reward value needs to guide the agent to move in the right direction, so as to achieve the target value. In order to achieve the goal of non violation of distribution network voltage under high photovoltaic permeability <xref ref-type="disp-formula" rid="eqn-13">Eq. (13)</xref>, the rewards used in constructing PPO algorithm in this paper are shown in <xref ref-type="disp-formula" rid="eqn-14">Eqs. (14)</xref> and <xref ref-type="disp-formula" rid="eqn-15">(15)</xref>.
<disp-formula id="eqn-13"><label>(13)</label><mml:math id="mml-eqn-13" display="block"><mml:mn>0.95</mml:mn><mml:mo>&#x2264;</mml:mo><mml:mrow><mml:mo>|</mml:mo><mml:mrow><mml:msubsup><mml:mrow><mml:mi mathvariant="bold-italic">U</mml:mi></mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">k</mml:mi></mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">&#x03C6;</mml:mi></mml:mrow></mml:msubsup></mml:mrow><mml:mo>|</mml:mo></mml:mrow><mml:mo>&#x2264;</mml:mo><mml:mn>1.05</mml:mn></mml:math></disp-formula></p>
<p>When the node voltage value exceeds the limit after the agent acts, a huge penalty <italic>M</italic> will be given to the out of limit part, the reward function at the current time is expressed as <xref ref-type="disp-formula" rid="eqn-14">Eq. (14)</xref>.
<disp-formula id="eqn-14"><label>(14)</label><mml:math id="mml-eqn-14" display="block"><mml:mrow><mml:msup><mml:mi>r</mml:mi><mml:mi>t</mml:mi></mml:msup></mml:mrow><mml:mo>=</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mi>M</mml:mi><mml:munder><mml:mrow><mml:mo movablelimits="false">&#x2211;</mml:mo></mml:mrow><mml:mi>k</mml:mi></mml:munder><mml:mo>&#x2061;</mml:mo><mml:munder><mml:mrow><mml:mo movablelimits="false">&#x2211;</mml:mo></mml:mrow><mml:mi>&#x03C6;</mml:mi></mml:munder><mml:mo>&#x2061;</mml:mo><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:mi>r</mml:mi><mml:mi>e</mml:mi><mml:mi>l</mml:mi><mml:mi>u</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mrow><mml:mo>|</mml:mo><mml:mrow><mml:msubsup><mml:mi>U</mml:mi><mml:mi>k</mml:mi><mml:mi>&#x03C6;</mml:mi></mml:msubsup></mml:mrow><mml:mo>|</mml:mo></mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mn>1.05</mml:mn></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mo>+</mml:mo><mml:mi>r</mml:mi><mml:mi>e</mml:mi><mml:mi>l</mml:mi><mml:mi>u</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mn>0.95</mml:mn><mml:mo>&#x2212;</mml:mo><mml:mrow><mml:mo>|</mml:mo><mml:mrow><mml:msubsup><mml:mi>U</mml:mi><mml:mi>k</mml:mi><mml:mi>&#x03C6;</mml:mi></mml:msubsup></mml:mrow><mml:mo>|</mml:mo></mml:mrow></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mrow><mml:mo>]</mml:mo></mml:mrow></mml:math></disp-formula>where relu function is a piecewise function, which can change all negative values to 0, while the positive values remain unchanged. Therefore, when the node voltage exceeds the limit, the voltage is moved to the normal range through the reward function <xref ref-type="disp-formula" rid="eqn-14">Eq. (14)</xref> we set.</p>
<p>When the node voltage does not exceed the limit after the agent acts, we set the reward function at the current time as follows [<xref ref-type="bibr" rid="ref-25">25</xref>]:
<disp-formula id="eqn-15"><label>(15)</label><mml:math id="mml-eqn-15" display="block"><mml:mrow><mml:msup><mml:mi>r</mml:mi><mml:mi>t</mml:mi></mml:msup></mml:mrow><mml:mo>=</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:msubsup><mml:mi>P</mml:mi><mml:mrow><mml:mi>l</mml:mi><mml:mi>o</mml:mi><mml:mi>s</mml:mi><mml:mi>s</mml:mi><mml:mo>,</mml:mo><mml:mn>0</mml:mn></mml:mrow><mml:mi>t</mml:mi></mml:msubsup><mml:mo>&#x2212;</mml:mo><mml:msubsup><mml:mi>P</mml:mi><mml:mrow><mml:mi>l</mml:mi><mml:mi>o</mml:mi><mml:mi>s</mml:mi><mml:mi>s</mml:mi></mml:mrow><mml:mi>t</mml:mi></mml:msubsup></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:math></disp-formula>where <inline-formula id="ieqn-20"><mml:math id="mml-ieqn-20"><mml:msubsup><mml:mi>P</mml:mi><mml:mrow><mml:mi>l</mml:mi><mml:mi>o</mml:mi><mml:mi>s</mml:mi><mml:mi>s</mml:mi><mml:mo>,</mml:mo><mml:mn>0</mml:mn></mml:mrow><mml:mi>t</mml:mi></mml:msubsup></mml:math></inline-formula> and <inline-formula id="ieqn-21"><mml:math id="mml-ieqn-21"><mml:msubsup><mml:mi>P</mml:mi><mml:mrow><mml:mi>l</mml:mi><mml:mi>o</mml:mi><mml:mi>s</mml:mi><mml:mi>s</mml:mi></mml:mrow><mml:mi>t</mml:mi></mml:msubsup></mml:math></inline-formula> respectively refer to the network loss value of the system when the inverter does not take action and after the action. When the inverter acts, the system network loss decreases, the agent will be given a positive reward, otherwise, the agent will be given a corresponding negative reward.</p>
</sec>
<sec id="s3_2"><label>3.2</label><title>Model Training Process</title>
<p>The training flow chart of real-time voltage regulation of distribution network based on PPO algorithm is shown in <xref ref-type="fig" rid="fig-1">Fig. 1</xref>.</p>
<fig id="fig-1"><label>Figure 1</label><caption><title>Voltage real-time control training framework</title></caption><graphic mimetype="image" mime-subtype="png" xlink:href="CMES_21052-fig-1.png"/></fig>
<p>Firstly, the network parameters of actor network and critic network are initialized, and the replay buffer capacity and related training parameters are set; Randomly select a group of initial state <xref ref-type="disp-formula" rid="eqn-9">Eq. (9)</xref> from the environment, select the action <xref ref-type="disp-formula" rid="eqn-11">Eq. (11)</xref> of the inverter according to the strategy of the actor network, input state and action into OPENDSS to obtain the state value <inline-formula id="ieqn-22"><mml:math id="mml-ieqn-22"><mml:msup><mml:mi mathvariant="bold-italic">s</mml:mi><mml:mrow><mml:mi mathvariant="bold">&#x2032;</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula> at the next time, obtain the reward value <inline-formula id="ieqn-23"><mml:math id="mml-ieqn-23"><mml:mrow><mml:msup><mml:mi>r</mml:mi><mml:mi>t</mml:mi></mml:msup></mml:mrow></mml:math></inline-formula> according to <xref ref-type="disp-formula" rid="eqn-14">Eqs. (14)</xref> and <xref ref-type="disp-formula" rid="eqn-15">(15)</xref>, and store (<inline-formula id="ieqn-24"><mml:math id="mml-ieqn-24"><mml:mrow><mml:msub><mml:mi mathvariant="bold-italic">s</mml:mi><mml:mrow><mml:mi mathvariant="bold-italic">t</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo>,</mml:mo><mml:mrow><mml:msub><mml:mi mathvariant="bold-italic">a</mml:mi><mml:mrow><mml:mi mathvariant="bold-italic">t</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo>,</mml:mo><mml:mrow><mml:msup><mml:mi>r</mml:mi><mml:mi>t</mml:mi></mml:msup></mml:mrow><mml:mo>,</mml:mo><mml:mrow><mml:msub><mml:mi mathvariant="bold-italic">s</mml:mi><mml:mrow><mml:mi mathvariant="bold-italic">t</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:mrow></mml:math></inline-formula>) in the replay buffer. Take <italic>l</italic> sample values (<inline-formula id="ieqn-25"><mml:math id="mml-ieqn-25"><mml:mrow><mml:msub><mml:mi mathvariant="bold-italic">s</mml:mi><mml:mrow><mml:mi mathvariant="bold-italic">l</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:math></inline-formula>, <inline-formula id="ieqn-26"><mml:math id="mml-ieqn-26"><mml:mrow><mml:msub><mml:mi mathvariant="bold-italic">a</mml:mi><mml:mrow><mml:mi mathvariant="bold-italic">l</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:math></inline-formula>, <inline-formula id="ieqn-27"><mml:math id="mml-ieqn-27"><mml:mrow><mml:msub><mml:mi mathvariant="bold-italic">s</mml:mi><mml:mrow><mml:mi mathvariant="bold-italic">l</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:mrow><mml:mo>,</mml:mo><mml:mrow><mml:msub><mml:mi>r</mml:mi><mml:mrow><mml:mi>l</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:mrow></mml:math></inline-formula>) from the replay buffer, <inline-formula id="ieqn-28"><mml:math id="mml-ieqn-28"><mml:mi>l</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mn>2</mml:mn><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:mi>L</mml:mi></mml:math></inline-formula> input (<inline-formula id="ieqn-29"><mml:math id="mml-ieqn-29"><mml:mrow><mml:msub><mml:mi mathvariant="bold-italic">s</mml:mi><mml:mrow><mml:mi mathvariant="bold-italic">l</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo>,</mml:mo><mml:mrow><mml:msub><mml:mi mathvariant="bold-italic">s</mml:mi><mml:mrow><mml:mi mathvariant="bold-italic">l</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:mrow></mml:math></inline-formula>) into the critic network, update the critic network parameters according to <xref ref-type="disp-formula" rid="eqn-1">Eq. (1)</xref>, and calculate the advantage function according to the critic network output value and <xref ref-type="disp-formula" rid="eqn-4">Eq. (4)</xref>. Input (<inline-formula id="ieqn-30"><mml:math id="mml-ieqn-30"><mml:mrow><mml:msub><mml:mi mathvariant="bold-italic">s</mml:mi><mml:mrow><mml:mi mathvariant="bold-italic">l</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:math></inline-formula>, <inline-formula id="ieqn-31"><mml:math id="mml-ieqn-31"><mml:mrow><mml:msub><mml:mi mathvariant="bold-italic">a</mml:mi><mml:mrow><mml:mi mathvariant="bold-italic">l</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:math></inline-formula>) into the actor network, calculate the probability ratio of the old and new strategies to take action <inline-formula id="ieqn-32"><mml:math id="mml-ieqn-32"><mml:mrow><mml:msub><mml:mi mathvariant="bold-italic">a</mml:mi><mml:mrow><mml:mi mathvariant="bold-italic">l</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:math></inline-formula> in the state <inline-formula id="ieqn-33"><mml:math id="mml-ieqn-33"><mml:mrow><mml:msub><mml:mi mathvariant="bold-italic">s</mml:mi><mml:mrow><mml:mi mathvariant="bold-italic">l</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:math></inline-formula> according to <xref ref-type="disp-formula" rid="eqn-5">Eq. (5)</xref>, and finally calculate the objective function of the actor network through equation <xref ref-type="disp-formula" rid="eqn-7">Eq. (7)</xref>, and update its network parameters through <xref ref-type="disp-formula" rid="eqn-8">Eq. (8)</xref>, so as to obtain the new strategy.</p>
</sec>
</sec>
<sec id="s4"><label>4</label><title>Case Studies and Analysis</title>
<sec id="s4_1"><label>4.1</label><title>Case Design</title>
<p>In this paper, IEEE 13-bus three phase unbalanced distribution system [<xref ref-type="bibr" rid="ref-26">26</xref>] is used to test whether PPO algorithm can realize voltage management. Four three-phase inverters are placed at nodes 645, 671, 652 and 692. In this case, the load in each node fluctuates randomly by 80&#x0025;&#x223C;120&#x0025;, and then 1000 groups of training data with random fluctuation are generated through the comprehensive power simulation tool OPENDSS of power distribution network system. The neural network determines the reactive power output of the inverter according to the node voltage value and network loss provided by the training data. The specific implementation process of the algorithm is shown in <xref ref-type="table" rid="table-1">Table 1</xref>.</p>
<table-wrap id="table-1"><label>Table 1</label><caption><title>Algorithm training process</title></caption>
<table frame="hsides">
<colgroup>
<col align="left"/>
</colgroup>
<thead>
<tr>
<th align="left">Algorithm 1 PPO Regulation Voltage Training Process</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">01: <bold>Input:</bold> IEEE 13-bus distribution system model and the action space <inline-formula id="ieqn-34"><mml:math id="mml-ieqn-34"><mml:mrow><mml:msub><mml:mi>A</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mrow></mml:math></inline-formula> of agent <italic>i</italic>, <inline-formula id="ieqn-35"><mml:math id="mml-ieqn-35"><mml:mi>i</mml:mi><mml:mo>&#x2208;</mml:mo><mml:mrow><mml:mo>{</mml:mo><mml:mrow><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mn>2</mml:mn><mml:mo>,</mml:mo><mml:mn>3</mml:mn><mml:mo>,</mml:mo><mml:mn>4</mml:mn></mml:mrow><mml:mo>}</mml:mo></mml:mrow></mml:math></inline-formula>.<break/>02: <bold>Initialization:</bold> randomly initialize critic network and actor network with parameters <inline-formula id="ieqn-36"><mml:math id="mml-ieqn-36"><mml:mi>&#x03D5;</mml:mi></mml:math></inline-formula> and <inline-formula id="ieqn-37"><mml:math id="mml-ieqn-37"><mml:mi>&#x03B8;</mml:mi></mml:math></inline-formula>, and set training parameters <inline-formula id="ieqn-38"><mml:math id="mml-ieqn-38"><mml:mi>&#x03B3;</mml:mi><mml:mo>,</mml:mo><mml:mi>&#x03B7;</mml:mi><mml:mo>,</mml:mo><mml:mi>L</mml:mi><mml:mo>,</mml:mo><mml:mrow><mml:msub><mml:mi>N</mml:mi><mml:mrow><mml:mi>e</mml:mi><mml:mi>p</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:math></inline-formula>.<break/>03: <bold>for</bold> epi&#x2009;&#x003D;&#x2009;1 to <inline-formula id="ieqn-39"><mml:math id="mml-ieqn-39"><mml:mrow><mml:msub><mml:mi>N</mml:mi><mml:mrow><mml:mi>e</mml:mi><mml:mi>p</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:math></inline-formula> do<break/>04: &#x2002;&#x2002;&#x2002;&#x2002;Initialize state s, and obtain action <italic>a</italic> according to <xref ref-type="disp-formula" rid="eqn-11">Eq. (11)</xref>, and get reward according to <xref ref-type="disp-formula" rid="eqn-14">Eqs. (14)</xref> and <xref ref-type="disp-formula" rid="eqn-15">(15)</xref>, and obtain the next state by OPENDSS then store (<inline-formula id="ieqn-40"><mml:math id="mml-ieqn-40"><mml:mi mathvariant="bold-italic">s</mml:mi><mml:mo>,</mml:mo><mml:mi mathvariant="bold-italic">a</mml:mi><mml:mo>,</mml:mo><mml:mrow><mml:msup><mml:mi>r</mml:mi><mml:mi>t</mml:mi></mml:msup></mml:mrow><mml:mo>,</mml:mo><mml:msup><mml:mi>s</mml:mi><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula>) in replay buffer.<break/>05: &#x2002;&#x2002;&#x2002;&#x2002;Take <italic>l</italic> sample values (<inline-formula id="ieqn-41"><mml:math id="mml-ieqn-41"><mml:mrow><mml:msub><mml:mi mathvariant="bold-italic">s</mml:mi><mml:mrow><mml:mi mathvariant="bold-italic">l</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:math></inline-formula>, <inline-formula id="ieqn-42"><mml:math id="mml-ieqn-42"><mml:mrow><mml:msub><mml:mi mathvariant="bold-italic">a</mml:mi><mml:mrow><mml:mi mathvariant="bold-italic">l</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:math></inline-formula>, <inline-formula id="ieqn-43"><mml:math id="mml-ieqn-43"><mml:mrow><mml:msub><mml:mi mathvariant="bold-italic">s</mml:mi><mml:mrow><mml:mi mathvariant="bold-italic">l</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:mrow><mml:mo>,</mml:mo><mml:mrow><mml:msub><mml:mi>r</mml:mi><mml:mrow><mml:mi>l</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:mrow></mml:math></inline-formula>) from the replay buffer, and <inline-formula id="ieqn-44"><mml:math id="mml-ieqn-44"><mml:mi>l</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mn>2</mml:mn><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:mi>L</mml:mi></mml:math></inline-formula>.<break/>06: &#x2002;&#x2002;&#x2002;&#x2002;Update the critic network parameters <inline-formula id="ieqn-45"><mml:math id="mml-ieqn-45"><mml:mi>&#x03D5;</mml:mi></mml:math></inline-formula> according to (<inline-formula id="ieqn-46"><mml:math id="mml-ieqn-46"><mml:mrow><mml:msub><mml:mi mathvariant="bold-italic">s</mml:mi><mml:mrow><mml:mi mathvariant="bold-italic">l</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo>,</mml:mo><mml:mrow><mml:msub><mml:mi mathvariant="bold-italic">s</mml:mi><mml:mrow><mml:mi mathvariant="bold-italic">l</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:mrow></mml:math></inline-formula>) and <xref ref-type="disp-formula" rid="eqn-1">Eq. (1)</xref>.<break/>07: &#x2002;&#x2002;&#x2002;&#x2002;Compute advantage function according to (<inline-formula id="ieqn-47"><mml:math id="mml-ieqn-47"><mml:mrow><mml:msub><mml:mi mathvariant="bold-italic">s</mml:mi><mml:mrow><mml:mi mathvariant="bold-italic">l</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:math></inline-formula>, <inline-formula id="ieqn-48"><mml:math id="mml-ieqn-48"><mml:mrow><mml:msub><mml:mi mathvariant="bold-italic">a</mml:mi><mml:mrow><mml:mi mathvariant="bold-italic">l</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:math></inline-formula>) and <xref ref-type="disp-formula" rid="eqn-4">Eq. (4)</xref>.<break/>08: &#x2002;&#x2002;&#x2002;&#x2002;Compute objective function according to <xref ref-type="disp-formula" rid="eqn-7">Eq. (7)</xref>.<break/>09: &#x2002;&#x2002;&#x2002;&#x2002;Update the actor network parameters <inline-formula id="ieqn-49"><mml:math id="mml-ieqn-49"><mml:mi>&#x03B8;</mml:mi></mml:math></inline-formula> according to <xref ref-type="disp-formula" rid="eqn-8">Eq. (8)</xref>.<break/>10: <bold>end for</bold><break/>11: <bold>Output:</bold> critic network and actor network get new parameters <inline-formula id="ieqn-50"><mml:math id="mml-ieqn-50"><mml:mrow><mml:msup><mml:mi>&#x03D5;</mml:mi><mml:mo>&#x2217;</mml:mo></mml:msup></mml:mrow></mml:math></inline-formula> and <inline-formula id="ieqn-51"><mml:math id="mml-ieqn-51"><mml:mrow><mml:msup><mml:mi>&#x03B8;</mml:mi><mml:mo>&#x2217;</mml:mo></mml:msup></mml:mrow></mml:math></inline-formula>.</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>Specific description of PPO algorithm neural network: In the neural network model designed in this paper, both actor network and critic network adopt fully connected network. Taking actor network as an example, the specific model is shown in <xref ref-type="fig" rid="fig-2">Fig. 2</xref>. The number of neurons in the input layer is determined by the node voltage in the power system model. In this case, the number of neurons in the input layer is 35, and batch normalization is carried out at the output to enhance the robustness of the model; The number of hidden layers and nodes in actor network and critic network are closely related to the power grid structure. In this case, actor network and critic network adopt the same hidden layer structure, both use two-layer neural networks to construct hidden layers, the number of neurons in each layer is 256, and both use relu function as activation function to enhance the nonlinear mapping ability of the whole neural network; In this case, the number of neurons in the output layer of actor network and critic network is 4 and 1, respectively, and the loss function adopts Adam optimization algorithm.</p>
<fig id="fig-2"><label>Figure 2</label><caption><title>The neural network of actor network</title></caption><graphic mimetype="image" mime-subtype="png" xlink:href="CMES_21052-fig-2.png"/></fig>
<p>The specific algorithm and model training parameters are shown in <xref ref-type="table" rid="table-2">Table 2</xref>.</p>
<table-wrap id="table-2"><label>Table 2</label><caption><title>Algorithm and model training parameters</title></caption>
<table frame="hsides">
<colgroup>
<col align="left"/>
<col align="left"/>
</colgroup>
<thead>
<tr>
<th align="left">Symbol</th>
<th align="left">Parameters value</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left"><inline-formula id="ieqn-52"><mml:math id="mml-ieqn-52"><mml:mi>&#x03B5;</mml:mi></mml:math></inline-formula></td>
<td align="left">0.2</td>
</tr>
<tr>
<td align="left"><inline-formula id="ieqn-53"><mml:math id="mml-ieqn-53"><mml:mrow><mml:msub><mml:mi>a</mml:mi><mml:mrow><mml:mi>m</mml:mi><mml:mi>i</mml:mi><mml:mi>n</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:math></inline-formula></td>
<td align="left">&#x02212;15</td>
</tr>
<tr>
<td align="left"><inline-formula id="ieqn-54"><mml:math id="mml-ieqn-54"><mml:mrow><mml:msub><mml:mi>a</mml:mi><mml:mrow><mml:mi>m</mml:mi><mml:mi>a</mml:mi><mml:mi>x</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:math></inline-formula></td>
<td align="left">15</td>
</tr>
<tr>
<td align="left"><inline-formula id="ieqn-55"><mml:math id="mml-ieqn-55"><mml:mi>&#x03B3;</mml:mi></mml:math></inline-formula></td>
<td align="left">0.99</td>
</tr>
<tr>
<td align="left"><inline-formula id="ieqn-56"><mml:math id="mml-ieqn-56"><mml:mi>&#x03B7;</mml:mi></mml:math></inline-formula></td>
<td align="left">0.0003</td>
</tr>
<tr>
<td align="left"><inline-formula id="ieqn-57"><mml:math id="mml-ieqn-57"><mml:mi>L</mml:mi></mml:math></inline-formula></td>
<td align="left">10</td>
</tr>
<tr>
<td align="left"><inline-formula id="ieqn-58"><mml:math id="mml-ieqn-58"><mml:mrow><mml:msub><mml:mi>N</mml:mi><mml:mrow><mml:mi>e</mml:mi><mml:mi>p</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:math></inline-formula></td>
<td align="left">1000</td>
</tr>
<tr>
<td align="left"><italic>M</italic></td>
<td align="left">10000</td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
<sec id="s4_2"><label>4.2</label><title>Result Analysis</title>
<p>According to the neural network model and algorithm training process designed in the previous section, the training reward function curve in <xref ref-type="fig" rid="fig-3">Fig. 3</xref> and the number of actions taken by the agent in each episode in <xref ref-type="fig" rid="fig-4">Fig. 4</xref> are obtained. It can be seen from the reward curve in <xref ref-type="fig" rid="fig-3">Fig. 3</xref> that at the beginning, due to the limited training times, the agent could not learn effective action strategies, therefore, the node voltage value after the action of the inverter cannot meet the constraint <xref ref-type="disp-formula" rid="eqn-13">Eq. (13)</xref>, and a negative reward will be obtained according to <xref ref-type="disp-formula" rid="eqn-14">Eqs. (14)</xref> and <xref ref-type="disp-formula" rid="eqn-15">(15)</xref>; With the continuous training, the agent will gradually move in the correct direction, so it will continue to obtain a positive reward; When the number of training times reaches 8000, the algorithm basically converges, and the action strategy selected by the agent can always obtain a positive reward.</p>
<fig id="fig-3"><label>Figure 3</label><caption><title>PPO training process in the IEEE 13-bus system</title></caption><graphic mimetype="image" mime-subtype="png" xlink:href="CMES_21052-fig-3.png"/></fig>
<fig id="fig-4"><label>Figure 4</label><caption><title>Number of steps taken for 10000 training episode</title></caption><graphic mimetype="image" mime-subtype="png" xlink:href="CMES_21052-fig-4.png"/></fig>
<p>The algorithm in this paper stipulates that the agent in an episode can act up to 10 times. If the voltage exceeds the limit, the episode will end in advance and proceed to the next episode. By observing <xref ref-type="fig" rid="fig-4">Fig. 4</xref>, it can be found that with the progress of training, the action times of agents in each episode gradually increase and finally converge to 10 times.</p>
<p>The voltage fluctuation curve of IEEE 13-bus three phase unbalanced system before reactive power regulation is shown in <xref ref-type="fig" rid="fig-5">Fig. 5</xref>. It can be observed that the voltage fluctuation range is relatively large between 10:00&#x223C;15:00, and the voltage value is outside the safe operation limit. The voltage fluctuation curve after reactive power regulation of the system using the VVC method proposed in this paper is shown in <xref ref-type="fig" rid="fig-6">Fig. 6</xref>. It is obvious that the agent can make the voltage within 0.95&#x223C;1.05 after action. <xref ref-type="fig" rid="fig-2 fig-3 fig-4 fig-5">Figs. 2&#x2013;5</xref> comprehensively illustrate that the algorithm designed in this paper can achieve the effect of voltage regulation.</p>
<fig id="fig-5"><label>Figure 5</label><caption><title>Voltage value before system reactive power regulation</title></caption><graphic mimetype="image" mime-subtype="png" xlink:href="CMES_21052-fig-5.png"/></fig>
<fig id="fig-6"><label>Figure 6</label><caption><title>Voltage value after system reactive power regulation</title></caption><graphic mimetype="image" mime-subtype="png" xlink:href="CMES_21052-fig-6.png"/></fig>
</sec>
</sec>
<sec id="s5"><label>5</label><title>Conclusion</title>
<p>In this paper, a voltage regulation method based on PPO is proposed and verified in IEEE 13-bus three-phase unbalanced distribution network. Taking the node load, photovoltaic quantity and inverter in the model as the DRL environment, and through the continuous interaction between the environment and the agent, the model can automatically select the control action, so as to realize the automatic voltage regulation in the distribution network. On the one hand, compared with the traditional voltage regulation using analytical optimization method, PPO algorithm can effectively avoid the inaccurate algorithm performance caused by transforming nonlinear model into linear model, and can quickly adjust the inverter in the face of complex distribution network model, so as to speed up the voltage regulation in distribution network. On the other hand, PPO skillfully removes those parts that make the network parameters change too violently through the clipping operation, so as to realize the screening of data. The filtered data will not produce gradient. Therefore, compared with the strategic gradient algorithm, PPO algorithm has higher stability and data efficiency.</p>
</sec>
</body>
<back>
<fn-group>
<fn fn-type="other"><p><bold>Funding Statement:</bold> This work is supported by the Science and Technology Project of State Grid Zhejiang Electric Power Co., Ltd. under Grant B311JY21000A.</p></fn>
<fn fn-type="conflict"><p><bold>Conflicts of Interest:</bold> The authors declare that they have no conflicts of interest to report regarding the present study.</p></fn>
</fn-group>
<ref-list content-type="authoryear">
<title>References</title>
<ref id="ref-1"><label>1.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Feng</surname>, <given-names>X.</given-names></string-name>, <string-name><surname>Zhang</surname>, <given-names>Y.</given-names></string-name>, <string-name><surname>Kang</surname>, <given-names>L.</given-names></string-name>, <string-name><surname>Wang</surname>, <given-names>L. C.</given-names></string-name>, <string-name><surname>Duan</surname>, <given-names>C. X.</given-names></string-name> <etal>et al.</etal></person-group> (<year>2021</year>). <article-title>Integrated energy storage system based on triboelectric nanogenerator in electronic devices</article-title>. <source>Frontiers of Chemical Science and Engineering</source><italic>,</italic> <volume>15</volume><issue>(2)</issue><italic>,</italic> <fpage>238</fpage>&#x2013;<lpage>250</lpage>. DOI <pub-id pub-id-type="doi">10.1007/s11705-020-1956-3</pub-id>.</mixed-citation></ref>
<ref id="ref-2"><label>2.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Liu</surname>, <given-names>C. L.</given-names></string-name>, <string-name><surname>Zhang</surname>, <given-names>Y.</given-names></string-name>, <string-name><surname>Sun</surname>, <given-names>J. R.</given-names></string-name>, <string-name><surname>Cui</surname>, <given-names>Z. H.</given-names></string-name>, <string-name><surname>Wang</surname>, <given-names>K.</given-names></string-name></person-group> (<year>2021</year>). <article-title>Stacked bidirectional LSTM RNN to evaluate the remaining useful life of supercapacitor</article-title>. <source>International Journal of Energy Research</source><italic>,</italic> <volume>46</volume><issue>(3)</issue><italic>,</italic> <fpage>3034</fpage>&#x2013;<lpage>3043</lpage>. DOI <pub-id pub-id-type="doi">10.1002/er.736</pub-id>.</mixed-citation></ref>
<ref id="ref-3"><label>3.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Feng</surname>, <given-names>X.</given-names></string-name>, <string-name><surname>Li</surname>, <given-names>Q.</given-names></string-name>, <string-name><surname>Wang</surname>, <given-names>K.</given-names></string-name></person-group> (<year>2020</year>). <article-title>Waste plastic triboelectric nanogenerators using recycled plastic bags for power generation</article-title>. <source>ACS Applied Materials &#x0026; Interfaces</source><italic>,</italic> <volume>13</volume><issue>(1)</issue><italic>,</italic> <fpage>400</fpage>&#x2013;<lpage>410</lpage>. DOI <pub-id pub-id-type="doi">10.1021/acsami.0c16489</pub-id>.</mixed-citation></ref>
<ref id="ref-4"><label>4.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Wang</surname>, <given-names>K.</given-names></string-name>, <string-name><surname>Liu</surname>, <given-names>C.</given-names></string-name>, <string-name><surname>Sun</surname>, <given-names>J. R.</given-names></string-name>, <string-name><surname>Zhao</surname>, <given-names>K.</given-names></string-name>, <string-name><surname>Wang</surname>, <given-names>L. C.</given-names></string-name> <etal>et al.</etal></person-group> (<year>2021</year>). <article-title>State of charge estimation of composite energy storage systems with supercapacitors and lithium batteries</article-title>. <source>Complexity</source><italic>,</italic> <volume>2021</volume><italic>,</italic> <fpage>1</fpage>&#x2013;<lpage>15</lpage>. DOI <pub-id pub-id-type="doi">10.1155/2021/8816250</pub-id>.</mixed-citation></ref>
<ref id="ref-5"><label>5.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Cui</surname>, <given-names>Z. H.</given-names></string-name>, <string-name><surname>Wang</surname>, <given-names>L. C.</given-names></string-name>, <string-name><surname>Li</surname>, <given-names>Q.</given-names></string-name>, <string-name><surname>Wang</surname>, <given-names>K.</given-names></string-name></person-group> (<year>2021</year>). <article-title>A comprehensive review on the state of charge estimation for lithium-ion battery based on neural network</article-title>. <source>International Journal of Energy Research</source><italic>,</italic> <volume>2021</volume><italic>,</italic> <fpage>1</fpage>&#x2013;<lpage>18</lpage>. DOI <pub-id pub-id-type="doi">10.1002/er.7545</pub-id>.</mixed-citation></ref>
<ref id="ref-6"><label>6.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Sun</surname>, <given-names>X.</given-names></string-name>, <string-name><surname>Qiu</surname>, <given-names>J.</given-names></string-name></person-group> (<year>2021</year>). <article-title>Two-stage volt/var control in active distribution networks with multi-agent deep reinforcement learning method</article-title>. <source>IEEE Transactions on Smart Grid</source><italic>,</italic> <volume>12</volume><issue>(4)</issue><italic>,</italic> <fpage>2903</fpage>&#x2013;<lpage>2912</lpage>. DOI <pub-id pub-id-type="doi">10.1109/TSG.2021.3052998</pub-id>.</mixed-citation></ref>
<ref id="ref-7"><label>7.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Hossain</surname>, <given-names>M. I.</given-names></string-name>, <string-name><surname>Yan</surname>, <given-names>R.</given-names></string-name>, <string-name><surname>Saha</surname>, <given-names>T. K.</given-names></string-name></person-group> (<year>2016</year>). <article-title>Investigation of the interaction between step voltage regulators and large-scale photovoltaic systems regarding voltage regulation and unbalance</article-title>. <source>IET Renewable Power Generation</source><italic>,</italic> <volume>10</volume><issue>(3)</issue><italic>,</italic> <fpage>299</fpage>&#x2013;<lpage>309</lpage>. DOI <pub-id pub-id-type="doi">10.1049/iet-rpg.2015.0086</pub-id>.</mixed-citation></ref>
<ref id="ref-8"><label>8.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Viawan</surname>, <given-names>F. A.</given-names></string-name>, <string-name><surname>Karlsson</surname>, <given-names>D.</given-names></string-name></person-group> (<year>2008</year>). <article-title>Voltage and reactive power control in systems with synchronous machine based distributed generation</article-title>. <source>IEEE Transactions on Power Delivery</source><italic>,</italic> <volume>23</volume><issue>(2)</issue><italic>,</italic> <fpage>1079</fpage>&#x2013;<lpage>1087</lpage>. DOI <pub-id pub-id-type="doi">10.1109/TPWRD.2007.915870</pub-id>.</mixed-citation></ref>
<ref id="ref-9"><label>9.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Wang</surname>, <given-names>L.</given-names></string-name>, <string-name><surname>Yan</surname>, <given-names>R.</given-names></string-name>, <string-name><surname>Saha</surname>, <given-names>T. K.</given-names></string-name></person-group> (<year>2018</year>). <article-title>Voltage management for large scale PV integration into weak distribution systems</article-title>. <source>IEEE Transactions on Smart Grid</source><italic>,</italic> <volume>9</volume><issue>(5)</issue><italic>,</italic> <fpage>4128</fpage>&#x2013;<lpage>4139</lpage>. DOI <pub-id pub-id-type="doi">10.1109/TSG.5165411</pub-id>.</mixed-citation></ref>
<ref id="ref-10"><label>10.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Hu</surname>, <given-names>Z.</given-names></string-name>, <string-name><surname>Wang</surname>, <given-names>X.</given-names></string-name>, <string-name><surname>Chen</surname>, <given-names>H.</given-names></string-name>, <string-name><surname>Taylor</surname>, <given-names>G. A.</given-names></string-name></person-group> (<year>2003</year>). <article-title>Volt/VAr control in distribution systems using a time-interval based approach</article-title>. <source>IEE Proceedings-Generation, Transmission and Distribution</source><italic>,</italic> <volume>150</volume><issue>(5)</issue><italic>,</italic> <fpage>548</fpage>&#x2013;<lpage>554</lpage>. DOI <pub-id pub-id-type="doi">10.1049/ip-gtd:20030562</pub-id>.</mixed-citation></ref>
<ref id="ref-11"><label>11.</label><mixed-citation publication-type="other">IEEE Standard for Interconnection and Interoperability of Distributed Energy Resources with Associated Electric Power Systems Interfaces (2018). In: <source>IEEE Std 1547-2018 (Revision of IEEE Std 1547-2003)</source>, pp. <fpage>1</fpage>&#x2013;<lpage>138</lpage>. DOI <pub-id pub-id-type="doi">10.1109/IEEESTD.2018.8332112</pub-id>.</mixed-citation></ref>
<ref id="ref-12"><label>12.</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Farivar</surname>, <given-names>M.</given-names></string-name>, <string-name><surname>Neal</surname>, <given-names>R.</given-names></string-name>, <string-name><surname>Clarke</surname>, <given-names>C.</given-names></string-name>, <string-name><surname>Low</surname>, <given-names>S.</given-names></string-name></person-group> (<year>2012</year>). <article-title>Optimal inverter VAR control in distribution systems with high PV penetration</article-title>. <conf-name>2012 IEEE Power and Energy Society General Meeting</conf-name>, pp. <fpage>1</fpage>&#x2013;<lpage>7</lpage>. <conf-loc>San Diego, CA, USA</conf-loc>.</mixed-citation></ref>
<ref id="ref-13"><label>13.</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Dall&#x0027;Anese</surname>, <given-names>E.</given-names></string-name>, <string-name><surname>Dhople</surname>, <given-names>S. V.</given-names></string-name>, <string-name><surname>Giannakis</surname>, <given-names>G. B.</given-names></string-name></person-group> (<year>2014</year>). <article-title>Optimal dispatch of photovoltaic inverters in residential distribution systems</article-title>. <conf-name>IEEE Transactions on Sustainable Energy</conf-name>, vol. 5, no. 2, pp. <fpage>487</fpage>&#x2013;<lpage>497</lpage>. IEEE.</mixed-citation></ref>
<ref id="ref-14"><label>14.</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Dall&#x0027;Anese</surname>, <given-names>E.</given-names></string-name>, <string-name><surname>Giannakis</surname>, <given-names>G. B.</given-names></string-name>, <string-name><surname>Wollenberg</surname>, <given-names>B. F.</given-names></string-name></person-group> (<year>2012</year>). <article-title>Optimization of unbalanced power distribution networks via semidefinite relaxation</article-title>. <conf-name>2012 North American Power Symposium (NAPS)</conf-name>, pp. <fpage>1</fpage>&#x2013;<lpage>6</lpage>. <conf-loc>Champaign, IL, USA</conf-loc>.</mixed-citation></ref>
<ref id="ref-15"><label>15.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Sulc</surname>, <given-names>P.</given-names></string-name>, <string-name><surname>Backhaus</surname>, <given-names>S.</given-names></string-name>, <string-name><surname>Chertkov</surname>, <given-names>M.</given-names></string-name></person-group> (<year>2014</year>). <article-title>Optimal distributed control of reactive power via the alternating direction method of multipliers</article-title>. <source>IEEE Transactions on Energy Conversion</source><italic>,</italic> <volume>29</volume><issue>(4)</issue><italic>,</italic> <fpage>968</fpage>&#x2013;<lpage>977</lpage>. DOI <pub-id pub-id-type="doi">10.1109/TEC.2014.2363196</pub-id>.</mixed-citation></ref>
<ref id="ref-16"><label>16.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Demirok</surname>, <given-names>E.</given-names></string-name>, <string-name><surname>Gonz&#x00E1;lez</surname>, <given-names>P. C.</given-names></string-name>, <string-name><surname>Frederiksen</surname>, <given-names>K. H. B.</given-names></string-name>, <string-name><surname>Sera</surname>, <given-names>D.</given-names></string-name>, <string-name><surname>Rodriguez</surname>, <given-names>P.</given-names></string-name> <etal>et al.</etal></person-group> (<year>2011</year>). <article-title>Local reactive power control methods for overvoltage prevention of distributed solar inverters in low-voltage grids</article-title>. <source>IEEE Journal of Photovoltaics</source><italic>,</italic> <volume>1</volume><issue>(2)</issue><italic>,</italic> <fpage>174</fpage>&#x2013;<lpage>182</lpage>. DOI <pub-id pub-id-type="doi">10.1109/JPHOTOV.2011.2174821</pub-id>.</mixed-citation></ref>
<ref id="ref-17"><label>17.</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Aghatehrani</surname>, <given-names>R.</given-names></string-name>, <string-name><surname>Golnas</surname>, <given-names>A.</given-names></string-name></person-group> (<year>2012</year>). <article-title>Reactive power control of photovoltaic systems based on the voltage sensitivity analysis</article-title>. <conf-name>2012 IEEE Power and Energy Society General Meeting</conf-name>, pp. <fpage>1</fpage>&#x2013;<lpage>5</lpage>. <conf-loc>San Diego, CA, USA</conf-loc>.</mixed-citation></ref>
<ref id="ref-18"><label>18.</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Jahangiri</surname>, <given-names>P.</given-names></string-name>, <string-name><surname>Aliprantis</surname>, <given-names>D.</given-names></string-name></person-group> (<year>2014</year>). <article-title>Distributed Volt/VAr control by PV inverters</article-title>. <conf-name>IEEE Transactions on Power Systems</conf-name>, vol. 28, no. 3, pp. <fpage>3429</fpage>&#x2013;<lpage>3439</lpage>. IEEE.</mixed-citation></ref>
<ref id="ref-19"><label>19.</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Yuryevich</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Wong</surname>, <given-names>K. P.</given-names></string-name></person-group> (<year>1999</year>). <article-title>Evolutionary programming based optimal power flow algorithm</article-title>. <conf-name>IEEE Transactions on Power Systems</conf-name>, vol. 14, no. 4, pp. <fpage>1245</fpage>&#x2013;<lpage>1250</lpage>. IEEE.</mixed-citation></ref>
<ref id="ref-20"><label>20.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Liu</surname>, <given-names>C. L.</given-names></string-name>, <string-name><surname>Li</surname>, <given-names>Q.</given-names></string-name>, <string-name><surname>Wang</surname>, <given-names>K.</given-names></string-name></person-group> (<year>2021</year>). <article-title>State-of-charge estimation and remaining useful life prediction of supercapacitors</article-title>. <source>Renewable and Sustainable Energy Reviews</source><italic>,</italic> <volume>150</volume><issue>(2)</issue><italic>,</italic> <fpage>111408</fpage>. DOI <pub-id pub-id-type="doi">10.1016/j.rser.2021.111408</pub-id>.</mixed-citation></ref>
<ref id="ref-21"><label>21.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Gan</surname>, <given-names>D.</given-names></string-name>, <string-name><surname>Thomas</surname>, <given-names>R. J.</given-names></string-name>, <string-name><surname>Zimmerman</surname>, <given-names>R. D.</given-names></string-name></person-group> (<year>2000</year>). <article-title>Stability-constrained optimal power flow</article-title>. <source>IEEE Transactions on Power Systems</source><italic>,</italic> <volume>15</volume><issue>(2)</issue><italic>,</italic> <fpage>535</fpage>&#x2013;<lpage>540</lpage>. DOI <pub-id pub-id-type="doi">10.1109/59.867137</pub-id>.</mixed-citation></ref>
<ref id="ref-22"><label>22.</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><surname>Schulman</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Wolski</surname>, <given-names>F.</given-names></string-name>, <string-name><surname>Dhariwal</surname>, <given-names>P.</given-names></string-name>, Radford, A., Klimov, O.</person-group> (<year>2017</year>). <article-title>Proximal policy optimization algorithms</article-title>. arXiv preprint arXiv:1707.06347.</mixed-citation></ref>
<ref id="ref-23"><label>23.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Hua</surname>, <given-names>Y.</given-names></string-name>, <string-name><surname>Wang</surname>, <given-names>N.</given-names></string-name>, <string-name><surname>Zhao</surname>, <given-names>K.</given-names></string-name></person-group> (<year>2021</year>). <article-title>Simultaneous unknown input and state estimation for the linear system with a rank-deficient distribution matrix</article-title>. <source>Mathematical Problems in Engineering</source><italic>,</italic> <volume>2012</volume><issue>(12)</issue><italic>,</italic> <fpage>1</fpage>&#x2013;<lpage>11</lpage>. DOI <pub-id pub-id-type="doi">10.1155/2021/6693690</pub-id>.</mixed-citation></ref>
<ref id="ref-24"><label>24.</label><mixed-citation publication-type="book"><person-group person-group-type="author"><string-name><surname>Lovric</surname>, <given-names>M.</given-names></string-name></person-group> (<year>2011</year>). <source>International encyclopedia of statistical science</source><italic>.</italic> <publisher-loc>Berlin</publisher-loc>: <publisher-name>Springer</publisher-name>.</mixed-citation></ref>
<ref id="ref-25"><label>25.</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Zhang</surname>, <given-names>Y.</given-names></string-name>, <string-name><surname>Wang</surname>, <given-names>X.</given-names></string-name>, <string-name><surname>Wang</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Zhang</surname>, <given-names>Y.</given-names></string-name></person-group> (<year>2021</year>). <article-title>Deep reinforcement learning based volt-VAR optimization in smart distribution systems</article-title>. <source>IEEE Transactions on Smart Grid</source><italic>,</italic> <volume>12</volume><issue>(1)</issue><italic>,</italic> <fpage>361</fpage>&#x2013;<lpage>371</lpage>. DOI <pub-id pub-id-type="doi">10.1109/TSG.5165411</pub-id>.</mixed-citation></ref>
<ref id="ref-26"><label>26.</label><mixed-citation publication-type="web">IEEE Test Feeder Specifications (<year>2017</year>). <uri xlink:href="http://sites.ieee.org/pes-testfeeders/resources">http://sites.ieee.org/pes-testfeeders/resources</uri>.</mixed-citation></ref>
</ref-list>
</back>
</article>