<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.1 20151215//EN" "http://jats.nlm.nih.gov/publishing/1.1/JATS-journalpublishing1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:mml="http://www.w3.org/1998/Math/MathML" xml:lang="en" article-type="research-article" dtd-version="1.1">
<front>
<journal-meta>
<journal-id journal-id-type="pmc">CMC</journal-id>
<journal-id journal-id-type="nlm-ta">CMC</journal-id>
<journal-id journal-id-type="publisher-id">CMC</journal-id>
<journal-title-group>
<journal-title>Computers, Materials &#x0026; Continua</journal-title>
</journal-title-group>
<issn pub-type="epub">1546-2226</issn>
<issn pub-type="ppub">1546-2218</issn>
<publisher>
<publisher-name>Tech Science Press</publisher-name>
<publisher-loc>USA</publisher-loc>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="publisher-id">65465</article-id>
<article-id pub-id-type="doi">10.32604/cmc.2025.065465</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Article</subject>
</subj-group>
</article-categories>
<title-group>
<article-title>Improved PPO-Based Task Offloading Strategies for Smart Grids</article-title>
<alt-title alt-title-type="left-running-head">Improved PPO-Based Task Offloading Strategies for Smart Grids</alt-title>
<alt-title alt-title-type="right-running-head">Improved PPO-Based Task Offloading Strategies for Smart Grids</alt-title>
</title-group>
<contrib-group>
<contrib id="author-1" contrib-type="author">
<name name-style="western"><surname>Wang</surname><given-names>Qian</given-names></name><xref ref-type="aff" rid="aff-1">1</xref></contrib>
<contrib id="author-2" contrib-type="author" corresp="yes">
<name name-style="western"><surname>Zhou</surname><given-names>Ya</given-names></name><xref ref-type="aff" rid="aff-1">1</xref><xref ref-type="aff" rid="aff-2">2</xref><email>12004042@xcu.edu.cn</email></contrib>
<aff id="aff-1"><label>1</label><institution>College of Electrical Engineering, North China University of Water Resources and Electric Power</institution>, <addr-line>Zhengzhou, 450045</addr-line>, <country>China</country></aff>
<aff id="aff-2"><label>2</label><institution>School of Electrical Engineering, Xuchang University</institution>, <addr-line>Xuchang, 461000</addr-line>, <country>China</country></aff>
</contrib-group>
<author-notes>
<corresp id="cor1"><label>&#x002A;</label>Corresponding Author: Ya Zhou. Email: <email>12004042@xcu.edu.cn</email></corresp>
</author-notes>
<pub-date date-type="collection" publication-format="electronic">
<year>2025</year>
</pub-date>
<pub-date date-type="pub" publication-format="electronic">
<day>03</day><month>07</month><year>2025</year>
</pub-date>
<volume>84</volume>
<issue>2</issue>
<fpage>3835</fpage>
<lpage>3856</lpage>
<history>
<date date-type="received">
<day>13</day>
<month>3</month>
<year>2025</year>
</date>
<date date-type="accepted">
<day>26</day>
<month>5</month>
<year>2025</year>
</date>
</history>
<permissions>
<copyright-statement>&#x00A9; 2025 The Authors.</copyright-statement>
<copyright-year>2025</copyright-year>
<copyright-holder>Published by Tech Science Press.</copyright-holder>
<license xlink:href="https://creativecommons.org/licenses/by/4.0/">
<license-p>This work is licensed under a <ext-link ext-link-type="uri" xlink:type="simple" xlink:href="https://creativecommons.org/licenses/by/4.0/">Creative Commons Attribution 4.0 International License</ext-link>, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
</license>
</permissions>
<self-uri content-type="pdf" xlink:href="TSP_CMC_65465.pdf"></self-uri>
<abstract>
<p>Edge computing has transformed smart grids by lowering latency, reducing network congestion, and enabling real-time decision-making. Nevertheless, devising an optimal task-offloading strategy remains challenging, as it must jointly minimise energy consumption and response time under fluctuating workloads and volatile network conditions. We cast the offloading problem as a Markov Decision Process (MDP) and solve it with Deep Reinforcement Learning (DRL). Specifically, we present a three-tier architecture&#x2014;end devices, edge nodes, and a cloud server&#x2014;and enhance Proximal Policy Optimization (PPO) to learn adaptive, energy-aware policies. A Convolutional Neural Network (CNN) extracts high-level features from system states, enabling the agent to respond continually to changing conditions. Extensive simulations show that the proposed method reduces task latency and energy consumption far more than several baseline algorithms, thereby improving overall system performance. These results demonstrate the effectiveness and robustness of the framework for real-time task offloading in dynamic smart-grid environments.</p>
</abstract>
<kwd-group kwd-group-type="author">
<kwd>Smart grid</kwd>
<kwd>task offloading</kwd>
<kwd>deep reinforcement learning</kwd>
<kwd>improved PPO algorithm</kwd>
<kwd>edge computing</kwd>
</kwd-group>
<funding-group>
<award-group id="awg1">
<funding-source>National Natural Science Foundation of China</funding-source>
<award-id>62103349</award-id>
</award-group>
<award-group id="awg2">
<funding-source>Henan Province Science and Technology Research Project</funding-source>
<award-id>232102210104</award-id>
</award-group>
</funding-group>
</article-meta>
</front>
<body>
<sec id="s1">
<label>1</label>
<title>Introduction</title>
<p>The large-scale integration of renewable energy sources and the digital transformation of power grids are imposing new stresses on traditional centralized infrastructures, including increased latency, congestion, and energy consumption [<xref ref-type="bibr" rid="ref-1">1</xref>&#x2013;<xref ref-type="bibr" rid="ref-5">5</xref>]. Modern information and communication technologies (ICT) enable smart grids to offer real-time monitoring and fine-grained dispatching; however, the high-frequency data streams generated by distributed energy resources (DERs) and massive Internet-of-Things (IoT) devices quickly overwhelm cloud-centric computing [<xref ref-type="bibr" rid="ref-6">6</xref>,<xref ref-type="bibr" rid="ref-7">7</xref>].</p>
<p>Edge computing mitigates these bottlenecks by processing data in close proximity to its source, thereby sharply reducing round-trip latency and easing the load on core networks [<xref ref-type="bibr" rid="ref-8">8</xref>&#x2013;<xref ref-type="bibr" rid="ref-11">11</xref>]. Yet static or heuristic task-offloading schemes are ill-suited to the smart-grid context, where load fluctuations, link-quality variations, and privacy constraints are the norm. Deep reinforcement learning (DRL), with its ability to learn optimal policies through interaction, has therefore become a popular choice for dynamic offloading [<xref ref-type="bibr" rid="ref-12">12</xref>&#x2013;<xref ref-type="bibr" rid="ref-15">15</xref>]. Early methods&#x2014;such as Deep Q-Networks (DQN) and their derivatives&#x2014;lower mean latency but suffer from slow convergence and limited scalability in highly coupled, multi-variable settings.</p>
<p>To address these challenges, we cast task offloading in smart grids as a Markov Decision Process (MDP) and introduce a DRL-based framework that couples a convolutional neural network (CNN) for feature extraction with an enhanced Proximal Policy Optimization (PPO) algorithm. The proposed approach delivers adaptive, energy-aware scheduling, significantly reducing task latency and energy use in dynamic grid environments while improving overall system performance.</p>
<p>Major contributions:
<list list-type="bullet">
<list-item>
<p>MDP-based modeling across a three-tier architecture. We formulate task characteristics, system dynamics, and energy expenditure for end devices, edge nodes, and cloud servers under a unified MDP, enabling efficient task offloading.</p></list-item>
<list-item>
<p>CNN-enhanced PPO. By integrating a lightweight CNN encoder with an improved PPO scheme, we accelerate training and bolster adaptability to non-stationary conditions.</p></list-item>
<list-item>
<p>Comprehensive simulations. Extensive experiments under dynamic, multi-task scenarios demonstrate substantial gains in latency, energy savings, and resource utilization.</p></list-item>
</list></p>
<p>The remainder of this paper is organized as follows: <xref ref-type="sec" rid="s2">Section 2</xref> surveys related research; <xref ref-type="sec" rid="s3">Section 3</xref> details the system model; <xref ref-type="sec" rid="s4">Section 4</xref> reviews foundational concepts in deep reinforcement learning; <xref ref-type="sec" rid="s5">Section 5</xref> presents the MDP formulation; <xref ref-type="sec" rid="s6">Section 6</xref> describes the DRL-based offloading and scheduling strategy; <xref ref-type="sec" rid="s7">Section 7</xref> evaluates performance via simulations; and <xref ref-type="sec" rid="s8">Section 8</xref> concludes the paper.</p>
</sec>
<sec id="s2">
<label>2</label>
<title>Related Work</title>
<p>Early studies relied on linear/non-linear and mixed-integer programming to solve offloading and resource-allocation problems [<xref ref-type="bibr" rid="ref-16">16</xref>]. While these methods can approximate globally optimal offline solutions, their computational complexity grows exponentially with network size, and they assume static links and loads, limiting real-time applicability.</p>
<p>To reduce complexity, subsequent work adopted greedy, threshold-based, or genetic heuristics [<xref ref-type="bibr" rid="ref-17">17</xref>,<xref ref-type="bibr" rid="ref-18">18</xref>]. More recently, DRL has gained prominence for its model-free, online adaptability. For instance, the Task Prediction and Multi-Objective Optimization Algorithm (TPMOA) minimizes user wait and rendering delay in wireless virtual-reality offloading [<xref ref-type="bibr" rid="ref-19">19</xref>]. Hybrid-PPO, a customized PPO variant with parameterized discrete&#x2013;continuous action spaces, improves offloading efficiency [<xref ref-type="bibr" rid="ref-20">20</xref>]. Combining a Slime-Mould Algorithm (SMA) with an optimized Harris Hawks Optimizer (HHO), HS-HHO clusters tasks for edge&#x2013;cloud collaboration, reducing energy consumption alongside delay [<xref ref-type="bibr" rid="ref-21">21</xref>].</p>
<p>In power-IoT (PIoT) scenarios [<xref ref-type="bibr" rid="ref-22">22</xref>,<xref ref-type="bibr" rid="ref-23">23</xref>], offloading must honor the stringent real-time and reliability requirements of power-system operations. Prior art includes quota-matching offloading in wireless sensor networks [<xref ref-type="bibr" rid="ref-24">24</xref>], joint optimization of service caching [<xref ref-type="bibr" rid="ref-25">25</xref>], and Q-learning-driven hydro&#x2013;power co-scheduling [<xref ref-type="bibr" rid="ref-26">26</xref>]; these methods typically introduce grid-specific priorities or stochastic models to capture pulse-load characteristics. PPO, favored for its stability and implementation ease, has been applied in multi-agent form to cooperative offloading and resource allocation in small-cell MEC [<xref ref-type="bibr" rid="ref-27">27</xref>], vehicular networks [<xref ref-type="bibr" rid="ref-28">28</xref>], and fog&#x2013;edge hybrids [<xref ref-type="bibr" rid="ref-29">29</xref>], consistently improving delay and energy efficiency with good distributed scalability.</p>
<p>Most existing studies assume stable link bandwidth and homogeneous computing capacity, overlooking the peak-load spikes and link disturbances that frequently occur in smart grids. In addition, when faced with high-dimensional, coupled state variables&#x2014;such as link rate, task size, and residual central processing unit (CPU) cycles&#x2014;current models rarely employ lightweight feature extractors, resulting in significant inference delays. To address these shortcomings, we design a Convolutional Neural Network&#x2013;Proximal Policy Optimization (CNN&#x2013;PPO) framework: the CNN first distils salient features from the high-dimensional state space, and the resulting embeddings are fed into a shared-parameter actor&#x2013;critic network that estimates both the policy and the value function. This architecture enables real-time inference while substantially improving training stability and scalability.</p>
</sec>
<sec id="s3">
<label>3</label>
<title>System Model and Related Mathematical Formulation</title>
<sec id="s3_1">
<label>3.1</label>
<title> System Architecture</title>
<p>The smart grid integrates distributed energy resources, smart meters, electric vehicles, and other intelligent devices via advanced wireless networks and edge-computing infrastructure, forming a highly interactive, computation-intensive cyber-physical system. In this context, computing resources must not only satisfy the terminal devices&#x2019; stringent real-time requirements but also preserve overall grid stability.</p>
<p><xref ref-type="fig" rid="fig-1">Fig. 1</xref> presents a three-tier edge-computing architecture for smart grids comprising: (i) the terminal layer, populated by local processing units (LPUs); (ii) the edge layer, implemented via mobile-edge computing (MEC) nodes; and (iii) the cloud layer, represented by a distribution cloud center (CC). The cloud layer undertakes centralized processing and global coordination. The terminal layer encompasses heterogeneous electrical equipment, while the edge layer supplies intermediate computation and storage through edge nodes and micro-data servers. Each edge node aggregates sampled data from differential-protection terminals together with operational metrics from the distribution network, thereby enabling automated load monitoring, anomaly detection, power-quality assessment, and consumption analytics. The processed insights are then translated into control commands that regulate field devices in real time.</p>
<fig id="fig-1">
<label>Figure 1</label>
<caption>
<title>Hierarchical task offloading and execution framework</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_65465-fig-1.tif"/>
</fig>
</sec>
<sec id="s3_2">
<label>3.2</label>
<title> Task Queue Model</title>
<p>In a smart-grid environment, every intelligent terminal generates a stream of application-driven tasks&#x2014;ranging from periodic data acquisition and anomaly detection to device-state monitoring, load forecasting, and advanced analytics. We model the aggregate arrival process as a Poisson process with intensity <inline-formula id="ieqn-1"><mml:math id="mml-ieqn-1"><mml:mi>&#x03BB;</mml:mi></mml:math></inline-formula>, which denotes the expected number of task arrivals within a given time interval. Each individual task <inline-formula id="ieqn-2"><mml:math id="mml-ieqn-2"><mml:msub><mml:mi>J</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> is described by the tuple:
<disp-formula id="eqn-1"><label>(1)</label><mml:math id="mml-eqn-1" display="block"><mml:msub><mml:mi>J</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:msubsup><mml:mi>t</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>g</mml:mi></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:msub><mml:mi>d</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>k</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow></mml:math></disp-formula>where <inline-formula id="ieqn-3"><mml:math id="mml-ieqn-3"><mml:msubsup><mml:mi>t</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>g</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula> is the task generation time, representing the moment a task is triggered or determined by the data sampling period. <inline-formula id="ieqn-4"><mml:math id="mml-ieqn-4"><mml:msub><mml:mi>d</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> is the size of the input data for the task, typically measured in bits. The size of the task is determined by the data volume to be processed, such as power consumption data collected by smart meters or real-time information obtained from sensors. <inline-formula id="ieqn-5"><mml:math id="mml-ieqn-5"><mml:msub><mml:mi>k</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> is the task&#x2019;s computation-to-data ratio (CVR), expressed in CPU cycles per bit. This parameter quantifies the computational complexity of the task. For instance, complex forecasting algorithms might have a higher CVR compared to simple state monitoring tasks.</p>
<p>Tasks are serviced under a finite&#x2013;buffer, first-come&#x2013;first-served (FCFS) discipline. When the buffer is full, additional arrivals are dropped, producing overflow events. Representing the queue as a fixed-size matrix&#x2014;each row corresponding to a single task&#x2014;facilitates efficient, dynamic updates as tasks are admitted or completed, thereby providing a tractable abstraction for subsequent scheduling and off-loading analysis.</p>
</sec>
<sec id="s3_3">
<label>3.3</label>
<title> Communication Model</title>
<p>In smart grids and edge-computing scenarios, the communication module is pivotal. Wireless channels, influenced by fading, interference, and device mobility, evolve dynamically. To capture these fluctuations, we employ a sinusoidal time-varying channel model that reflects the periodic changes in transmission rate commonly caused by traffic congestion or multipath propagation. Time is discretised into fixed-length slots; the channel state is assumed to remain constant within each slot but may differ from one slot to the next, thereby affecting both the achievable data rate and task-offloading decisions.</p>
<p>Let <inline-formula id="ieqn-6"><mml:math id="mml-ieqn-6"><mml:mi>R</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mi>t</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula> denote the instantaneous communication rate between smart terminal devices and the edge/cloud server. To reproduce the temporal dynamics outlined above, we model <inline-formula id="ieqn-7"><mml:math id="mml-ieqn-7"><mml:mi>R</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mi>t</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula> at an arbitrary time t as:
<disp-formula id="eqn-2"><label>(2)</label><mml:math id="mml-eqn-2" display="block"><mml:mi>R</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mi>t</mml:mi><mml:mo>)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:msub><mml:mi>R</mml:mi><mml:mrow><mml:mrow><mml:mtext>avg</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:mrow><mml:mi mathvariant="normal">&#x0394;</mml:mi></mml:mrow><mml:mi>R</mml:mi><mml:mo>&#x22C5;</mml:mo><mml:mi>sin</mml:mi><mml:mo>&#x2061;</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:mfrac><mml:mrow><mml:mn>2</mml:mn><mml:mi>&#x03C0;</mml:mi><mml:mi>t</mml:mi></mml:mrow><mml:mi>T</mml:mi></mml:mfrac><mml:mo>)</mml:mo></mml:mrow></mml:math></disp-formula>where <inline-formula id="ieqn-8"><mml:math id="mml-ieqn-8"><mml:mi>R</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mi>t</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula> represents the transmission rate at time <italic>t</italic>, <inline-formula id="ieqn-9"><mml:math id="mml-ieqn-9"><mml:msub><mml:mi>R</mml:mi><mml:mrow><mml:mi>a</mml:mi><mml:mi>v</mml:mi><mml:mi>g</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> is the average transmission rate, <inline-formula id="ieqn-10"><mml:math id="mml-ieqn-10"><mml:mrow><mml:mi mathvariant="normal">&#x0394;</mml:mi></mml:mrow><mml:mi>R</mml:mi></mml:math></inline-formula> is the amplitude of the rate fluctuation, representing the maximum deviation from the average rate, <italic>T</italic> is the period, indicating the duration of one cycle of fluctuation.</p>
<p>In this model, periodic fluctuations in transmission rates are suitable for two types of communication scenarios. For edge-to-edge communication, the transmission rate <inline-formula id="ieqn-11"><mml:math id="mml-ieqn-11"><mml:msub><mml:mi>R</mml:mi><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:mi>t</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula> (the data rate between devices and edge servers) with periodic variations can be expressed as:
<disp-formula id="eqn-3"><label>(3)</label><mml:math id="mml-eqn-3" display="block"><mml:msub><mml:mi>R</mml:mi><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:mi>t</mml:mi><mml:mo>)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:msub><mml:mi>R</mml:mi><mml:mrow><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mo movablelimits="true" form="prefix">max</mml:mo></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:msub><mml:mi>R</mml:mi><mml:mrow><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mo movablelimits="true" form="prefix">min</mml:mo></mml:mrow></mml:msub></mml:mrow><mml:mn>2</mml:mn></mml:mfrac><mml:mo>+</mml:mo><mml:mfrac><mml:mrow><mml:msub><mml:mi>R</mml:mi><mml:mrow><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mo movablelimits="true" form="prefix">max</mml:mo></mml:mrow></mml:msub><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mi>R</mml:mi><mml:mrow><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mo movablelimits="true" form="prefix">min</mml:mo></mml:mrow></mml:msub></mml:mrow><mml:mn>2</mml:mn></mml:mfrac><mml:mo>&#x22C5;</mml:mo><mml:mi>sin</mml:mi><mml:mo>&#x2061;</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:mfrac><mml:mrow><mml:mn>2</mml:mn><mml:mi>&#x03C0;</mml:mi><mml:mrow><mml:mtext>t</mml:mtext></mml:mrow></mml:mrow><mml:mi>T</mml:mi></mml:mfrac><mml:mo>)</mml:mo></mml:mrow></mml:math></disp-formula>where <inline-formula id="ieqn-12"><mml:math id="mml-ieqn-12"><mml:msub><mml:mi>R</mml:mi><mml:mrow><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mo movablelimits="true" form="prefix">max</mml:mo></mml:mrow></mml:msub></mml:math></inline-formula> and <inline-formula id="ieqn-13"><mml:math id="mml-ieqn-13"><mml:msub><mml:mi>R</mml:mi><mml:mrow><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mo movablelimits="true" form="prefix">min</mml:mo></mml:mrow></mml:msub></mml:math></inline-formula> denote the maximum and minimum edge transmission rates, respectively.</p>
<p>For cloud communication, the transmission rate <inline-formula id="ieqn-14"><mml:math id="mml-ieqn-14"><mml:msub><mml:mi>R</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:mi>t</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula> (the data rate between devices and the cloud server) includes a phase offset of 180 degrees, ensuring asynchronous dynamics with edge communication rates. <inline-formula id="ieqn-15"><mml:math id="mml-ieqn-15"><mml:msub><mml:mi>T</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula> represents the phase offset or time shift of the cloud communication system. This can be represented as:
<disp-formula id="eqn-4"><label>(4)</label><mml:math id="mml-eqn-4" display="block"><mml:msub><mml:mi>R</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:mi>t</mml:mi><mml:mo>)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:msub><mml:mi>R</mml:mi><mml:mrow><mml:mn>2</mml:mn><mml:mo>,</mml:mo><mml:mo movablelimits="true" form="prefix">max</mml:mo></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:msub><mml:mi>R</mml:mi><mml:mrow><mml:mn>2</mml:mn><mml:mo>,</mml:mo><mml:mo movablelimits="true" form="prefix">min</mml:mo></mml:mrow></mml:msub></mml:mrow><mml:mn>2</mml:mn></mml:mfrac><mml:mo>+</mml:mo><mml:mfrac><mml:mrow><mml:msub><mml:mi>R</mml:mi><mml:mrow><mml:mn>2</mml:mn><mml:mo>,</mml:mo><mml:mo movablelimits="true" form="prefix">max</mml:mo></mml:mrow></mml:msub><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mi>R</mml:mi><mml:mrow><mml:mn>2</mml:mn><mml:mo>,</mml:mo><mml:mo movablelimits="true" form="prefix">min</mml:mo></mml:mrow></mml:msub></mml:mrow><mml:mn>2</mml:mn></mml:mfrac><mml:mo>&#x22C5;</mml:mo><mml:mi>sin</mml:mi><mml:mo>&#x2061;</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:mfrac><mml:mrow><mml:mn>2</mml:mn><mml:mi>&#x03C0;</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mi>t</mml:mi><mml:mo>+</mml:mo><mml:msub><mml:mi>T</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow></mml:mrow><mml:mi>T</mml:mi></mml:mfrac><mml:mo>)</mml:mo></mml:mrow></mml:math></disp-formula>where <inline-formula id="ieqn-16"><mml:math id="mml-ieqn-16"><mml:msub><mml:mi>R</mml:mi><mml:mrow><mml:mn>2</mml:mn><mml:mo>,</mml:mo><mml:mo movablelimits="true" form="prefix">max</mml:mo></mml:mrow></mml:msub></mml:math></inline-formula> and <inline-formula id="ieqn-17"><mml:math id="mml-ieqn-17"><mml:msub><mml:mi>R</mml:mi><mml:mrow><mml:mn>2</mml:mn><mml:mo>,</mml:mo><mml:mo movablelimits="true" form="prefix">min</mml:mo></mml:mrow></mml:msub></mml:math></inline-formula> denote the maximum and minimum cloud transmission rates, and <inline-formula id="ieqn-18"><mml:math id="mml-ieqn-18"><mml:msub><mml:mi>T</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula> represents the phase offset introduced to simulate asynchronous fluctuations across different communication links.</p>
<p>We employ a sinusoidal model to capture the periodic fluctuations of wireless channels in smart-grid and edge-computing environments. Although this streamlined formulation omits complex phenomena such as multipath fading and sudden blockages, it nevertheless reflects the dominant variability observed in substation-level deployments with largely stationary nodes. Its low computational overhead makes the model well-suited to analysing task-offloading and resource-optimisation strategies. Future work could extend this framework by integrating more sophisticated channel models tailored to highly dynamic scenarios.</p>
</sec>
<sec id="s3_4">
<label>3.4</label>
<title> Computational Model</title>
<p>In smart grids, computational tasks can be executed either locally on terminal devices via their on-board Local Processing Units (LPUs) or off-loaded to edge servers over wireless links. To characterise the resulting computation time and energy expenditure, we develop analytical models for local, edge, and cloud execution. These models quantify resource consumption and performance trade-offs among the three modes, thereby providing theoretical guidance for optimising task-offloading decisions.</p>
<p>(1) Local Execution Model. Under the local execution mode, tasks are processed by the LPU on terminal devices. LPUs typically have limited computational power but can efficiently handle latency-sensitive tasks. For a task <inline-formula id="ieqn-19"><mml:math id="mml-ieqn-19"><mml:msub><mml:mi>J</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> offloaded at time <inline-formula id="ieqn-20"><mml:math id="mml-ieqn-20"><mml:msubsup><mml:mi>t</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>a</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula>, the local execution time is given as:
<disp-formula id="eqn-5"><label>(5)</label><mml:math id="mml-eqn-5" display="block"><mml:msubsup><mml:mi>t</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>l</mml:mi></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:mrow><mml:mo>&#x2308;</mml:mo><mml:mfrac><mml:mrow><mml:msub><mml:mi>d</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:msub><mml:mi>k</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:msup><mml:mi>f</mml:mi><mml:mrow><mml:mi>l</mml:mi></mml:mrow></mml:msup></mml:mfrac><mml:mo>&#x2309;</mml:mo></mml:mrow></mml:math></disp-formula>where <inline-formula id="ieqn-21"><mml:math id="mml-ieqn-21"><mml:msup><mml:mi>f</mml:mi><mml:mrow><mml:mi>l</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula> is the CPU frequency of the LPU, measured in cycles per second, which determines the task execution efficiency. <inline-formula id="ieqn-22"><mml:math id="mml-ieqn-22"><mml:msub><mml:mi>d</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:msub><mml:mi>k</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> represents computational demand of the task, expressed in CPU cycles.</p>
<p>The energy consumption of local computation depends on the power consumption model of the LPU. Generally, power <inline-formula id="ieqn-23"><mml:math id="mml-ieqn-23"><mml:msup><mml:mi>p</mml:mi><mml:mrow><mml:mi>l</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula> has a nonlinear relationship with CPU frequency, expressed as:
<disp-formula id="eqn-6"><label>(6)</label><mml:math id="mml-eqn-6" display="block"><mml:msup><mml:mi>p</mml:mi><mml:mrow><mml:mi>l</mml:mi></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:mi>&#x03B5;</mml:mi><mml:mo>&#x22C5;</mml:mo><mml:msup><mml:mrow><mml:mo>(</mml:mo><mml:msup><mml:mi>f</mml:mi><mml:mrow><mml:mi>l</mml:mi></mml:mrow></mml:msup><mml:mo>)</mml:mo></mml:mrow><mml:mrow><mml:mi>v</mml:mi></mml:mrow></mml:msup></mml:math></disp-formula>where <inline-formula id="ieqn-24"><mml:math id="mml-ieqn-24"><mml:mi>&#x03B5;</mml:mi></mml:math></inline-formula> and <inline-formula id="ieqn-25"><mml:math id="mml-ieqn-25"><mml:mi>v</mml:mi></mml:math></inline-formula> are constants specific to the device. Thus, the total energy consumption for local execution is:
<disp-formula id="eqn-7"><label>(7)</label><mml:math id="mml-eqn-7" display="block"><mml:msubsup><mml:mi>e</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>l</mml:mi></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:msup><mml:mi>p</mml:mi><mml:mrow><mml:mi>l</mml:mi></mml:mrow></mml:msup><mml:mo>&#x22C5;</mml:mo><mml:msubsup><mml:mi>t</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>l</mml:mi></mml:mrow></mml:msubsup></mml:math></disp-formula></p>
<p>(2) Edge Execution Model. In the edge execution mode, task data is first transmitted via wireless networks to an edge server and then processed at the server. For a task <inline-formula id="ieqn-26"><mml:math id="mml-ieqn-26"><mml:msub><mml:mi>J</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> offloaded at time <inline-formula id="ieqn-27"><mml:math id="mml-ieqn-27"><mml:msubsup><mml:mi>t</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>a</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula>, the total delay consists of two parts: data transmission delay and computation delay. It can be expressed as:
<disp-formula id="eqn-8"><label>(8)</label><mml:math id="mml-eqn-8" display="block"><mml:msubsup><mml:mi>t</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>e</mml:mi></mml:mrow></mml:msubsup><mml:mrow><mml:mo>(</mml:mo><mml:msubsup><mml:mi>t</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>a</mml:mi></mml:mrow></mml:msubsup><mml:mo>)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:msub><mml:mi>t</mml:mi><mml:mrow><mml:mrow><mml:mtext>tx</mml:mtext></mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>d</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msubsup><mml:mi>t</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>a</mml:mi></mml:mrow></mml:msubsup><mml:mo>)</mml:mo></mml:mrow><mml:mo>+</mml:mo><mml:msubsup><mml:mi>t</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mtext>exe</mml:mtext></mml:mrow><mml:mo>,</mml:mo><mml:mrow><mml:mtext>e</mml:mtext></mml:mrow></mml:mrow></mml:msubsup></mml:math></disp-formula>where <inline-formula id="ieqn-28"><mml:math id="mml-ieqn-28"><mml:msub><mml:mi>t</mml:mi><mml:mrow><mml:mrow><mml:mtext>tx</mml:mtext></mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>d</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msubsup><mml:mi>t</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>a</mml:mi></mml:mrow></mml:msubsup><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula> represents data transmission delay, depending on the data size and current wireless channel state. <inline-formula id="ieqn-29"><mml:math id="mml-ieqn-29"><mml:msubsup><mml:mi>t</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mtext>exe</mml:mtext></mml:mrow><mml:mo>,</mml:mo><mml:mrow><mml:mtext>e</mml:mtext></mml:mrow></mml:mrow></mml:msubsup></mml:math></inline-formula>: computation delay at the edge server, given as:
<disp-formula id="eqn-9"><label>(9)</label><mml:math id="mml-eqn-9" display="block"><mml:msubsup><mml:mi>t</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mtext>exe</mml:mtext></mml:mrow></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:mrow><mml:mo>&#x2308;</mml:mo><mml:mfrac><mml:mrow><mml:msub><mml:mi>d</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:msub><mml:mi>k</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:msub><mml:mi>f</mml:mi><mml:mrow><mml:mi>s</mml:mi></mml:mrow></mml:msub></mml:mfrac><mml:mo>&#x2309;</mml:mo></mml:mrow></mml:math></disp-formula>where <inline-formula id="ieqn-30"><mml:math id="mml-ieqn-30"><mml:msub><mml:mi>f</mml:mi><mml:mrow><mml:mi>s</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> is the CPU frequency of the edge server. From an energy perspective, edge execution energy consumption primarily occurs during data transmission. The energy consumption model is:
<disp-formula id="eqn-10"><label>(10)</label><mml:math id="mml-eqn-10" display="block"><mml:msubsup><mml:mi>e</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi></mml:mrow></mml:msubsup><mml:mrow><mml:mo>(</mml:mo><mml:msubsup><mml:mi>t</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>a</mml:mi></mml:mrow></mml:msubsup><mml:mo>)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:msub><mml:mi>p</mml:mi><mml:mrow><mml:mrow><mml:mtext>tx</mml:mtext></mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>&#x22C5;</mml:mo><mml:msub><mml:mi>t</mml:mi><mml:mrow><mml:mrow><mml:mtext>tx</mml:mtext></mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>d</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msubsup><mml:mi>t</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>a</mml:mi></mml:mrow></mml:msubsup><mml:mo>)</mml:mo></mml:mrow></mml:math></disp-formula>where <inline-formula id="ieqn-31"><mml:math id="mml-ieqn-31"><mml:msub><mml:mi>p</mml:mi><mml:mrow><mml:mrow><mml:mtext>tx</mml:mtext></mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula> is the power consumption rate for transmission, depending on the communication module and transmission distance.</p>
<p>(3) Cloud Execution Model. In the cloud execution mode, tasks are offloaded to cloud servers for execution. Cloud servers have the highest computational power but incur higher transmission delays and energy costs due to the distance. The execution time for a task <inline-formula id="ieqn-32"><mml:math id="mml-ieqn-32"><mml:msub><mml:mi>J</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> offloaded to the cloud at time <inline-formula id="ieqn-33"><mml:math id="mml-ieqn-33"><mml:msubsup><mml:mi>t</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>a</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula> includes both data transmission time and computation time:
<disp-formula id="eqn-11"><label>(11)</label><mml:math id="mml-eqn-11" display="block"><mml:msubsup><mml:mi>t</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi></mml:mrow></mml:msubsup><mml:mrow><mml:mo>(</mml:mo><mml:msubsup><mml:mi>t</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>a</mml:mi></mml:mrow></mml:msubsup><mml:mo>)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:msub><mml:mi>t</mml:mi><mml:mrow><mml:mrow><mml:mtext>tx</mml:mtext></mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>d</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msubsup><mml:mi>t</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>a</mml:mi></mml:mrow></mml:msubsup><mml:mo>)</mml:mo></mml:mrow><mml:mo>+</mml:mo><mml:msubsup><mml:mi>t</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mtext>exe</mml:mtext></mml:mrow><mml:mo>,</mml:mo><mml:mrow><mml:mtext>c</mml:mtext></mml:mrow></mml:mrow></mml:msubsup></mml:math></disp-formula>where <inline-formula id="ieqn-34"><mml:math id="mml-ieqn-34"><mml:msub><mml:mi>t</mml:mi><mml:mrow><mml:mrow><mml:mtext>tx</mml:mtext></mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>d</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msubsup><mml:mi>t</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>a</mml:mi></mml:mrow></mml:msubsup><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula> represents data transmission time from the terminal device to the cloud server. <inline-formula id="ieqn-35"><mml:math id="mml-ieqn-35"><mml:msubsup><mml:mi>t</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mtext>exe</mml:mtext></mml:mrow><mml:mo>,</mml:mo><mml:mrow><mml:mtext>c</mml:mtext></mml:mrow></mml:mrow></mml:msubsup></mml:math></inline-formula> represents computation time on the cloud server, expressed as:
<disp-formula id="eqn-12"><label>(12)</label><mml:math id="mml-eqn-12" display="block"><mml:msubsup><mml:mi>t</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mtext>exe</mml:mtext></mml:mrow><mml:mo>,</mml:mo><mml:mrow><mml:mtext>c</mml:mtext></mml:mrow></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:mrow><mml:mo>&#x2308;</mml:mo><mml:mfrac><mml:mrow><mml:msub><mml:mi>d</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:msub><mml:mi>k</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:msub><mml:mi>f</mml:mi><mml:mrow><mml:mi>c</mml:mi></mml:mrow></mml:msub></mml:mfrac><mml:mo>&#x2309;</mml:mo></mml:mrow></mml:math></disp-formula>where <inline-formula id="ieqn-36"><mml:math id="mml-ieqn-36"><mml:msub><mml:mi>f</mml:mi><mml:mrow><mml:mi>c</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> is the CPU frequency of the cloud server.</p>
<p>(4) Cloud Energy Consumption. Cloud energy consumption includes the energy used for data transmission and the energy consumed by receiving results. Since cloud servers are not energy-constrained, the energy consumption of smart devices is primarily concentrated in the communication stage:
<disp-formula id="eqn-13"><label>(13)</label><mml:math id="mml-eqn-13" display="block"><mml:msubsup><mml:mi>e</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:msub><mml:mi>p</mml:mi><mml:mrow><mml:mrow><mml:mtext>tx</mml:mtext></mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub><mml:mo>&#x22C5;</mml:mo><mml:msub><mml:mi>t</mml:mi><mml:mrow><mml:mrow><mml:mtext>tx</mml:mtext></mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>d</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msubsup><mml:mi>t</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>a</mml:mi></mml:mrow></mml:msubsup><mml:mo>)</mml:mo></mml:mrow></mml:math></disp-formula>where <inline-formula id="ieqn-37"><mml:math id="mml-ieqn-37"><mml:msub><mml:mi>p</mml:mi><mml:mrow><mml:mrow><mml:mtext>tx</mml:mtext></mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula> is the transmission power rate, determined by the communication module and transmission distance.</p>
</sec>
<sec id="s3_5">
<label>3.5</label>
<title> Objectives</title>
<p>In smart-grid environments, geographically distributed devices continually generate computational workloads&#x2014;including energy forecasting, real-time monitoring, and data analytics. Minimising latency and energy consumption therefore depends on selecting both an appropriate execution venue and an optimal execution schedule for each task. To address this challenge, we propose an optimisation framework that jointly allocates computational and communication resources while orchestrating task execution, thereby enhancing overall system performance.</p>
<p>For each task <inline-formula id="ieqn-38"><mml:math id="mml-ieqn-38"><mml:msub><mml:mi>J</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>, the system must first decide whether the task should be processed locally, offloaded to an edge server, or sent to a cloud server. We use a binary indicator <inline-formula id="ieqn-39"><mml:math id="mml-ieqn-39"><mml:msub><mml:mi>a</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> to represent this choice: When <inline-formula id="ieqn-40"><mml:math id="mml-ieqn-40"><mml:msub><mml:mi>a</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mn>0</mml:mn></mml:math></inline-formula>, the task <inline-formula id="ieqn-41"><mml:math id="mml-ieqn-41"><mml:msub><mml:mi>J</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> is executed locally on the device&#x2019;s LPU. This option is suitable for latency-sensitive tasks with relatively low computational demands, which can be processed quickly on device, thereby avoiding communication delays. When <inline-formula id="ieqn-42"><mml:math id="mml-ieqn-42"><mml:msub><mml:mi>a</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:math></inline-formula>, the task <inline-formula id="ieqn-43"><mml:math id="mml-ieqn-43"><mml:msub><mml:mi>J</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> is offloaded to an edge server for execution. Edge servers can significantly reduce task execution latency while avoiding the higher transmission latencies associated with cloud computing. When <inline-formula id="ieqn-44"><mml:math id="mml-ieqn-44"><mml:msub><mml:mi>a</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mn>2</mml:mn></mml:math></inline-formula>, the task <inline-formula id="ieqn-45"><mml:math id="mml-ieqn-45"><mml:msub><mml:mi>J</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> is further offloaded to a cloud server for execution. Cloud servers are ideal for handling large-scale computationally intensive tasks but incur greater latency and energy consumption due to long-distance communication.</p>
<p>The total delay experienced by a task <inline-formula id="ieqn-46"><mml:math id="mml-ieqn-46"><mml:msub><mml:mi>J</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> is defined as the time elapsed from the task generation moment <inline-formula id="ieqn-47"><mml:math id="mml-ieqn-47"><mml:msubsup><mml:mi>t</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>g</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula> to the task completion time <inline-formula id="ieqn-48"><mml:math id="mml-ieqn-48"><mml:msubsup><mml:mi>t</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mtext>exe</mml:mtext></mml:mrow></mml:mrow></mml:msubsup></mml:math></inline-formula> (measured in time units). Therefore, the delay can be expressed as a function of the scheduling decision <inline-formula id="ieqn-49"><mml:math id="mml-ieqn-49"><mml:msub><mml:mi>a</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> and the task&#x2019;s start time <inline-formula id="ieqn-50"><mml:math id="mml-ieqn-50"><mml:msubsup><mml:mi>t</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>a</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula>, given by:
<disp-formula id="eqn-14"><label>(14)</label><mml:math id="mml-eqn-14" display="block"><mml:msub><mml:mi>l</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>a</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msubsup><mml:mi>t</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>a</mml:mi></mml:mrow></mml:msubsup><mml:mo>)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:msubsup><mml:mi>t</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>a</mml:mi></mml:mrow></mml:msubsup><mml:mo>+</mml:mo><mml:mi>T</mml:mi><mml:mo>&#x2212;</mml:mo><mml:msubsup><mml:mi>t</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>g</mml:mi></mml:mrow></mml:msubsup></mml:math></disp-formula>where <inline-formula id="ieqn-51"><mml:math id="mml-ieqn-51"><mml:msubsup><mml:mi>t</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>a</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula> is the time at which task execution begins, <inline-formula id="ieqn-52"><mml:math id="mml-ieqn-52"><mml:msubsup><mml:mi>t</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mtext>exe</mml:mtext></mml:mrow></mml:mrow></mml:msubsup><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>a</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msubsup><mml:mi>t</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>a</mml:mi></mml:mrow></mml:msubsup><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula> is the time required to execute the task at the chosen location, which can be further defined as:
<disp-formula id="eqn-15"><label>(15)</label><mml:math id="mml-eqn-15" display="block"><mml:msubsup><mml:mi>t</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mtext>exe</mml:mtext></mml:mrow></mml:mrow></mml:msubsup><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>a</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msubsup><mml:mi>t</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>a</mml:mi></mml:mrow></mml:msubsup><mml:mo>)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:mn>1</mml:mn><mml:mo>&#x2212;</mml:mo><mml:msubsup><mml:mi>a</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>e</mml:mi></mml:mrow></mml:msubsup><mml:mo>&#x2212;</mml:mo><mml:msubsup><mml:mi>a</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi></mml:mrow></mml:msubsup><mml:mo>)</mml:mo></mml:mrow><mml:mo>&#x22C5;</mml:mo><mml:msubsup><mml:mi>t</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>l</mml:mi></mml:mrow></mml:msubsup><mml:mo>+</mml:mo><mml:msubsup><mml:mi>a</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>e</mml:mi></mml:mrow></mml:msubsup><mml:mo>&#x22C5;</mml:mo><mml:msubsup><mml:mi>t</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>e</mml:mi></mml:mrow></mml:msubsup><mml:mo>+</mml:mo><mml:msubsup><mml:mi>a</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi></mml:mrow></mml:msubsup><mml:mo>&#x22C5;</mml:mo><mml:msubsup><mml:mi>t</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi></mml:mrow></mml:msubsup></mml:math></disp-formula></p>
<p>Similarly, combining the system scheduling model, the energy consumption for task execution can be expressed as a function of the scheduling decision and execution time:
<disp-formula id="eqn-16"><label>(16)</label><mml:math id="mml-eqn-16" display="block"><mml:msub><mml:mi>e</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>a</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msubsup><mml:mi>t</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>a</mml:mi></mml:mrow></mml:msubsup><mml:mo>)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:mn>1</mml:mn><mml:mo>&#x2212;</mml:mo><mml:msubsup><mml:mi>a</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>e</mml:mi></mml:mrow></mml:msubsup><mml:mo>&#x2212;</mml:mo><mml:msubsup><mml:mi>a</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi></mml:mrow></mml:msubsup><mml:mo>)</mml:mo></mml:mrow><mml:mo>&#x22C5;</mml:mo><mml:msubsup><mml:mi>e</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>l</mml:mi></mml:mrow></mml:msubsup><mml:mo>+</mml:mo><mml:msubsup><mml:mi>a</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>e</mml:mi></mml:mrow></mml:msubsup><mml:mo>&#x22C5;</mml:mo><mml:msubsup><mml:mi>e</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>e</mml:mi></mml:mrow></mml:msubsup><mml:mo>+</mml:mo><mml:msubsup><mml:mi>a</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi></mml:mrow></mml:msubsup><mml:mo>&#x22C5;</mml:mo><mml:msubsup><mml:mi>e</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi></mml:mrow></mml:msubsup></mml:math></disp-formula>where <inline-formula id="ieqn-53"><mml:math id="mml-ieqn-53"><mml:mi>&#x03B1;</mml:mi></mml:math></inline-formula> is the weight coefficient for task delay costs, representing the importance of minimizing delays. <inline-formula id="ieqn-54"><mml:math id="mml-ieqn-54"><mml:mi>&#x03B2;</mml:mi></mml:math></inline-formula> is the weight coefficient for energy consumption costs, indicating the priority of reducing energy usage. The system&#x2019;s optimization objective is to minimize the total cost of the scheduling strategy, defined as the sum of task delays and energy consumption. The comprehensive cost function is:
<disp-formula id="eqn-17"><label>(17)</label><mml:math id="mml-eqn-17" display="block"><mml:msub><mml:mi>c</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>a</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msubsup><mml:mi>t</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>a</mml:mi></mml:mrow></mml:msubsup><mml:mo>)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mi>&#x03B1;</mml:mi><mml:msub><mml:mi>l</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>a</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msubsup><mml:mi>t</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>a</mml:mi></mml:mrow></mml:msubsup><mml:mo>)</mml:mo></mml:mrow><mml:mo>+</mml:mo><mml:mi>&#x03B2;</mml:mi><mml:msub><mml:mi>e</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>a</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msubsup><mml:mi>t</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>a</mml:mi></mml:mrow></mml:msubsup><mml:mo>)</mml:mo></mml:mrow></mml:math></disp-formula></p>
<p>To optimize the execution efficiency of intelligent tasks in the smart grid, the system&#x2019;s objective is to minimize the average cost of all tasks generated within a specified time period <inline-formula id="ieqn-55"><mml:math id="mml-ieqn-55"><mml:mi>T</mml:mi></mml:math></inline-formula>. The average cost is defined as:
<disp-formula id="eqn-18"><label>(18)</label><mml:math id="mml-eqn-18" display="block"><mml:mo movablelimits="true" form="prefix">min</mml:mo><mml:munder><mml:mo movablelimits="true" form="prefix">lim</mml:mo><mml:mrow><mml:mi>a</mml:mi><mml:mo>,</mml:mo><mml:msubsup><mml:mi>t</mml:mi><mml:mrow><mml:mi>n</mml:mi></mml:mrow><mml:mrow><mml:mi>a</mml:mi></mml:mrow></mml:msubsup><mml:mo stretchy="false">&#x2192;</mml:mo><mml:mi mathvariant="normal">&#x221E;</mml:mi></mml:mrow></mml:munder><mml:mfrac><mml:mn>1</mml:mn><mml:mi>n</mml:mi></mml:mfrac><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:munderover><mml:msub><mml:mi>c</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>a</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msubsup><mml:mi>t</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>a</mml:mi></mml:mrow></mml:msubsup><mml:mo>)</mml:mo></mml:mrow></mml:math></disp-formula></p>
<p>Our model minimizes the average cost per task to optimize long-term system performance despite challenges from random task arrivals and unpredictable wireless conditions. In dynamic smart grid scenarios, where task arrival rates, data volumes, and resource availability vary, static methods fail. We propose a Deep Reinforcement Learning (DRL) approach to achieve optimal task offloading and scheduling through adaptive, continuous learning.</p>
</sec>
</sec>
<sec id="s4">
<label>4</label>
<title>Background of Deep Reinforcement Learning (DRL)</title>
<p>Deep Reinforcement Learning (DRL) is an enhancement of traditional reinforcement learning (RL) that introduces deep neural networks (DNNs) to approximate state representations and functions. The core concept of RL is to enable an intelligent agent to interact with its environment and learn an optimal strategy through continuous exploration. In reinforcement learning, at each time step <inline-formula id="ieqn-56"><mml:math id="mml-ieqn-56"><mml:mi>n</mml:mi></mml:math></inline-formula>, the agent observes the environment state <inline-formula id="ieqn-57"><mml:math id="mml-ieqn-57"><mml:msub><mml:mi>s</mml:mi><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> and selects an action <inline-formula id="ieqn-58"><mml:math id="mml-ieqn-58"><mml:msub><mml:mrow><mml:mtext mathvariant="italic">a</mml:mtext></mml:mrow><mml:mrow><mml:mrow><mml:mtext mathvariant="italic">n</mml:mtext></mml:mrow></mml:mrow></mml:msub></mml:math></inline-formula> from the action space <inline-formula id="ieqn-59"><mml:math id="mml-ieqn-59"><mml:mi>A</mml:mi></mml:math></inline-formula>. The action is chosen based on a policy <inline-formula id="ieqn-60"><mml:math id="mml-ieqn-60"><mml:mi>&#x03C0;</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>a</mml:mi><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:msub><mml:mspace width="0pt" /><mml:mo>&#x2223;</mml:mo><mml:msub><mml:mi>s</mml:mi><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula>, which defines the probability of executing action <inline-formula id="ieqn-61"><mml:math id="mml-ieqn-61"><mml:msub><mml:mi>a</mml:mi><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> in state <inline-formula id="ieqn-62"><mml:math id="mml-ieqn-62"><mml:msub><mml:mi>s</mml:mi><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>. Upon executing the action, the environment transitions to a new state <inline-formula id="ieqn-63"><mml:math id="mml-ieqn-63"><mml:msub><mml:mi>s</mml:mi><mml:mrow><mml:mi>n</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula> and returns a reward signal <inline-formula id="ieqn-64"><mml:math id="mml-ieqn-64"><mml:msub><mml:mrow><mml:mtext mathvariant="italic">r</mml:mtext></mml:mrow><mml:mrow><mml:mrow><mml:mtext mathvariant="italic">n</mml:mtext></mml:mrow></mml:mrow></mml:msub></mml:math></inline-formula>, determined by the transition probability <inline-formula id="ieqn-65"><mml:math id="mml-ieqn-65"><mml:mi>P</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>s</mml:mi><mml:mrow><mml:mi>n</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:msub><mml:mi>s</mml:mi><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>a</mml:mi><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula> and the reward function <inline-formula id="ieqn-66"><mml:math id="mml-ieqn-66"><mml:mi>R</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>s</mml:mi><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>a</mml:mi><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>s</mml:mi><mml:mrow><mml:mi>n</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula>. This process continues iteratively, starting from an initial state <inline-formula id="ieqn-67"><mml:math id="mml-ieqn-67"><mml:msub><mml:mi>s</mml:mi><mml:mrow><mml:mrow><mml:mtext>m</mml:mtext></mml:mrow></mml:mrow></mml:msub></mml:math></inline-formula>, and the cumulative reward expectation <inline-formula id="ieqn-68"><mml:math id="mml-ieqn-68"><mml:msub><mml:mi>G</mml:mi><mml:mrow><mml:mrow><mml:mtext>m</mml:mtext></mml:mrow></mml:mrow></mml:msub></mml:math></inline-formula> is expressed as:
<disp-formula id="eqn-19"><label>(19)</label><mml:math id="mml-eqn-19" display="block"><mml:msub><mml:mi>G</mml:mi><mml:mrow><mml:mi>m</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>l</mml:mi><mml:mo>=</mml:mo><mml:mn>0</mml:mn></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x221E;</mml:mi></mml:mrow></mml:munderover><mml:msup><mml:mi>&#x03B3;</mml:mi><mml:mrow><mml:mi>l</mml:mi></mml:mrow></mml:msup><mml:msub><mml:mi>r</mml:mi><mml:mrow><mml:mi>m</mml:mi><mml:mo>+</mml:mo><mml:mi>l</mml:mi></mml:mrow></mml:msub></mml:math></disp-formula>where <inline-formula id="ieqn-69"><mml:math id="mml-ieqn-69"><mml:mi>&#x03B3;</mml:mi><mml:mo>&#x2208;</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mn>1</mml:mn><mml:mo>]</mml:mo></mml:mrow></mml:math></inline-formula> is the discount factor, used to balance immediate and future rewards. By maximizing the expected value of <inline-formula id="ieqn-70"><mml:math id="mml-ieqn-70"><mml:msub><mml:mi>G</mml:mi><mml:mrow><mml:mi>m</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>, the agent can learn an optimal strategy to achieve the highest long-term reward.</p>
<p>In the mathematical framework of RL, the problem is typically defined as a Markov Decision Process (MDP):
<disp-formula id="eqn-20"><label>(20)</label><mml:math id="mml-eqn-20" display="block"><mml:mi>M</mml:mi><mml:mo>=</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:mi>S</mml:mi><mml:mo>,</mml:mo><mml:mi>A</mml:mi><mml:mo>,</mml:mo><mml:mi>P</mml:mi><mml:mo>,</mml:mo><mml:mi>R</mml:mi><mml:mo>,</mml:mo><mml:mi>&#x03B3;</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:math></disp-formula>where <inline-formula id="ieqn-71"><mml:math id="mml-ieqn-71"><mml:mi>S</mml:mi></mml:math></inline-formula> represents state space, representing all possible environmental states. <inline-formula id="ieqn-72"><mml:math id="mml-ieqn-72"><mml:mi>A</mml:mi></mml:math></inline-formula> represents action space, representing all possible agent actions. <inline-formula id="ieqn-73"><mml:math id="mml-ieqn-73"><mml:mi>P</mml:mi></mml:math></inline-formula> represents state transition probability, describing the likelihood of transitioning from one state <inline-formula id="ieqn-74"><mml:math id="mml-ieqn-74"><mml:mi>s</mml:mi></mml:math></inline-formula> to another <inline-formula id="ieqn-75"><mml:math id="mml-ieqn-75"><mml:msup><mml:mi>s</mml:mi><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula> after executing action <inline-formula id="ieqn-76"><mml:math id="mml-ieqn-76"><mml:mi>a</mml:mi></mml:math></inline-formula>. <inline-formula id="ieqn-77"><mml:math id="mml-ieqn-77"><mml:mi>R</mml:mi></mml:math></inline-formula> represents reward function, quantifying the immediate reward for performing an action in a specific state. <inline-formula id="ieqn-78"><mml:math id="mml-ieqn-78"><mml:mi>&#x03B3;</mml:mi></mml:math></inline-formula>: discount factor, controlling the importance of future rewards. The goal of RL is to use policy <inline-formula id="ieqn-79"><mml:math id="mml-ieqn-79"><mml:mi>&#x03C0;</mml:mi></mml:math></inline-formula> to maximize the cumulative reward expectation, defined as:
<disp-formula id="eqn-21"><label>(21)</label><mml:math id="mml-eqn-21" display="block"><mml:msub><mml:mi>v</mml:mi><mml:mrow><mml:mi>&#x03C0;</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:mi>s</mml:mi><mml:mo>)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mrow><mml:mi mathvariant="double-struck">E</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>&#x03C0;</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo>[</mml:mo><mml:msub><mml:mi>G</mml:mi><mml:mrow><mml:mi>m</mml:mi></mml:mrow></mml:msub><mml:mo>&#x2223;</mml:mo><mml:msub><mml:mi>S</mml:mi><mml:mrow><mml:mi>m</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mi>s</mml:mi><mml:mo>]</mml:mo></mml:mrow></mml:math></disp-formula>where <inline-formula id="ieqn-80"><mml:math id="mml-ieqn-80"><mml:msub><mml:mi>v</mml:mi><mml:mrow><mml:mi>&#x03C0;</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>, referred to as the state value function, represents the cumulative reward expectation for a specific state under policy <inline-formula id="ieqn-81"><mml:math id="mml-ieqn-81"><mml:mi>&#x03C0;</mml:mi></mml:math></inline-formula>. For a specific state-action pair, the action-value function is defined as:
<disp-formula id="eqn-22"><label>(22)</label><mml:math id="mml-eqn-22" display="block"><mml:msub><mml:mi>q</mml:mi><mml:mrow><mml:mi>&#x03C0;</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:mi>s</mml:mi><mml:mo>,</mml:mo><mml:mi>a</mml:mi><mml:mo>)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mrow><mml:mi mathvariant="double-struck">E</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>&#x03C0;</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo>[</mml:mo><mml:msub><mml:mi>G</mml:mi><mml:mrow><mml:mi>m</mml:mi></mml:mrow></mml:msub><mml:mo>&#x2223;</mml:mo><mml:msub><mml:mi>S</mml:mi><mml:mrow><mml:mi>m</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mi>s</mml:mi><mml:mo>,</mml:mo><mml:msub><mml:mi>a</mml:mi><mml:mrow><mml:mi>m</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mi>a</mml:mi><mml:mo>]</mml:mo></mml:mrow></mml:math></disp-formula></p>
<p>The goal of DRL is to find an optimal policy <inline-formula id="ieqn-82"><mml:math id="mml-ieqn-82"><mml:msup><mml:mi>&#x03C0;</mml:mi><mml:mrow><mml:mo>&#x2217;</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula> that maximizes the expected cumulative reward from any state:
<disp-formula id="eqn-23"><label>(23)</label><mml:math id="mml-eqn-23" display="block"><mml:msub><mml:mi>v</mml:mi><mml:mrow><mml:mrow><mml:msup><mml:mi>&#x03C0;</mml:mi><mml:mrow><mml:mo>&#x2217;</mml:mo></mml:mrow></mml:msup></mml:mrow></mml:mrow></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:mi>s</mml:mi><mml:mo>)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:munder><mml:mo movablelimits="true" form="prefix">max</mml:mo><mml:mrow><mml:mi>a</mml:mi></mml:mrow></mml:munder><mml:msub><mml:mi>q</mml:mi><mml:mrow><mml:mrow><mml:msup><mml:mi>&#x03C0;</mml:mi><mml:mrow><mml:mo>&#x2217;</mml:mo></mml:mrow></mml:msup></mml:mrow></mml:mrow></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:mi>s</mml:mi><mml:mo>,</mml:mo><mml:mi>a</mml:mi><mml:mo>)</mml:mo></mml:mrow><mml:mo>,</mml:mo><mml:mspace width="thinmathspace" /><mml:mi>s</mml:mi><mml:mo>&#x2208;</mml:mo><mml:mi>S</mml:mi></mml:math></disp-formula></p>
<p>In practical applications, DRL uses deep neural networks (DNNs) to approximate policies and value functions. Leveraging the feature representation capabilities of DNNs, DRL can adapt to large-scale state spaces. Currently, DRL methods are categorized into two major approaches: value-based methods and policy-based methods.</p>
<p>Value-Based Methods. In value-based methods, DNNs are employed to approximate the value function, commonly referred to as the Q-network (Deep Q-Network, DQN) and its variations. The core idea is to minimize the loss between the DNN-predicted value and the true target value, formally expressed as:
<disp-formula id="eqn-24"><label>(24)</label><mml:math id="mml-eqn-24" display="block"><mml:msup><mml:mi>L</mml:mi><mml:mrow><mml:mi>V</mml:mi></mml:mrow></mml:msup><mml:mrow><mml:mo>(</mml:mo><mml:mi>&#x03B8;</mml:mi><mml:mo>)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mrow><mml:mi mathvariant="double-struck">E</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo>[</mml:mo><mml:msup><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>v</mml:mi><mml:mrow><mml:mrow><mml:msup><mml:mi>&#x03C0;</mml:mi><mml:mrow><mml:mo>&#x2217;</mml:mo></mml:mrow></mml:msup></mml:mrow></mml:mrow></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>s</mml:mi><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mi>v</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>s</mml:mi><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:msub><mml:mo>;</mml:mo><mml:mi>&#x03B8;</mml:mi><mml:mo>)</mml:mo></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msup><mml:mo>]</mml:mo></mml:mrow></mml:math></disp-formula>where <inline-formula id="ieqn-83"><mml:math id="mml-ieqn-83"><mml:msub><mml:mi>s</mml:mi><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> is the state at time step <inline-formula id="ieqn-84"><mml:math id="mml-ieqn-84"><mml:mi>n</mml:mi></mml:math></inline-formula>, <inline-formula id="ieqn-85"><mml:math id="mml-ieqn-85"><mml:msub><mml:mi>v</mml:mi><mml:mrow><mml:mrow><mml:msup><mml:mi>&#x03C0;</mml:mi><mml:mrow><mml:mo>&#x2217;</mml:mo></mml:mrow></mml:msup></mml:mrow></mml:mrow></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>s</mml:mi><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula> is the value function parameterized by <inline-formula id="ieqn-86"><mml:math id="mml-ieqn-86"><mml:mi>&#x03B8;</mml:mi></mml:math></inline-formula>, representing the network&#x2019;s training weights.</p>
<p>Policy-Based Methods. Policy-based methods directly use DNNs to approximate the parameterized policy, known as the policy network. Typical policy-based algorithms include REINFORCE and Actor-Critic, which exhibit higher sample efficiency and learning capabilities. A common policy gradient update equation is:
<disp-formula id="eqn-25"><label>(25)</label><mml:math id="mml-eqn-25" display="block"><mml:mi mathvariant="normal">&#x2207;</mml:mi><mml:msup><mml:mi>L</mml:mi><mml:mrow><mml:mi>P</mml:mi><mml:mi>G</mml:mi></mml:mrow></mml:msup><mml:mrow><mml:mo>(</mml:mo><mml:mi>&#x03B8;</mml:mi><mml:mo>)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mrow><mml:mi mathvariant="double-struck">E</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo>[</mml:mo><mml:msub><mml:mi mathvariant="normal">&#x2207;</mml:mi><mml:mrow><mml:mi>&#x03B8;</mml:mi></mml:mrow></mml:msub><mml:mi>log</mml:mi><mml:mo>&#x2061;</mml:mo><mml:mi>&#x03C0;</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>a</mml:mi><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:msub><mml:mo>&#x2223;</mml:mo><mml:msub><mml:mi>s</mml:mi><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:msub><mml:mo>;</mml:mo><mml:mi>&#x03B8;</mml:mi><mml:mo>)</mml:mo></mml:mrow><mml:msub><mml:mrow><mml:mover><mml:mi>A</mml:mi><mml:mo stretchy="false">&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:msub><mml:mo>]</mml:mo></mml:mrow></mml:math></disp-formula>where <inline-formula id="ieqn-87"><mml:math id="mml-ieqn-87"><mml:mi>&#x03C0;</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>a</mml:mi><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:msub><mml:mo>&#x2223;</mml:mo><mml:msub><mml:mi>s</mml:mi><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:msub><mml:mo>;</mml:mo><mml:mi>&#x03B8;</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula> represents the probability of selecting action <inline-formula id="ieqn-88"><mml:math id="mml-ieqn-88"><mml:msub><mml:mi>a</mml:mi><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> in state <inline-formula id="ieqn-89"><mml:math id="mml-ieqn-89"><mml:msub><mml:mi>s</mml:mi><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>, <inline-formula id="ieqn-90"><mml:math id="mml-ieqn-90"><mml:mi>&#x03B8;</mml:mi></mml:math></inline-formula> is the network weights, <inline-formula id="ieqn-91"><mml:math id="mml-ieqn-91"><mml:msub><mml:mrow><mml:mover><mml:mi>A</mml:mi><mml:mo stretchy="false">&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> represents advantage function, used to balance the relative quality of action <inline-formula id="ieqn-92"><mml:math id="mml-ieqn-92"><mml:msub><mml:mi>a</mml:mi><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>.</p>
<p>Proximal Policy Optimization (PPO). To enhance exploration while avoiding local optima, the Generalized Advantage Estimation (GAE) method is introduced to balance bias and variance:
<disp-formula id="eqn-26"><label>(26)</label><mml:math id="mml-eqn-26" display="block"><mml:msubsup><mml:mrow><mml:mover><mml:mi>A</mml:mi><mml:mo stretchy="false">&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>n</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mtext>GAE</mml:mtext></mml:mrow></mml:mrow></mml:msubsup><mml:mrow><mml:mo>(</mml:mo><mml:mi>&#x03B3;</mml:mi><mml:mo>,</mml:mo><mml:mi>&#x03D5;</mml:mi><mml:mo>)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>l</mml:mi><mml:mo>=</mml:mo><mml:mn>0</mml:mn></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x221E;</mml:mi></mml:mrow></mml:munderover><mml:msup><mml:mrow><mml:mo>(</mml:mo><mml:mi>&#x03B3;</mml:mi><mml:mi>&#x03D5;</mml:mi><mml:mo>)</mml:mo></mml:mrow><mml:mrow><mml:mi>l</mml:mi></mml:mrow></mml:msup><mml:msubsup><mml:mi>&#x03B7;</mml:mi><mml:mrow><mml:mi>n</mml:mi><mml:mo>+</mml:mo><mml:mi>l</mml:mi></mml:mrow><mml:mrow><mml:mi>v</mml:mi></mml:mrow></mml:msubsup></mml:math></disp-formula>where <inline-formula id="ieqn-93"><mml:math id="mml-ieqn-93"><mml:msubsup><mml:mi>&#x03B7;</mml:mi><mml:mrow><mml:mi>n</mml:mi><mml:mo>+</mml:mo><mml:mi>l</mml:mi></mml:mrow><mml:mrow><mml:mi>v</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula> is the temporal difference (TD) error at step <inline-formula id="ieqn-94"><mml:math id="mml-ieqn-94"><mml:mi>n</mml:mi></mml:math></inline-formula>, <inline-formula id="ieqn-95"><mml:math id="mml-ieqn-95"><mml:mi>&#x03D5;</mml:mi></mml:math></inline-formula> adjusts the balance between bias and variance. Additionally, the PPO algorithm introduces a clipped objective function to limit policy updates, ensuring stability and improving model robustness:
<disp-formula id="eqn-27"><label>(27)</label><mml:math id="mml-eqn-27" display="block"><mml:msup><mml:mi>L</mml:mi><mml:mrow><mml:mrow><mml:mtext>CLIP</mml:mtext></mml:mrow></mml:mrow></mml:msup><mml:mrow><mml:mo>(</mml:mo><mml:mi>&#x03B8;</mml:mi><mml:mo>)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mrow><mml:mi mathvariant="double-struck">E</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo>[</mml:mo><mml:mo movablelimits="true" form="prefix">min</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>r</mml:mi><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:mi>&#x03B8;</mml:mi><mml:mo>)</mml:mo></mml:mrow><mml:msub><mml:mrow><mml:mover><mml:mi>A</mml:mi><mml:mo stretchy="false">&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mrow><mml:mtext>clip</mml:mtext></mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>r</mml:mi><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:mi>&#x03B8;</mml:mi><mml:mo>)</mml:mo></mml:mrow><mml:mo>,</mml:mo><mml:mn>1</mml:mn><mml:mo>&#x2212;</mml:mo><mml:mo>&#x03F5;</mml:mo><mml:mo>,</mml:mo><mml:mn>1</mml:mn><mml:mo>+</mml:mo><mml:mo>&#x03F5;</mml:mo><mml:mo>)</mml:mo></mml:mrow><mml:msub><mml:mrow><mml:mover><mml:mi>A</mml:mi><mml:mo stretchy="false">&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow><mml:mo>]</mml:mo></mml:mrow></mml:math></disp-formula>where <inline-formula id="ieqn-96"><mml:math id="mml-ieqn-96"><mml:msub><mml:mi>r</mml:mi><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:mi>&#x03B8;</mml:mi><mml:mo>)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mi>&#x03C0;</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>a</mml:mi><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:msub><mml:mo>&#x2223;</mml:mo><mml:msub><mml:mi>s</mml:mi><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:msub><mml:mo>;</mml:mo><mml:mi>&#x03B8;</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:mi>&#x03C0;</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>a</mml:mi><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:msub><mml:mo>&#x2223;</mml:mo><mml:msub><mml:mi>s</mml:mi><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:msub><mml:mo>;</mml:mo><mml:msub><mml:mi>&#x03B8;</mml:mi><mml:mrow><mml:mrow><mml:mtext>old</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:mfrac></mml:math></inline-formula>, <inline-formula id="ieqn-97"><mml:math id="mml-ieqn-97"><mml:mo>&#x03F5;</mml:mo></mml:math></inline-formula> controls the range of policy updates.</p>
<p>Based on the above framework, this paper employs the PPO-based DRL method to design the task offloading strategy in smart grids. The clipped objective ensures stable policy updates while enhancing the model&#x2019;s adaptability, enabling the system to make efficient scheduling decisions in dynamic communication environments and achieve resource optimization.</p>
</sec>
<sec id="s5">
<label>5</label>
<title>MDP Formulation</title>
<p>We address the task-offloading problem with deep reinforcement learning (DRL). By casting dynamic task allocation as a Markov decision process (MDP), we leverage DRL to learn an offloading policy that maximizes efficiency.</p>
<sec id="s5_1">
<label>5.1</label>
<title> State Space</title>
<p>At each time step during smart grid task offloading and scheduling, the system orchestrator monitors the current system state and determines offloading decisions. To accurately represent tasks, computational resources, and communication link dynamics, we define the state space as a collection of relevant variables, formally expressed as:
<disp-formula id="eqn-28"><label>(28)</label><mml:math id="mml-eqn-28" display="block"><mml:mi>S</mml:mi><mml:mo>=</mml:mo><mml:mrow><mml:mo>{</mml:mo><mml:mi>s</mml:mi><mml:mo>&#x2223;</mml:mo><mml:mi>s</mml:mi><mml:mo>=</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:mi>Q</mml:mi><mml:mo>,</mml:mo><mml:msup><mml:mi>s</mml:mi><mml:mrow><mml:mrow><mml:mtext>lpul</mml:mtext></mml:mrow></mml:mrow></mml:msup><mml:mo>,</mml:mo><mml:msup><mml:mi>s</mml:mi><mml:mrow><mml:mrow><mml:mtext>tq</mml:mtext></mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msup><mml:mo>,</mml:mo><mml:msup><mml:mi>s</mml:mi><mml:mrow><mml:mrow><mml:mtext>tq</mml:mtext></mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msup><mml:mo>,</mml:mo><mml:msup><mml:mi>s</mml:mi><mml:mrow><mml:mrow><mml:mtext>mec</mml:mtext></mml:mrow></mml:mrow></mml:msup><mml:mo>,</mml:mo><mml:msup><mml:mi>s</mml:mi><mml:mrow><mml:mrow><mml:mtext>cc</mml:mtext></mml:mrow></mml:mrow></mml:msup><mml:mo>,</mml:mo><mml:mi>R</mml:mi><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mi>R</mml:mi><mml:mn>2</mml:mn><mml:mo>)</mml:mo></mml:mrow><mml:mo>}</mml:mo></mml:mrow></mml:math></disp-formula>where the state space <italic>S</italic> includes multiple key variables describing task execution states at the local, edge server, and cloud server levels, as well as data transmission conditions and network states. These components are detailed as follows:</p>
<p>(1) We define the task queue <inline-formula id="ieqn-98"><mml:math id="mml-ieqn-98"><mml:mi>Q</mml:mi></mml:math></inline-formula> as an <inline-formula id="ieqn-99"><mml:math id="mml-ieqn-99"><mml:mi>M</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mn>3</mml:mn></mml:math></inline-formula> matrix, where each row <inline-formula id="ieqn-100"><mml:math id="mml-ieqn-100"><mml:mi>Q</mml:mi><mml:mrow><mml:mo>[</mml:mo><mml:mi>t</mml:mi><mml:mo>]</mml:mo></mml:mrow><mml:mrow><mml:mo>[</mml:mo><mml:mi>j</mml:mi><mml:mo>]</mml:mo></mml:mrow></mml:math></inline-formula> represents a task&#x2019;s generation time, data size, and computational demand. These attributes facilitate evaluating queue latency, prioritizing time-sensitive tasks, determining offloading bandwidth to edge or cloud servers, and estimating the CPU cycles required for execution.</p>
<p>(2) Local Processing Unit State <inline-formula id="ieqn-101"><mml:math id="mml-ieqn-101"><mml:msup><mml:mi>s</mml:mi><mml:mrow><mml:mrow><mml:mtext>lpu</mml:mtext></mml:mrow></mml:mrow></mml:msup></mml:math></inline-formula>: The Local Processing Unit (LPU) is responsible for handling computational tasks at the terminal device level. The state <inline-formula id="ieqn-102"><mml:math id="mml-ieqn-102"><mml:msup><mml:mi>s</mml:mi><mml:mrow><mml:mrow><mml:mtext>lpu</mml:mtext></mml:mrow></mml:mrow></mml:msup></mml:math></inline-formula> represents the remaining CPU cycles available for processing tasks. At time t, if the orchestrator schedules a task to be executed locally, <inline-formula id="ieqn-103"><mml:math id="mml-ieqn-103"><mml:msup><mml:mi>s</mml:mi><mml:mrow><mml:mrow><mml:mtext>lpu</mml:mtext></mml:mrow></mml:mrow></mml:msup></mml:math></inline-formula> is updated based on the computational demand of the task <inline-formula id="ieqn-104"><mml:math id="mml-ieqn-104"><mml:msub><mml:mi>d</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>&#x22C5;</mml:mo><mml:msub><mml:mi>k</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>, where <inline-formula id="ieqn-105"><mml:math id="mml-ieqn-105"><mml:msub><mml:mi>d</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> is the task&#x2019;s data size and <inline-formula id="ieqn-106"><mml:math id="mml-ieqn-106"><mml:msub><mml:mi>k</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> is the computational intensity. As time progresses, the available computational resources gradually decrease, expressed as:
<disp-formula id="eqn-29"><label>(29)</label><mml:math id="mml-eqn-29" display="block"><mml:msup><mml:mi>s</mml:mi><mml:mrow><mml:mrow><mml:mtext>lpu</mml:mtext></mml:mrow></mml:mrow></mml:msup><mml:mrow><mml:mo>[</mml:mo><mml:mi>t</mml:mi><mml:mo>]</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mo movablelimits="true" form="prefix">max</mml:mo><mml:mrow><mml:mo>{</mml:mo><mml:msup><mml:mi>s</mml:mi><mml:mrow><mml:mrow><mml:mtext>lpul</mml:mtext></mml:mrow></mml:mrow></mml:msup><mml:mrow><mml:mo>[</mml:mo><mml:mi>t</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo>]</mml:mo></mml:mrow><mml:mo>&#x2212;</mml:mo><mml:msup><mml:mi>f</mml:mi><mml:mrow><mml:mi>l</mml:mi></mml:mrow></mml:msup><mml:mo>,</mml:mo><mml:mn>0</mml:mn><mml:mo>}</mml:mo></mml:mrow></mml:math></disp-formula>where <inline-formula id="ieqn-107"><mml:math id="mml-ieqn-107"><mml:msup><mml:mi>f</mml:mi><mml:mrow><mml:mi>l</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula> is the fixed computational capacity of the LPU in CPU cycles.</p>
<p>(3) Edge Server Transmission Queue State <inline-formula id="ieqn-108"><mml:math id="mml-ieqn-108"><mml:msup><mml:mi>s</mml:mi><mml:mrow><mml:mrow><mml:mtext>tq</mml:mtext></mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msup></mml:math></inline-formula>: This state describes the remaining data to be transmitted from terminal devices to the edge server via the wireless network. At time t, if a task is offloaded to the edge server, <inline-formula id="ieqn-109"><mml:math id="mml-ieqn-109"><mml:msup><mml:mi>s</mml:mi><mml:mrow><mml:mrow><mml:mtext>tq</mml:mtext></mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msup><mml:mrow><mml:mo>[</mml:mo><mml:mi>t</mml:mi><mml:mo>]</mml:mo></mml:mrow></mml:math></inline-formula> is initialized with the task&#x2019;s data size <inline-formula id="ieqn-110"><mml:math id="mml-ieqn-110"><mml:msub><mml:mi>d</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>. The transmission process depends on the channel&#x2019;s transmission rate <inline-formula id="ieqn-111"><mml:math id="mml-ieqn-111"><mml:msub><mml:mi>R</mml:mi><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:mi>t</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula>, and the state is updated as:
<disp-formula id="eqn-30"><label>(30)</label><mml:math id="mml-eqn-30" display="block"><mml:msup><mml:mi>s</mml:mi><mml:mrow><mml:mrow><mml:mtext>tq</mml:mtext></mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msup><mml:mrow><mml:mo>[</mml:mo><mml:mi>t</mml:mi><mml:mo>]</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mo movablelimits="true" form="prefix">max</mml:mo><mml:mrow><mml:mo>{</mml:mo><mml:msup><mml:mi>s</mml:mi><mml:mrow><mml:mrow><mml:mtext>tq</mml:mtext></mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msup><mml:mrow><mml:mo>[</mml:mo><mml:mi>t</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo>]</mml:mo></mml:mrow><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mi>r</mml:mi><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:mi>t</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo>)</mml:mo></mml:mrow><mml:mo>,</mml:mo><mml:mn>0</mml:mn><mml:mo>}</mml:mo></mml:mrow></mml:math></disp-formula>when <inline-formula id="ieqn-112"><mml:math id="mml-ieqn-112"><mml:msup><mml:mi>s</mml:mi><mml:mrow><mml:mrow><mml:mtext>tq</mml:mtext></mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msup><mml:mrow><mml:mo>[</mml:mo><mml:mi>t</mml:mi><mml:mo>]</mml:mo></mml:mrow></mml:math></inline-formula> &#x003D; 0, the task has been successfully transmitted to the edge server.</p>
<p>(4) Cloud Transmission Queue State <inline-formula id="ieqn-113"><mml:math id="mml-ieqn-113"><mml:msup><mml:mi>s</mml:mi><mml:mrow><mml:mrow><mml:mtext>tq</mml:mtext></mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msup></mml:math></inline-formula>: This state represents the remaining data to be transmitted from terminal devices to the cloud server. If a task is offloaded to the cloud, <inline-formula id="ieqn-114"><mml:math id="mml-ieqn-114"><mml:msup><mml:mi>s</mml:mi><mml:mrow><mml:mrow><mml:mtext>tq</mml:mtext></mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msup><mml:mrow><mml:mo>[</mml:mo><mml:mi>t</mml:mi><mml:mo>]</mml:mo></mml:mrow></mml:math></inline-formula> is initialized with the task&#x2019;s data size <inline-formula id="ieqn-115"><mml:math id="mml-ieqn-115"><mml:msub><mml:mi>d</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>. The state is updated dynamically based on the cloud transmission rate <inline-formula id="ieqn-116"><mml:math id="mml-ieqn-116"><mml:msub><mml:mi>R</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:mi>t</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula>:
<disp-formula id="eqn-31"><label>(31)</label><mml:math id="mml-eqn-31" display="block"><mml:msup><mml:mi>s</mml:mi><mml:mrow><mml:mrow><mml:mtext>tq</mml:mtext></mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msup><mml:mrow><mml:mo>[</mml:mo><mml:mi>t</mml:mi><mml:mo>]</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mo movablelimits="true" form="prefix">max</mml:mo><mml:mrow><mml:mo>{</mml:mo><mml:msup><mml:mi>s</mml:mi><mml:mrow><mml:mrow><mml:mtext>tq</mml:mtext></mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msup><mml:mrow><mml:mo>[</mml:mo><mml:mi>t</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo>]</mml:mo></mml:mrow><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mi>r</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:mi>t</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo>)</mml:mo></mml:mrow><mml:mo>,</mml:mo><mml:mn>0</mml:mn><mml:mo>}</mml:mo></mml:mrow></mml:math></disp-formula></p>
<p>(5) Edge Server State <inline-formula id="ieqn-117"><mml:math id="mml-ieqn-117"><mml:msup><mml:mi>s</mml:mi><mml:mrow><mml:mrow><mml:mtext>mec</mml:mtext></mml:mrow></mml:mrow></mml:msup></mml:math></inline-formula>: The edge server is an extension of local processing that allows tasks to be offloaded for execution. <inline-formula id="ieqn-118"><mml:math id="mml-ieqn-118"><mml:msup><mml:mi>s</mml:mi><mml:mrow><mml:mrow><mml:mtext>mec</mml:mtext></mml:mrow></mml:mrow></mml:msup></mml:math></inline-formula> represents the remaining CPU cycles at the edge server. At time t, if a task is offloaded, <inline-formula id="ieqn-119"><mml:math id="mml-ieqn-119"><mml:msup><mml:mi>s</mml:mi><mml:mrow><mml:mrow><mml:mtext>mec</mml:mtext></mml:mrow></mml:mrow></mml:msup><mml:mrow><mml:mo>[</mml:mo><mml:mi>t</mml:mi><mml:mo>]</mml:mo></mml:mrow></mml:math></inline-formula> is updated as:
<disp-formula id="eqn-32"><label>(32)</label><mml:math id="mml-eqn-32" display="block"><mml:msup><mml:mi>s</mml:mi><mml:mrow><mml:mrow><mml:mtext>mec</mml:mtext></mml:mrow></mml:mrow></mml:msup><mml:mrow><mml:mo>[</mml:mo><mml:mi>t</mml:mi><mml:mo>]</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mo movablelimits="true" form="prefix">max</mml:mo><mml:mrow><mml:mo>{</mml:mo><mml:msup><mml:mi>s</mml:mi><mml:mrow><mml:mrow><mml:mtext>mec</mml:mtext></mml:mrow></mml:mrow></mml:msup><mml:mrow><mml:mo>[</mml:mo><mml:mi>t</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo>]</mml:mo></mml:mrow><mml:mo>&#x2212;</mml:mo><mml:msup><mml:mi>f</mml:mi><mml:mrow><mml:mi>s</mml:mi></mml:mrow></mml:msup><mml:mo>,</mml:mo><mml:mn>0</mml:mn><mml:mo>}</mml:mo></mml:mrow></mml:math></disp-formula>where <inline-formula id="ieqn-120"><mml:math id="mml-ieqn-120"><mml:msup><mml:mi>f</mml:mi><mml:mrow><mml:mi>s</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula> is the computational capacity of the edge server.</p>
<p>(6) Cloud Server State <inline-formula id="ieqn-121"><mml:math id="mml-ieqn-121"><mml:msup><mml:mi>s</mml:mi><mml:mrow><mml:mrow><mml:mtext>cc</mml:mtext></mml:mrow></mml:mrow></mml:msup></mml:math></inline-formula>: The cloud center provides extensive computational power for large-scale tasks. <inline-formula id="ieqn-122"><mml:math id="mml-ieqn-122"><mml:msup><mml:mi>s</mml:mi><mml:mrow><mml:mrow><mml:mtext>cc</mml:mtext></mml:mrow></mml:mrow></mml:msup></mml:math></inline-formula> represents the remaining computational resources in the cloud. When a task is processed, the state is updated as:
<disp-formula id="eqn-33"><label>(33)</label><mml:math id="mml-eqn-33" display="block"><mml:msup><mml:mi>s</mml:mi><mml:mrow><mml:mrow><mml:mtext>cc</mml:mtext></mml:mrow></mml:mrow></mml:msup><mml:mrow><mml:mo>[</mml:mo><mml:mi>t</mml:mi><mml:mo>]</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mo movablelimits="true" form="prefix">max</mml:mo><mml:mrow><mml:mo>{</mml:mo><mml:msup><mml:mi>s</mml:mi><mml:mrow><mml:mrow><mml:mtext>cc</mml:mtext></mml:mrow></mml:mrow></mml:msup><mml:mrow><mml:mo>[</mml:mo><mml:mi>t</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo>]</mml:mo></mml:mrow><mml:mo>&#x2212;</mml:mo><mml:msup><mml:mi>f</mml:mi><mml:mrow><mml:mrow><mml:mtext>cc</mml:mtext></mml:mrow></mml:mrow></mml:msup><mml:mo>,</mml:mo><mml:mn>0</mml:mn><mml:mo>}</mml:mo></mml:mrow></mml:math></disp-formula>where <inline-formula id="ieqn-123"><mml:math id="mml-ieqn-123"><mml:msup><mml:mi>f</mml:mi><mml:mrow><mml:mrow><mml:mtext>cc</mml:mtext></mml:mrow></mml:mrow></mml:msup></mml:math></inline-formula> is the computational capacity of the cloud server.</p>
<p>(7) Transmission Rates R: Transmission rates <inline-formula id="ieqn-124"><mml:math id="mml-ieqn-124"><mml:msub><mml:mi>R</mml:mi><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:mi>t</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula> and <inline-formula id="ieqn-125"><mml:math id="mml-ieqn-125"><mml:msub><mml:mi>R</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:mi>t</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula> represent the data rates between terminal devices, edge servers, and cloud servers. These rates depend on the current channel conditions and device locations, directly influencing the dynamics of <inline-formula id="ieqn-126"><mml:math id="mml-ieqn-126"><mml:msup><mml:mi>s</mml:mi><mml:mrow><mml:mrow><mml:mtext>tq</mml:mtext></mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msup></mml:math></inline-formula> and <inline-formula id="ieqn-127"><mml:math id="mml-ieqn-127"><mml:msup><mml:mi>s</mml:mi><mml:mrow><mml:mrow><mml:mtext>tq</mml:mtext></mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msup></mml:math></inline-formula>.</p>
</sec>
<sec id="s5_2">
<label>5.2</label>
<title> Action Space</title>
<p>Within the devised MDP framework, the action space comprises three principal task-scheduling options: Local Processing (LP), Edge Processing (EP), and Cloud Processing (CP). These alternatives are selected to optimize task offloading in smart-grid environments by striking an effective balance between execution latency and energy consumption. A detailed description of each action type follows.</p>
<p>1) Local Processing (LP): The Local Processing action refers to assigning tasks from the queue to the Local Processing Unit (LPU) for execution. The complete set of actions can be expressed as:
<disp-formula id="eqn-34"><label>(34)</label><mml:math id="mml-eqn-34" display="block"><mml:mi>L</mml:mi><mml:mi>E</mml:mi><mml:mo>=</mml:mo><mml:mrow><mml:mo>{</mml:mo><mml:mi>L</mml:mi><mml:msub><mml:mi>E</mml:mi><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mi>L</mml:mi><mml:msub><mml:mi>E</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:mi>L</mml:mi><mml:msub><mml:mi>E</mml:mi><mml:mrow><mml:mi>Q</mml:mi></mml:mrow></mml:msub><mml:mo>}</mml:mo></mml:mrow></mml:math></disp-formula>where <inline-formula id="ieqn-128"><mml:math id="mml-ieqn-128"><mml:mi>L</mml:mi><mml:msub><mml:mi>E</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>&#x2208;</mml:mo><mml:mrow><mml:mo>{</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mn>2</mml:mn><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:mi>Q</mml:mi><mml:mo>}</mml:mo></mml:mrow></mml:math></inline-formula> denotes the selection of the <italic>i</italic>-th task in the task queue for processing by the LPU. This action is valid only under the following conditions: The LPU is currently in an idle state. The <italic>i</italic>-th task exists in the task queue.</p>
<p>When <inline-formula id="ieqn-129"><mml:math id="mml-ieqn-129"><mml:mi>L</mml:mi><mml:msub><mml:mi>E</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> is selected, the <italic>i</italic>-th task is removed from the queue, and the LPU state is updated dynamically based on the task&#x2019;s attributes. The task is then executed according to the time slices defined.</p>
<p>2) Edge Processing (EP): The Edge Processing action refers to offloading tasks from the queue to the Edge Server for execution. This set of actions is defined as:
<disp-formula id="eqn-35"><label>(35)</label><mml:math id="mml-eqn-35" display="block"><mml:mi>E</mml:mi><mml:mi>P</mml:mi><mml:mo>=</mml:mo><mml:mrow><mml:mo>{</mml:mo><mml:mi>E</mml:mi><mml:msub><mml:mi>P</mml:mi><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mi>E</mml:mi><mml:msub><mml:mi>P</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:mi>E</mml:mi><mml:msub><mml:mi>P</mml:mi><mml:mrow><mml:mi>Q</mml:mi></mml:mrow></mml:msub><mml:mo>}</mml:mo></mml:mrow></mml:math></disp-formula>where <inline-formula id="ieqn-130"><mml:math id="mml-ieqn-130"><mml:mi>E</mml:mi><mml:msub><mml:mi>P</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> denotes the selection of the <italic>i</italic>-th task in the task queue for offloading to the edge server. This action is valid only under the following conditions: The Data Transmission Unit 1 (DTU1) is idle. The <italic>i</italic>-th task exists in the task queue.</p>
<p>When <inline-formula id="ieqn-131"><mml:math id="mml-ieqn-131"><mml:mi>E</mml:mi><mml:msub><mml:mi>P</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> is selected, the <italic>i</italic>-th task is removed from the queue, and DTU1 initiates the transmission of task data to the edge server. Upon successful data transmission, the edge server begins processing the task, and DTU1 returns to an idle state.</p>
<p>3) Cloud Processing (CP): The Cloud Processing action refers to offloading tasks from the queue to the Cloud Server for execution. This set of actions is defined as:
<disp-formula id="eqn-36"><label>(36)</label><mml:math id="mml-eqn-36" display="block"><mml:mi>C</mml:mi><mml:mi>P</mml:mi><mml:mo>=</mml:mo><mml:mrow><mml:mo>{</mml:mo><mml:mi>C</mml:mi><mml:msub><mml:mi>P</mml:mi><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mi>C</mml:mi><mml:msub><mml:mi>P</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:mi>C</mml:mi><mml:msub><mml:mi>P</mml:mi><mml:mrow><mml:mi>Q</mml:mi></mml:mrow></mml:msub><mml:mo>}</mml:mo></mml:mrow></mml:math></disp-formula>where <inline-formula id="ieqn-132"><mml:math id="mml-ieqn-132"><mml:mi>C</mml:mi><mml:msub><mml:mi>P</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> denotes the selection of the <italic>i</italic>-th task in the task queue for offloading to the cloud server. This action is valid only under the following conditions: The Data Transmission Unit 2 (DTU2) is idle. The <italic>i</italic>-th task exists in the task queue.</p>
<p>When <inline-formula id="ieqn-133"><mml:math id="mml-ieqn-133"><mml:mi>C</mml:mi><mml:msub><mml:mi>P</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> is selected, the <italic>i</italic>-th task is removed from the queue, and DTU2 initiates the transmission of task data to the cloud server. Upon successful data reception, the cloud server processes the task with its higher computational capacity, returning the results to the local device, and DTU2 returns to an idle state.</p>
<p>4) We introduce a dynamic action selection mechanism that evaluates the system&#x2019;s state&#x2014;task queue, LPU, and DTUs&#x2014;at each time step. This mechanism adaptively chooses valid actions (Local Execution, Edge Processing, Cloud Processing) based on real-time conditions to optimize resource utilization and minimize delays. For instance, if the task queue is non-empty and the LPU is idle, local execution reduces communication latency. If tasks demand more resources, offloading to edge or cloud servers becomes available when DTUs are idle (<xref ref-type="table" rid="table-1">Table 1</xref>). This dynamic scheduling mechanism ensures effective resource allocation and improved system performance.</p>
<table-wrap id="table-1">
<label>Table 1</label>
<caption>
<title>Task execution status and actions under different scenarios</title>
</caption>
<table>
<colgroup>
<col/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
</colgroup>
<thead>
<tr>
<th>Case ID</th>
<th align="center">Task queue status</th>
<th align="center">LPU status</th>
<th align="center">DTU1 status</th>
<th align="center">DTU2 status</th>
<th align="center">Local execution (LE)</th>
<th align="center">Edge execution (EP)</th>
<th align="center">Cloud execution (CP)</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>Non-empty</td>
<td>Busy</td>
<td>Empty</td>
<td>Empty</td>
<td>Legal</td>
<td>Illegal</td>
<td>Illegal</td>
</tr>
<tr>
<td>2</td>
<td>Non-empty</td>
<td>Busy</td>
<td>Busy</td>
<td>Empty</td>
<td>Legal</td>
<td>Illegal</td>
<td>Illegal</td>
</tr>
<tr>
<td>3</td>
<td>Non-empty</td>
<td>Empty</td>
<td>Busy</td>
<td>Empty</td>
<td>Illegal</td>
<td>Legal</td>
<td>Illegal</td>
</tr>
<tr>
<td>4</td>
<td>Non-empty</td>
<td>Empty</td>
<td>Empty</td>
<td>Busy</td>
<td>Illegal</td>
<td>Legal</td>
<td>Illegal</td>
</tr>
<tr>
<td>5</td>
<td>Non-empty</td>
<td>Busy</td>
<td>Busy</td>
<td>Busy</td>
<td>Legal</td>
<td>Illegal</td>
<td>Illegal</td>
</tr>
<tr>
<td>6</td>
<td>Non-empty</td>
<td>Empty</td>
<td>Busy</td>
<td>Busy</td>
<td>Illegal</td>
<td>Legal</td>
<td>Illegal</td>
</tr>
<tr>
<td>7</td>
<td>Empty</td>
<td>Empty</td>
<td>Empty</td>
<td>Empty</td>
<td>Illegal</td>
<td>Illegal</td>
<td>Illegal</td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
<sec id="s5_3">
<label>5.3</label>
<title> Reward Function</title>
<p>In a three-layer smart grid framework (terminal devices, edge servers, and cloud servers), the reward function optimizes task offloading by balancing delay and energy consumption.</p>
<p>Delay and Energy Consumption Calculation: The orchestrator schedules task offloading decisions at discrete time intervals. Assume that at time <inline-formula id="ieqn-134"><mml:math id="mml-ieqn-134"><mml:msub><mml:mi>t</mml:mi><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>, an action<inline-formula id="ieqn-135"><mml:math id="mml-ieqn-135"><mml:mo>;</mml:mo><mml:msub><mml:mi>a</mml:mi><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> is selected, transitioning the system state from <inline-formula id="ieqn-136"><mml:math id="mml-ieqn-136"><mml:msub><mml:mi>S</mml:mi><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:msub><mml:mo>&#x2208;</mml:mo><mml:mi>S</mml:mi></mml:math></inline-formula> to <inline-formula id="ieqn-137"><mml:math id="mml-ieqn-137"><mml:msub><mml:mi>S</mml:mi><mml:mrow><mml:mi>n</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula>. For all tasks at time t, the total delay and energy consumption are quantified based on the task&#x2019;s execution or transmission conditions. At time t, the total delay <inline-formula id="ieqn-138"><mml:math id="mml-ieqn-138"><mml:msubsup><mml:mrow><mml:mi mathvariant="normal">&#x0394;</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mi>s</mml:mi></mml:mrow></mml:msubsup><mml:mrow><mml:mo>(</mml:mo><mml:mi>t</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula> for all tasks can be expressed as:
<disp-formula id="eqn-37"><label>(37)</label><mml:math id="mml-eqn-37" display="block"><mml:msubsup><mml:mrow><mml:mi mathvariant="normal">&#x0394;</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mi>s</mml:mi></mml:mrow></mml:msubsup><mml:mrow><mml:mo>(</mml:mo><mml:mi>t</mml:mi><mml:mo>)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mi>q</mml:mi><mml:mrow><mml:mo>[</mml:mo><mml:mi>t</mml:mi><mml:mo>]</mml:mo></mml:mrow><mml:mo>+</mml:mo><mml:mn>1</mml:mn><mml:mrow><mml:mo>{</mml:mo><mml:msup><mml:mi>s</mml:mi><mml:mrow><mml:mrow><mml:mtext>lpu</mml:mtext></mml:mrow></mml:mrow></mml:msup><mml:mrow><mml:mo>[</mml:mo><mml:mi>t</mml:mi><mml:mo>]</mml:mo></mml:mrow><mml:mo>&#x003E;</mml:mo><mml:mn>0</mml:mn><mml:mo>}</mml:mo></mml:mrow><mml:mo>+</mml:mo><mml:mn>1</mml:mn><mml:mrow><mml:mo>{</mml:mo><mml:msup><mml:mi>s</mml:mi><mml:mrow><mml:mrow><mml:mtext>dtu</mml:mtext></mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msup><mml:mrow><mml:mo>[</mml:mo><mml:mi>t</mml:mi><mml:mo>]</mml:mo></mml:mrow><mml:mo>&#x003E;</mml:mo><mml:mn>0</mml:mn><mml:mo>}</mml:mo></mml:mrow><mml:mo>+</mml:mo><mml:mn>1</mml:mn><mml:mrow><mml:mo>{</mml:mo><mml:msup><mml:mi>s</mml:mi><mml:mrow><mml:mrow><mml:mtext>mec</mml:mtext></mml:mrow></mml:mrow></mml:msup><mml:mrow><mml:mo>[</mml:mo><mml:mi>t</mml:mi><mml:mo>]</mml:mo></mml:mrow><mml:mo>&#x003E;</mml:mo><mml:mn>0</mml:mn><mml:mo>}</mml:mo></mml:mrow><mml:mo>+</mml:mo><mml:mn>1</mml:mn><mml:mrow><mml:mo>{</mml:mo><mml:msup><mml:mi>s</mml:mi><mml:mrow><mml:mrow><mml:mtext>dtu</mml:mtext></mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msup><mml:mrow><mml:mo>[</mml:mo><mml:mi>t</mml:mi><mml:mo>]</mml:mo></mml:mrow><mml:mo>&#x003E;</mml:mo><mml:mn>0</mml:mn><mml:mo>}</mml:mo></mml:mrow><mml:mo>+</mml:mo><mml:mn>1</mml:mn><mml:mrow><mml:mo>{</mml:mo><mml:msup><mml:mi>s</mml:mi><mml:mrow><mml:mrow><mml:mtext>cc</mml:mtext></mml:mrow></mml:mrow></mml:msup><mml:mrow><mml:mo>[</mml:mo><mml:mi>t</mml:mi><mml:mo>]</mml:mo></mml:mrow><mml:mo>&#x003E;</mml:mo><mml:mn>0</mml:mn><mml:mo>}</mml:mo></mml:mrow></mml:math></disp-formula>where <italic>q</italic>[<italic>t</italic>] is the number of tasks in the task queue. <inline-formula id="ieqn-139"><mml:math id="mml-ieqn-139"><mml:mn>1</mml:mn><mml:mrow><mml:mo>{</mml:mo><mml:mo>&#x22C5;</mml:mo><mml:mo>}</mml:mo></mml:mrow></mml:math></inline-formula> is an indicator function evaluating whether specific states <inline-formula id="ieqn-140"><mml:math id="mml-ieqn-140"><mml:msup><mml:mi>s</mml:mi><mml:mrow><mml:mrow><mml:mtext>lpu</mml:mtext></mml:mrow></mml:mrow></mml:msup><mml:msup><mml:mi>s</mml:mi><mml:mrow><mml:mrow><mml:mtext>dtu</mml:mtext></mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msup></mml:math></inline-formula> are active. The total delay from state <inline-formula id="ieqn-141"><mml:math id="mml-ieqn-141"><mml:msub><mml:mi>s</mml:mi><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> to <inline-formula id="ieqn-142"><mml:math id="mml-ieqn-142"><mml:msub><mml:mi>s</mml:mi><mml:mrow><mml:mi>n</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula> is then aggregated as:
<disp-formula id="eqn-38"><label>(38)</label><mml:math id="mml-eqn-38" display="block"><mml:mrow><mml:mi mathvariant="normal">&#x0394;</mml:mi></mml:mrow><mml:mi>t</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>s</mml:mi><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>a</mml:mi><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>s</mml:mi><mml:mrow><mml:mi>n</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>t</mml:mi><mml:mo>=</mml:mo><mml:msubsup><mml:mi>t</mml:mi><mml:mrow><mml:mi>n</mml:mi></mml:mrow><mml:mrow><mml:mi>a</mml:mi></mml:mrow></mml:msubsup></mml:mrow><mml:mrow><mml:msubsup><mml:mi>t</mml:mi><mml:mrow><mml:mi>n</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>a</mml:mi></mml:mrow></mml:msubsup><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:munderover><mml:msubsup><mml:mrow><mml:mi mathvariant="normal">&#x0394;</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mi>s</mml:mi></mml:mrow></mml:msubsup><mml:mrow><mml:mo>(</mml:mo><mml:mi>t</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:math></disp-formula></p>
<p>At time <italic>t</italic>, the total energy consumption <inline-formula id="ieqn-143"><mml:math id="mml-ieqn-143"><mml:msubsup><mml:mrow><mml:mi mathvariant="normal">&#x0394;</mml:mi></mml:mrow><mml:mrow><mml:mi>e</mml:mi></mml:mrow><mml:mrow><mml:mi>s</mml:mi></mml:mrow></mml:msubsup><mml:mrow><mml:mo>(</mml:mo><mml:mi>t</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula> is related to local computation and data transmission. It is given by:
<disp-formula id="eqn-39"><label>(39)</label><mml:math id="mml-eqn-39" display="block"><mml:msubsup><mml:mrow><mml:mi mathvariant="normal">&#x0394;</mml:mi></mml:mrow><mml:mrow><mml:mi>e</mml:mi></mml:mrow><mml:mrow><mml:mi>s</mml:mi></mml:mrow></mml:msubsup><mml:mrow><mml:mo>(</mml:mo><mml:mi>t</mml:mi><mml:mo>)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:msup><mml:mi>p</mml:mi><mml:mrow><mml:mi>l</mml:mi></mml:mrow></mml:msup><mml:mo>&#x22C5;</mml:mo><mml:mn>1</mml:mn><mml:mrow><mml:mo>{</mml:mo><mml:msup><mml:mi>s</mml:mi><mml:mrow><mml:mrow><mml:mtext>lpu</mml:mtext></mml:mrow></mml:mrow></mml:msup><mml:mrow><mml:mo>[</mml:mo><mml:mi>t</mml:mi><mml:mo>]</mml:mo></mml:mrow><mml:mo>&#x003E;</mml:mo><mml:mn>0</mml:mn><mml:mo>}</mml:mo></mml:mrow><mml:mo>+</mml:mo><mml:msup><mml:mi>p</mml:mi><mml:mrow><mml:mrow><mml:mtext>tx</mml:mtext></mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msup><mml:mo>&#x22C5;</mml:mo><mml:mn>1</mml:mn><mml:mrow><mml:mo>{</mml:mo><mml:msup><mml:mi>s</mml:mi><mml:mrow><mml:mrow><mml:mtext>dtu</mml:mtext></mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msup><mml:mrow><mml:mo>[</mml:mo><mml:mi>t</mml:mi><mml:mo>]</mml:mo></mml:mrow><mml:mo>&#x003E;</mml:mo><mml:mn>0</mml:mn><mml:mo>}</mml:mo></mml:mrow><mml:mo>+</mml:mo><mml:msup><mml:mi>p</mml:mi><mml:mrow><mml:mrow><mml:mtext>tx</mml:mtext></mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msup><mml:mo>&#x22C5;</mml:mo><mml:mn>1</mml:mn><mml:mrow><mml:mo>{</mml:mo><mml:msup><mml:mi>s</mml:mi><mml:mrow><mml:mrow><mml:mtext>dtu</mml:mtext></mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msup><mml:mrow><mml:mo>[</mml:mo><mml:mi>t</mml:mi><mml:mo>]</mml:mo></mml:mrow><mml:mo>&#x003E;</mml:mo><mml:mn>0</mml:mn><mml:mo>}</mml:mo></mml:mrow></mml:math></disp-formula></p>
<p>Thus, the total energy consumption during the transition is:
<disp-formula id="eqn-40"><label>(40)</label><mml:math id="mml-eqn-40" display="block"><mml:msub><mml:mrow><mml:mi mathvariant="normal">&#x0394;</mml:mi></mml:mrow><mml:mrow><mml:mi>e</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>s</mml:mi><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>a</mml:mi><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>s</mml:mi><mml:mrow><mml:mi>n</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>t</mml:mi><mml:mo>=</mml:mo><mml:msubsup><mml:mi>t</mml:mi><mml:mrow><mml:mi>n</mml:mi></mml:mrow><mml:mrow><mml:mi>a</mml:mi></mml:mrow></mml:msubsup></mml:mrow><mml:mrow><mml:msubsup><mml:mi>t</mml:mi><mml:mrow><mml:mi>n</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>a</mml:mi></mml:mrow></mml:msubsup><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:munderover><mml:msubsup><mml:mrow><mml:mi mathvariant="normal">&#x0394;</mml:mi></mml:mrow><mml:mrow><mml:mi>e</mml:mi></mml:mrow><mml:mrow><mml:mi>s</mml:mi></mml:mrow></mml:msubsup><mml:mrow><mml:mo>(</mml:mo><mml:mi>t</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:math></disp-formula></p>
<p>Therefore, the overall cost is:
<disp-formula id="eqn-41"><label>(41)</label><mml:math id="mml-eqn-41" display="block"><mml:mrow><mml:mtext>cost</mml:mtext></mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>s</mml:mi><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>a</mml:mi><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>s</mml:mi><mml:mrow><mml:mi>n</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mi>&#x03B1;</mml:mi><mml:msub><mml:mrow><mml:mi mathvariant="normal">&#x0394;</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>s</mml:mi><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>a</mml:mi><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>s</mml:mi><mml:mrow><mml:mi>n</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow><mml:mo>+</mml:mo><mml:mi>&#x03B2;</mml:mi><mml:msub><mml:mrow><mml:mi mathvariant="normal">&#x0394;</mml:mi></mml:mrow><mml:mrow><mml:mi>e</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>s</mml:mi><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>a</mml:mi><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>s</mml:mi><mml:mrow><mml:mi>n</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow></mml:math></disp-formula></p>
<p>where <inline-formula id="ieqn-144"><mml:math id="mml-ieqn-144"><mml:mi>&#x03B1;</mml:mi></mml:math></inline-formula> and <inline-formula id="ieqn-145"><mml:math id="mml-ieqn-145"><mml:mi>&#x03B2;</mml:mi></mml:math></inline-formula> are weighting factors balancing delay and energy consumption.</p>
<p>To optimize task offloading strategies, the reward function is defined as the negative of the total cost:
<disp-formula id="eqn-42"><label>(42)</label><mml:math id="mml-eqn-42" display="block"><mml:mi>R</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>s</mml:mi><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>a</mml:mi><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>s</mml:mi><mml:mrow><mml:mi>n</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mi>k</mml:mi><mml:mrow><mml:mi>s</mml:mi></mml:mrow></mml:msub><mml:mo>&#x22C5;</mml:mo><mml:mrow><mml:mtext>cost</mml:mtext></mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>s</mml:mi><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>a</mml:mi><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>s</mml:mi><mml:mrow><mml:mi>n</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow></mml:math></disp-formula></p>
<p>where <inline-formula id="ieqn-146"><mml:math id="mml-ieqn-146"><mml:mi>k</mml:mi></mml:math></inline-formula> is a scaling constant to adjust the range of reward values.</p>
<p>Cumulative Reward Function. In the smart grid task offloading problem, the cumulative reward function evaluates the long-term performance of a scheduling strategy. Starting from an initial state <inline-formula id="ieqn-147"><mml:math id="mml-ieqn-147"><mml:msub><mml:mi>s</mml:mi><mml:mrow><mml:mi>m</mml:mi></mml:mrow></mml:msub><mml:mo>&#x2208;</mml:mo><mml:mi>S</mml:mi></mml:math></inline-formula>, the system interacts with the environment iteratively, selecting actions <inline-formula id="ieqn-148"><mml:math id="mml-ieqn-148"><mml:msub><mml:mi>a</mml:mi><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:msub><mml:mo>&#x2208;</mml:mo><mml:mi>A</mml:mi></mml:math></inline-formula>. This forms a Markov Decision Process (MDP), and the cumulative reward function is:
<disp-formula id="eqn-43"><label>(43)</label><mml:math id="mml-eqn-43" display="block"><mml:msub><mml:mi>G</mml:mi><mml:mrow><mml:mi>m</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>n</mml:mi><mml:mo>=</mml:mo><mml:mn>0</mml:mn></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x221E;</mml:mi></mml:mrow></mml:munderover><mml:msup><mml:mi>&#x03B3;</mml:mi><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:msup><mml:mi>R</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>s</mml:mi><mml:mrow><mml:mi>m</mml:mi><mml:mo>+</mml:mo><mml:mi>n</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>a</mml:mi><mml:mrow><mml:mi>m</mml:mi><mml:mo>+</mml:mo><mml:mi>n</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>s</mml:mi><mml:mrow><mml:mi>m</mml:mi><mml:mo>+</mml:mo><mml:mi>n</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow><mml:mo>,</mml:mo><mml:mspace width="thinmathspace" /><mml:msub><mml:mi>s</mml:mi><mml:mrow><mml:mi>m</mml:mi></mml:mrow></mml:msub><mml:mo>&#x2208;</mml:mo><mml:mi>S</mml:mi></mml:math></disp-formula></p>
<p>where <inline-formula id="ieqn-149"><mml:math id="mml-ieqn-149"><mml:msub><mml:mi>G</mml:mi><mml:mrow><mml:mi>m</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> is the total reward obtained starting from state<inline-formula id="ieqn-150"><mml:math id="mml-ieqn-150"><mml:mo>;</mml:mo><mml:msub><mml:mi>s</mml:mi><mml:mrow><mml:mi>m</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>. <inline-formula id="ieqn-151"><mml:math id="mml-ieqn-151"><mml:mi>&#x03B3;</mml:mi><mml:mo>&#x2208;</mml:mo><mml:mrow><mml:mo>[</mml:mo><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mn>1</mml:mn><mml:mo>]</mml:mo></mml:mrow></mml:math></inline-formula> is the discount factor balancing immediate and future rewards.</p>
<p>By maximizing <inline-formula id="ieqn-152"><mml:math id="mml-ieqn-152"><mml:msub><mml:mi>G</mml:mi><mml:mrow><mml:mi>m</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>, the task offloading strategy can be optimized to minimize delays and energy consumption while maintaining system efficiency. For &#x03B3; &#x003D; 1, the cumulative reward represents the sum of all delays and energy costs. Thus, finding the optimal strategy <inline-formula id="ieqn-153"><mml:math id="mml-ieqn-153"><mml:msup><mml:mi>&#x03C0;</mml:mi><mml:mrow><mml:mo>&#x2217;</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula> aligns with minimizing the original objective function. However, due to the vast state space, we plan to utilize Deep Reinforcement Learning (DRL) for training.</p>
</sec>
</sec>
<sec id="s6">
<label>6</label>
<title>DRL-Based Task Offloading and Scheduling</title>
<p>This section presents a novel deep neural network (DNN) architecture trained with Proximal Policy Optimization (PPO). The DNN is tailored to extract complex patterns from high-dimensional data, while PPO updates the policy parameters in a way that carefully balances exploration and exploitation, thereby ensuring stable convergence. Comprehensive experiments show that the proposed method yields significant performance improvements over prevailing approaches.</p>
<sec id="s6_1">
<label>6.1</label>
<title> Network Architecture</title>
<p>As illustrated in <xref ref-type="fig" rid="fig-2">Fig. 2</xref>, the proposed DRL framework is designed to simultaneously approximate the task offloading policy and estimate the value function. To achieve this, we develop a parameter-sharing deep neural network (DNN) that approximates two objectives: the task offloading policy <inline-formula id="ieqn-154"><mml:math id="mml-ieqn-154"><mml:mi>&#x03C0;</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>&#x03B1;</mml:mi><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:msub><mml:mo>&#x2223;</mml:mo><mml:msub><mml:mi>s</mml:mi><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:msub><mml:mo>;</mml:mo><mml:mi>&#x03B8;</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula>, which selects the optimal offloading action, and the value function <inline-formula id="ieqn-155"><mml:math id="mml-ieqn-155"><mml:mi>v</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>s</mml:mi><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:msub><mml:mo>;</mml:mo><mml:mi>&#x03C9;</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula>, which evaluates the advantage function to optimize the policy.</p>
<fig id="fig-2">
<label>Figure 2</label>
<caption>
<title>Neural network architecture</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_65465-fig-2.tif"/>
</fig>
<p>Due to the overwhelming size of the input state, processing becomes challenging; moreover, since the data stored in task queue Q are structured, we employed a Convolutional Neural Network (CNN) for feature extraction. Subsequent studies have demonstrated that this architecture significantly enhances training performance compared to using solely fully connected layers.</p>
</sec>
<sec id="s6_2">
<label>6.2</label>
<title> Training Algorithm</title>
<p>We use a shared-parameter DNN where the objective function combines errors from both the policy and value networks. To enhance sample efficiency and stabilize policy updates, we employ Generalized Advantage Estimation (GAE).</p>
<p>The policy network is optimized using the PPO Clipped Objective, while the value network minimizes the state-value error. The overall optimization objective is expressed as:
<disp-formula id="eqn-44"><label>(44)</label><mml:math id="mml-eqn-44" display="block"><mml:msup><mml:mi>L</mml:mi><mml:mrow><mml:mrow><mml:mtext>PPO</mml:mtext></mml:mrow></mml:mrow></mml:msup><mml:mrow><mml:mo>(</mml:mo><mml:mi>&#x03B8;</mml:mi><mml:mo>)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mrow><mml:mi mathvariant="double-struck">E</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo>[</mml:mo><mml:msubsup><mml:mi>L</mml:mi><mml:mrow><mml:mi>n</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mtext>CLIP</mml:mtext></mml:mrow></mml:mrow></mml:msubsup><mml:mrow><mml:mo>(</mml:mo><mml:mi>&#x03B8;</mml:mi><mml:mo>)</mml:mo></mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mi>c</mml:mi><mml:msubsup><mml:mi>L</mml:mi><mml:mrow><mml:mi>n</mml:mi></mml:mrow><mml:mrow><mml:mi>V</mml:mi></mml:mrow></mml:msubsup><mml:mrow><mml:mo>(</mml:mo><mml:mi>&#x03B8;</mml:mi><mml:mo>)</mml:mo></mml:mrow><mml:mo>]</mml:mo></mml:mrow></mml:math></disp-formula></p>
<p>where <inline-formula id="ieqn-156"><mml:math id="mml-ieqn-156"><mml:mi>c</mml:mi></mml:math></inline-formula> is a loss coefficient used to balance the loss between the policy and value networks.</p>
<p>As shown in Algorithm 1, the training process alternates between sampling and optimization phases. During the sampling phase, the old policy <inline-formula id="ieqn-157"><mml:math id="mml-ieqn-157"><mml:msub><mml:mi>&#x03C0;</mml:mi><mml:mrow><mml:mrow><mml:mtext>old</mml:mtext></mml:mrow></mml:mrow></mml:msub></mml:math></inline-formula> is used to generate N trajectories, with each trajectory containing multiple time-step data points: <inline-formula id="ieqn-158"><mml:math id="mml-ieqn-158"><mml:mi>n</mml:mi></mml:math></inline-formula>. At each time step <inline-formula id="ieqn-159"><mml:math id="mml-ieqn-159"><mml:mi>n</mml:mi></mml:math></inline-formula>, based on the current environment state <inline-formula id="ieqn-160"><mml:math id="mml-ieqn-160"><mml:msub><mml:mi>s</mml:mi><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>, an action is selected, and the corresponding reward and the next state are recorded.</p>
<fig id="fig-10">
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_65465-fig-10.tif"/>
</fig>
<p>To improve training efficiency, the Generalized Advantage Estimate <inline-formula id="ieqn-172"><mml:math id="mml-ieqn-172"><mml:msubsup><mml:mrow><mml:mover><mml:mi>A</mml:mi><mml:mo stretchy="false">&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>n</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mtext>GAE</mml:mtext></mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mi>&#x03B3;</mml:mi><mml:mo>,</mml:mo><mml:mi>&#x03BB;</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:msubsup></mml:math></inline-formula> is precomputed for each trajectory and stored in two sets: T (Trajectories) and A (Advantages).</p>
<p>During the optimization phase, the collected trajectory data are used to update the policy network. The parameters <inline-formula id="ieqn-173"><mml:math id="mml-ieqn-173"><mml:mi>&#x03B8;</mml:mi></mml:math></inline-formula> are optimized over multiple epochs using stochastic gradient ascent. The objective is to maximize the PPO loss function <inline-formula id="ieqn-174"><mml:math id="mml-ieqn-174"><mml:msup><mml:mi>L</mml:mi><mml:mrow><mml:mrow><mml:mtext>PPO</mml:mtext></mml:mrow></mml:mrow></mml:msup><mml:mrow><mml:mo>(</mml:mo><mml:mi>&#x03B8;</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula>. In each epoch, the Adam optimizer is used to update the policy parameters. After optimization is completed, the updated parameters replace the old policy <inline-formula id="ieqn-175"><mml:math id="mml-ieqn-175"><mml:msub><mml:mi>&#x03C0;</mml:mi><mml:mrow><mml:mi>o</mml:mi><mml:mi>l</mml:mi><mml:mi>d</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>, and the data sets T and A are cleared to prepare for the next iteration.</p>
<p>During sampling, the exploration policy may choose invalid actions&#x2014;for example, executing tasks locally when the LPU is saturated. To prevent errors, a validity constraint mechanism is implemented: invalid actions are ignored, the current state is maintained, and a valid action is reselected, ensuring that optimization proceeds correctly.</p>
</sec>
</sec>
<sec id="s7">
<label>7</label>
<title>Performance Evaluation</title>
<p>This section provides a comprehensive evaluation of the proposed PPO-based Offloading Strategy Method (PPO-OSM) through extensive simulation experiments. The algorithm and its neural architecture are implemented in TensorFlow. Key simulation settings and training hyper-parameters are summarized in <xref ref-type="table" rid="table-2">Tables 2</xref> and <xref ref-type="table" rid="table-3">3</xref> [<xref ref-type="bibr" rid="ref-30">30</xref>], respectively.</p>
<table-wrap id="table-2">
<label>Table 2</label>
<caption>
<title>Simulation settings</title>
</caption>
<table>
<colgroup>
<col/>
<col/>
</colgroup>
<thead>
<tr>
<th>Parameter</th>
<th>Value</th>
</tr>
</thead>
<tbody>
<tr>
<td>Length of time slot</td>
<td>0.01 s</td>
</tr>
<tr>
<td>LPU&#x2019;s CPU frequency <inline-formula id="ieqn-176"><mml:math id="mml-ieqn-176"><mml:msup><mml:mi>f</mml:mi><mml:mrow><mml:mi>l</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula></td>
<td>0.4 GHz</td>
</tr>
<tr>
<td>LPU power linear parameter <inline-formula id="ieqn-177"><mml:math id="mml-ieqn-177"><mml:mi>&#x03BE;</mml:mi></mml:math></inline-formula></td>
<td>1.3 &#x00D7; 10<sup>&#x2212;26</sup></td>
</tr>
<tr>
<td>LPU power exponential parameter <inline-formula id="ieqn-178"><mml:math id="mml-ieqn-178"><mml:mi>&#x03BD;</mml:mi></mml:math></inline-formula></td>
<td>3</td>
</tr>
<tr>
<td>Cloud server&#x2019;s CPU frequency <inline-formula id="ieqn-179"><mml:math id="mml-ieqn-179"><mml:msub><mml:mi>f</mml:mi><mml:mrow><mml:mi>c</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula></td>
<td>5 GHz</td>
</tr>
<tr>
<td>Wireless transmission power <inline-formula id="ieqn-180"><mml:math id="mml-ieqn-180"><mml:msup><mml:mi>p</mml:mi><mml:mrow><mml:mi>t</mml:mi><mml:mi>x</mml:mi><mml:mn>1</mml:mn></mml:mrow></mml:msup></mml:math></inline-formula></td>
<td>2.5 W</td>
</tr>
<tr>
<td>Wireless transmission power <inline-formula id="ieqn-181"><mml:math id="mml-ieqn-181"><mml:msup><mml:mi>p</mml:mi><mml:mrow><mml:mi>t</mml:mi><mml:mi>x</mml:mi><mml:mn>2</mml:mn></mml:mrow></mml:msup></mml:math></inline-formula></td>
<td>3.5 W</td>
</tr>
<tr>
<td>Size of task input data <inline-formula id="ieqn-182"><mml:math id="mml-ieqn-182"><mml:msub><mml:mi>d</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula></td>
<td>[0.3, 3.5] MB</td>
</tr>
<tr>
<td>Computation-to-volume ratio <inline-formula id="ieqn-183"><mml:math id="mml-ieqn-183"><mml:msub><mml:mi>&#x03BA;</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula></td>
<td>[130, 3400] cycles/byte</td>
</tr>
<tr>
<td>Size of task queue Q</td>
<td>25</td>
</tr>
<tr>
<td>Edge transmission rate range <inline-formula id="ieqn-184"><mml:math id="mml-ieqn-184"><mml:msub><mml:mi>r</mml:mi><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula></td>
<td>[1, 50] Mbps</td>
</tr>
<tr>
<td>Cloud transmission rate range <inline-formula id="ieqn-185"><mml:math id="mml-ieqn-185"><mml:msub><mml:mi>r</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula></td>
<td>[0.5, 60] Mbps</td>
</tr>
<tr>
<td>Task arrival rate <inline-formula id="ieqn-186"><mml:math id="mml-ieqn-186"><mml:mi>&#x03BB;</mml:mi></mml:math></inline-formula></td>
<td>Dynamic (0.5 &#x00B1; variation)</td>
</tr>
</tbody>
</table>
</table-wrap><table-wrap id="table-3">
<label>Table 3</label>
<caption>
<title>Training parameters</title>
</caption>
<table>
<colgroup>
<col/>
<col/>
<col/>
<col/>
</colgroup>
<thead>
<tr>
<th>Parameter</th>
<th>Value</th>
<th>Parameter</th>
<th>Value</th>
</tr>
</thead>
<tbody>
<tr>
<td>Clipping range</td>
<td>0.1</td>
<td>Optimization method</td>
<td>Adam</td>
</tr>
<tr>
<td>Entropy coefficient</td>
<td>0.05</td>
<td>Adv. discount factor <inline-formula id="ieqn-187"><mml:math id="mml-ieqn-187"><mml:mi>&#x03C6;</mml:mi></mml:math></inline-formula></td>
<td>0.95</td>
</tr>
<tr>
<td>Learning rate</td>
<td>0.003</td>
<td>Discount factor <inline-formula id="ieqn-188"><mml:math id="mml-ieqn-188"><mml:mi>&#x03B3;</mml:mi></mml:math></inline-formula></td>
<td>0.99</td>
</tr>
<tr>
<td>Clipping range</td>
<td>0.1</td>
<td>Reward scaling factor <inline-formula id="ieqn-189"><mml:math id="mml-ieqn-189"><mml:mi>k</mml:mi></mml:math></inline-formula></td>
<td>2.0</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>We set the time step duration to 0.01 s, during which the system updates task scheduling and status at each interval. Parameters for the Local Processing Unit (LPU) are configured based on. Thus, the local computational power consumption <inline-formula id="ieqn-190"><mml:math id="mml-ieqn-190"><mml:msup><mml:mi>p</mml:mi><mml:mrow><mml:mi>l</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula> is determined as <inline-formula id="ieqn-191"><mml:math id="mml-ieqn-191"><mml:msup><mml:mi>p</mml:mi><mml:mrow><mml:mi>l</mml:mi></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:mi>&#x03BE;</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msup><mml:mi>f</mml:mi><mml:mrow><mml:mi>l</mml:mi></mml:mrow></mml:msup><mml:msup><mml:mo stretchy="false">)</mml:mo><mml:mrow><mml:mi>&#x03BD;</mml:mi></mml:mrow></mml:msup><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>. The power consumption for edge and cloud transmission is set as <inline-formula id="ieqn-192"><mml:math id="mml-ieqn-192"><mml:msup><mml:mi>p</mml:mi><mml:mrow><mml:mi>t</mml:mi><mml:mi>x</mml:mi><mml:mn>1</mml:mn></mml:mrow></mml:msup></mml:math></inline-formula> &#x003D; 2.5 W, <inline-formula id="ieqn-193"><mml:math id="mml-ieqn-193"><mml:msup><mml:mi>p</mml:mi><mml:mrow><mml:mi>t</mml:mi><mml:mi>x</mml:mi><mml:mn>2</mml:mn></mml:mrow></mml:msup></mml:math></inline-formula> &#x003D; 3.5 W, respectively. Each task&#x2019;s data size <inline-formula id="ieqn-194"><mml:math id="mml-ieqn-194"><mml:msub><mml:mi>d</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> and computational complexity <inline-formula id="ieqn-195"><mml:math id="mml-ieqn-195"><mml:msub><mml:mi>&#x03BA;</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> are sampled from uniform distributions defined in <xref ref-type="table" rid="table-2">Table 2</xref>.</p>

<p>In smart grid environments, transmission rates change dynamically over time due to variations in node distance and other factors. We use a sinusoidal model to represent these periodic rate fluctuations. Specifically, <inline-formula id="ieqn-196"><mml:math id="mml-ieqn-196"><mml:mi>R</mml:mi><mml:mn>1</mml:mn></mml:math></inline-formula> simulates communication between users and nearby edge nodes, while <inline-formula id="ieqn-197"><mml:math id="mml-ieqn-197"><mml:mi>R</mml:mi><mml:mn>2</mml:mn></mml:math></inline-formula> models communication between terminal devices and the cloud. Although <inline-formula id="ieqn-198"><mml:math id="mml-ieqn-198"><mml:mi>R</mml:mi><mml:mn>2</mml:mn></mml:math></inline-formula> generally provides higher bandwidth, its variability can be more pronounced, reflecting trade-offs between edge and cloud offloading.</p>
<p>Our simulation integrates system parameters&#x2014;such as LPU configurations, edge servers (MEC), and cloud computing (CC) resources&#x2014;to form a cohesive environment. As task complexity increases, DRL-based scheduling strategies like PPOOSM adapt to changing conditions, learn optimal offloading decisions, and enhance overall task scheduling performance.</p>
<sec id="s7_1">
<label>7.1</label>
<title> Convergence Performance</title>
<p>To assess the efficacy of the proposed PPO-based Offloading Strategy Method (PPO-OSM), we conducted experiments under the conditions illustrated in <xref ref-type="fig" rid="fig-3">Fig. 3</xref>. PPO-OSM was benchmarked against a baseline PPO implementation that uses only fully connected (FC) layers; both algorithms were trained with identical hyper-parameters, and their learning curves were logged. As shown in <xref ref-type="fig" rid="fig-3">Fig. 3</xref>, PPO-OSM consistently achieves higher cumulative rewards and converges more rapidly than the baseline.</p>
<fig id="fig-3">
<label>Figure 3</label>
<caption>
<title>Comparison of average reward over training epochs for PPO and PPO-OSM</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_65465-fig-3.tif"/>
</fig>
<p>Our experiments show that PPOOSM significantly accelerates training, with cumulative rewards stabilizing after about 100 epochs. In contrast, the FC-based PPO algorithm exhibits erratic performance and slower convergence&#x2014;its cumulative rewards even drop around 500 epochs. Additionally, PPOOSM enhances both task offloading and strategy optimization, effectively balancing task delay and energy consumption. Overall, these results demonstrate that PPOOSM is more efficient, stable, and robust for optimizing task offloading in dynamic environments than the conventional FC-based PPO.</p>
</sec>
<sec id="s7_2">
<label>7.2</label>
<title> Analysis of Performance under Different Biases</title>
<p>We assessed the proposed PPO-based offloading strategy (PPOOSM) under a range of latency&#x2013;energy preference settings and benchmarked it against five representative baselines:
<list list-type="bullet">
<list-item>
<p>All-Local Execution (AL): every task is processed entirely on the device&#x2019;s local CPU.</p></list-item>
<list-item>
<p>All-Edge Offloading (AE): all tasks are offloaded to an edge server, regardless of wireless-channel conditions. All-Cloud Offloading (AC): all tasks are transmitted straight to the cloud, ignoring backhaul and fronthaul constraints.</p></list-item>
<list-item>
<p>PPO: the standard Proximal Policy Optimization algorithm directly applied to the offloading decision problem, with no additional multi-objective shaping.</p></list-item>
<list-item>
<p>Genetic Algorithm (GA): a heuristic implemented with the DEAP library that evolves offloading decisions through selection, crossover, and mutation.</p></list-item>
</list></p>
<p>As illustrated in <xref ref-type="fig" rid="fig-4">Figs. 4</xref> and <xref ref-type="fig" rid="fig-5">5</xref>&#x2014;which decompose the overall cost into latency and energy components&#x2014;the six evaluated strategies display markedly different behaviours as the latency-weighting factor &#x03B1; increases (&#x03B2; is fixed at 1). All-Local execution (AL), in which every task is processed on the resource-constrained device, consistently yields the highest delay and energy consumption. All-Edge (AE) and All-Cloud (AC) offloading shorten latency slightly relative to AL, yet they remain energy-intensive and cannot adapt to wireless-channel fluctuations, causing their performance to cluster in the upper regions of both plots. The heuristic Genetic Algorithm (GA) reduces delay appreciably&#x2014;especially when &#x03B1; &#x2264; 0.3&#x2014;but achieves only moderate energy savings. PPO further lowers energy consumption through policy-gradient updates, although its average delay is still marginally higher than that of GA. By contrast, PPOOSM adapts its offloading policy online and therefore attains the lowest energy usage across all settings; moreover, once &#x03B1; &#x2248; 0.3, it also achieves the smallest delay among all schemes. These results demonstrate that PPOOSM offers the most favourable latency&#x2014;energy trade-off in realistic smart-grid scenarios.</p>
<fig id="fig-4">
<label>Figure 4</label>
<caption>
<title>Comparison of average delay</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_65465-fig-4.tif"/>
</fig><fig id="fig-5">
<label>Figure 5</label>
<caption>
<title>Comparison of average energy</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_65465-fig-5.tif"/>
</fig>
<p><xref ref-type="fig" rid="fig-6">Fig. 6</xref> reveals that the static schemes&#x2014;AL, AE, and AC&#x2014;incur the highest overall cost because they lack the flexibility required to cope with a dynamic environment. In contrast, PPOOSM, GA, and PPO strike a more favorable balance between computation and transmission expenses. Notably, by embedding a convolutional neural network (CNN) within its policy network, PPOOSM not only reduces the average cost most substantially but also delivers superior stability and adaptability compared with GA and PPO.</p>
<fig id="fig-6">
<label>Figure 6</label>
<caption>
<title>Comparison of average cost</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_65465-fig-6.tif"/>
</fig>
</sec>
<sec id="s7_3">
<label>7.3</label>
<title> Performance Analysis in Dynamic Queue Scenarios</title>
<p>In the Dynamic Queue Scenario (DQS), we evaluated the performance of different task offloading strategies as the task load incrementally increased. The analysis particularly focused on variations in average delay, energy consumption, and overall cost. In the experiments, we set the parameters &#x03B1; &#x003D; 0.4, &#x03B2; &#x003D; 1, simulating each algorithm under varying load factors ranging from 0.1 (low load) to 1.0 (high load). Through the analysis of experimental data, we could clearly observe the performance advantages of each strategy under different load levels.</p>
<p>As the workload intensity &#x03BB; increases (<xref ref-type="fig" rid="fig-7">Figs. 7</xref> and <xref ref-type="fig" rid="fig-8">8</xref>), all schemes experience higher latency, but the growth rates diverge: AL climbs most steeply, while AC and AE deteriorate once network congestion sets in. PPO keeps delay low with an almost linear trend, and PPOOSM flattens the curve even further, achieving the smallest latency across the entire range. Energy consumption follows the same ordering: the static policies (AL, AE, AC) remain high and nearly flat, PPO cuts energy appreciably, and PPOOSM delivers the lowest and most stable profile. Overall, PPOOSM offers the best latency&#x2013;energy trade-off, with PPO serving as a strong adaptive baseline that consistently outperforms all fixed strategies.</p>
<fig id="fig-7">
<label>Figure 7</label>
<caption>
<title>Average delay under different workloads</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_65465-fig-7.tif"/>
</fig><fig id="fig-8">
<label>Figure 8</label>
<caption>
<title>Average energy under different workloads</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_65465-fig-8.tif"/>
</fig>
<p><xref ref-type="fig" rid="fig-9">Fig. 9</xref> charts the composite cost&#x2014;latency plus energy&#x2014;against workload intensity &#x03BB; for five schemes. The three static policies (AL, AE, AC) exhibit the highest and steepest cost growth because they cannot adapt to changing conditions. Vanilla PPO reduces the curve substantially by continuously refining its off-loading policy, yet PPOOSM remains dominant, yielding the lowest cost across the entire workload range. Equipped with a CNN-enhanced state encoder and an &#x03B1;&#x2013;&#x03B2;-weighted objective, PPOOSM dynamically reallocates tasks in real time, achieving superior multi-objective optimisation in non-stationary edge environments.</p>
<fig id="fig-9">
<label>Figure 9</label>
<caption>
<title>Average cost under different workloads</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_65465-fig-9.tif"/>
</fig>
</sec>
</sec>
<sec id="s8">
<label>8</label>
<title>Conclusion</title>
<p>This paper presents a Proximal-Policy-Optimisation-based Offloading Strategy Model (PPOOSM) that allocates computational resources efficiently for task-offloading in smart-grid environments. By formulating the off-loading problem as a Markov decision process (MDP), the framework integrates deep reinforcement learning through a shared convolutional neural network and a clipped objective function, markedly improving training stability. Extensive simulations demonstrate that, under dynamic off-loading conditions, PPOOSM reduces both latency and energy consumption, outperforming conventional baseline algorithms and heuristic methods. Relative to static allocation strategies, it achieves a more favourable latency&#x2013;energy trade-off and exhibits superior adaptability and robustness, particularly at high load. These findings confirm the viability of deep reinforcement learning for task-offloading decisions and provide an efficient, flexible solution for real-time scheduling in smart grids, underscoring its significant potential for practical engineering deployment and broad adoption.</p>
</sec>
</body>
<back>
<ack>
<p>We would sincerely want to thank the peoples who are supported to do this work and reviewing committee for their estimable feedbacks.</p>
</ack>
<sec>
<title>Funding Statement</title>
<p>This work was supported by the National Natural Science Foundation of China (Grant No. 62103349) and the Henan Province Science and Technology Research Project (Grant No. 232102210104).</p>
</sec>
<sec>
<title>Author Contributions</title>
<p>The authors confirm contribution to the paper as follows: study conception and design: Ya Zhou, Qian Wang; data collection: Qian Wang; analysis and interpretation of results: Qian Wang; draft manuscript preparation: Qian Wang, Ya Zhou. All authors reviewed the results and approved the final version of the manuscript.</p>
</sec>
<sec sec-type="data-availability">
<title>Availability of Data and Materials</title>
<p>The datasets generated or analyzed during the current study are not publicly available due to privacy and confidentiality concerns, but are available from the corresponding author on reasonable request.</p>
</sec>
<sec>
<title>Ethics Approval</title>
<p>Not applicable.</p>
</sec>
<sec sec-type="COI-statement">
<title>Conflicts of Interest</title>
<p>The authors declare no conflicts of interest to report regarding the present study.</p>
</sec>
<ref-list content-type="authoryear">
<title>References</title>
<ref id="ref-1"><label>[1]</label><mixed-citation publication-type="book"><person-group person-group-type="author"><string-name><surname>Acarali</surname> <given-names>D</given-names></string-name>, <string-name><surname>Chugh</surname> <given-names>S</given-names></string-name>, <string-name><surname>Rao</surname> <given-names>KR</given-names></string-name>, <string-name><surname>Rajarajan</surname> <given-names>M</given-names></string-name></person-group>. <chapter-title>IoT deployment and management in the smart grid</chapter-title>. In: <person-group person-group-type="editor"><string-name><surname>Ranjan</surname> <given-names>R</given-names></string-name>, <string-name><surname>Mitra</surname> <given-names>K</given-names></string-name>, <string-name><surname>Jayaraman</surname> <given-names>PP</given-names></string-name>, <string-name><surname>Zomaya</surname> <given-names>AY</given-names></string-name></person-group>, editors. <source>Managing Internet of Things applications across edge and cloud data centres</source>. <publisher-loc>London, UK</publisher-loc>: <publisher-name>The Institution of Engineering and Technology</publisher-name>; <year>2024</year>. p. <fpage>255</fpage>&#x2013;<lpage>75</lpage>. doi:<pub-id pub-id-type="doi">10.1049/PBPC027E_ch11</pub-id>.</mixed-citation></ref>
<ref id="ref-2"><label>[2]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Al-Bossly</surname> <given-names>A</given-names></string-name></person-group>. <article-title>Metaheuristic optimization with deep learning enabled smart grid stability prediction</article-title>. <source>Comput Mater Contin</source>. <year>2023</year>;<volume>75</volume>(<issue>3</issue>):<fpage>6395</fpage>&#x2013;<lpage>408</lpage>. doi:<pub-id pub-id-type="doi">10.32604/cmc.2023.028433</pub-id>.</mixed-citation></ref>
<ref id="ref-3"><label>[3]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Ahmed</surname> <given-names>RA</given-names></string-name>, <string-name><surname>Abdelraouf</surname> <given-names>M</given-names></string-name>, <string-name><surname>Elsaid</surname> <given-names>SA</given-names></string-name>, <string-name><surname>ElAffendi</surname> <given-names>M</given-names></string-name>, <string-name><surname>Abd El-Latif</surname> <given-names>AA</given-names></string-name>, <string-name><surname>Shaalan</surname> <given-names>AA</given-names></string-name>, <etal>et al</etal></person-group>. <article-title>Internet of Things-based robust green smart grid</article-title>. <source>Comput</source>. <year>2024</year>;<volume>13</volume>(<issue>7</issue>):<fpage>169</fpage>. doi:<pub-id pub-id-type="doi">10.3390/computers13070169</pub-id>.</mixed-citation></ref>
<ref id="ref-4"><label>[4]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Aminifar</surname> <given-names>F</given-names></string-name></person-group>. <article-title>Evolution in computing paradigms for Internet of Things-enabled smart grid applications</article-title>. In: <conf-name>Proceedings of the 2024 5th CPSSI International Symposium on Cyber-Physical Systems (Applications and Theory) (CPSAT); 2024 Oct 16&#x2013;17; Tehran, Iran</conf-name>. doi:<pub-id pub-id-type="doi">10.1109/CPSAT64082.2024.10745414</pub-id>.</mixed-citation></ref>
<ref id="ref-5"><label>[5]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><surname>Arcas</surname> <given-names>GI</given-names></string-name>, <string-name><surname>Cioara</surname> <given-names>T</given-names></string-name>, <string-name><surname>Anghel</surname> <given-names>I</given-names></string-name>, <string-name><surname>Lazea</surname> <given-names>D</given-names></string-name>, <string-name><surname>Hangan</surname> <given-names>A</given-names></string-name></person-group>. <article-title>Edge offloading in smart grid</article-title>. <comment>arXiv:2402.01664. 2024</comment>.</mixed-citation></ref>
<ref id="ref-6"><label>[6]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Li</surname> <given-names>K</given-names></string-name>, <string-name><surname>Meng</surname> <given-names>J</given-names></string-name>, <string-name><surname>Luo</surname> <given-names>G</given-names></string-name>, <string-name><surname>Hou</surname> <given-names>L</given-names></string-name>, <string-name><surname>Cheng</surname> <given-names>H</given-names></string-name>, <string-name><surname>Liu</surname> <given-names>M</given-names></string-name>, <etal>et al</etal></person-group>. <article-title>Fusion-communication MEC offloading strategy for smart grid</article-title>. <source>Dianli Xinxi Yu Tongxin Jishu</source>. <year>2024</year>;<volume>22</volume>(<issue>6</issue>):<fpage>10</fpage>&#x2013;<lpage>7</lpage>. <comment>(In Chinese)</comment>. doi:<pub-id pub-id-type="doi">10.16543/j.2095-641X.electric.power.ict.2024.06.02</pub-id>. </mixed-citation></ref>
<ref id="ref-7"><label>[7]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Liu</surname> <given-names>M</given-names></string-name>, <string-name><surname>Tu</surname> <given-names>Q</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Meng</surname> <given-names>S</given-names></string-name>, <string-name><surname>Zhao</surname> <given-names>X</given-names></string-name></person-group>. <article-title>Research status of mobile cloud computing offloading technology and its application in the power grid</article-title>. <source>Dianli Xinxi Yu Tongxin Jishu</source>. <year>2021</year>;<volume>19</volume>(<issue>1</issue>):<fpage>49</fpage>&#x2013;<lpage>56</lpage>. <comment>(In Chinese)</comment>. doi:<pub-id pub-id-type="doi">10.16543/j.2095-641X.electric.power.ict.2021.01.007</pub-id> .</mixed-citation></ref>
<ref id="ref-8"><label>[8]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Zhang</surname> <given-names>N</given-names></string-name>, <string-name><surname>Li</surname> <given-names>WJ</given-names></string-name>, <string-name><surname>Liu</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Li</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Liu</surname> <given-names>YM</given-names></string-name>, <string-name><surname>Nahar</surname> <given-names>N</given-names></string-name></person-group>. <article-title>A new task scheduling scheme based on genetic algorithm for edge computing</article-title>. <source>Comput Mater Contin</source>. <year>2022</year>;<volume>71</volume>(<issue>1</issue>):<fpage>843</fpage>&#x2013;<lpage>54</lpage>. doi:<pub-id pub-id-type="doi">10.32604/cmc.2022.017504</pub-id>.</mixed-citation></ref>
<ref id="ref-9"><label>[9]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Han</surname> <given-names>X</given-names></string-name>, <string-name><surname>Dai</surname> <given-names>J</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>Y</given-names></string-name></person-group>. <article-title>Research on edge computing-oriented resource-aware access and intelligent gateway technology for power transmission, transformation and distribution</article-title>. In: <conf-name>Proceedings of the 2023 International Conference on Applied Intelligence and Sustainable Computing (ICAISC); 2023 Jun 16&#x2013;17; Dharwad, India</conf-name>. p. <fpage>1</fpage>&#x2013;<lpage>6</lpage>. doi:<pub-id pub-id-type="doi">10.1109/ICAISC58445.2023.10199983</pub-id>.</mixed-citation></ref>
<ref id="ref-10"><label>[10]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Wei</surname> <given-names>H</given-names></string-name>, <string-name><surname>Guan</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Zhao</surname> <given-names>Q</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>T</given-names></string-name>, <string-name><surname>Liu</surname> <given-names>J</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>H</given-names></string-name></person-group>. <article-title>A novel distributed computing resource operation mechanism for edge computing</article-title>. In: <conf-name>Proceedings of the 2023 9th International Conference on Computer and Communications (ICCC); 2023 Dec 8&#x2013;11; Chengdu, China</conf-name>. p. <fpage>2593</fpage>&#x2013;<lpage>8</lpage>. doi:<pub-id pub-id-type="doi">10.1109/ICCC59590.2023.10507521</pub-id>.</mixed-citation></ref>
<ref id="ref-11"><label>[11]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Dong</surname> <given-names>S</given-names></string-name>, <string-name><surname>Tang</surname> <given-names>J</given-names></string-name>, <string-name><surname>Abbas</surname> <given-names>K</given-names></string-name>, <string-name><surname>Hou</surname> <given-names>R</given-names></string-name>, <string-name><surname>Kamruzzaman</surname> <given-names>J</given-names></string-name>, <string-name><surname>Rutkowski</surname> <given-names>L</given-names></string-name>, <etal>et al</etal></person-group>. <article-title>Task offloading strategies for mobile edge computing: a survey</article-title>. <source>Comput Netw</source>. <year>2024</year>;<volume>254</volume>(<issue>6</issue>):<fpage>110791</fpage>. doi:<pub-id pub-id-type="doi">10.1016/j.comnet.2024.110791</pub-id>.</mixed-citation></ref>
<ref id="ref-12"><label>[12]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Park</surname> <given-names>S</given-names></string-name>, <string-name><surname>Kwon</surname> <given-names>D</given-names></string-name>, <string-name><surname>Kim</surname> <given-names>J</given-names></string-name>, <string-name><surname>Lee</surname> <given-names>YK</given-names></string-name>, <string-name><surname>Cho</surname> <given-names>S</given-names></string-name></person-group>. <article-title>Adaptive real-time offloading decision-making for mobile edges: deep reinforcement learning framework and simulation results</article-title>. <source>Appl Sci</source>. <year>2020</year>;<volume>10</volume>(<issue>5</issue>):<fpage>1663</fpage>. doi:<pub-id pub-id-type="doi">10.3390/app10051663</pub-id>.</mixed-citation></ref>
<ref id="ref-13"><label>[13]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Peng</surname> <given-names>P</given-names></string-name>, <string-name><surname>Lin</surname> <given-names>W</given-names></string-name>, <string-name><surname>Wu</surname> <given-names>W</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>H</given-names></string-name>, <string-name><surname>Peng</surname> <given-names>S</given-names></string-name>, <string-name><surname>Wu</surname> <given-names>Q</given-names></string-name>, <etal>et al</etal></person-group>. <article-title>A survey on computation offloading in edge systems: from the perspective of deep reinforcement learning approaches</article-title>. <source>Comput Sci Rev</source>. <year>2024</year>;<volume>53</volume>(<issue>5</issue>):<fpage>100656</fpage>. doi:<pub-id pub-id-type="doi">10.1016/j.cosrev.2024.100656</pub-id>.</mixed-citation></ref>
<ref id="ref-14"><label>[14]</label><mixed-citation publication-type="book"><person-group person-group-type="author"><string-name><surname>Zhu</surname> <given-names>C</given-names></string-name>, <string-name><surname>Xia</surname> <given-names>L</given-names></string-name>, <string-name><surname>Qin</surname> <given-names>C</given-names></string-name></person-group>. <chapter-title>Research progress and prospects of deep reinforcement learning in the field of mobile edge computing</chapter-title>. In: <person-group person-group-type="editor"><string-name><surname>Ning</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Xiong</surname> <given-names>Z</given-names></string-name></person-group>, editors. <source>Proceedings of the Fifth International Conference on Computer Communication and Network Security (CCNS 2024); 2024 May 3&#x2013;5; Guangzhou, China</source>. p. <fpage>1322813</fpage>. doi:<pub-id pub-id-type="doi">10.1117/12.3038174</pub-id>.</mixed-citation></ref>
<ref id="ref-15"><label>[15]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Gao</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Wu</surname> <given-names>G</given-names></string-name>, <string-name><surname>Shen</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>H</given-names></string-name>, <string-name><surname>Shen</surname> <given-names>S</given-names></string-name>, <string-name><surname>Cao</surname> <given-names>Q</given-names></string-name></person-group>. <article-title>DRL-based optimization of privacy protection and computation performance in MEC computation offloading</article-title>. In: <conf-name>IEEE INFOCOM 2022&#x2014;IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS); 2022 May 2&#x2013;5; Online</conf-name>. p. <fpage>1</fpage>&#x2013;<lpage>6</lpage>. doi:<pub-id pub-id-type="doi">10.1109/INFOCOMWKSHPS54753.2022.9797993</pub-id>.</mixed-citation></ref>
<ref id="ref-16"><label>[16]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Alfa</surname> <given-names>AS</given-names></string-name>, <string-name><surname>Maharaj</surname> <given-names>BT</given-names></string-name>, <string-name><surname>Lall</surname> <given-names>S</given-names></string-name>, <string-name><surname>Pal</surname> <given-names>S</given-names></string-name></person-group>. <article-title>Resource allocation techniques in underlay cognitive radio networks based on mixed-integer programming: a survey</article-title>. <source>J Commun Netw</source>. <year>2016</year>;<volume>18</volume>(<issue>5</issue>):<fpage>744</fpage>&#x2013;<lpage>61</lpage>. doi:<pub-id pub-id-type="doi">10.1109/JCN.2016.000104</pub-id>.</mixed-citation></ref>
<ref id="ref-17"><label>[17]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Wei</surname> <given-names>F</given-names></string-name>, <string-name><surname>Chen</surname> <given-names>S</given-names></string-name>, <string-name><surname>Zou</surname> <given-names>W</given-names></string-name></person-group>. <article-title>A greedy algorithm for task offloading in mobile edge computing system</article-title>. <source>China Commun</source>. <year>2018</year>;<volume>15</volume>(<issue>11</issue>):<fpage>149</fpage>&#x2013;<lpage>57</lpage>. doi:<pub-id pub-id-type="doi">10.1109/CC.2018.8543056</pub-id>.</mixed-citation></ref>
<ref id="ref-18"><label>[18]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Umair</surname> <given-names>M</given-names></string-name>, <string-name><surname>Saeed</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Saeed</surname> <given-names>F</given-names></string-name>, <string-name><surname>Ishtiaq</surname> <given-names>H</given-names></string-name>, <string-name><surname>Zubair</surname> <given-names>M</given-names></string-name>, <string-name><surname>Hameed</surname> <given-names>HA</given-names></string-name></person-group>. <article-title>Energy theft detection in smart grids with genetic algorithm-based feature selection</article-title>. <source>Comput Mater Contin</source>. <year>2023</year>;<volume>74</volume>(<issue>3</issue>):<fpage>5431</fpage>&#x2013;<lpage>46</lpage>. doi:<pub-id pub-id-type="doi">10.32604/cmc.2023.033884</pub-id>.</mixed-citation></ref>
<ref id="ref-19"><label>[19]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Wang</surname> <given-names>J</given-names></string-name>, <string-name><surname>Xia</surname> <given-names>H</given-names></string-name>, <string-name><surname>Xu</surname> <given-names>L</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>R</given-names></string-name>, <string-name><surname>Jia</surname> <given-names>K</given-names></string-name></person-group>. <article-title>DRL-based latency-energy offloading optimization strategy in wireless VR networks with edge computing</article-title>. <source>Comput Netw</source>. <year>2025</year>;<volume>258</volume>:<fpage>111034</fpage>. doi:<pub-id pub-id-type="doi">10.1016/j.comnet.2025.111034</pub-id>.</mixed-citation></ref>
<ref id="ref-20"><label>[20]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Wang</surname> <given-names>T</given-names></string-name>, <string-name><surname>Deng</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Yang</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Cai</surname> <given-names>H</given-names></string-name></person-group>. <article-title>Parameterized deep reinforcement learning with hybrid action space for edge task offloading</article-title>. <source>IEEE Internet Things J</source>. <year>2024</year>;<volume>11</volume>(<issue>6</issue>):<fpage>10754</fpage>&#x2013;<lpage>10767</lpage>. doi:<pub-id pub-id-type="doi">10.1109/JIOT.2023.3327121</pub-id>.</mixed-citation></ref>
<ref id="ref-21"><label>[21]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Li</surname> <given-names>H</given-names></string-name>, <string-name><surname>Liu</surname> <given-names>L</given-names></string-name>, <string-name><surname>Duan</surname> <given-names>X</given-names></string-name>, <string-name><surname>Li</surname> <given-names>H</given-names></string-name>, <string-name><surname>Zheng</surname> <given-names>P</given-names></string-name>, <string-name><surname>Tang</surname> <given-names>L</given-names></string-name></person-group>. <article-title>Energy-efficient offloading based on hybrid bio-inspired algorithm for edge-cloud integrated computation</article-title>. <source>Sustain Comput Inform Syst</source>. <year>2024</year>;<volume>42</volume>(<issue>11</issue>):<fpage>100972</fpage>. doi:<pub-id pub-id-type="doi">10.1016/j.suscom.2024.100972</pub-id>.</mixed-citation></ref>
<ref id="ref-22"><label>[22]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Wang</surname> <given-names>W</given-names></string-name>, <string-name><surname>Yang</surname> <given-names>L</given-names></string-name>, <string-name><surname>Long</surname> <given-names>T</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>X</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>M</given-names></string-name></person-group>. <article-title>Mobile edge computing task offloading method for the power Internet of Things</article-title>. In: <conf-name>Proceedings of the 2024 IEEE 7th Information Technology, Networking, Electronic and Automation Control Conference (ITNEC); 2024 Sep 20&#x2013;22; Chongqing, China</conf-name>. p. <fpage>118</fpage>&#x2013;<lpage>22</lpage>. doi:<pub-id pub-id-type="doi">10.1109/ITNEC60942.2024.10733102</pub-id>.</mixed-citation></ref>
<ref id="ref-23"><label>[23]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Cui</surname> <given-names>J</given-names></string-name>, <string-name><surname>Li</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Yang</surname> <given-names>H</given-names></string-name>, <string-name><surname>Wei</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Liu</surname> <given-names>W</given-names></string-name>, <string-name><surname>Ji</surname> <given-names>C</given-names></string-name>, <etal>et al.</etal></person-group> <article-title>Quota matching-based task offloading for WSN in smart grid</article-title>. In: <conf-name>Proceedings of the 2022 7th International Conference on Electronic Technology and Information Science (ICETIS 2022); 2022 Jan 21&#x2013;23; Harbin, China</conf-name>. p. <fpage>1</fpage>&#x2013;<lpage>4</lpage>.</mixed-citation></ref>
<ref id="ref-24"><label>[24]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Hu</surname> <given-names>J</given-names></string-name>, <string-name><surname>Li</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Zhao</surname> <given-names>G</given-names></string-name>, <string-name><surname>Xu</surname> <given-names>B</given-names></string-name>, <string-name><surname>Ni</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Zhao</surname> <given-names>H</given-names></string-name></person-group>. <article-title>Deep reinforcement learning for task offloading in edge computing assisted power IoT</article-title>. <source>IEEE Access</source>. <year>2021</year>;<volume>9</volume>:<fpage>93892</fpage>&#x2013;<lpage>901</lpage>. doi:<pub-id pub-id-type="doi">10.1109/ACCESS.2021.3092381</pub-id>.</mixed-citation></ref>
<ref id="ref-25"><label>[25]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Zhou</surname> <given-names>H</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Li</surname> <given-names>D</given-names></string-name>, <string-name><surname>Su</surname> <given-names>Z</given-names></string-name></person-group>. <article-title>Joint optimization of computing offloading and service caching in edge computing-based smart grid</article-title>. <source>IEEE Trans Cloud Comput</source>. <year>2023</year>;<volume>11</volume>(<issue>2</issue>):<fpage>1122</fpage>&#x2013;<lpage>32</lpage>. doi:<pub-id pub-id-type="doi">10.1109/TCC.2022.3163750</pub-id>.</mixed-citation></ref>
<ref id="ref-26"><label>[26]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Nimkar</surname> <given-names>S</given-names></string-name>, <string-name><surname>Khanapurkar</surname> <given-names>MM</given-names></string-name></person-group>. <article-title>Design of a Q-learning based smart grid and smart water scheduling model based on heterogeneous task specific offloading process</article-title>. In: <conf-name>Proceedings of the 2022 International Conference on Smart Generation Computing, Communication and Networking (SMART GENCON); 2022 Dec 23&#x2013;25; Bangalore, India</conf-name>. p. <fpage>1</fpage>&#x2013;<lpage>9</lpage>. doi:<pub-id pub-id-type="doi">10.1109/SMARTGENCON56628.2022.10084189</pub-id>.</mixed-citation></ref>
<ref id="ref-27"><label>[27]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Li</surname> <given-names>H</given-names></string-name>, <string-name><surname>Xiong</surname> <given-names>K</given-names></string-name>, <string-name><surname>Lu</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Chen</surname> <given-names>W</given-names></string-name>, <string-name><surname>Fan</surname> <given-names>P</given-names></string-name>, <string-name><surname>Letaief</surname> <given-names>KB</given-names></string-name></person-group>. <article-title>Collaborative task offloading and resource allocation in small-cell MEC: a multi-agent PPO-based scheme</article-title>. <source>IEEE Trans Mob Comput</source>. <year>2025</year>;<volume>24</volume>(<issue>3</issue>):<fpage>2346</fpage>&#x2013;<lpage>59</lpage>. doi:<pub-id pub-id-type="doi">10.1109/TMC.2024.3496536</pub-id>.</mixed-citation></ref>
<ref id="ref-28"><label>[28]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Mustafa</surname> <given-names>E</given-names></string-name>, <string-name><surname>Shuja</surname> <given-names>J</given-names></string-name>, <string-name><surname>Rehman</surname> <given-names>F</given-names></string-name>, <string-name><surname>Namoun</surname> <given-names>A</given-names></string-name>, <string-name><surname>Bilal</surname> <given-names>M</given-names></string-name>, <string-name><surname>Iqbal</surname> <given-names>A</given-names></string-name></person-group>. <article-title>Computation offloading in vehicular communications using PPO-based deep reinforcement learning</article-title>. <source>J Supercomput</source>. <year>2025</year>;<volume>81</volume>(<issue>4</issue>):<fpage>547</fpage>. doi:<pub-id pub-id-type="doi">10.1007/s11227-025-07009-z</pub-id>.</mixed-citation></ref>
<ref id="ref-29"><label>[29]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Goudarzi</surname> <given-names>M</given-names></string-name>, <string-name><surname>Palaniswami</surname> <given-names>M</given-names></string-name>, <string-name><surname>Buyya</surname> <given-names>R</given-names></string-name></person-group>. <article-title>A distributed deep reinforcement learning technique for application placement in edge and fog computing environments</article-title>. <source>IEEE Trans Mob Comput</source>. <year>2023</year>;<volume>22</volume>(<issue>5</issue>):<fpage>2491</fpage>&#x2013;<lpage>505</lpage>. doi:<pub-id pub-id-type="doi">10.1109/TMC.2021.3123165</pub-id>.</mixed-citation></ref>
<ref id="ref-30"><label>[30]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Dinh</surname> <given-names>TQ</given-names></string-name>, <string-name><surname>Tang</surname> <given-names>J</given-names></string-name>, <string-name><surname>La</surname> <given-names>QD</given-names></string-name>, <string-name><surname>Quek</surname> <given-names>TQS</given-names></string-name></person-group>. <article-title>Offloading in mobile edge computing: task allocation and computational frequency scaling</article-title>. <source>IEEE Trans Commun</source>. <year>2017</year>;<volume>65</volume>(<issue>8</issue>):<fpage>3571</fpage>&#x2013;<lpage>84</lpage>. doi:<pub-id pub-id-type="doi">10.1109/TCOMM.2017.2699660</pub-id>.</mixed-citation></ref>
</ref-list>
</back></article>