<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.1 20151215//EN" "http://jats.nlm.nih.gov/publishing/1.1/JATS-journalpublishing1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:mml="http://www.w3.org/1998/Math/MathML" xml:lang="en" article-type="research-article" dtd-version="1.1">
<front>
<journal-meta>
<journal-id journal-id-type="pmc">CMC</journal-id>
<journal-id journal-id-type="nlm-ta">CMC</journal-id>
<journal-id journal-id-type="publisher-id">CMC</journal-id>
<journal-title-group>
<journal-title>Computers, Materials &#x0026; Continua</journal-title>
</journal-title-group>
<issn pub-type="epub">1546-2226</issn>
<issn pub-type="ppub">1546-2218</issn>
<publisher>
<publisher-name>Tech Science Press</publisher-name>
<publisher-loc>USA</publisher-loc>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="publisher-id">74034</article-id>
<article-id pub-id-type="doi">10.32604/cmc.2025.074034</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Article</subject>
</subj-group>
</article-categories>
<title-group>
<article-title>Unlocking Edge Fine-Tuning: A Sample-Efficient Language-Empowered Split Fine-Tuning Framework</article-title>
<alt-title alt-title-type="left-running-head">Unlocking Edge Fine-Tuning: A Sample-Efficient Language-Empowered Split Fine-Tuning Framework</alt-title>
<alt-title alt-title-type="right-running-head">Unlocking Edge Fine-Tuning: A Sample-Efficient Language-Empowered Split Fine-Tuning Framework</alt-title>
</title-group>
<contrib-group>
<contrib id="author-1" contrib-type="author">
<name name-style="western"><surname>Huang</surname><given-names>Zuyi</given-names></name><xref ref-type="aff" rid="aff-1">1</xref></contrib>
<contrib id="author-2" contrib-type="author">
<name name-style="western"><surname>Wang</surname><given-names>Yue</given-names></name><xref ref-type="aff" rid="aff-1">1</xref></contrib>
<contrib id="author-3" contrib-type="author">
<name name-style="western"><surname>Liu</surname><given-names>Jia</given-names></name><xref ref-type="aff" rid="aff-2">2</xref></contrib>
<contrib id="author-4" contrib-type="author">
<name name-style="western"><surname>Yi</surname><given-names>Haodong</given-names></name><xref ref-type="aff" rid="aff-1">1</xref></contrib>
<contrib id="author-5" contrib-type="author">
<name name-style="western"><surname>Ai</surname><given-names>Lejun</given-names></name><xref ref-type="aff" rid="aff-1">1</xref></contrib>
<contrib id="author-6" contrib-type="author" corresp="yes">
<name name-style="western"><surname>Chen</surname><given-names>Min</given-names></name><xref ref-type="aff" rid="aff-1">1</xref><xref ref-type="aff" rid="aff-3">3</xref><xref rid="cor1" ref-type="corresp">&#x002A;</xref><email>minchen@ieee.org</email></contrib>
<contrib id="author-7" contrib-type="author">
<name name-style="western"><surname>AlQahtani</surname><given-names>Salman A.</given-names></name><xref ref-type="aff" rid="aff-4">4</xref></contrib>
<aff id="aff-1"><label>1</label><institution>School of Computer Science and Engineering, South China University of Technology</institution>, <addr-line>Guangzhou, 510006</addr-line>, <country>China</country></aff>
<aff id="aff-2"><label>2</label><institution>School of Computer Science and Technology, Huazhong University of Science and Technology</institution>, <addr-line>Wuhan, 430074</addr-line>, <country>China</country></aff>
<aff id="aff-3"><label>3</label><institution>Pazhou Laboratory</institution>, <addr-line>Guangzhou, 510640</addr-line>, <country>China</country></aff>
<aff id="aff-4"><label>4</label><institution>New Emerging Technologies and 5G Network and Beyond Research Chair, Department of Computer Engineering, College of Computer and Information Sciences, King Saud University</institution>, <addr-line>Riyadh, 11574</addr-line>, <country>Saudi Arabia</country></aff>
</contrib-group>
<author-notes>
<corresp id="cor1"><label>&#x002A;</label>Corresponding Author: Min Chen. Email: <email>minchen@ieee.org</email></corresp>
</author-notes>
<pub-date date-type="collection" publication-format="electronic">
<year>2026</year>
</pub-date>
<pub-date date-type="pub" publication-format="electronic">
<day>10</day><month>2</month><year>2026</year>
</pub-date>
<volume>87</volume>
<issue>1</issue>
<elocation-id>66</elocation-id>
<history>
<date date-type="received">
<day>30</day>
<month>09</month>
<year>2025</year>
</date>
<date date-type="accepted">
<day>05</day>
<month>12</month>
<year>2025</year>
</date>
</history>
<permissions>
<copyright-statement>&#x00A9; 2026 The Authors.</copyright-statement>
<copyright-year>2026</copyright-year>
<copyright-holder>Published by Tech Science Press.</copyright-holder>
<license xlink:href="https://creativecommons.org/licenses/by/4.0/">
<license-p>This work is licensed under a <ext-link ext-link-type="uri" xlink:type="simple" xlink:href="https://creativecommons.org/licenses/by/4.0/">Creative Commons Attribution 4.0 International License</ext-link>, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
</license>
</permissions>
<self-uri content-type="pdf" xlink:href="TSP_CMC_74034.pdf"></self-uri>
<abstract>
<p>The personalized fine-tuning of large language models (LLMs) on edge devices is severely constrained by limited computation resources. Although split federated learning alleviates on-device burdens, its effectiveness diminishes in few-shot reasoning scenarios due to the low data efficiency of conventional supervised fine-tuning, which leads to excessive communication overhead. To address this, we propose Language-Empowered Split Fine-Tuning (LESFT), a framework that integrates split architectures with a contrastive-inspired fine-tuning paradigm. LESFT simultaneously learns from multiple logically equivalent but linguistically diverse reasoning chains, providing richer supervisory signals and improving data efficiency. This process-oriented training allows more effective reasoning adaptation with fewer samples. Extensive experiments demonstrate that LESFT consistently outperforms strong baselines such as SplitLoRA in task accuracy. LESFT consistently outperforms strong baselines on GSM8K, CommonsenseQA, and AQUA_RAT, with the largest gains observed on Qwen2.5-3B. These results indicate that LESFT can effectively adapt large language models for reasoning tasks under the computational and communication constraints of edge environments.</p>
</abstract>
<kwd-group kwd-group-type="author">
<kwd>Large language models</kwd>
<kwd>edge computing</kwd>
<kwd>efficient fine-tuning</kwd>
<kwd>few-shot fine-tuning</kwd>
<kwd>split federated learning</kwd>
</kwd-group>
<funding-group>
<award-group id="awg1">
<funding-source>National Natural Science Foundation of China (NSFC)</funding-source>
<award-id>62276109</award-id>
</award-group>
<award-group id="awg2">
<funding-source>Deanship of Scientific Research at King Saud University</funding-source>
<award-id>ORF-2025-585</award-id>
</award-group>
</funding-group>
</article-meta>
</front>
<body>
<sec id="s1">
<label>1</label>
<title>Introduction</title>
<p>Recent breakthroughs in large language models (LLMs) are converging with the rapid advancement of edge computing. This convergence gives rise to a critical challenge: how to achieve personalized deployment and efficient fine-tuning of models on edge devices. These devices are severely constrained by computational power, storage, and energy consumption [<xref ref-type="bibr" rid="ref-1">1</xref>&#x2013;<xref ref-type="bibr" rid="ref-3">3</xref>]. Compared to traditional cloud-based inference, edge-side applications can significantly reduce communication overhead and service latency caused by remote calls, while demonstrating unique advantages in user privacy protection and continuously available intelligent services [<xref ref-type="bibr" rid="ref-4">4</xref>&#x2013;<xref ref-type="bibr" rid="ref-6">6</xref>]. Edge large models are especially valuable in complex reasoning scenarios, including autonomous driving, industrial inspection, and medical diagnosis [<xref ref-type="bibr" rid="ref-7">7</xref>&#x2013;<xref ref-type="bibr" rid="ref-10">10</xref>]. However, existing models often scale to billions or even trillions of parameters (e.g., GPT-3 [<xref ref-type="bibr" rid="ref-11">11</xref>], LLaMA [<xref ref-type="bibr" rid="ref-12">12</xref>]), far exceeding the computational and storage capacities of edge devices. This gap between model scale and available resources makes the effective migration and adaptation of large model inference capabilities to the edge a long-standing critical challenge for both academia and industry.</p>
<p>To address this challenge, researchers have explored multiple technical directions. Among them, Split Learning and its federated extension, Split Federated Learning (SFL), have been widely recognized as representative paradigms for overcoming the bottlenecks of edge deployment [<xref ref-type="bibr" rid="ref-13">13</xref>,<xref ref-type="bibr" rid="ref-14">14</xref>]. More recently, approaches that combine the idea of model splitting with parameter-efficient fine-tuning (PEFT) have emerged. For example, SplitLoRA has been proposed to enable lightweight adaptation within split architectures [<xref ref-type="bibr" rid="ref-15">15</xref>]. These methods split the model into front and back segments. The edge device processes the front part, while the server handles the remaining layers. This design reduces the client-side computation and memory burden, making split architectures highly suitable for edge intelligence. However, in few-shot and complex reasoning tasks, the effectiveness of split architectures is limited because they still rely on conventional supervised fine-tuning (SFT). Prior studies have shown that SFT and instruction tuning often exhibit low data efficiency in acquiring complex logical induction and reasoning capabilities, typically depending on large-scale annotations or human feedback data [<xref ref-type="bibr" rid="ref-16">16</xref>&#x2013;<xref ref-type="bibr" rid="ref-18">18</xref>]. In split or federated settings, such low sample efficiency implies the need for more training samples and more frequent communication rounds to accomplish fine-tuning, thereby directly increasing the overall communication cost and undermining the advantages of split architectures for edge deployment. Consequently, the root cause of this limitation does not lie in the architecture itself, but rather in its incompatibility with the supervised fine-tuning paradigm under few-shot reasoning scenarios. Therefore, achieving efficient few-shot fine-tuning in edge environments remains a core open problem that requires novel methodological advances.</p>
<p>By effectively coordinating efficient split federated architectures with an advanced learning paradigm, LESFT provides a practical solution for personalized fine-tuning of LLMs on edge devices under dual constraints of limited resources and scarce data. The proposed framework not only inherits the advantages of split federated architectures in alleviating computational burdens but also introduces paradigm-level innovations that significantly enhance overall resource efficiency by improving data utilization in complex and few-shot reasoning scenarios.</p>
<p>The main contributions of this work can be summarized as follows:
<list list-type="bullet">
<list-item>
<p><bold>Problem Identification.</bold> We systematically analyze existing SFL frameworks and point out a previously overlooked limitation: their reliance on standard supervised fine-tuning leads to poor data efficiency in few-shot reasoning tasks. This limitation inherently increases communication overhead and restricts the applicability of SFL in resource-constrained edge environments.</p></list-item>
<list-item>
<p><bold>Framework Design.</bold> We propose LESFT, a new split federated fine-tuning framework that integrates a dual-path contrastive learning paradigm with the split architecture. LESFT leverages logically consistent but linguistically diverse reasoning chains to guide the model toward learning generalizable reasoning patterns, rather than memorizing specific linguistic forms, thus significantly improving few-shot learning effectiveness.</p></list-item>
<list-item>
<p><bold>Comprehensive Empirical Validation.</bold> We conduct a thorough empirical study across multiple reasoning domains rather than relying solely on mathematical reasoning tasks. LESFT is evaluated on three representative benchmarks: GSM8K [<xref ref-type="bibr" rid="ref-19">19</xref>] for arithmetic reasoning, CommonsenseQA [<xref ref-type="bibr" rid="ref-20">20</xref>] for commonsense multiple-choice reasoning, and AQUA_RAT [<xref ref-type="bibr" rid="ref-21">21</xref>] for algebraic word-problem reasoning. Experiments on multiple LLM scales consistently show that LESFT achieves substantial improvements over advanced baselines such as SplitLoRA. Notably, on the Qwen2.5-3B model [<xref ref-type="bibr" rid="ref-22">22</xref>], LESFT reaches 76.04% accuracy on GSM8K and yields consistent gains on CommonsenseQA and AQUA-RAT, demonstrating strong generalization across heterogeneous reasoning tasks in edge deployment scenarios.</p></list-item>
</list></p>
</sec>
<sec id="s2">
<label>2</label>
<title>Related Work</title>
<sec id="s2_1">
<label>2.1</label>
<title>Parameter-Efficient Fine-Tuning of Large Language Models</title>
<p>PEFT has emerged as a mainstream paradigm for adapting pre-trained LLMs to new tasks with minimal cost, preserving their strong generalization capabilities [<xref ref-type="bibr" rid="ref-23">23</xref>,<xref ref-type="bibr" rid="ref-24">24</xref>]. The core idea is to freeze the vast majority of the LLM&#x2019;s parameters and update only a small, manageable subset.</p>
<p>Prominent PEFT techniques achieve this in various ways. For instance, Adapter Tuning inserts small, trainable modules between existing Transformer layers [<xref ref-type="bibr" rid="ref-25">25</xref>,<xref ref-type="bibr" rid="ref-26">26</xref>], while Prefix-Tuning prepends trainable vectors to the input to steer the model&#x2019;s attention [<xref ref-type="bibr" rid="ref-27">27</xref>]. Other methods, like BitFit, fine-tune an extremely small fraction of existing parameters, such as the bias terms alone, yet achieve competitive performance [<xref ref-type="bibr" rid="ref-28">28</xref>]. Among the most widely adopted methods is Low-Rank Adaptation (LoRA) [<xref ref-type="bibr" rid="ref-29">29</xref>], which utilizes a low-rank approximation for weight updates to ensure high parameter efficiency.</p>
<p>While these methods effectively reduce the number of trainable parameters, they fail to resolve the critical memory bottleneck during on-device training. This is because backpropagation still requires loading the entire model and storing intermediate activations, resulting in a memory footprint as high as 70% of full fine-tuning [<xref ref-type="bibr" rid="ref-30">30</xref>,<xref ref-type="bibr" rid="ref-31">31</xref>]. This memory requirement is prohibitive for most edge devices, revealing a critical gap: parameter efficiency does not equate to training efficiency. This fundamental limitation motivates architectural-level solutions. Therefore, our proposed LESFT framework leverages model splitting to overcome the memory barrier that traditional PEFT cannot, enabling efficient learning at the edge.</p>
</sec>
<sec id="s2_2">
<label>2.2</label>
<title>Federated and Split Learning for LLMs</title>
<p>To fine-tune LLMs on distributed edge data while preserving privacy, research has centered on two main approaches: Federated Learning (FL) and Split Learning (SL).</p>
<p>FL is a key paradigm for edge AI [<xref ref-type="bibr" rid="ref-32">32</xref>&#x2013;<xref ref-type="bibr" rid="ref-34">34</xref>], but applying it to LLMs faces prohibitive communication and computation costs. A solution is Federated PEFT [<xref ref-type="bibr" rid="ref-35">35</xref>], where clients train and aggregate only lightweight modules. For example, FedLoRA requires exchanging only the LoRA adapters [<xref ref-type="bibr" rid="ref-36">36</xref>]. While this reduces communication, it fails to solve the local training bottleneck, as clients must still load the entire LLM. In contrast, SL directly tackles the local resource issue by partitioning the model. This property makes SL an attractive option for deploying LLMs on resource-constrained devices [<xref ref-type="bibr" rid="ref-37">37</xref>]. A natural extension is to combine SL with PEFT in methods like SplitLoRA, which mitigates the local bottleneck while keeping communication low.</p>
<p>The evolution from FL to SplitLoRA has progressively solved system-level bottlenecks related to communication, memory, and computation. However, these methods share a deeper limitation: their universal reliance on the conventional SFT paradigm. SFT is notoriously inefficient for complex reasoning tasks, often requiring extensive training samples and communication rounds for convergence. With system-level hurdles now largely addressed, this paradigm-level inefficiency emerges as the next critical barrier. Our work, the LESFT framework, is motivated by the need to overcome this very challenge by introducing a more data-efficient learning paradigm for LLMs at the edge.</p>
</sec>
<sec id="s2_3">
<label>2.3</label>
<title>Token-Level Fine-Tuning</title>
<p>Conventional response-level fine-tuning, or outcome-supervised learning, provides only sparse feedback from the final output, creating severe credit assignment challenges in complex reasoning tasks [<xref ref-type="bibr" rid="ref-38">38</xref>]. The model often fails to identify which reasoning steps are correct, and may even reach the right answer through flawed reasoning, a weakness reinforced by outcome-level supervision. In contrast, process supervision offers feedback to intermediate steps, yielding more reliable reasoning capabilities [<xref ref-type="bibr" rid="ref-39">39</xref>].</p>
<p>Building on this idea, token-level fine-tuning further refines supervision granularity by directly optimizing generated tokens. This enables more precise and stable learning and has been widely applied in SFT. However, studies show that even in high-quality datasets, many tokens are redundant or detrimental [<xref ref-type="bibr" rid="ref-40">40</xref>], diluting gradients and hindering performance. To address this, selective token optimization strategies estimate token contributions or use perplexity-based weighting to mask or down-weight non-informative tokens, thereby improving the signal-to-noise ratio and robustness [<xref ref-type="bibr" rid="ref-40">40</xref>,<xref ref-type="bibr" rid="ref-41">41</xref>].</p>
<p>Token-level techniques have also been extended to preference alignment. While Direct Preference Optimization (DPO) simplifies training, it ignores token-level variation in preference signals [<xref ref-type="bibr" rid="ref-42">42</xref>,<xref ref-type="bibr" rid="ref-43">43</xref>]. Token-Level DPO (TDPO) addresses this by modeling alignment as a token-wise Markov Decision Process, improving credit assignment and better matching the autoregressive nature of LLMs [<xref ref-type="bibr" rid="ref-44">44</xref>]. Further, Reinforced Token Optimization (RTO) integrates DPO with PPO, using dense token-level rewards from preference data to enhance policy learning and performance [<xref ref-type="bibr" rid="ref-45">45</xref>]. These advances highlight the effectiveness of token-level methods in improving accuracy and consistency for complex reasoning tasks.</p>
<p>Complementary to weighting and alignment approaches, Natural Language Fine-Tuning (NLFT) [<xref ref-type="bibr" rid="ref-41">41</xref>] strengthens reasoning by contrastively training on correct and incorrect reasoning chains. In this work, we adopt and extend this paradigm for few-shot reasoning in edge environments, simplifying the process to rely only on correct reasoning chains. This positive-sample strategy allows more efficient modeling of high-quality reasoning paths under limited data, improving both robustness and adaptability of LLMs.</p>
</sec>
</sec>
<sec id="s3">
<label>3</label>
<title>The Proposed LESFT Framework</title>
<p>To address the computational and data constraints of fine-tuning LLMs on edge devices, we propose the LESFT framework. By integrating model structure splitting with PEFT methods, the framework reduces the computational burden on the edge side while enhancing training efficiency in few-shot scenarios. This section will detail the overall architecture, key designs, and training algorithm of LESFT.</p>
<sec id="s3_1">
<label>3.1</label>
<title>LLM Split Framework</title>
<p>As shown in <xref ref-type="fig" rid="fig-1">Fig. 1</xref>, LESFT splits a traditional LLM architecture into three parts: an embedding layer, Transformer blocks, and a language modeling head. During inference, the input text is converted into a sequence of token IDs by the client-side tokenizer, and the embedding layer maps these IDs to continuous representations. The intermediate Transformer layers are deployed on the server to perform feature transformation, and finally, the language modeling head outputs a sequence of tokens in an auto-regressive manner.</p>
<fig id="fig-1">
<label>Figure 1</label>
<caption>
<title>Schematic of the LLM split architecture</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_74034-fig-1.tif"/>
</fig>
<p>To alleviate the memory bottleneck on edge devices, LESFT deploys the majority of its parameters in the cloud. The client retains only the embedding layer and some lightweight components. Specifically, after the input text is tokenized and embedded locally, it is sent to the server as a high-dimensional tensor. The server then performs the forward pass to output tokens. Since the server does not have a tokenizer, it cannot access the raw data. This mechanism structurally prevents raw data from leaving the local device, thereby ensuring user privacy while achieving computational offloading.</p>
<p>To further reduce the training overhead on the client, our framework introduces the LoRA technique. We partition the overall parameters into pre-trained base parameters, <inline-formula id="ieqn-1"><mml:math id="mml-ieqn-1"><mml:msub><mml:mi>&#x03B8;</mml:mi><mml:mn>0</mml:mn></mml:msub></mml:math></inline-formula>, and lightweight, trainable adaptation parameters, <inline-formula id="ieqn-2"><mml:math id="mml-ieqn-2"><mml:mi mathvariant="normal">&#x0394;</mml:mi><mml:mi>&#x03B8;</mml:mi></mml:math></inline-formula>. LoRA assumes that the parameter update <inline-formula id="ieqn-3"><mml:math id="mml-ieqn-3"><mml:mi mathvariant="normal">&#x0394;</mml:mi><mml:mi>W</mml:mi></mml:math></inline-formula> during fine-tuning has a low intrinsic rank and can be decomposed into the product of two low-rank matrices, <italic>U</italic> and <italic>V</italic>. Therefore, only these low-rank parameters need to be updated during training, significantly reducing the memory and computational resources required by the client.</p>
</sec>
<sec id="s3_2">
<label>3.2</label>
<title>Problem Formulation</title>
<p>In this section, we define the problem and introduce the LESFT framework, which will be detailed in <xref ref-type="sec" rid="s3_3">Section 3.3</xref>. The aim is to describe the system model of LESFT, laying a theoretical foundation for the subsequent sections. As illustrated in <xref ref-type="fig" rid="fig-2">Fig. 2</xref>, we construct a typical scenario where edge users with limited computational resources fine-tune a model for complex reasoning tasks. This system comprises three core components:</p>
<fig id="fig-2">
<label>Figure 2</label>
<caption>
<title>The training workflow of LESFT</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_74034-fig-2.tif"/>
</fig>
<p><list list-type="bullet">
<list-item>
<p><bold>Client Server:</bold> In this framework, a client is abstracted as an edge node with independent computing capabilities. We denote the set of all participating clients as <inline-formula id="ieqn-4"><mml:math id="mml-ieqn-4"><mml:mrow><mml:mi>&#x1D4A9;</mml:mi></mml:mrow><mml:mo>=</mml:mo><mml:mo fence="false" stretchy="false">{</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mn>2</mml:mn><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:mi>N</mml:mi><mml:mo fence="false" stretchy="false">}</mml:mo></mml:math></inline-formula>, where <italic>N</italic> is the total number of clients. For any client <inline-formula id="ieqn-5"><mml:math id="mml-ieqn-5"><mml:mi>i</mml:mi><mml:mo>&#x2208;</mml:mo><mml:mrow><mml:mi>&#x1D4A9;</mml:mi></mml:mrow></mml:math></inline-formula>, its local private dataset is denoted as <inline-formula id="ieqn-6"><mml:math id="mml-ieqn-6"><mml:msub><mml:mrow><mml:mi>&#x1D49F;</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mo fence="false" stretchy="false">{</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>y</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mover><mml:mi>x</mml:mi><mml:mo stretchy="false">&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mover><mml:mi>y</mml:mi><mml:mo stretchy="false">&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:msubsup><mml:mo fence="false" stretchy="false">}</mml:mo><mml:mrow><mml:mi>k</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:msub><mml:mrow><mml:mi>&#x1D49F;</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow></mml:mrow></mml:msubsup></mml:math></inline-formula>, where <inline-formula id="ieqn-7"><mml:math id="mml-ieqn-7"><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:msub><mml:mrow><mml:mi>&#x1D49F;</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow></mml:math></inline-formula> is the total number of samples. In this context, <inline-formula id="ieqn-8"><mml:math id="mml-ieqn-8"><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>k</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> denotes the user&#x2019;s prompt input, while <inline-formula id="ieqn-9"><mml:math id="mml-ieqn-9"><mml:msub><mml:mi>y</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>k</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> represents the corresponding desired output from the model, which contains the Chain-of-Thought (CoT) reasoning [<xref ref-type="bibr" rid="ref-46">46</xref>]. <inline-formula id="ieqn-10"><mml:math id="mml-ieqn-10"><mml:msub><mml:mrow><mml:mover><mml:mi>x</mml:mi><mml:mo stretchy="false">&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>k</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> is a reference input formed by combining the user&#x2019;s prompt with the ground-truth answer; its purpose is to provide a constraining signal in contrastive learning, thereby helping the model generate the target reasoning <inline-formula id="ieqn-11"><mml:math id="mml-ieqn-11"><mml:msub><mml:mi>y</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>k</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>. Meanwhile, <inline-formula id="ieqn-12"><mml:math id="mml-ieqn-12"><mml:msub><mml:mrow><mml:mover><mml:mi>y</mml:mi><mml:mo stretchy="false">&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>k</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> is identical to <inline-formula id="ieqn-13"><mml:math id="mml-ieqn-13"><mml:msub><mml:mi>y</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>k</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>. We assume that each edge node can independently perform local model forward propagation and backward propagation operations. The local model weights on each client are denoted as <inline-formula id="ieqn-14"><mml:math id="mml-ieqn-14"><mml:msub><mml:mi>W</mml:mi><mml:mrow><mml:mi>c</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>, within which only a set of trainable LoRA adapter parameters is designated as the trainable part. This parameter set is denoted as <inline-formula id="ieqn-15"><mml:math id="mml-ieqn-15"><mml:msub><mml:mi mathvariant="normal">&#x0398;</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mo fence="false" stretchy="false">{</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:msubsup><mml:mi>U</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>n</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:msubsup><mml:mi>V</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>n</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mo stretchy="false">)</mml:mo><mml:msubsup><mml:mo fence="false" stretchy="false">}</mml:mo><mml:mrow><mml:mi>n</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:msub><mml:mi>M</mml:mi><mml:mrow><mml:mi>c</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:msubsup></mml:math></inline-formula>, where <inline-formula id="ieqn-16"><mml:math id="mml-ieqn-16"><mml:msub><mml:mi>M</mml:mi><mml:mrow><mml:mi>c</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> is the number of trainable LoRA adapters on the client side, and <inline-formula id="ieqn-17"><mml:math id="mml-ieqn-17"><mml:msubsup><mml:mi>U</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>n</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup></mml:math></inline-formula> and <inline-formula id="ieqn-18"><mml:math id="mml-ieqn-18"><mml:msubsup><mml:mi>V</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>n</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup></mml:math></inline-formula> are the low-rank decomposition matrices of the <inline-formula id="ieqn-19"><mml:math id="mml-ieqn-19"><mml:mi>n</mml:mi></mml:math></inline-formula>-th LoRA adapter on the edge device <inline-formula id="ieqn-20"><mml:math id="mml-ieqn-20"><mml:mi>i</mml:mi></mml:math></inline-formula>.</p></list-item>
<list-item>
<p><bold>Central Server:</bold> The central server is typically a more powerful computing node responsible for managing and updating the parameters of the server-side sub-model. We represent the base model weights on the server as <inline-formula id="ieqn-21"><mml:math id="mml-ieqn-21"><mml:msub><mml:mi>W</mml:mi><mml:mrow><mml:mi>s</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>, and its trainable part consists of a set of LoRA adapters, denoted as <inline-formula id="ieqn-22"><mml:math id="mml-ieqn-22"><mml:mi mathvariant="normal">&#x03A6;</mml:mi><mml:mo>=</mml:mo><mml:mo fence="false" stretchy="false">{</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:msup><mml:mi>U</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>m</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mo>,</mml:mo><mml:msup><mml:mi>V</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>m</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mo stretchy="false">)</mml:mo><mml:msubsup><mml:mo fence="false" stretchy="false">}</mml:mo><mml:mrow><mml:mi>m</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:msub><mml:mi>M</mml:mi><mml:mrow><mml:mi>s</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:msubsup></mml:math></inline-formula>. Here, <inline-formula id="ieqn-23"><mml:math id="mml-ieqn-23"><mml:msub><mml:mi>M</mml:mi><mml:mrow><mml:mi>s</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> is the number of adapters on the server side, and <inline-formula id="ieqn-24"><mml:math id="mml-ieqn-24"><mml:msup><mml:mi>U</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>m</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula> and <inline-formula id="ieqn-25"><mml:math id="mml-ieqn-25"><mml:msup><mml:mi>V</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>m</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula> correspond to the low-rank decomposition matrices of the <inline-formula id="ieqn-26"><mml:math id="mml-ieqn-26"><mml:mi>m</mml:mi></mml:math></inline-formula>-th adapter. By adjusting <inline-formula id="ieqn-27"><mml:math id="mml-ieqn-27"><mml:mi mathvariant="normal">&#x03A6;</mml:mi></mml:math></inline-formula>, the central server can efficiently adapt to different tasks without altering the underlying pre-trained model <inline-formula id="ieqn-28"><mml:math id="mml-ieqn-28"><mml:msub><mml:mi>W</mml:mi><mml:mrow><mml:mi>s</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>.</p></list-item>
<list-item>
<p><bold>Aggregation Server:</bold> During the training process, an aggregation entity is required to coordinate updates from all clients. This aggregation server periodically collects and aggregates the set of adapter parameters <inline-formula id="ieqn-29"><mml:math id="mml-ieqn-29"><mml:msub><mml:mi mathvariant="normal">&#x0398;</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> uploaded by each client, enabling the sharing and transfer of global knowledge while preserving data privacy. For security considerations, the aggregation server and the central server are often maintained by different organizations to prevent the leakage of raw user data resulting from potential malicious attacks.</p></list-item>
</list></p>
<p>We represent the overall model parameters as <inline-formula id="ieqn-30"><mml:math id="mml-ieqn-30"><mml:mi>W</mml:mi><mml:mo>=</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:msub><mml:mi>W</mml:mi><mml:mrow><mml:mi>s</mml:mi></mml:mrow></mml:msub><mml:mo>;</mml:mo><mml:msub><mml:mi>W</mml:mi><mml:mrow><mml:mi>c</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">]</mml:mo></mml:math></inline-formula>, which consists of the server-side parameters <inline-formula id="ieqn-31"><mml:math id="mml-ieqn-31"><mml:msub><mml:mi>W</mml:mi><mml:mi>s</mml:mi></mml:msub></mml:math></inline-formula> and the client-side parameters <inline-formula id="ieqn-32"><mml:math id="mml-ieqn-32"><mml:msub><mml:mi>W</mml:mi><mml:mi>c</mml:mi></mml:msub></mml:math></inline-formula>. The training objective is to learn the optimal client adapter parameters <inline-formula id="ieqn-33"><mml:math id="mml-ieqn-33"><mml:mo fence="false" stretchy="false">{</mml:mo><mml:msub><mml:mi mathvariant="normal">&#x0398;</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:msubsup><mml:mo fence="false" stretchy="false">}</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>N</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula> and server-side adapter parameters <inline-formula id="ieqn-34"><mml:math id="mml-ieqn-34"><mml:mi mathvariant="normal">&#x03A6;</mml:mi></mml:math></inline-formula> by minimizing the weighted average of local losses across all clients. This optimization process can be formulated as:
<disp-formula id="eqn-1"><label>(1)</label><mml:math id="mml-eqn-1" display="block"><mml:munder><mml:mo movablelimits="true" form="prefix">min</mml:mo><mml:mrow><mml:mo fence="false" stretchy="false">{</mml:mo><mml:msub><mml:mi mathvariant="normal">&#x0398;</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:msubsup><mml:mo fence="false" stretchy="false">}</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>N</mml:mi></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:mspace width="thinmathspace" /><mml:mi mathvariant="normal">&#x03A6;</mml:mi></mml:mrow></mml:munder><mml:mtext>&#x00A0;</mml:mtext><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>N</mml:mi></mml:mrow></mml:munderover><mml:mfrac><mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:msub><mml:mrow><mml:mi>&#x1D49F;</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mrow><mml:mi>&#x1D49F;</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow></mml:mrow></mml:mfrac><mml:mspace width="thinmathspace" /><mml:msub><mml:mrow><mml:mi>&#x02112;</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mspace width="negativethinmathspace" /><mml:mrow><mml:mo>(</mml:mo><mml:mi>W</mml:mi><mml:mo>;</mml:mo><mml:msub><mml:mi mathvariant="normal">&#x0398;</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mi mathvariant="normal">&#x03A6;</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:math></disp-formula>where <inline-formula id="ieqn-35"><mml:math id="mml-ieqn-35"><mml:mrow><mml:mi>&#x1D49F;</mml:mi></mml:mrow><mml:mo>=</mml:mo><mml:munderover><mml:mo>&#x22C3;</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>N</mml:mi></mml:mrow></mml:munderover><mml:msub><mml:mrow><mml:mi>&#x1D49F;</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> is the union of all local datasets. The term <inline-formula id="ieqn-36"><mml:math id="mml-ieqn-36"><mml:msub><mml:mrow><mml:mi>&#x02112;</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mo>&#x22C5;</mml:mo><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> represents the loss for client <inline-formula id="ieqn-37"><mml:math id="mml-ieqn-37"><mml:mi>i</mml:mi></mml:math></inline-formula> on its local data <inline-formula id="ieqn-38"><mml:math id="mml-ieqn-38"><mml:msub><mml:mrow><mml:mi>&#x1D49F;</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>, computed by the model configured with the base parameters <italic>W</italic> and the adapters <inline-formula id="ieqn-39"><mml:math id="mml-ieqn-39"><mml:msub><mml:mi mathvariant="normal">&#x0398;</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> and <inline-formula id="ieqn-40"><mml:math id="mml-ieqn-40"><mml:mi mathvariant="normal">&#x03A6;</mml:mi></mml:math></inline-formula>. The weighted aggregation provides a unified global objective that integrates updates from multiple clients.</p>
</sec>
<sec id="s3_3">
<label>3.3</label>
<title>Training Workflow of LESFT</title>
<p>This section details the overall workflow of the proposed LESFT framework. LESFT&#x2019;s core innovation, which sets it apart from existing methods, is a fine-grained, token-level fine-tuning mechanism. We integrate this mechanism into a hierarchical split collaborative fine-tuning architecture. This approach not only enhances the model&#x2019;s expressive power but also effectively reduces resource overhead. Simultaneously, LESFT incorporates the PEFT technique, LoRA, to further enhance the adaptability and efficiency of the training process.</p>
<p>During the training initialization phase, the central server first processes the base weights of the large model to be fine-tuned and partitions them into a server-side sub-model and a client-side sub-model to accommodate the distributed computational resources. Subsequently, LESFT performs distributed fine-tuning over <italic>I</italic> consecutive local training rounds. In each round, the clients update only their local adapter parameters, while the server updates its assigned model parameters. After completing <italic>I</italic> iterations, the aggregation server integrates the sets of LoRA adapter parameters, <inline-formula id="ieqn-41"><mml:math id="mml-ieqn-41"><mml:msub><mml:mi mathvariant="normal">&#x0398;</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>, uploaded by each client to obtain the global adapter parameters, <inline-formula id="ieqn-42"><mml:math id="mml-ieqn-42"><mml:mover><mml:mi mathvariant="normal">&#x0398;</mml:mi><mml:mo accent="false">&#x00AF;</mml:mo></mml:mover></mml:math></inline-formula>, and distributes them to all clients to serve as the initialization for subsequent training. This process is repeated until the global model converges or the maximum number of training rounds, <italic>R</italic>, is reached.</p>
<p>Overall, the training process of LESFT consists of two main phase: (i) the split natural language fine-tuning stage, which is executed in every training round; and (ii) the client adapter aggregation stage, which is triggered every <italic>I</italic> rounds. <xref ref-type="fig" rid="fig-2">Fig. 2</xref> illustrates the overall workflow of LESFT, where the training round index is <inline-formula id="ieqn-43"><mml:math id="mml-ieqn-43"><mml:mi>r</mml:mi><mml:mo>&#x2208;</mml:mo><mml:mrow><mml:mi>&#x211B;</mml:mi></mml:mrow><mml:mo>=</mml:mo><mml:mo fence="false" stretchy="false">{</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mn>2</mml:mn><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:mi>R</mml:mi><mml:mo fence="false" stretchy="false">}</mml:mo></mml:math></inline-formula>.</p>
<p><bold>Phase 1. Split Natural Language Fine-Tuning Phase:</bold> The split natural language fine-tuning phase involves client-side and server-side fine-tuning performed by the participating clients and the central server in each training round. This phase consists of the following seven steps.</p>
<p><bold>1. Client-Side forward Propagation:</bold> In this step, all participating clients <inline-formula id="ieqn-44"><mml:math id="mml-ieqn-44"><mml:mi>i</mml:mi><mml:mo>&#x2208;</mml:mo><mml:mrow><mml:mi>&#x1D4A9;</mml:mi></mml:mrow></mml:math></inline-formula> perform forward propagation on their local sub-models in parallel. Specifically, each client <inline-formula id="ieqn-45"><mml:math id="mml-ieqn-45"><mml:mi>i</mml:mi></mml:math></inline-formula> randomly samples a mini-batch of size <italic>B</italic>, denoted as <inline-formula id="ieqn-46"><mml:math id="mml-ieqn-46"><mml:msubsup><mml:mrow><mml:mi>&#x0212C;</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>r</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:mo fence="false" stretchy="false">{</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:msubsup><mml:mi>x</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>r</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:msubsup><mml:mi>y</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>r</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:msubsup><mml:mrow><mml:mover><mml:mi>x</mml:mi><mml:mo stretchy="false">&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>r</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:msubsup><mml:mrow><mml:mover><mml:mi>y</mml:mi><mml:mo stretchy="false">&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>r</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mo stretchy="false">)</mml:mo><mml:msubsup><mml:mo fence="false" stretchy="false">}</mml:mo><mml:mrow><mml:mi>j</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>B</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula>, from its local dataset <inline-formula id="ieqn-47"><mml:math id="mml-ieqn-47"><mml:msub><mml:mrow><mml:mi>&#x1D49F;</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>. This batch contains two sets of input data: the base input <inline-formula id="ieqn-48"><mml:math id="mml-ieqn-48"><mml:msubsup><mml:mi>X</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>r</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:mo fence="false" stretchy="false">{</mml:mo><mml:msubsup><mml:mi>x</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>r</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:msubsup><mml:mo fence="false" stretchy="false">}</mml:mo><mml:mrow><mml:mi>j</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>B</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula> and a reference input for contrastive learning, <inline-formula id="ieqn-49"><mml:math id="mml-ieqn-49"><mml:msubsup><mml:mrow><mml:mover><mml:mi>X</mml:mi><mml:mo stretchy="false">&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>r</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:mo fence="false" stretchy="false">{</mml:mo><mml:msubsup><mml:mrow><mml:mover><mml:mi>x</mml:mi><mml:mo stretchy="false">&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>r</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:msubsup><mml:mo fence="false" stretchy="false">}</mml:mo><mml:mrow><mml:mi>j</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>B</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula>.</p>
<p>The client sub-model consists of fixed pre-trained weights <inline-formula id="ieqn-50"><mml:math id="mml-ieqn-50"><mml:msub><mml:mi>W</mml:mi><mml:mrow><mml:mi>c</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> and trainable LoRA adapters <inline-formula id="ieqn-51"><mml:math id="mml-ieqn-51"><mml:msubsup><mml:mi mathvariant="normal">&#x0398;</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>r</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:mo fence="false" stretchy="false">{</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:msubsup><mml:mi>U</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>n</mml:mi><mml:mo>,</mml:mo><mml:mi>r</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:msubsup><mml:mi>V</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>n</mml:mi><mml:mo>,</mml:mo><mml:mi>r</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mo stretchy="false">)</mml:mo><mml:msubsup><mml:mo fence="false" stretchy="false">}</mml:mo><mml:mrow><mml:mi>n</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:msub><mml:mi>M</mml:mi><mml:mi>c</mml:mi></mml:msub></mml:mrow></mml:msubsup></mml:math></inline-formula> specific to client <inline-formula id="ieqn-52"><mml:math id="mml-ieqn-52"><mml:mi>i</mml:mi></mml:math></inline-formula> in round <inline-formula id="ieqn-53"><mml:math id="mml-ieqn-53"><mml:mi>r</mml:mi></mml:math></inline-formula>. After feeding both sets of inputs, <inline-formula id="ieqn-54"><mml:math id="mml-ieqn-54"><mml:msubsup><mml:mi>X</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>r</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup></mml:math></inline-formula> and <inline-formula id="ieqn-55"><mml:math id="mml-ieqn-55"><mml:msubsup><mml:mrow><mml:mover><mml:mi>X</mml:mi><mml:mo stretchy="false">&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>r</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup></mml:math></inline-formula>, to the client sub-model, activations are generated at the cut layer. The computation is as follows:
<disp-formula id="eqn-2"><label>(2)</label><mml:math id="mml-eqn-2" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:msubsup><mml:mi>S</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>r</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup></mml:mtd><mml:mtd><mml:mi></mml:mi><mml:mo>=</mml:mo><mml:mi>&#x03D5;</mml:mi><mml:mspace width="negativethinmathspace" /><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>W</mml:mi><mml:mrow><mml:mi>c</mml:mi></mml:mrow></mml:msub><mml:mspace width="thinmathspace" /><mml:mstyle scriptlevel="0"><mml:mrow><mml:mo maxsize="1.2em" minsize="1.2em">|</mml:mo></mml:mrow></mml:mstyle><mml:mspace width="thinmathspace" /><mml:msubsup><mml:mi mathvariant="normal">&#x0398;</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>r</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:mspace width="thinmathspace" /><mml:msubsup><mml:mi>X</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>r</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mo>)</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<disp-formula id="eqn-3"><label>(3)</label><mml:math id="mml-eqn-3" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:msubsup><mml:mrow><mml:mover><mml:mi>S</mml:mi><mml:mo stretchy="false">&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>r</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup></mml:mtd><mml:mtd><mml:mi></mml:mi><mml:mo>=</mml:mo><mml:mi>&#x03D5;</mml:mi><mml:mspace width="negativethinmathspace" /><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>W</mml:mi><mml:mrow><mml:mi>c</mml:mi></mml:mrow></mml:msub><mml:mspace width="thinmathspace" /><mml:mstyle scriptlevel="0"><mml:mrow><mml:mo maxsize="1.2em" minsize="1.2em">|</mml:mo></mml:mrow></mml:mstyle><mml:mspace width="thinmathspace" /><mml:msubsup><mml:mi mathvariant="normal">&#x0398;</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>r</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:mspace width="thinmathspace" /><mml:msubsup><mml:mrow><mml:mover><mml:mi>X</mml:mi><mml:mo stretchy="false">&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>r</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mo>)</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>here, <inline-formula id="ieqn-56"><mml:math id="mml-ieqn-56"><mml:mi>&#x03D5;</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>W</mml:mi><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mi mathvariant="normal">&#x0398;</mml:mi><mml:mo>,</mml:mo><mml:mi>X</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> and <inline-formula id="ieqn-57"><mml:math id="mml-ieqn-57"><mml:mi>&#x03D5;</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>W</mml:mi><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mi mathvariant="normal">&#x0398;</mml:mi><mml:mo>,</mml:mo><mml:mrow><mml:mover><mml:mi>X</mml:mi><mml:mo stretchy="false">&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> represent the mapping from the joint processing of the base input <italic>X</italic> and the reference input <inline-formula id="ieqn-58"><mml:math id="mml-ieqn-58"><mml:mrow><mml:mover><mml:mi>X</mml:mi><mml:mo stretchy="false">&#x007E;</mml:mo></mml:mover></mml:mrow></mml:math></inline-formula> to the predicted values at the cut layer, given the model parameters <italic>W</italic> and the set of trainable LoRA adapters <inline-formula id="ieqn-59"><mml:math id="mml-ieqn-59"><mml:mi mathvariant="normal">&#x0398;</mml:mi></mml:math></inline-formula>. Finally, these activations, <inline-formula id="ieqn-60"><mml:math id="mml-ieqn-60"><mml:msubsup><mml:mi>S</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>r</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup></mml:math></inline-formula> and <inline-formula id="ieqn-61"><mml:math id="mml-ieqn-61"><mml:msubsup><mml:mrow><mml:mover><mml:mi>S</mml:mi><mml:mo stretchy="false">&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>r</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup></mml:math></inline-formula>, are transmitted to the central server for subsequent computations.</p>
<p><bold>2. Upload of Intermediate Activations and Supervision Signals:</bold> After the client completes its local forward propagation, each client <inline-formula id="ieqn-62"><mml:math id="mml-ieqn-62"><mml:mi>i</mml:mi></mml:math></inline-formula> uploads its two generated sets of activations, <inline-formula id="ieqn-63"><mml:math id="mml-ieqn-63"><mml:msubsup><mml:mi>S</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>r</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup></mml:math></inline-formula> and <inline-formula id="ieqn-64"><mml:math id="mml-ieqn-64"><mml:msubsup><mml:mrow><mml:mover><mml:mi>S</mml:mi><mml:mo stretchy="false">&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>r</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup></mml:math></inline-formula>, along with the corresponding mini-batch labels <inline-formula id="ieqn-65"><mml:math id="mml-ieqn-65"><mml:msubsup><mml:mi>Y</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>r</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup></mml:math></inline-formula> and <inline-formula id="ieqn-66"><mml:math id="mml-ieqn-66"><mml:msubsup><mml:mrow><mml:mover><mml:mi>Y</mml:mi><mml:mo stretchy="false">&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>r</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup></mml:math></inline-formula>, to the central server via a communication link. The server uses these collected activations as input for its server-side sub-model to proceed with the subsequent training steps.</p>
<p><bold>3. Central Server Forward Computation:</bold> Upon receiving the activations and labels from all participating clients, the central server feeds these activations into its server-side model to perform the server-side forward pass. The concatenated activation matrices, <inline-formula id="ieqn-67"><mml:math id="mml-ieqn-67"><mml:msup><mml:mrow><mml:mi>&#x1D4AE;</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>r</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula> and <inline-formula id="ieqn-68"><mml:math id="mml-ieqn-68"><mml:msup><mml:mrow><mml:mover><mml:mrow><mml:mi>&#x1D4AE;</mml:mi></mml:mrow><mml:mo stretchy="false">&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>r</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula>, are represented as: <inline-formula id="ieqn-69"><mml:math id="mml-ieqn-69"><mml:msup><mml:mrow><mml:mi>&#x1D4AE;</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>r</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:msubsup><mml:mi>S</mml:mi><mml:mrow><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>r</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mo>;</mml:mo><mml:msubsup><mml:mi>S</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>r</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mo>;</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>;</mml:mo><mml:msubsup><mml:mi>S</mml:mi><mml:mrow><mml:mi>N</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>r</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mo stretchy="false">]</mml:mo></mml:math></inline-formula> and <inline-formula id="ieqn-70"><mml:math id="mml-ieqn-70"><mml:msup><mml:mrow><mml:mover><mml:mrow><mml:mi>&#x1D4AE;</mml:mi></mml:mrow><mml:mo stretchy="false">&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>r</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:msubsup><mml:mrow><mml:mover><mml:mi>S</mml:mi><mml:mo stretchy="false">&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>r</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mo>;</mml:mo><mml:msubsup><mml:mrow><mml:mover><mml:mi>S</mml:mi><mml:mo stretchy="false">&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>r</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mo>;</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>;</mml:mo><mml:msubsup><mml:mrow><mml:mover><mml:mi>S</mml:mi><mml:mo stretchy="false">&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>N</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>r</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mo stretchy="false">]</mml:mo></mml:math></inline-formula>. Subsequently, the server inputs these two aggregated activation matrices into its server-side sub-model. The server sub-model consists of fixed pre-trained weights <inline-formula id="ieqn-71"><mml:math id="mml-ieqn-71"><mml:msub><mml:mi>W</mml:mi><mml:mrow><mml:mi>s</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> and trainable LoRA adapters <inline-formula id="ieqn-72"><mml:math id="mml-ieqn-72"><mml:msup><mml:mi mathvariant="normal">&#x03A6;</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>r</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:mo fence="false" stretchy="false">{</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:msup><mml:mi>U</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>m</mml:mi><mml:mo>,</mml:mo><mml:mi>r</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mo>,</mml:mo><mml:msup><mml:mi>V</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>m</mml:mi><mml:mo>,</mml:mo><mml:mi>r</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mo stretchy="false">)</mml:mo><mml:msubsup><mml:mo fence="false" stretchy="false">}</mml:mo><mml:mrow><mml:mi>m</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:msub><mml:mi>M</mml:mi><mml:mrow><mml:mi>s</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:msubsup></mml:math></inline-formula> for round <inline-formula id="ieqn-73"><mml:math id="mml-ieqn-73"><mml:mi>r</mml:mi></mml:math></inline-formula>. Using the server-side mapping function <inline-formula id="ieqn-74"><mml:math id="mml-ieqn-74"><mml:mi>&#x03C8;</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mo>&#x22C5;</mml:mo><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>, the final predictions are calculated:
<disp-formula id="eqn-4"><label>(4)</label><mml:math id="mml-eqn-4" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:msup><mml:mrow><mml:mover><mml:mi>Y</mml:mi><mml:mo stretchy="false">&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>r</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup></mml:mtd><mml:mtd><mml:mi></mml:mi><mml:mo>=</mml:mo><mml:mi>&#x03C8;</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msubsup><mml:mi>W</mml:mi><mml:mi>s</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>r</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:msup><mml:mi mathvariant="normal">&#x03A6;</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>r</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mo>,</mml:mo><mml:msup><mml:mrow><mml:mi>&#x1D4AE;</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>r</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mo stretchy="false">)</mml:mo><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<disp-formula id="eqn-5"><label>(5)</label><mml:math id="mml-eqn-5" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:msup><mml:mrow><mml:mover><mml:mrow><mml:mover><mml:mi>Y</mml:mi><mml:mo stretchy="false">&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mo stretchy="false">&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>r</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup></mml:mtd><mml:mtd><mml:mi></mml:mi><mml:mo>=</mml:mo><mml:mi>&#x03C8;</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msubsup><mml:mi>W</mml:mi><mml:mi>s</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>r</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:msup><mml:mi mathvariant="normal">&#x03A6;</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>r</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mo>,</mml:mo><mml:msup><mml:mrow><mml:mover><mml:mrow><mml:mi>&#x1D4AE;</mml:mi></mml:mrow><mml:mo stretchy="false">&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>r</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mo stretchy="false">)</mml:mo><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>where <inline-formula id="ieqn-75"><mml:math id="mml-ieqn-75"><mml:msup><mml:mrow><mml:mover><mml:mi>Y</mml:mi><mml:mo stretchy="false">&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>r</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula> and <inline-formula id="ieqn-76"><mml:math id="mml-ieqn-76"><mml:msup><mml:mrow><mml:mover><mml:mrow><mml:mover><mml:mi>Y</mml:mi><mml:mo stretchy="false">&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mo stretchy="false">&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>r</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula> are the model&#x2019;s reasoning results corresponding to the client base inputs and reference inputs, respectively.</p>
<p><bold>4. Token-Level Loss Computation:</bold> After the server-side model computes the reasoning results, the model base predictions <inline-formula id="ieqn-77"><mml:math id="mml-ieqn-77"><mml:msup><mml:mrow><mml:mover><mml:mi>Y</mml:mi><mml:mo stretchy="false">&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>r</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula> and reference predictions <inline-formula id="ieqn-78"><mml:math id="mml-ieqn-78"><mml:msup><mml:mrow><mml:mover><mml:mrow><mml:mover><mml:mi>Y</mml:mi><mml:mo stretchy="false">&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mo stretchy="false">&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>r</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula> and the ground-truth labels <inline-formula id="ieqn-79"><mml:math id="mml-ieqn-79"><mml:msup><mml:mi>Y</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>r</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula> and <inline-formula id="ieqn-80"><mml:math id="mml-ieqn-80"><mml:msup><mml:mrow><mml:mover><mml:mi>Y</mml:mi><mml:mo stretchy="false">&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>r</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula> are used to calculate an improved token-level loss function, <inline-formula id="ieqn-81"><mml:math id="mml-ieqn-81"><mml:msup><mml:mrow><mml:mi>&#x02112;</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>r</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula>. This loss function assigns different weights to each token to guide the model in learning reasoning logic. Its specific formulation will be detailed in <xref ref-type="sec" rid="s3_4">Section 3.4</xref>.</p>
<p><bold>5. Central Server Gradient Computation and Parameter Update:</bold> After computing the token-level loss <inline-formula id="ieqn-82"><mml:math id="mml-ieqn-82"><mml:msup><mml:mrow><mml:mi>&#x02112;</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>r</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula>, the central server performs backward propagation to calculate the gradients for the server-side LoRA adapter parameters. Specifically, for the <inline-formula id="ieqn-83"><mml:math id="mml-ieqn-83"><mml:mi>m</mml:mi></mml:math></inline-formula>-th server-side LoRA adapter, the gradients for its decomposition matrices <inline-formula id="ieqn-84"><mml:math id="mml-ieqn-84"><mml:msup><mml:mi>U</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>m</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula> and <inline-formula id="ieqn-85"><mml:math id="mml-ieqn-85"><mml:msup><mml:mi>V</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>m</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula> are denoted as <inline-formula id="ieqn-86"><mml:math id="mml-ieqn-86"><mml:msubsup><mml:mi>G</mml:mi><mml:mrow><mml:mi>U</mml:mi><mml:mo>,</mml:mo><mml:mi>s</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>m</mml:mi><mml:mo>,</mml:mo><mml:mi>r</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup></mml:math></inline-formula> and <inline-formula id="ieqn-87"><mml:math id="mml-ieqn-87"><mml:msubsup><mml:mi>G</mml:mi><mml:mrow><mml:mi>V</mml:mi><mml:mo>,</mml:mo><mml:mi>s</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>m</mml:mi><mml:mo>,</mml:mo><mml:mi>r</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup></mml:math></inline-formula>, respectively. These parameters are then updated using a gradient descent algorithm:
<disp-formula id="eqn-6"><label>(6)</label><mml:math id="mml-eqn-6" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:msup><mml:mi>U</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>m</mml:mi><mml:mo>,</mml:mo><mml:mi>r</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup></mml:mtd><mml:mtd><mml:mi></mml:mi><mml:mo stretchy="false">&#x2190;</mml:mo><mml:msup><mml:mi>U</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>m</mml:mi><mml:mo>,</mml:mo><mml:mi>r</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mi>&#x03B3;</mml:mi><mml:mrow><mml:mi>s</mml:mi></mml:mrow></mml:msub><mml:mo>&#x22C5;</mml:mo><mml:msubsup><mml:mi>G</mml:mi><mml:mrow><mml:mi>U</mml:mi><mml:mo>,</mml:mo><mml:mi>s</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>m</mml:mi><mml:mo>,</mml:mo><mml:mi>r</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<disp-formula id="eqn-7"><label>(7)</label><mml:math id="mml-eqn-7" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:msup><mml:mi>V</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>m</mml:mi><mml:mo>,</mml:mo><mml:mi>r</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup></mml:mtd><mml:mtd><mml:mi></mml:mi><mml:mo stretchy="false">&#x2190;</mml:mo><mml:msup><mml:mi>V</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>m</mml:mi><mml:mo>,</mml:mo><mml:mi>r</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mi>&#x03B3;</mml:mi><mml:mrow><mml:mi>s</mml:mi></mml:mrow></mml:msub><mml:mo>&#x22C5;</mml:mo><mml:msubsup><mml:mi>G</mml:mi><mml:mrow><mml:mi>V</mml:mi><mml:mo>,</mml:mo><mml:mi>s</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>m</mml:mi><mml:mo>,</mml:mo><mml:mi>r</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>where <inline-formula id="ieqn-88"><mml:math id="mml-ieqn-88"><mml:msub><mml:mi>&#x03B3;</mml:mi><mml:mrow><mml:mi>s</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> is the learning rate on the server side. This update process ensures that the server-side adapters can effectively learn from the aggregated information from all clients.</p>
<p><bold>6. Backward Transmission of Gradients to Clients:</bold> After the server completes its backward propagation and updates its LoRA adapter parameters, it computes the gradients of the loss function <inline-formula id="ieqn-89"><mml:math id="mml-ieqn-89"><mml:msup><mml:mrow><mml:mi>&#x02112;</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>r</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula> with respect to the input activations <inline-formula id="ieqn-90"><mml:math id="mml-ieqn-90"><mml:msup><mml:mrow><mml:mi>&#x1D4AE;</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>r</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula> and <inline-formula id="ieqn-91"><mml:math id="mml-ieqn-91"><mml:msup><mml:mrow><mml:mover><mml:mrow><mml:mi>&#x1D4AE;</mml:mi></mml:mrow><mml:mo stretchy="false">&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>r</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula>, which are <inline-formula id="ieqn-92"><mml:math id="mml-ieqn-92"><mml:msub><mml:mi mathvariant="normal">&#x2207;</mml:mi><mml:mrow><mml:msup><mml:mrow><mml:mi>&#x1D4AE;</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>r</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup></mml:mrow></mml:msub><mml:msup><mml:mrow><mml:mi>&#x02112;</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>r</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula> and <inline-formula id="ieqn-93"><mml:math id="mml-ieqn-93"><mml:msub><mml:mi mathvariant="normal">&#x2207;</mml:mi><mml:mrow><mml:msup><mml:mrow><mml:mover><mml:mrow><mml:mi>&#x1D4AE;</mml:mi></mml:mrow><mml:mo stretchy="false">&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>r</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup></mml:mrow></mml:msub><mml:msup><mml:mrow><mml:mi>&#x02112;</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>r</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula>. Subsequently, the server partitions these gradients according to their client origin and transmits them to the corresponding participating clients. Specifically, each client <inline-formula id="ieqn-94"><mml:math id="mml-ieqn-94"><mml:mi>i</mml:mi></mml:math></inline-formula> receives its corresponding gradient components <inline-formula id="ieqn-95"><mml:math id="mml-ieqn-95"><mml:msub><mml:mi mathvariant="normal">&#x2207;</mml:mi><mml:mrow><mml:msubsup><mml:mi>S</mml:mi><mml:mi>i</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>r</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup></mml:mrow></mml:msub><mml:msup><mml:mrow><mml:mi>&#x02112;</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>r</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula> and <inline-formula id="ieqn-96"><mml:math id="mml-ieqn-96"><mml:msub><mml:mi mathvariant="normal">&#x2207;</mml:mi><mml:mrow><mml:msubsup><mml:mrow><mml:mover><mml:mi>S</mml:mi><mml:mo stretchy="false">&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mi>i</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>r</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup></mml:mrow></mml:msub><mml:msup><mml:mrow><mml:mi>&#x02112;</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>r</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula>, which will serve as the input for the continued backward propagation on the client&#x2019;s local model.</p>
<p><bold>7. Client-Side Local Parameter Update:</bold> In this step, each client, based on the received activation gradients, continues the backward propagation process on its local sub-model to update the client-side LoRA adapter parameters. For a client <inline-formula id="ieqn-97"><mml:math id="mml-ieqn-97"><mml:mi>i</mml:mi></mml:math></inline-formula>, the decomposition matrices <inline-formula id="ieqn-98"><mml:math id="mml-ieqn-98"><mml:msubsup><mml:mi>U</mml:mi><mml:mi>i</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>n</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup></mml:math></inline-formula> and <inline-formula id="ieqn-99"><mml:math id="mml-ieqn-99"><mml:msubsup><mml:mi>V</mml:mi><mml:mi>i</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>n</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup></mml:math></inline-formula> of its <inline-formula id="ieqn-100"><mml:math id="mml-ieqn-100"><mml:mi>n</mml:mi></mml:math></inline-formula>-th LoRA adapter are updated via gradient descent:
<disp-formula id="eqn-8"><label>(8)</label><mml:math id="mml-eqn-8" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:msubsup><mml:mi>U</mml:mi><mml:mi>i</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>n</mml:mi><mml:mo>,</mml:mo><mml:mi>r</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup></mml:mtd><mml:mtd><mml:mi></mml:mi><mml:mo stretchy="false">&#x2190;</mml:mo><mml:msubsup><mml:mi>U</mml:mi><mml:mi>i</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>n</mml:mi><mml:mo>,</mml:mo><mml:mi>r</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mi>&#x03B3;</mml:mi><mml:mi>c</mml:mi></mml:msub><mml:mo>&#x22C5;</mml:mo><mml:msubsup><mml:mi>G</mml:mi><mml:mrow><mml:mi>U</mml:mi><mml:mo>,</mml:mo><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>n</mml:mi><mml:mo>,</mml:mo><mml:mi>r</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<disp-formula id="eqn-9"><label>(9)</label><mml:math id="mml-eqn-9" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:msubsup><mml:mi>V</mml:mi><mml:mi>i</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>n</mml:mi><mml:mo>,</mml:mo><mml:mi>r</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup></mml:mtd><mml:mtd><mml:mi></mml:mi><mml:mo stretchy="false">&#x2190;</mml:mo><mml:msubsup><mml:mi>V</mml:mi><mml:mi>i</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>n</mml:mi><mml:mo>,</mml:mo><mml:mi>r</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mi>&#x03B3;</mml:mi><mml:mi>c</mml:mi></mml:msub><mml:mo>&#x22C5;</mml:mo><mml:msubsup><mml:mi>G</mml:mi><mml:mrow><mml:mi>V</mml:mi><mml:mo>,</mml:mo><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>n</mml:mi><mml:mo>,</mml:mo><mml:mi>r</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>where <inline-formula id="ieqn-101"><mml:math id="mml-ieqn-101"><mml:msubsup><mml:mi>G</mml:mi><mml:mrow><mml:mi>U</mml:mi><mml:mo>,</mml:mo><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>n</mml:mi><mml:mo>,</mml:mo><mml:mi>r</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup></mml:math></inline-formula> and <inline-formula id="ieqn-102"><mml:math id="mml-ieqn-102"><mml:msubsup><mml:mi>G</mml:mi><mml:mrow><mml:mi>V</mml:mi><mml:mo>,</mml:mo><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>n</mml:mi><mml:mo>,</mml:mo><mml:mi>r</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup></mml:math></inline-formula> are the gradients computed in the current training round <inline-formula id="ieqn-103"><mml:math id="mml-ieqn-103"><mml:mi>r</mml:mi></mml:math></inline-formula> for the matrices <inline-formula id="ieqn-104"><mml:math id="mml-ieqn-104"><mml:msubsup><mml:mi>U</mml:mi><mml:mi>i</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>n</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup></mml:math></inline-formula> and <inline-formula id="ieqn-105"><mml:math id="mml-ieqn-105"><mml:msubsup><mml:mi>V</mml:mi><mml:mi>i</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>n</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup></mml:math></inline-formula> of the <inline-formula id="ieqn-106"><mml:math id="mml-ieqn-106"><mml:mi>n</mml:mi></mml:math></inline-formula>-th LoRA adapter of client <inline-formula id="ieqn-107"><mml:math id="mml-ieqn-107"><mml:mi>i</mml:mi></mml:math></inline-formula>, and <inline-formula id="ieqn-108"><mml:math id="mml-ieqn-108"><mml:msub><mml:mi>&#x03B3;</mml:mi><mml:mi>c</mml:mi></mml:msub></mml:math></inline-formula> is the local learning rate for the client. This process allows clients to optimize their local adapter parameters under the guidance of the global loss signal, achieving a more collaborative and efficient distributed fine-tuning.</p>
<p><bold>Phase 2. Periodic Global Adapter Fusion Cycle:</bold> The client adapter aggregation phase is primarily executed by the aggregation server. Its core objective is to integrate and fuse the local LoRA adapter parameters uploaded by each client to achieve global knowledge sharing and model performance improvement. This phase is executed once every <italic>I</italic> training rounds and consists of the following three steps:</p>
<p><bold>8. Upload of Local Adapter Parameters:</bold> In this step, all participating clients <inline-formula id="ieqn-109"><mml:math id="mml-ieqn-109"><mml:mi>i</mml:mi><mml:mo>&#x2208;</mml:mo><mml:mrow><mml:mi>&#x1D4A9;</mml:mi></mml:mrow></mml:math></inline-formula> upload their current client-side LoRA adapter parameter sets, <inline-formula id="ieqn-110"><mml:math id="mml-ieqn-110"><mml:msubsup><mml:mi mathvariant="normal">&#x0398;</mml:mi><mml:mi>i</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>r</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:mo fence="false" stretchy="false">{</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:msubsup><mml:mi>U</mml:mi><mml:mi>i</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>n</mml:mi><mml:mo>,</mml:mo><mml:mi>r</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:msubsup><mml:mi>V</mml:mi><mml:mi>i</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>n</mml:mi><mml:mo>,</mml:mo><mml:mi>r</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mo stretchy="false">)</mml:mo><mml:msubsup><mml:mo fence="false" stretchy="false">}</mml:mo><mml:mrow><mml:mi>n</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:msub><mml:mi>M</mml:mi><mml:mi>c</mml:mi></mml:msub></mml:mrow></mml:msubsup></mml:math></inline-formula>, to the aggregation server via wireless or wired links. This process only transmits the adapter parameters and does not involve any raw data, thereby ensuring that user privacy is protected.</p>
<p><bold>9. Weighted Federated Aggregation:</bold> Upon receiving the adapter parameters from all clients, the aggregation server performs a weighted average of their LoRA parameters based on the size of each client&#x2019;s local dataset to generate a globally unified client LoRA adapter, <inline-formula id="ieqn-111"><mml:math id="mml-ieqn-111"><mml:mover><mml:mi mathvariant="normal">&#x0398;</mml:mi><mml:mo accent="false">&#x00AF;</mml:mo></mml:mover><mml:mo>=</mml:mo><mml:mo fence="false" stretchy="false">{</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:msup><mml:mover><mml:mi>U</mml:mi><mml:mo accent="false">&#x00AF;</mml:mo></mml:mover><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>n</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mo>,</mml:mo><mml:msup><mml:mover><mml:mi>V</mml:mi><mml:mo accent="false">&#x00AF;</mml:mo></mml:mover><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>n</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mo stretchy="false">)</mml:mo><mml:msubsup><mml:mo fence="false" stretchy="false">}</mml:mo><mml:mrow><mml:mi>n</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:msub><mml:mi>M</mml:mi><mml:mi>c</mml:mi></mml:msub></mml:mrow></mml:msubsup></mml:math></inline-formula>. Specifically, the decomposition matrices of the <inline-formula id="ieqn-112"><mml:math id="mml-ieqn-112"><mml:mi>n</mml:mi></mml:math></inline-formula>-th LoRA adapter are aggregated as follows:
<disp-formula id="eqn-10"><label>(10)</label><mml:math id="mml-eqn-10" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:msup><mml:mover><mml:mi>U</mml:mi><mml:mo accent="false">&#x00AF;</mml:mo></mml:mover><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>n</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup></mml:mtd><mml:mtd><mml:mi></mml:mi><mml:mo>=</mml:mo><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>N</mml:mi></mml:mrow></mml:munderover><mml:mfrac><mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:msub><mml:mrow><mml:mi>&#x1D49F;</mml:mi></mml:mrow><mml:mi>i</mml:mi></mml:msub><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mrow><mml:mi>&#x1D49F;</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow></mml:mrow></mml:mfrac><mml:msubsup><mml:mi>U</mml:mi><mml:mi>i</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>n</mml:mi><mml:mo>,</mml:mo><mml:mi>r</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<disp-formula id="eqn-11"><label>(11)</label><mml:math id="mml-eqn-11" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:msup><mml:mover><mml:mi>V</mml:mi><mml:mo accent="false">&#x00AF;</mml:mo></mml:mover><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>n</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup></mml:mtd><mml:mtd><mml:mi></mml:mi><mml:mo>=</mml:mo><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>N</mml:mi></mml:mrow></mml:munderover><mml:mfrac><mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:msub><mml:mrow><mml:mi>&#x1D49F;</mml:mi></mml:mrow><mml:mi>i</mml:mi></mml:msub><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mrow><mml:mi>&#x1D49F;</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow></mml:mrow></mml:mfrac><mml:msubsup><mml:mi>V</mml:mi><mml:mi>i</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>n</mml:mi><mml:mo>,</mml:mo><mml:mi>r</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>This weighted aggregation strategy accounts for the differences in data distribution among clients, which helps to improve the generalization ability and convergence stability of the global model.</p>
<p><bold>10. Distribution of Global Adapter Parameters:</bold> After the aggregation is complete, the aggregation server distributes the global adapter parameters <inline-formula id="ieqn-113"><mml:math id="mml-ieqn-113"><mml:mover><mml:mi mathvariant="normal">&#x0398;</mml:mi><mml:mo accent="false">&#x00AF;</mml:mo></mml:mover></mml:math></inline-formula> back to each participating client. Upon receiving <inline-formula id="ieqn-114"><mml:math id="mml-ieqn-114"><mml:mover><mml:mi mathvariant="normal">&#x0398;</mml:mi><mml:mo accent="false">&#x00AF;</mml:mo></mml:mover></mml:math></inline-formula>, each client uses it as the initial parameters for its local adapters in the next training phase. This step ensures that all clients proceed with fine-tuning from a consistent global starting point in subsequent training, thereby promoting the overall consistency and efficiency of the federated training.</p>
<p>The overall training process of LESFT is summarized in Algorithm 1.</p>
<fig id="fig-8">
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_74034-fig-8.tif"/>
</fig>
<p>Through the algorithmic process outlined above, the LESFT framework effectively coordinates the computational resources of edge devices and cloud servers while preserving data privacy, providing a systematic solution for the efficient fine-tuning of LLMs in edge environments. The framework not only reduces the computational burden on the edge side but also significantly enhances training efficiency in few-shot scenarios through its natural language fine-tuning mechanism.</p>
</sec>
<sec id="s3_4">
<label>3.4</label>
<title>Token Loss Calculation</title>
<p>Traditional SFT optimizes by minimizing the cross-entropy loss between predictions and ground-truth labels. It treats all tokens in a reasoning chain equally, which can easily lead the model to rely on pattern memorization while neglecting logical deduction. Natural Language Fine-Tuning (NLFT) builds upon this by introducing dynamic weighting and language feedback. On one hand, it identifies salient tokens from positive and negative samples&#x2019; reasoning chains to enhance the model&#x2019;s sensitivity to reasoning. On the other hand, it uses natural language feedback to locate critical steps and penalize erroneous ones, shifting the optimization objective from simple answer fitting to capability-guided learning.</p>
<p>NLFT assumes that answers generated by a large model can be categorized as either correct or incorrect, and it identifies salient tokens by comparing different prompts. However, in practical edge fine-tuning, model outputs are often partially correct and difficult to classify as entirely wrong. In a translation task, for example, multiple results may be semantically correct but differ in fluency and style, making it hard to determine a clear winner. Furthermore, NLFT&#x2019;s approach, which involves constructing positive and two distinct negative reasoning paths, effectively triples the processing load per sample and significantly increasing computational power requirements and latency.</p>
<p>To address this, we propose an improved scheme that constructs only two paths: a base input and a reference input. This design reduces redundant computation. The base input provides minimal context, while the reference input serves as a quality comparator to guide the model&#x2019;s learning. This introduces soft prior knowledge and forms a directional constraint on the original task, maintaining the effectiveness of contrastive learning while significantly reducing latency and computational overhead.</p>
<p>For a mini-batch of samples <inline-formula id="ieqn-145"><mml:math id="mml-ieqn-145"><mml:msubsup><mml:mrow><mml:mi>&#x0212C;</mml:mi></mml:mrow><mml:mi>i</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>r</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:mo fence="false" stretchy="false">{</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:msubsup><mml:mi>x</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>r</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:msubsup><mml:mi>y</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>r</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:msubsup><mml:mrow><mml:mover><mml:mi>x</mml:mi><mml:mo stretchy="false">&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>r</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:msubsup><mml:mrow><mml:mover><mml:mi>y</mml:mi><mml:mo stretchy="false">&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>r</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mo stretchy="false">)</mml:mo><mml:msubsup><mml:mo fence="false" stretchy="false">}</mml:mo><mml:mrow><mml:mi>j</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>B</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula> drawn by client <inline-formula id="ieqn-146"><mml:math id="mml-ieqn-146"><mml:mi>i</mml:mi></mml:math></inline-formula> in round <inline-formula id="ieqn-147"><mml:math id="mml-ieqn-147"><mml:mi>r</mml:mi></mml:math></inline-formula>, the server receives activations <inline-formula id="ieqn-148"><mml:math id="mml-ieqn-148"><mml:msubsup><mml:mi>S</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>r</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup></mml:math></inline-formula> and <inline-formula id="ieqn-149"><mml:math id="mml-ieqn-149"><mml:msubsup><mml:mrow><mml:mover><mml:mi>S</mml:mi><mml:mo stretchy="false">&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>r</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup></mml:math></inline-formula> from the client and generates two sets of predictions:
<disp-formula id="eqn-12"><label>(12)</label><mml:math id="mml-eqn-12" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:msubsup><mml:mrow><mml:mover><mml:mi>Y</mml:mi><mml:mo stretchy="false">&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>r</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup></mml:mtd><mml:mtd><mml:mi></mml:mi><mml:mo>=</mml:mo><mml:mi>&#x03C8;</mml:mi><mml:mspace width="negativethinmathspace" /><mml:mrow><mml:mo>(</mml:mo><mml:msubsup><mml:mi>W</mml:mi><mml:mi>s</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>r</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mspace width="thinmathspace" /><mml:mstyle scriptlevel="0"><mml:mrow><mml:mo maxsize="1.2em" minsize="1.2em">|</mml:mo></mml:mrow></mml:mstyle><mml:mspace width="thinmathspace" /><mml:msup><mml:mi mathvariant="normal">&#x03A6;</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>r</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mo>,</mml:mo><mml:msubsup><mml:mi>S</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>r</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mo>)</mml:mo></mml:mrow><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<disp-formula id="eqn-13"><label>(13)</label><mml:math id="mml-eqn-13" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:msubsup><mml:mrow><mml:mover><mml:mrow><mml:mover><mml:mi>Y</mml:mi><mml:mo stretchy="false">&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mo stretchy="false">&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>r</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup></mml:mtd><mml:mtd><mml:mi></mml:mi><mml:mo>=</mml:mo><mml:mi>&#x03C8;</mml:mi><mml:mspace width="negativethinmathspace" /><mml:mrow><mml:mo>(</mml:mo><mml:msubsup><mml:mi>W</mml:mi><mml:mi>s</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>r</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mspace width="thinmathspace" /><mml:mstyle scriptlevel="0"><mml:mrow><mml:mo maxsize="1.2em" minsize="1.2em">|</mml:mo></mml:mrow></mml:mstyle><mml:mspace width="thinmathspace" /><mml:msup><mml:mi mathvariant="normal">&#x03A6;</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>r</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mo>,</mml:mo><mml:msubsup><mml:mrow><mml:mover><mml:mi>S</mml:mi><mml:mo stretchy="false">&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>r</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mo>)</mml:mo></mml:mrow><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>Let the target output sequence be <inline-formula id="ieqn-150"><mml:math id="mml-ieqn-150"><mml:msubsup><mml:mi>y</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>r</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:mo fence="false" stretchy="false">{</mml:mo><mml:msubsup><mml:mi>y</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi><mml:mo>,</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>r</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:msubsup><mml:mi>y</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi><mml:mo>,</mml:mo><mml:msub><mml:mi>T</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>r</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mo fence="false" stretchy="false">}</mml:mo></mml:math></inline-formula>. For a token at any position <inline-formula id="ieqn-151"><mml:math id="mml-ieqn-151"><mml:mi>t</mml:mi></mml:math></inline-formula>, we define the conditional probabilities under the two paths:
<disp-formula id="eqn-14"><label>(14)</label><mml:math id="mml-eqn-14" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:msubsup><mml:mi>P</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi><mml:mo>,</mml:mo><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>r</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:mi>P</mml:mi><mml:mspace width="negativethinmathspace" /><mml:mrow><mml:mo>(</mml:mo><mml:msubsup><mml:mi>y</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi><mml:mo>,</mml:mo><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>r</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mo>&#x2223;</mml:mo><mml:msubsup><mml:mi>x</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>r</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:mspace width="thinmathspace" /><mml:msubsup><mml:mi>y</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi><mml:mo>,</mml:mo><mml:mo>&#x003C;</mml:mo><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>r</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mo>;</mml:mo><mml:mtext>&#x00A0;</mml:mtext><mml:msub><mml:mi>W</mml:mi><mml:mi>c</mml:mi></mml:msub><mml:mo>,</mml:mo><mml:msubsup><mml:mi mathvariant="normal">&#x0398;</mml:mi><mml:mi>i</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>r</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:msubsup><mml:mi>W</mml:mi><mml:mi>s</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>r</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:msup><mml:mi mathvariant="normal">&#x03A6;</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>r</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mo>)</mml:mo></mml:mrow><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<disp-formula id="eqn-15"><label>(15)</label><mml:math id="mml-eqn-15" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:msubsup><mml:mrow><mml:mover><mml:mi>P</mml:mi><mml:mo stretchy="false">&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi><mml:mo>,</mml:mo><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>r</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:mi>P</mml:mi><mml:mspace width="negativethinmathspace" /><mml:mrow><mml:mo>(</mml:mo><mml:msubsup><mml:mi>y</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi><mml:mo>,</mml:mo><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>r</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mo>&#x2223;</mml:mo><mml:msubsup><mml:mrow><mml:mover><mml:mi>x</mml:mi><mml:mo stretchy="false">&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>r</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:mspace width="thinmathspace" /><mml:msubsup><mml:mi>y</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi><mml:mo>,</mml:mo><mml:mo>&#x003C;</mml:mo><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>r</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mo>;</mml:mo><mml:mtext>&#x00A0;</mml:mtext><mml:msub><mml:mi>W</mml:mi><mml:mi>c</mml:mi></mml:msub><mml:mo>,</mml:mo><mml:msubsup><mml:mi mathvariant="normal">&#x0398;</mml:mi><mml:mi>i</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>r</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:msubsup><mml:mi>W</mml:mi><mml:mi>s</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>r</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:msup><mml:mi mathvariant="normal">&#x03A6;</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>r</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mo>)</mml:mo></mml:mrow><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>When analyzing the reasoning chain generated by the model, we first identify Salient Tokens based on the conditional probability of the reference path, <inline-formula id="ieqn-152"><mml:math id="mml-ieqn-152"><mml:msubsup><mml:mrow><mml:mover><mml:mi>P</mml:mi><mml:mo stretchy="false">&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi><mml:mo>,</mml:mo><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>r</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup></mml:math></inline-formula>. Specifically, if the probability <inline-formula id="ieqn-153"><mml:math id="mml-ieqn-153"><mml:msubsup><mml:mrow><mml:mover><mml:mi>P</mml:mi><mml:mo stretchy="false">&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi><mml:mo>,</mml:mo><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>r</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup></mml:math></inline-formula> of a token <inline-formula id="ieqn-154"><mml:math id="mml-ieqn-154"><mml:msubsup><mml:mi>y</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi><mml:mo>,</mml:mo><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>r</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup></mml:math></inline-formula> exceeds a predefined threshold <inline-formula id="ieqn-155"><mml:math id="mml-ieqn-155"><mml:msup><mml:mi>p</mml:mi><mml:mrow><mml:mtext>sal</mml:mtext></mml:mrow></mml:msup></mml:math></inline-formula>, it is classified into the salient token set, <inline-formula id="ieqn-156"><mml:math id="mml-ieqn-156"><mml:msubsup><mml:mi>Y</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi><mml:mo>,</mml:mo><mml:mtext>sal</mml:mtext></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>r</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup></mml:math></inline-formula>. Next, using these salient tokens as cluster centers, we apply cosine similarity&#x2013;based semantic clustering to group semantically related tokens, which are then labeled as Sub-salient Tokens and placed in the set <inline-formula id="ieqn-157"><mml:math id="mml-ieqn-157"><mml:msubsup><mml:mi>Y</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi><mml:mo>,</mml:mo><mml:mtext>subsal</mml:mtext></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>r</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup></mml:math></inline-formula>. Finally, tokens not assigned to either category are regarded as Irrelevant Tokens and are collected in the set <inline-formula id="ieqn-158"><mml:math id="mml-ieqn-158"><mml:msubsup><mml:mi>Y</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi><mml:mo>,</mml:mo><mml:mtext>irrel</mml:mtext></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>r</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup></mml:math></inline-formula>.</p>
<p>For these three categories of tokens, we define a scaling weight <inline-formula id="ieqn-159"><mml:math id="mml-ieqn-159"><mml:mi>S</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msubsup><mml:mi>y</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi><mml:mo>,</mml:mo><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>r</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> based on the reference path probability:
<disp-formula id="eqn-16"><label>(16)</label><mml:math id="mml-eqn-16" display="block"><mml:mi>S</mml:mi><mml:mspace width="negativethinmathspace" /><mml:mrow><mml:mo>(</mml:mo><mml:msubsup><mml:mi>y</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi><mml:mo>,</mml:mo><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>r</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mo>)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mrow><mml:mo>{</mml:mo><mml:mtable columnalign="left left" rowspacing=".2em" columnspacing="1em" displaystyle="false"><mml:mtr><mml:mtd><mml:mn>1</mml:mn><mml:mo>+</mml:mo><mml:mspace width="negativethinmathspace" /><mml:msup><mml:mrow><mml:mo>(</mml:mo><mml:mstyle displaystyle="true" scriptlevel="0"><mml:mfrac><mml:mrow><mml:msubsup><mml:mrow><mml:mover><mml:mi>P</mml:mi><mml:mo stretchy="false">&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi><mml:mo>,</mml:mo><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>r</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mo>&#x2212;</mml:mo><mml:msup><mml:mi>p</mml:mi><mml:mrow><mml:mrow><mml:mtext>sal</mml:mtext></mml:mrow></mml:mrow></mml:msup></mml:mrow><mml:mrow><mml:mn>1</mml:mn><mml:mo>&#x2212;</mml:mo><mml:msup><mml:mi>p</mml:mi><mml:mrow><mml:mrow><mml:mtext>sal</mml:mtext></mml:mrow></mml:mrow></mml:msup></mml:mrow></mml:mfrac></mml:mstyle><mml:mo>)</mml:mo></mml:mrow><mml:mrow><mml:msub><mml:mi>&#x03B3;</mml:mi><mml:mn>1</mml:mn></mml:msub></mml:mrow></mml:msup><mml:mo>,</mml:mo></mml:mtd><mml:mtd><mml:msubsup><mml:mi>y</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi><mml:mo>,</mml:mo><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>r</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mo>&#x2208;</mml:mo><mml:msubsup><mml:mi>Y</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi><mml:mo>,</mml:mo><mml:mrow><mml:mtext>sal</mml:mtext></mml:mrow></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>r</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mo>,</mml:mo></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mspace width="negativethinmathspace" /><mml:msup><mml:mrow><mml:mo>(</mml:mo><mml:mstyle displaystyle="true" scriptlevel="0"><mml:mfrac><mml:msubsup><mml:mrow><mml:mover><mml:mi>P</mml:mi><mml:mo stretchy="false">&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi><mml:mo>,</mml:mo><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>r</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:msup><mml:mi>p</mml:mi><mml:mrow><mml:mrow><mml:mtext>sal</mml:mtext></mml:mrow></mml:mrow></mml:msup></mml:mfrac></mml:mstyle><mml:mo>)</mml:mo></mml:mrow><mml:mrow><mml:msub><mml:mi>&#x03B3;</mml:mi><mml:mn>2</mml:mn></mml:msub></mml:mrow></mml:msup><mml:mo>,</mml:mo></mml:mtd><mml:mtd><mml:msubsup><mml:mi>y</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi><mml:mo>,</mml:mo><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>r</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mo>&#x2208;</mml:mo><mml:msubsup><mml:mi>Y</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi><mml:mo>,</mml:mo><mml:mrow><mml:mtext>subsal</mml:mtext></mml:mrow></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>r</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mo>,</mml:mo></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mspace width="negativethinmathspace" /><mml:msup><mml:mrow><mml:mo>(</mml:mo><mml:mstyle displaystyle="true" scriptlevel="0"><mml:mfrac><mml:msubsup><mml:mrow><mml:mover><mml:mi>P</mml:mi><mml:mo stretchy="false">&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi><mml:mo>,</mml:mo><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>r</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:msup><mml:mi>p</mml:mi><mml:mrow><mml:mrow><mml:mtext>sal</mml:mtext></mml:mrow></mml:mrow></mml:msup></mml:mfrac></mml:mstyle><mml:mo>)</mml:mo></mml:mrow><mml:mrow><mml:msub><mml:mi>&#x03B3;</mml:mi><mml:mn>3</mml:mn></mml:msub></mml:mrow></mml:msup><mml:mo>,</mml:mo></mml:mtd><mml:mtd><mml:msubsup><mml:mi>y</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi><mml:mo>,</mml:mo><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>r</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mo>&#x2208;</mml:mo><mml:msubsup><mml:mi>Y</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi><mml:mo>,</mml:mo><mml:mrow><mml:mtext>irrel</mml:mtext></mml:mrow></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>r</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable><mml:mo fence="true" stretchy="true" symmetric="true"></mml:mo></mml:mrow></mml:math></disp-formula>where <inline-formula id="ieqn-160"><mml:math id="mml-ieqn-160"><mml:msub><mml:mi>&#x03B3;</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>&#x03B3;</mml:mi><mml:mn>2</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>&#x03B3;</mml:mi><mml:mn>3</mml:mn></mml:msub></mml:math></inline-formula> are tunable hyperparameters, typically set with <inline-formula id="ieqn-161"><mml:math id="mml-ieqn-161"><mml:msub><mml:mi>&#x03B3;</mml:mi><mml:mn>2</mml:mn></mml:msub><mml:mo>&#x003C;</mml:mo><mml:msub><mml:mi>&#x03B3;</mml:mi><mml:mn>3</mml:mn></mml:msub></mml:math></inline-formula> so that sub-salient tokens receive higher weights than irrelevant tokens. For salient tokens, the formulation includes an additive constant <inline-formula id="ieqn-162"><mml:math id="mml-ieqn-162"><mml:mn>1</mml:mn></mml:math></inline-formula>, which ensures that their scaling value is always greater than <inline-formula id="ieqn-163"><mml:math id="mml-ieqn-163"><mml:mn>1</mml:mn></mml:math></inline-formula>, thereby amplifying their influence in the gradient computation during loss optimization.</p>
<p>After obtaining the scaling weight for each token, we use a weighted cross-entropy on the predicted probabilities <inline-formula id="ieqn-164"><mml:math id="mml-ieqn-164"><mml:msubsup><mml:mi>P</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi><mml:mo>,</mml:mo><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>r</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup></mml:math></inline-formula> of the model&#x2019;s reasoning chain as the final loss:
<disp-formula id="eqn-17"><label>(17)</label><mml:math id="mml-eqn-17" display="block"><mml:msubsup><mml:mrow><mml:mi>&#x02112;</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>r</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mtext>&#x00A0;</mml:mtext><mml:mo>=</mml:mo><mml:mtext>&#x00A0;</mml:mtext><mml:mo>&#x2212;</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:msubsup><mml:mi>y</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>r</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow></mml:mrow></mml:mfrac><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>t</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:msubsup><mml:mi>y</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>r</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow></mml:mrow></mml:munderover><mml:mi>S</mml:mi><mml:mspace width="negativethinmathspace" /><mml:mrow><mml:mo>(</mml:mo><mml:msubsup><mml:mi>y</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi><mml:mo>,</mml:mo><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>r</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mo>)</mml:mo></mml:mrow><mml:mspace width="thinmathspace" /><mml:mi>log</mml:mi><mml:mo>&#x2061;</mml:mo><mml:msubsup><mml:mi>P</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi><mml:mo>,</mml:mo><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>r</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mo>.</mml:mo></mml:math></disp-formula></p>
<p>This weighted design preserves the emphasis on logically critical tokens from NLFT while using only two input paths, thereby significantly reducing the additional computational and communication overhead in edge scenarios.</p>
<p>The loss calculation process of LESFT is summarized in Algorithm 2.</p>
<fig id="fig-9">
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_74034-fig-9.tif"/>
</fig>
<p>This section has detailed the overall architecture, training process, and core token-level loss calculation mechanism of the LESFT framework. By introducing natural language fine-tuning into a split federated fine-tuning architecture, the framework provides a systematic solution for the efficient fine-tuning of large models in edge environments. To validate the practical performance and effectiveness of our proposed framework, the next chapter will present a series of comprehensive comparative experiments and analyses conducted on public benchmark datasets.</p>
</sec>
</sec>
<sec id="s4">
<label>4</label>
<title>Experiment</title>
<sec id="s4_1">
<label>4.1</label>
<title>Datasets and Data Preparation</title>
<p>To evaluate the generalizability and performance of the proposed framework, this study utilizes three distinct reasoning benchmarks. To simulate data-constrained edge scenarios and test few-shot learning efficiency, we uniformly and randomly sampled 800 instances from the official training set of each dataset.</p>
<p><bold>GSM8K dataset.</bold> This dataset contains high quality elementary school level math word problems. The official training split comprises 7473 instances and the official test split comprises 1319 instances.</p>
<p><bold>CommonsenseQA dataset.</bold> This dataset is designed for commonsense question answering and consists of multiple choice questions. The official training split comprises 9741 instances and the official test split comprises 1140 instances.</p>
<p><bold>AQUA_RAT dataset.</bold> This dataset targets algebraic question answering with rationale annotations and consists of programmatic and textual solutions. The official training split comprises 97467 instances and the official test split comprises 254 instances.</p>
<p>Considering that the base models used in our experiments, such as Qwen2.5-3B, do not possess robust mathematical reasoning capabilities before fine-tuning, it is not feasible to have them directly generate the reasoning chains required for training. Therefore, we adopt a data distillation strategy, using a more powerful teacher model to generate high-quality fine-tuning data. Specifically, we utilize the Llama-3-8B-Instruct model, which achieves an accuracy of 77% on GSM8K, 79% on CommonsenseQA, and 81% on AQUA_RAT, to ensure the reliability of the generated supervision signals. For each dataset, we randomly select 800 samples correctly answered by the teacher model to construct the fine-tuning subset. The selected responses, combined with Chain-of-Thought prompting, provide detailed reasoning chains that serve as the ideal user outputs the student model needs to learn. Concurrently, the original solution steps from each dataset are used as the reference input to provide the constraint signal, thereby satisfying the dual-path input requirement of the LESFT framework.</p>
</sec>
<sec id="s4_2">
<label>4.2</label>
<title>Baseline Algorithms</title>
<p>To comprehensively evaluate our framework, we compare it against four representative fine-tuning methods, spanning centralized to distributed paradigms.</p>
<p><bold>Centralized Supervised Fine-Tuning (Centralized SFT).</bold> This baseline reflects the upper-bound performance in a non-privacy-preserving setting, where a central server directly fine-tunes the LLM with LoRA on the entire training set, comprising 800 GSM8K samples in our study.</p>
<p><bold>Federated Averaging (FedAvg).</bold> The canonical algorithm in federated learning, adapted here for LoRA-based fine-tuning. In each round, clients train local adapters on their data and upload updates to the server, which aggregates them via weighted averaging. Raw data remain local, ensuring privacy.</p>
<p><bold>Split Learning (SL).</bold> A vertical partitioning strategy in which clients hold the front layers and the server holds the remaining ones. Clients send intermediate activations to the server, which completes the forward pass, computes the loss, and returns gradients. No aggregation across clients is performed, making SL suitable for evaluating performance without horizontal knowledge sharing.</p>
<p><bold>SplitLoRA.</bold> An advanced variant combining split learning with federated aggregation. Adapter parameters are decomposed into shared and private parts: shared layers are collaboratively trained across clients, while private layers are updated locally to preserve personalization. This design balances knowledge sharing and data heterogeneity.</p>
<p><bold>SplitFrozen [<xref ref-type="bibr" rid="ref-47">47</xref>].</bold> This method is a split learning framework where the initial model layers deployed on client devices are frozen. Clients execute only a forward pass and transmit the resulting activations to a central server. The server holds the remaining layers and manages all training updates, applying parameter-efficient fine-tuning via LoRA exclusively to its portion of the model. This configuration avoids client-side backward propagation and can be combined with pipeline parallelism to reduce device idle time.</p>
</sec>
<sec id="s4_3">
<label>4.3</label>
<title>Experimental Setups</title>
<p>We conducted experiments on three models from the Qwen2.5 series with different parameter scales: Qwen2.5-0.5B, Qwen2.5-1.5B, and Qwen2.5-3B. These cover lightweight to medium-size configurations and enable validation under different resource constraints. All experiments used the original pre-trained versions to avoid potential bias introduced by instruction tuning.</p>
<p>A vertical splitting strategy was adopted to simulate edge devices with limited resources. The client was assigned the tokenizer, the embedding layer, and the first quarter of the Transformer layers, while the server processed the remaining layers and the language model head. This allocation reduced the computational burden on the client and enhanced data privacy. The same splitting strategy was applied to all baselines to ensure fairness in comparison.</p>
<p>All experiments were performed on a unified high-performance computing platform equipped with an NVIDIA RTX 4090 GPU, PyTorch 2.3.0, and CUDA 12.1. To guarantee consistency, we used AdamW as the optimizer with a learning rate of 5e&#x2212;5. The batch size per client was set to 1 with gradient accumulation of 4, resulting in an effective batch size of 4. Local training was conducted for 10 epochs on a dataset of 800 samples. Four clients were simulated, and evaluation was performed using identical prompts and validation protocols.</p>
<p>The LoRA configuration was kept uniform across all experiments. The rank was set to 8, lora_alpha to 16, and the dropout rate to 0.2. LoRA was applied to the gate_proj, down_proj, and up_proj linear layers. The salient token threshold was fixed at 0.95, with hyperparameters <inline-formula id="ieqn-184"><mml:math id="mml-ieqn-184"><mml:msub><mml:mi>&#x03B3;</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:mo>=</mml:mo><mml:mn>5</mml:mn></mml:math></inline-formula>, <inline-formula id="ieqn-185"><mml:math id="mml-ieqn-185"><mml:msub><mml:mi>&#x03B3;</mml:mi><mml:mn>2</mml:mn></mml:msub><mml:mo>=</mml:mo><mml:mn>0.3</mml:mn></mml:math></inline-formula>, and <inline-formula id="ieqn-186"><mml:math id="mml-ieqn-186"><mml:msub><mml:mi>&#x03B3;</mml:mi><mml:mn>3</mml:mn></mml:msub><mml:mo>=</mml:mo><mml:mn>0.6</mml:mn></mml:math></inline-formula>.</p>
<sec id="s4_3_1">
<label>4.3.1</label>
<title>Comparative Analysis of Accuracy and System Overhead</title>
<p>The empirical results are summarized in <xref ref-type="table" rid="table-1">Table 1</xref>. These results demonstrate that the proposed LESFT framework consistently outperforms all baseline methods. This holds true across all three reasoning benchmarks and model scales. This outcome validates the generalizability of our approach. On the Qwen2.5-3B model, LESFT achieves 76.04% accuracy on GSM8K, 78.13% on CommonsenseQA, and 71.26% on AQUA_RAT. These results represent significant relative improvements over the strongest baselines. For example, on GSM8K, LESFT shows a 34.4% relative gain over SplitFrozen, which scored 56.56%. LESFT also achieves a 36.1% relative gain over FedAvg, which scored 55.88%.</p>
<table-wrap id="table-1">
<label>Table 1</label>
<caption>
<title>Comparison of accuracy on GSM8K, COMMONSENSEQA, and AQUA_RAT datasets across different fine-tuning methods and model scales. The best result for each model scale is marked in bold</title>
</caption>
<table>
<colgroup>
<col align="center" width="25mm"/>
<col align="center" width="23mm"/>
<col align="center" width="11mm"/>
<col align="center" width="15mm"/>
<col align="center" width="11mm"/>
<col align="center" width="15mm"/>
<col align="center" width="11mm"/>
<col align="center" width="11mm"/>
<col align="center" width="11mm"/> </colgroup>
<thead>
<tr>
<th>Dataset</th>
<th>Model</th>
<th>Base</th>
<th>Centralized SFT</th>
<th>SL</th>
<th>SplitLoRA</th>
<th>Split frozen</th>
<th>FedAvg</th>
<th>LESFT (Ours)</th>
</tr>
</thead>
<tbody>
<tr>
<td rowspan="3"><bold>GSM8K</bold></td>
<td>Qwen2.5-0.5B</td>
<td>18.57</td>
<td>17.44</td>
<td>15.85</td>
<td>21.91</td>
<td>29.26</td>
<td>23.12</td>
<td><bold>40.94</bold></td>
</tr>
<tr>
<td>Qwen2.5-1.5B</td>
<td>17.36</td>
<td>42.76</td>
<td>43.21</td>
<td>44.73</td>
<td>47.54</td>
<td>46.10</td>
<td><bold>61.49</bold></td>
</tr>
<tr>
<td>Qwen2.5-3B</td>
<td>18.27</td>
<td>50.87</td>
<td>54.28</td>
<td>55.65</td>
<td>56.56</td>
<td>55.88</td>
<td><bold>76.04</bold></td>
</tr>
<tr>
<td rowspan="3"><bold>CommonsenseQA</bold></td>
<td>Qwen2.5-0.5B</td>
<td>23.59</td>
<td>41.44</td>
<td>32.76</td>
<td>40.21</td>
<td>36.53</td>
<td>44.06</td>
<td><bold>54.46</bold></td>
</tr>
<tr>
<td>Qwen2.5-1.5B</td>
<td>24.24</td>
<td>73.38</td>
<td>62.82</td>
<td>60.20</td>
<td>67.24</td>
<td>62.82</td>
<td><bold>75.27</bold></td>
</tr>
<tr>
<td>Qwen2.5-3B</td>
<td>45.21</td>
<td>76.41</td>
<td>66.99</td>
<td>71.99</td>
<td>74.88</td>
<td>71.33</td>
<td><bold>78.13</bold></td>
</tr>
<tr>
<td rowspan="3"><bold>AQUA_RAT</bold></td>
<td>Qwen2.5-0.5B</td>
<td>22.44</td>
<td>27.95</td>
<td>21.65</td>
<td>26.77</td>
<td>27.17</td>
<td>23.62</td>
<td><bold>38.19</bold></td>
</tr>
<tr>
<td>Qwen2.5-1.5B</td>
<td>33.46</td>
<td>37.01</td>
<td>37.01</td>
<td>40.16</td>
<td>39.37</td>
<td>38.58</td>
<td><bold>53.15</bold></td>
</tr>
<tr>
<td>Qwen2.5-3B</td>
<td>43.31</td>
<td>37.40</td>
<td>44.88</td>
<td>42.50</td>
<td>52.36</td>
<td>41.73</td>
<td><bold>71.26</bold></td>
</tr>
</tbody>
</table>
</table-wrap>
<p>The unfine-tuned Base models perform poorly on all tasks. This highlights the necessity of fine-tuning. However, conventional Centralized SFT and SL show inconsistent results. They even cause performance degradation on the 0.5B model for GSM8K. This suggests that standard SFT struggles with catastrophic forgetting in low-resource, low-capacity settings.</p>
<p>The adoption of more advanced methods yields more stable and consistent gains. These methods include FedAvg, SplitLoRA, and SplitFrozen. SplitFrozen, in particular, shows strong performance on GSM8K, reaching 56.56% on the 3B model. This demonstrates the effectiveness of its client-side frozen approach. Nevertheless, these methods still show limited absolute performance. This indicates that architectural optimizations alone are insufficient to solve the core data-efficiency problem in few-shot reasoning.</p>
<p>LESFT&#x2019;s superior performance stems from its unique learning mechanism. Baselines rely solely on architectural optimization. In contrast, LESFT introduces a token-level weighting strategy. This mechanism is powered by a dual-path contrastive signal. It guides the model to focus on critical logical tokens rather than optimizing all tokens equally. This semantic-level supervision effectively mitigates gradient noise in few-shot scenarios. This process enables the model to stably activate and calibrate its pre-trained reasoning capabilities using limited data.</p>
<p>Furthermore, the experimental results reveal the excellent scalability of LESFT. As the model scale increases from 0.5B to 3B, the performance gap between LESFT and all baseline methods tends to widen. This trend is visible across all three datasets. This indicates that on larger models, LESFT can more fully leverage the advantages of its token selection and weighting mechanism. It thereby more efficiently utilizes the parameter capacity and pre-trained knowledge of large models to achieve significant performance enhancements.</p>
<p><xref ref-type="table" rid="table-2">Table 2</xref> quantifies the system overhead. The reported metrics represent the average latency and total communication volume per epoch. The N/A values for Centralized SFT and FedAvg confirm their infeasibility for client-side training. All split-based methods operate within a manageable GPU footprint. Our LESFT framework, however, incurs notably higher client latency and communication volume compared to other baselines like SplitFrozen. This increased cost is an inherent consequence of our dual-path mechanism. This mechanism is essential for generating the contrastive signal. This design represents a deliberate trade-off. We exchange a moderate, acceptable system overhead for the significant task accuracy improvements demonstrated in <xref ref-type="table" rid="table-1">Table 1</xref>.</p>
<table-wrap id="table-2">
<label>Table 2</label>
<caption>
<title>Quantitative comparison of computational and communication overhead. We report average GPU memory footprint (MB), total communication volume (MB), and average training latency (s) for both client and server across all frameworks. N/A denotes metrics that are Not Applicable for a given framework</title>
</caption>
<table>
<colgroup>
<col align="center" width="30mm"/>
<col align="center" width="14mm"/>
<col align="center" width="14mm"/>
<col align="center" width="14mm"/>
<col align="center" width="14mm"/>
<col align="center" width="14mm"/>
<col align="center" width="14mm"/> </colgroup>
<thead>
<tr>
<th rowspan="2">Framework</th>
<th align="center" colspan="2">Avg. GPU memory (MB)</th>
<th align="center" colspan="2">Total comm. (MB)</th>
<th align="center" colspan="2">Avg. latency (s)</th>
</tr>
<tr>
<th>Client</th>
<th>Server</th>
<th>Client</th>
<th>Server</th>
<th>Client</th>
<th>Server</th>
</tr>
</thead>
<tbody>
<tr>
<td>Centralized SFT</td>
<td>N/A</td>
<td>3576</td>
<td>N/A</td>
<td>N/A</td>
<td>N/A</td>
<td>72.16</td>
</tr>
<tr>
<td>SL</td>
<td>906</td>
<td>2718</td>
<td>2151</td>
<td>2151</td>
<td>10.02</td>
<td>30.07</td>
</tr>
<tr>
<td>SplitLoRA</td>
<td>985</td>
<td>2954</td>
<td>2354</td>
<td>2354</td>
<td>7.28</td>
<td>21.85</td>
</tr>
<tr>
<td>SplitFrozen</td>
<td>709</td>
<td>2836</td>
<td>1076</td>
<td>1076</td>
<td>4.49</td>
<td>13.47</td>
</tr>
<tr>
<td>FedAvg</td>
<td>N/A</td>
<td>3643</td>
<td>N/A</td>
<td>202</td>
<td>N/A</td>
<td>19.73</td>
</tr>
<tr>
<td>LESFT (Ours)</td>
<td>1054</td>
<td>3161</td>
<td>3027</td>
<td>3027</td>
<td>10.74</td>
<td>32.21</td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
<sec id="s4_3_2">
<label>4.3.2</label>
<title>Analysis of the Impact of Varying the Model Split Point</title>
<p>To investigate the robustness of our framework, we conducted an ablation study to evaluate the sensitivity of model performance to the network split point, which is a key hyperparameter. In a distributed learning setting, the choice of the split point directly impacts the client&#x2019;s computational load and the final performance; therefore, verifying the model&#x2019;s stability with respect to this choice is of significant importance. We established four different split ratios, deploying the first 1/2, 1/3, 1/4, and 1/5 of the network layers on the client, and recorded the accuracy changes of our method, as well as SL and SplitLoRA, at these different split points.</p>
<p>The experimental results are shown in <xref ref-type="fig" rid="fig-3">Fig. 3</xref>. Our method exhibits high stability across all tested model scales and split configurations. Its performance curve is considerably flatter compared to the baseline methods, and its accuracy consistently surpasses that of SL, SplitLoRA, and SplitFrozen. Specifically, on the largest Qwen2.5-3B model, the results shown in <xref ref-type="fig" rid="fig-3">Fig. 3c</xref> indicate that as the split point is adjusted from 1/2 to 1/5, our method&#x2019;s accuracy remains stable within the range of 75.59% to 76.65%, showing minimal fluctuation. In contrast, the baseline methods demonstrate greater sensitivity to the choice of the split point. The performance of SL and SplitLoRA shows significant fluctuations as the split point changes; for example, on the Qwen2.5-1.5B model, the accuracy of the SL method fluctuates by more than 10%. SplitFrozen shows relatively better stability than SL and SplitLoRA, but its accuracy remains consistently lower than ours across all split configurations. This indicates that while SplitFrozen partially alleviates the sensitivity issue, it does not match the overall robustness and performance of our framework.</p>
<fig id="fig-3">
<label>Figure 3</label>
<caption>
<title>Impact of different model split layers on the accuracy of fine-tuning methods on the GSM8K dataset</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_74034-fig-3.tif"/>
</fig>
<p>The results of this ablation study demonstrate that our proposed framework possesses good robustness, as its performance is not sensitive to the choice of the network split point. This offers significant convenience for its fine-tuning application in real-world scenarios.</p>
</sec>
<sec id="s4_3_3">
<label>4.3.3</label>
<title>Analysis of the Impact of Varying the Number of Samples</title>
<p>To evaluate the data efficiency of our framework under few-shot conditions, we conducted an experiment where we systematically varied the number of samples used for fine-tuning. In this experimental setup, we simulated a distributed fine-tuning environment with 4 clients and adjusted the size of each client&#x2019;s local dataset to 10, 20, 30, 40, and 50 samples, respectively. For comparison, the Centralized SFT baseline was trained on the corresponding aggregated sample sets, ranging from 40 to 200 total samples.</p>
<p>The experimental results are shown in <xref ref-type="fig" rid="fig-4">Fig. 4</xref>. As can be seen, the performance of our proposed method is consistently higher than that of all baseline methods across all tested model scales and sample size configurations. This advantage is particularly pronounced in scenarios where data is extremely limited. For instance, on the Qwen2.5-3B model with only 10 samples per client, our method achieves 71.19% accuracy, while SplitLoRA and FedAvg reach 63.61% and 64.90%, respectively. SplitFrozen, despite its improved stability, achieves 63.88%, and Centralized SFT trained on 40 samples yields only 56.10%. As the number of samples increases, the performance of all methods shows an upward trend, but the gap between our method and the baselines remains clear. This indicates that our framework achieves higher data utilization efficiency. This advantage stems from the natural language fine-tuning paradigm, which uses high-level instructions to guide the model in leveraging pre-trained knowledge for task understanding, reducing reliance on large-scale labeled data. These results demonstrate that the proposed framework is well-suited for edge scenarios with constrained data resources.</p>
<fig id="fig-4">
<label>Figure 4</label>
<caption>
<title>Impact of different numbers of training samples on the accuracy of fine-tuning methods on the GSM8K dataset</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_74034-fig-4.tif"/>
</fig>
</sec>
<sec id="s4_3_4">
<label>4.3.4</label>
<title>Analysis of the Impact of Varying the Number of Clients</title>
<p>To evaluate the scalability and robustness of our framework under different distributed configurations, we conducted an experiment by varying the number of clients from 1 to 5 while keeping the total training data constant. This simulates realistic federated learning scenarios where data becomes increasingly decentralized.</p>
<p>The results in <xref ref-type="fig" rid="fig-5">Fig. 5</xref> show that our method consistently outperforms all baselines across model scales and client counts. On the Qwen2.5-3B model, our accuracy remains above 71% across all settings, reaching up to 76.36% with 5 clients. In contrast, methods such as SplitLoRA and SL exhibit noticeable performance drops as the number of clients increases. SplitFrozen shows better stability but still trails behind our method in overall accuracy.</p>
<fig id="fig-5">
<label>Figure 5</label>
<caption>
<title>Impact of different numbers of clients on the accuracy of fine-tuning methods on the GSM8K dataset</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_74034-fig-5.tif"/>
</fig>
<p>This robustness can be attributed to our framework&#x2019;s dual-path supervision and periodic adapter aggregation. By learning from both base and reference reasoning chains, the model captures core reasoning patterns and reduces reliance on any single client&#x2019;s data distribution. The aggregation mechanism further enables mutual learning among clients, enhancing generalization in decentralized environments.</p>
</sec>
<sec id="s4_3_5">
<label>4.3.5</label>
<title>Ablation Study</title>
<p>An ablation study was first conducted to validate the architectural components of LESFT. Three configurations were compared, as shown in <xref ref-type="fig" rid="fig-6">Fig. 6</xref>. The Ours configuration is the complete LESFT SFL framework. The SL Framework applies the paradigm within a pure Split Learning architecture without aggregation. The FL Framework uses a conventional, non-split Federated Learning architecture. The results show the full SFL framework consistently outperforms both architectural variants across all model scales. The performance gap over the FL Framework highlights the benefit of the split-based design. The gap over the SL Framework confirms the critical importance of federated aggregation. These results validate that the synergy of combining split learning and federation is a key factor in LESFT&#x2019;s effectiveness.</p>
<fig id="fig-6">
<label>Figure 6</label>
<caption>
<title>Ablation study on the architectural framework. LESFT (SFL) consistently outperforms both SL framework and FL framework variants across different model scales</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_74034-fig-6.tif"/>
</fig>
<p>The analysis next isolates the contribution of the core dual-path mechanism. As shown in <xref ref-type="fig" rid="fig-7">Fig. 7</xref>, the full LESFT model Ours is compared against two single-path variants. These are Base Input Only, which uses only the standard input, and Reference Input Only, which uses only the high-quality reference input for training. The results on the Qwen2.5-3B model are particularly illustrative. The Base Input Only variant achieves 60.16% accuracy. Using the high-quality Reference Input Only improves this to 68.46%. Crucially, the full dual-path model Ours significantly outperforms both, achieving 76.04%. This finding is critical. It demonstrates that the performance gain does not come from merely using a higher-quality input. Instead, the contrastive signal generated by the interaction between the two logically-equivalent paths is essential. This signal guides the model to learn generalizable reasoning patterns and achieve superior data efficiency.</p>
<fig id="fig-7">
<label>Figure 7</label>
<caption>
<title>Ablation study on the dual-path mechanism. LESFT consistently outperforms both single-path variants using only the base input or the reference input across different model scales</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_74034-fig-7.tif"/>
</fig>
<p>The analysis next examines the effect of the token-weighting mechanism. As shown in <xref ref-type="table" rid="table-3">Table 3</xref>, the proposed adaptive weighting strategy Ours consistently achieves the highest accuracy across all model scales. In contrast, assigning uniform values such as 0 leads to a clear performance drop. For Qwen2.5-0.5B, accuracy decreases from 40.94% to 32.07%. For Qwen2.5-1.5B, accuracy falls from 61.49% to 45.64%. For Qwen2.5-3B, accuracy declines from 76.04% to 51.86%. These results indicate that removing token-level differentiation severely limits reasoning adaptation. The sensitivity analysis further shows that moderate weighting values between 0.25 and 0.75 yield more stable results than extreme values equal to or greater than 1.5. However, none of these settings surpass the adaptive mechanism. This confirms that the proposed token-weighting design is essential for maximizing data efficiency and reasoning accuracy in LESFT.</p>
<table-wrap id="table-3">
<label>Table 3</label>
<caption>
<title>Ablation study on the token-weighting mechanism and hyperparameter sensitivity. We compare LESFT against a variant with no weighting (0) and other weighting hyperparameter values on the GSM8K dataset. The best result is marked in bold</title>
</caption>
<table>
<colgroup>
<col align="center" width="30mm"/>
<col align="center" width="11mm"/>
<col align="center" width="11mm"/>
<col align="center" width="11mm"/>
<col align="center" width="11mm"/>
<col align="center" width="11mm"/>
<col align="center" width="11mm"/>
<col align="center" width="11mm"/>
<col align="center" width="11mm"/> </colgroup>
<thead>
<tr>
<th>Model</th>
<th>Ours</th>
<th>0</th>
<th>0.25</th>
<th>0.5</th>
<th>0.75</th>
<th>1</th>
<th>1.5</th>
<th>2</th>
</tr>
</thead>
<tbody>
<tr>
<td>Qwen2.5-0.5B</td>
<td><bold>40.94</bold></td>
<td>32.07</td>
<td>36.16</td>
<td>35.63</td>
<td>35.63</td>
<td>34.95</td>
<td>35.10</td>
<td>34.80</td>
</tr>
<tr>
<td>Qwen2.5-1.5B</td>
<td><bold>61.49</bold></td>
<td>45.64</td>
<td>54.97</td>
<td>55.04</td>
<td>55.04</td>
<td>54.97</td>
<td>55.68</td>
<td>54.44</td>
</tr>
<tr>
<td>Qwen2.5-3B</td>
<td><bold>76.04</bold></td>
<td>51.86</td>
<td>68.39</td>
<td>68.61</td>
<td>67.10</td>
<td>67.93</td>
<td>67.78</td>
<td>68.01</td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
</sec>
</sec>
<sec id="s5">
<label>5</label>
<title>Discussion</title>
<p>LESFT emphasizes the token as the fundamental unit of large language models and leverages the dual-path mechanism to highlight differences between base and reference reasoning chains. By contrasting these two paths, the framework guides the model to identify and learn the most informative tokens, thereby improving credit assignment and enabling generalizable reasoning patterns under few-shot conditions. This fine-grained supervision directly addresses the data inefficiency of conventional approaches and explains the consistent gains observed across diverse tasks and model scales.</p>
<p>In addition, the split federated design adapts naturally to edge environments. The model is partitioned so that computation and storage burdens are minimized on clients, while privacy is preserved by keeping raw data local. Periodic aggregation of lightweight adapters allows edge models to exchange knowledge without exposing sensitive information. This mechanism enables each client to benefit from the progress of others, achieving collective improvement across heterogeneous and potentially non-IID data distributions.</p>
<p>The proposed framework is particularly well-suited for reasoning-intensive tasks, where intermediate steps and token-level supervision are critical. Examples include mathematical problem solving, commonsense reasoning, and multi-step question answering. In these domains, the dual-path contrast provides richer signals than outcome-only supervision, allowing the model to capture logical dependencies more effectively. Beyond reasoning tasks, LESFT can also adapt to structured prediction problems such as code generation or symbolic manipulation, where token-level granularity plays a decisive role. Its reliance on modular adapters and federated aggregation further ensures applicability across diverse model scales and heterogeneous client data, making it a general solution for edge deployment in both reasoning-centric and structured learning scenarios.</p>
</sec>
<sec id="s6">
<label>6</label>
<title>Conclusion</title>
<p>This paper addresses the incompatibility between efficient SFL architectures and data-inefficient SFT, a key challenge that creates prohibitive communication bottlenecks for LLMs on edge devices. We introduce the LESFT framework, a novel paradigm designed to co-optimize computation, communication, and data efficiency. The core of LESFT is a contrastive-inspired fine-tuning method that uses logically consistent yet diversely expressed reasoning chains to provide a robust supervision signal. This design compels the model to shift from memorization toward generalizable reasoning, which significantly improves data efficiency and directly translates to reduced communication overhead.</p>
<p>Our extensive experiments across three diverse reasoning benchmarks, GSM8K, CommonsenseQA, and AQUA_RAT, validate that LESFT substantially outperforms all state-of-the-art baselines, including SplitLoRA and SplitFrozen. The framework&#x2019;s superiority stems from its ability to focus on critical logical tokens, allowing it to stably activate and calibrate pre-trained reasoning capabilities with limited data. Our ablation studies further demonstrate the framework&#x2019;s robustness, showing its performance is stable against variations in network split points and data distribution, highlighting its practical applicability in diverse edge environments.</p>
<p>Future work can extend LESFT in several directions. Enhancing its robustness to non-IID data is critical for complex federated scenarios. Furthermore, developing adaptive split-point mechanisms and integrating lightweight privacy-preserving techniques could enable more intelligent and secure deployments. Combining LESFT with retrieval-augmented generation also offers a path to overcome static knowledge limitations in specialized domains. In conclusion, this work not only provides a validated framework for efficient edge fine-tuning but also demonstrates a promising direction for overcoming resource constraints through the co-design of learning paradigms and system architectures.</p>
</sec>
</body>
<back>
<ack>
<p>None.</p>
</ack>
<sec>
<title>Funding Statement</title>
<p>This work was supported in part by the National Natural Science Foundation of China (NSFC) under Grant 62276109. The authors extend their appreciation to the Deanship of Scientific Research at King Saud University for funding this work through the Research Group Project number (ORF-2025-585).</p>
</sec>
<sec>
<title>Author Contributions</title>
<p>The authors confirm contribution to the paper as follows: Conceptualization, Zuyi Huang, Yue Wang; Methodology, Zuyi Huang; Software, Zuyi Huang, Yue Wang; Validation, Zuyi Huang, Jia Liu, Haodong Yi; Formal analysis, Zuyi Huang; Investigation, Zuyi Huang, Haodong Yi, Lejun Ai; Writing&#x2014;original draft preparation, Zuyi Huang, Yue Wang, Haodong Yi; Data curation, Yue Wang; Visualization, Zuyi Huang; Resources, Jia Liu, Salman A. AlQahtani, Min Chen; Writing&#x2014;review and editing, Zuyi Huang, Yue Wang, Haodong Yi, Min Chen; Supervision, Min Chen; Project administration, Min Chen; Funding acquisition, Salman A. AlQahtani, Min Chen. All authors reviewed the results and approved the final version of the manuscript.</p>
</sec>
<sec sec-type="data-availability">
<title>Availability of Data and Materials</title>
<p>Data supporting the results presented in this article are available from the corresponding author upon reasonable request.</p>
</sec>
<sec>
<title>Ethics Approval</title>
<p>Not applicable.</p>
</sec>
<sec sec-type="COI-statement">
<title>Conflicts of Interest</title>
<p>The authors declare no conflicts of interest to report regarding the present study.</p>
</sec>
<ref-list content-type="authoryear">
<title>References</title>
<ref id="ref-1"><label>[1]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Zheng</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Chen</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Qian</surname> <given-names>B</given-names></string-name>, <string-name><surname>Shi</surname> <given-names>X</given-names></string-name>, <string-name><surname>Shu</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Chen</surname> <given-names>J</given-names></string-name></person-group>. <article-title>A review on edge large language models: design, execution, and applications</article-title>. <source>ACM Comput Surv</source>. <year>2025</year>;<volume>57</volume>(<issue>8</issue>):<fpage>1</fpage>&#x2013;<lpage>35</lpage>. doi:<pub-id pub-id-type="doi">10.1145/3719664</pub-id>.</mixed-citation></ref>
<ref id="ref-2"><label>[2]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Chen</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Wu</surname> <given-names>C</given-names></string-name>, <string-name><surname>Sui</surname> <given-names>R</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>J</given-names></string-name></person-group>. <article-title>Feasibility study of edge computing empowered by artificial intelligence&#x2014;a quantitative analysis based on large models</article-title>. <source>Big Data Cogn Comput</source>. <year>2024</year>;<volume>8</volume>(<issue>8</issue>):<fpage>94</fpage>. doi:<pub-id pub-id-type="doi">10.3390/bdcc8080094</pub-id>.</mixed-citation></ref>
<ref id="ref-3"><label>[3]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Yang</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Ji</surname> <given-names>W</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>Z</given-names></string-name></person-group>. <article-title>Adaptive joint configuration optimization for collaborative inference in edge-cloud systems</article-title>. <source>Sci China Inf Sci</source>. <year>2024</year>;<volume>67</volume>(<issue>4</issue>):<fpage>149103</fpage>. doi:<pub-id pub-id-type="doi">10.1007/s11432-023-3957-4</pub-id>.</mixed-citation></ref>
<ref id="ref-4"><label>[4]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Xiao</surname> <given-names>W</given-names></string-name>, <string-name><surname>Ling</surname> <given-names>X</given-names></string-name>, <string-name><surname>Chen</surname> <given-names>M</given-names></string-name>, <string-name><surname>Liang</surname> <given-names>J</given-names></string-name>, <string-name><surname>Alqahtani</surname> <given-names>SA</given-names></string-name>, <string-name><surname>Chen</surname> <given-names>M</given-names></string-name></person-group>. <article-title>Mvpoa: a learning-based vehicle proposal offloading for cloud-edge-vehicle networks</article-title>. <source>IEEE Internet Things J</source>. <year>2025</year>;<volume>12</volume>(<issue>5</issue>):<fpage>4738</fpage>&#x2013;<lpage>49</lpage>. doi:<pub-id pub-id-type="doi">10.1109/JIOT.2024.3524469</pub-id>.</mixed-citation></ref>
<ref id="ref-5"><label>[5]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Shen</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Shao</surname> <given-names>J</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>X</given-names></string-name>, <string-name><surname>Lin</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Pan</surname> <given-names>H</given-names></string-name>, <string-name><surname>Li</surname> <given-names>D</given-names></string-name>, <etal>et al</etal></person-group>. <article-title>Large language models empowered autonomous edge AI for connected intelligence</article-title>. <source>IEEE Commun Magaz</source>. <year>2024</year>;<volume>62</volume>(<issue>10</issue>):<fpage>140</fpage>&#x2013;<lpage>6</lpage>. doi:<pub-id pub-id-type="doi">10.1109/MCOM.001.2300550</pub-id>.</mixed-citation></ref>
<ref id="ref-6"><label>[6]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Xiao</surname> <given-names>W</given-names></string-name>, <string-name><surname>Shi</surname> <given-names>C</given-names></string-name>, <string-name><surname>Chen</surname> <given-names>M</given-names></string-name>, <string-name><surname>Liu</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Chen</surname> <given-names>M</given-names></string-name>, <string-name><surname>Song</surname> <given-names>HH</given-names></string-name></person-group>. <article-title>GraphEdge: dynamic graph partition and task scheduling for GNNs computing in edge network</article-title>. <source>Inf Fusion</source>. <year>2025</year>;<volume>124</volume>(<issue>5</issue>):<fpage>103329</fpage>. doi:<pub-id pub-id-type="doi">10.1016/j.inffus.2025.103329</pub-id>.</mixed-citation></ref>
<ref id="ref-7"><label>[7]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Chen</surname> <given-names>J</given-names></string-name>, <string-name><surname>Dai</surname> <given-names>S</given-names></string-name>, <string-name><surname>Chen</surname> <given-names>F</given-names></string-name>, <string-name><surname>Lv</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Tang</surname> <given-names>J</given-names></string-name>, <string-name><surname>Han</surname> <given-names>L</given-names></string-name></person-group>. <article-title>Edge-cloud collaborative motion planning for autonomous driving with large language models</article-title>. In: <conf-name>2024 IEEE 24th International Conference on Communication Technology (ICCT); 2024 Oct 18&#x2013;20</conf-name>; <publisher-loc>Chengdu, China</publisher-loc>. p. <fpage>185</fpage>&#x2013;<lpage>90</lpage>. doi:<pub-id pub-id-type="doi">10.1109/ICCT62411.2024.10946488</pub-id>.</mixed-citation></ref>
<ref id="ref-8"><label>[8]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Nguyen</surname> <given-names>HTT</given-names></string-name>, <string-name><surname>Nguyen</surname> <given-names>LPT</given-names></string-name>, <string-name><surname>Cao</surname> <given-names>H</given-names></string-name></person-group>. <article-title>XEdgeAI: a human-centered industrial inspection framework with data-centric Explainable Edge AI approach</article-title>. <source>Inf Fusion</source>. <year>2025</year>;<volume>116</volume>(<issue>1</issue>):<fpage>102782</fpage>. doi:<pub-id pub-id-type="doi">10.1016/j.inffus.2024.102782</pub-id>.</mixed-citation></ref>
<ref id="ref-9"><label>[9]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Shakhadri</surname> <given-names>SAG</given-names></string-name>, <string-name><surname>Kruthika</surname> <given-names>KR</given-names></string-name>, <string-name><surname>Aralimatti</surname> <given-names>R</given-names></string-name></person-group>. <article-title>Shakti: a 2.5 billion parameter small language model optimized for edge ai and low-resource environments</article-title>. In: <conf-name>IFIP International Conference on Artificial Intelligence Applications and Innovations</conf-name>. <publisher-loc>Cham, Switzerland</publisher-loc>: <publisher-name>Springer</publisher-name>; <year>2025</year>. p. <fpage>434</fpage>&#x2013;<lpage>47</lpage>. doi:<pub-id pub-id-type="doi">10.1007/978-3-031-96231-8_32</pub-id>.</mixed-citation></ref>
<ref id="ref-10"><label>[10]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Li</surname> <given-names>M</given-names></string-name>, <string-name><surname>Ji</surname> <given-names>W</given-names></string-name></person-group>. <article-title>Lightweight multiattention recursive residual CNN-based in-loop filter driven by neuron diversity</article-title>. <source>IEEE Trans Circuits Syst Video Technol</source>. <year>2023</year>;<volume>33</volume>(<issue>11</issue>):<fpage>6996</fpage>&#x2013;<lpage>7008</lpage>. doi:<pub-id pub-id-type="doi">10.1109/TCSVT.2023.3270729</pub-id>.</mixed-citation></ref>
<ref id="ref-11"><label>[11]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Brown</surname> <given-names>T</given-names></string-name>, <string-name><surname>Mann</surname> <given-names>B</given-names></string-name>, <string-name><surname>Ryder</surname> <given-names>N</given-names></string-name>, <string-name><surname>Subbiah</surname> <given-names>M</given-names></string-name>, <string-name><surname>Kaplan</surname> <given-names>JD</given-names></string-name>, <string-name><surname>Dhariwal</surname> <given-names>P</given-names></string-name>, <etal>et al</etal></person-group>. <article-title>Language models are few-shot learners</article-title>. <source>Adv Neur Inform Process Syst</source>. <year>2020</year>;<volume>33</volume>:<fpage>1877</fpage>&#x2013;<lpage>901</lpage>.</mixed-citation></ref>
<ref id="ref-12"><label>[12]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><surname>Touvron</surname> <given-names>H</given-names></string-name>, <string-name><surname>Lavril</surname> <given-names>T</given-names></string-name>, <string-name><surname>Izacard</surname> <given-names>G</given-names></string-name>, <string-name><surname>Martinet</surname> <given-names>X</given-names></string-name>, <string-name><surname>Lachaux</surname> <given-names>MA</given-names></string-name>, <string-name><surname>Lacroix</surname> <given-names>T</given-names></string-name>, <etal>et al</etal></person-group>. <article-title>Llama: open and efficient foundation language models</article-title>. <comment>arXiv:2302.13971. 2023</comment>.</mixed-citation></ref>
<ref id="ref-13"><label>[13]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Thapa</surname> <given-names>C</given-names></string-name>, <string-name><surname>Chamikara</surname> <given-names>MAP</given-names></string-name>, <string-name><surname>Camtepe</surname> <given-names>SA</given-names></string-name></person-group>. <article-title>Advancements of federated learning towards privacy preservation: from federated learning to split learning</article-title>. In: <conf-name>Federated learning systems: towards next-generation AI</conf-name>. <publisher-loc>Cham, Switzerland</publisher-loc>: <publisher-name>Springer</publisher-name>; <year>2021</year>. p. <fpage>79</fpage>&#x2013;<lpage>109</lpage>. doi:<pub-id pub-id-type="doi">10.1007/978-3-030-70604-3_4</pub-id>.</mixed-citation></ref>
<ref id="ref-14"><label>[14]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Qu</surname> <given-names>G</given-names></string-name>, <string-name><surname>Chen</surname> <given-names>Q</given-names></string-name>, <string-name><surname>Wei</surname> <given-names>W</given-names></string-name>, <string-name><surname>Lin</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Chen</surname> <given-names>X</given-names></string-name>, <string-name><surname>Huang</surname> <given-names>K</given-names></string-name></person-group>. <article-title>Mobile edge intelligence for large language models: a contemporary survey</article-title>. <source>IEEE Commun Surv Tutorials</source>. <year>2025</year>;<volume>27</volume>(<issue>6</issue>):<fpage>3820</fpage>&#x2013;<lpage>60</lpage>. doi:<pub-id pub-id-type="doi">10.1109/COMST.2025.3527641</pub-id>.</mixed-citation></ref>
<ref id="ref-15"><label>[15]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><surname>Lin</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Hu</surname> <given-names>X</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Chen</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Fang</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Chen</surname> <given-names>X</given-names></string-name>, <etal>et al</etal></person-group>. <article-title>Splitlora: a split parameter-efficient fine-tuning framework for large language models</article-title>. <comment>arXiv:2407.00952. 2024</comment>.</mixed-citation></ref>
<ref id="ref-16"><label>[16]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Ouyang</surname> <given-names>L</given-names></string-name>, <string-name><surname>Wu</surname> <given-names>J</given-names></string-name>, <string-name><surname>Jiang</surname> <given-names>X</given-names></string-name>, <string-name><surname>Almeida</surname> <given-names>D</given-names></string-name>, <string-name><surname>Wainwright</surname> <given-names>C</given-names></string-name>, <string-name><surname>Mishkin</surname> <given-names>P</given-names></string-name>, <etal>et al</etal></person-group>. <article-title>Training language models to follow instructions with human feedback</article-title>. <source>Adv Neural Inform Process Syst</source>. <year>2022</year>;<volume>35</volume>:<fpage>27730</fpage>&#x2013;<lpage>44</lpage>.</mixed-citation></ref>
<ref id="ref-17"><label>[17]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Wei</surname> <given-names>J</given-names></string-name>, <string-name><surname>Bosma</surname> <given-names>M</given-names></string-name>, <string-name><surname>Zhao</surname> <given-names>V</given-names></string-name>, <string-name><surname>Guu</surname> <given-names>K</given-names></string-name>, <string-name><surname>Yu</surname> <given-names>AW</given-names></string-name>, <string-name><surname>Lester</surname> <given-names>B</given-names></string-name>, <etal>et al</etal></person-group>. <article-title>Finetuned language models are zero-shot learners</article-title>. In: <conf-name>Proceedings of the International Conference on Learning Representations</conf-name>; <year>2022 Apr 25; Virtual</year>.</mixed-citation></ref>
<ref id="ref-18"><label>[18]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Wu</surname> <given-names>XK</given-names></string-name>, <string-name><surname>Chen</surname> <given-names>M</given-names></string-name>, <string-name><surname>Li</surname> <given-names>W</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>R</given-names></string-name>, <string-name><surname>Lu</surname> <given-names>L</given-names></string-name>, <string-name><surname>Liu</surname> <given-names>J</given-names></string-name>, <etal>et al</etal></person-group>. <article-title>Llm fine-tuning: concepts, opportunities, and challenges</article-title>. <source>Big Data Cogn Comput</source>. <year>2025</year>;<volume>9</volume>(<issue>4</issue>):<fpage>87</fpage>. doi:<pub-id pub-id-type="doi">10.3390/bdcc9040087</pub-id>.</mixed-citation></ref>
<ref id="ref-19"><label>[19]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Zhong</surname> <given-names>Q</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>K</given-names></string-name>, <string-name><surname>Xu</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Ding</surname> <given-names>L</given-names></string-name>, <string-name><surname>Liu</surname> <given-names>J</given-names></string-name>, <string-name><surname>Du</surname> <given-names>B</given-names></string-name></person-group>. <article-title>Achieving &#x003E;97% on gsm8k: deeply understanding the problems makes llms better solvers for math word problems</article-title>. <source>Front Comput Sci</source>. <year>2026</year>;<volume>20</volume>(<issue>1</issue>):<fpage>1</fpage>&#x2013;<lpage>3</lpage>. doi:<pub-id pub-id-type="doi">10.1007/s11704-025-41102-z</pub-id>.</mixed-citation></ref>
<ref id="ref-20"><label>[20]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Talmor</surname> <given-names>A</given-names></string-name>, <string-name><surname>Herzig</surname> <given-names>J</given-names></string-name>, <string-name><surname>Lourie</surname> <given-names>N</given-names></string-name>, <string-name><surname>Berant</surname> <given-names>J</given-names></string-name></person-group>. <article-title>Commonsenseqa: a question answering challenge targeting commonsense knowledge</article-title>. In: <conf-name>Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers); 2019 Jun 2&#x2013;7</conf-name>; <publisher-loc>Minneapolis, MA, USA</publisher-loc>. p. <fpage>4149</fpage>&#x2013;<lpage>58</lpage>. doi:<pub-id pub-id-type="doi">10.18653/v1/N19-1421</pub-id>.</mixed-citation></ref>
<ref id="ref-21"><label>[21]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Ling</surname> <given-names>W</given-names></string-name>, <string-name><surname>Yogatama</surname> <given-names>D</given-names></string-name>, <string-name><surname>Dyer</surname> <given-names>C</given-names></string-name>, <string-name><surname>Blunsom</surname> <given-names>P</given-names></string-name></person-group>. <article-title>Program induction by rationale generation: learning to solve and explain algebraic word problems</article-title>. In: <conf-name>Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers); 2017 Jul 30&#x2013;Aug 4</conf-name>; <publisher-loc>Vancouver, BC, Canada</publisher-loc>. p. <fpage>158</fpage>&#x2013;<lpage>67</lpage>. doi:<pub-id pub-id-type="doi">10.18653/v1/P17-1015</pub-id>.</mixed-citation></ref>
<ref id="ref-22"><label>[22]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><surname>Yang</surname> <given-names>A</given-names></string-name>, <string-name><surname>Yang</surname> <given-names>BS</given-names></string-name>, <string-name><surname>Hui</surname> <given-names>BY</given-names></string-name>, <string-name><surname>Zheng</surname> <given-names>B</given-names></string-name>, <string-name><surname>Yu</surname> <given-names>BW</given-names></string-name>, <string-name><surname>Zhou</surname> <given-names>C</given-names></string-name>, <etal>et al</etal></person-group>. <article-title>Qwen2 technical report</article-title>. <comment>arXiv:2407.10671. 2024</comment>.</mixed-citation></ref>
<ref id="ref-23"><label>[23]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Han</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Gao</surname> <given-names>C</given-names></string-name>, <string-name><surname>Liu</surname> <given-names>J</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>J</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>SQ</given-names></string-name></person-group>. <article-title>Parameter-efficient fine-tuning for large models: a comprehensive survey</article-title>. <source>Trans Mach Learn Res</source>. <comment>arXiv:2403.14608. 2024</comment>.</mixed-citation></ref>
<ref id="ref-24"><label>[24]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Ding</surname> <given-names>N</given-names></string-name>, <string-name><surname>Qin</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Yang</surname> <given-names>G</given-names></string-name>, <string-name><surname>Wei</surname> <given-names>F</given-names></string-name>, <string-name><surname>Yang</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Su</surname> <given-names>Y</given-names></string-name>, <etal>et al</etal></person-group>. <article-title>Parameter-efficient fine-tuning of large-scale pre-trained language models</article-title>. <source>Nat Mach Intell</source>. <year>2023</year>;<volume>5</volume>(<issue>3</issue>):<fpage>220</fpage>&#x2013;<lpage>35</lpage>. doi:<pub-id pub-id-type="doi">10.1038/s42256-023-00626-4</pub-id>.</mixed-citation></ref>
<ref id="ref-25"><label>[25]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Hu</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>L</given-names></string-name>, <string-name><surname>Lan</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Xu</surname> <given-names>W</given-names></string-name>, <string-name><surname>Lim</surname> <given-names>EP</given-names></string-name>, <string-name><surname>Bing</surname> <given-names>L</given-names></string-name>, <etal>et al</etal></person-group>. <article-title>Llm-adapters: an adapter family for parameter-efficient fine-tuning of large language models</article-title>. In: <conf-name>Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing; 2023 Dec 6&#x2013;10</conf-name>; <publisher-loc>Singapore</publisher-loc>. p. <fpage>5254</fpage>&#x2013;<lpage>76</lpage>. doi:<pub-id pub-id-type="doi">10.18653/v1/2023.emnlp-main.319</pub-id>.</mixed-citation></ref>
<ref id="ref-26"><label>[26]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Chen</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Xu</surname> <given-names>L</given-names></string-name>, <string-name><surname>Zheng</surname> <given-names>H</given-names></string-name>, <string-name><surname>Chen</surname> <given-names>L</given-names></string-name>, <string-name><surname>Tolba</surname> <given-names>A</given-names></string-name>, <string-name><surname>Zhao</surname> <given-names>L</given-names></string-name>, <etal>et al</etal></person-group>. <article-title>Evolution and prospects of foundation models: from large language models to large multimodal models</article-title>. <source>Comput Mater Contin</source>. <year>2024</year>;<volume>80</volume>(<issue>2</issue>):<fpage>1753</fpage>&#x2013;<lpage>808</lpage>. doi:<pub-id pub-id-type="doi">10.32604/cmc.2024.052618</pub-id>.</mixed-citation></ref>
<ref id="ref-27"><label>[27]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Li</surname> <given-names>XL</given-names></string-name>, <string-name><surname>Liang</surname> <given-names>P</given-names></string-name></person-group>. <article-title>Prefix-tuning: optimizing continuous prompts for generation</article-title>. In: <conf-name>Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers); 2021 Aug 1&#x2013;6</conf-name>; <publisher-loc>Virtual</publisher-loc>. p. <fpage>4582</fpage>&#x2013;<lpage>97</lpage>. doi:<pub-id pub-id-type="doi">10.18653/v1/2021.acl-long.353</pub-id>.</mixed-citation></ref>
<ref id="ref-28"><label>[28]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Zaken</surname> <given-names>EB</given-names></string-name>, <string-name><surname>Goldberg</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Ravfogel</surname> <given-names>S</given-names></string-name></person-group>. <article-title>Bitfit: simple parameter-efficient fine-tuning for transformer-based masked language-models</article-title>. In: <conf-name>Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers); 2022 May 22&#x2013;27</conf-name>; <publisher-loc>Dublin, Ireland</publisher-loc>. p. <fpage>1</fpage>&#x2013;<lpage>9</lpage>. doi:<pub-id pub-id-type="doi">10.18653/v1/2022.acl-short.1</pub-id>.</mixed-citation></ref>
<ref id="ref-29"><label>[29]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Hu</surname> <given-names>EJ</given-names></string-name>, <string-name><surname>Shen</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Wallis</surname> <given-names>P</given-names></string-name>, <string-name><surname>Allen-Zhu</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Li</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>S</given-names></string-name>, <etal>et al.</etal></person-group> <article-title>Lora: low-rank adaptation of large language models</article-title>. In: <conf-name>Proceedings of the Tenth International Conference on Learning Representations; 2022 Apr 25&#x2013;29</conf-name>; <publisher-loc>Virtual</publisher-loc>.</mixed-citation></ref>
<ref id="ref-30"><label>[30]</label><mixed-citation publication-type="book"><person-group person-group-type="author"><string-name><surname>Sung</surname> <given-names>YL</given-names></string-name>, <string-name><surname>Cho</surname> <given-names>J</given-names></string-name>, <string-name><surname>Bansal</surname> <given-names>M</given-names></string-name></person-group>. <chapter-title>Lst: ladder side-tuning for parameter and memory efficient transfer learning</chapter-title>. In: <source>Advances in neural information processing systems</source>. Vol. <volume>35</volume>. <publisher-loc>Cambridge, MA, USA</publisher-loc>: <publisher-name>MIT Press</publisher-name>; <year>2022</year>. p. <fpage>12991</fpage>&#x2013;<lpage>3005</lpage>.</mixed-citation></ref>
<ref id="ref-31"><label>[31]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Jin</surname> <given-names>F</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>J</given-names></string-name>, <string-name><surname>Zong</surname> <given-names>C</given-names></string-name></person-group>. <article-title>Parameter-efficient tuning for large language model without calculating its gradients</article-title>. In: <conf-name>Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing; 2023 Dec 6&#x2013;10</conf-name>; <publisher-loc>Singapore</publisher-loc>. p. <fpage>321</fpage>&#x2013;<lpage>30</lpage>. doi:<pub-id pub-id-type="doi">10.18653/v1/2023.emnlp-main.22</pub-id>.</mixed-citation></ref>
<ref id="ref-32"><label>[32]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Kuang</surname> <given-names>W</given-names></string-name>, <string-name><surname>Qian</surname> <given-names>B</given-names></string-name>, <string-name><surname>Li</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Chen</surname> <given-names>D</given-names></string-name>, <string-name><surname>Gao</surname> <given-names>D</given-names></string-name>, <string-name><surname>Pan</surname> <given-names>X</given-names></string-name>, <etal>et al</etal></person-group>. <article-title>Federatedscope-llm: a comprehensive package for fine-tuning large language models in federated learning</article-title>. In: <conf-name>Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining; 2024 Aug 25&#x2013;29</conf-name>; <publisher-loc>Barcelona, Spain</publisher-loc>. p. <fpage>5260</fpage>&#x2013;<lpage>71</lpage>. doi:<pub-id pub-id-type="doi">10.1145/3637528.3671573</pub-id>.</mixed-citation></ref>
<ref id="ref-33"><label>[33]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Wang</surname> <given-names>J</given-names></string-name>, <string-name><surname>Liu</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Yang</surname> <given-names>X</given-names></string-name>, <string-name><surname>Li</surname> <given-names>M</given-names></string-name>, <string-name><surname>Lyu</surname> <given-names>Z</given-names></string-name></person-group>. <article-title>The internet of things under federated learning: a review of the latest advances and applications</article-title>. <source>Comput Mater Contin</source>. <year>2025</year>;<volume>82</volume>(<issue>1</issue>):<fpage>1</fpage>&#x2013;<lpage>39</lpage>. doi:<pub-id pub-id-type="doi">10.32604/cmc.2024.058926</pub-id>.</mixed-citation></ref>
<ref id="ref-34"><label>[34]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Liu</surname> <given-names>X</given-names></string-name>, <string-name><surname>Zhou</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Wu</surname> <given-names>D</given-names></string-name>, <string-name><surname>Hu</surname> <given-names>M</given-names></string-name>, <string-name><surname>Chen</surname> <given-names>M</given-names></string-name>, <string-name><surname>Guizani</surname> <given-names>M</given-names></string-name>, <etal>et al</etal></person-group>. <article-title>CPFedAvg: enhancing hierarchical federated learning via optimized local aggregation and parameter mixing</article-title>. <source>IEEE Trans Netw</source>. <year>2025</year>;<volume>33</volume>(<issue>3</issue>):<fpage>1160</fpage>&#x2013;<lpage>73</lpage>. doi:<pub-id pub-id-type="doi">10.1109/TON.2025.3526866</pub-id>.</mixed-citation></ref>
<ref id="ref-35"><label>[35]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Hu</surname> <given-names>J</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>D</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Pang</surname> <given-names>X</given-names></string-name>, <string-name><surname>Xu</surname> <given-names>H</given-names></string-name>, <string-name><surname>Ren</surname> <given-names>J</given-names></string-name>, <etal>et al</etal></person-group>. <article-title>Federated large language model: solutions, challenges and future directions</article-title>. <source>IEEE Wirel Commun</source>. <year>2025</year>;<volume>32</volume>(<issue>4</issue>):<fpage>82</fpage>&#x2013;<lpage>9</lpage>. doi:<pub-id pub-id-type="doi">10.1109/MWC.009.2400244</pub-id>.</mixed-citation></ref>
<ref id="ref-36"><label>[36]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Yang</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Long</surname> <given-names>G</given-names></string-name>, <string-name><surname>Lu</surname> <given-names>Q</given-names></string-name>, <string-name><surname>Zhu</surname> <given-names>L</given-names></string-name>, <string-name><surname>Jiang</surname> <given-names>J</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>C</given-names></string-name></person-group>. <article-title>Federated low-rank adaptation for foundation models: a survey</article-title>. In: <conf-name>Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence, IJCAI &#x2019;25; 2025 Aug 18&#x2013;22</conf-name>; <publisher-loc>Montreal, QC, Canada</publisher-loc>. doi:<pub-id pub-id-type="doi">10.24963/ijcai.2025/1196</pub-id>.</mixed-citation></ref>
<ref id="ref-37"><label>[37]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Basubeit</surname> <given-names>O</given-names></string-name>, <string-name><surname>Alenazi</surname> <given-names>A</given-names></string-name>, <string-name><surname>Radovi&#x010D;</surname> <given-names>B</given-names></string-name>, <string-name><surname>Milibari</surname> <given-names>A</given-names></string-name>, <string-name><surname>Canini</surname> <given-names>M</given-names></string-name>, <string-name><surname>Khayyat</surname> <given-names>Z</given-names></string-name></person-group>. <article-title>LLM optimization without data sharing: a split learning paradigm</article-title>. In: <conf-name>Proceedings of the 2025 International Conference on Metaverse Computing, Networking and Applications (MetaCom); 2025 Aug 27&#x2013;29; Seoul, Republic of Korea</conf-name>. p. <fpage>75</fpage>&#x2013;<lpage>81</lpage>. doi:<pub-id pub-id-type="doi">10.1109/MetaCom65502.2025.00019</pub-id>.</mixed-citation></ref>
<ref id="ref-38"><label>[38]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Ghosh</surname> <given-names>S</given-names></string-name>, <string-name><surname>Evuru</surname> <given-names>CKR</given-names></string-name>, <string-name><surname>Kumar</surname> <given-names>S</given-names></string-name>, <string-name><surname>Ramaneswaran</surname> <given-names>S</given-names></string-name>, <string-name><surname>Aneja</surname> <given-names>D</given-names></string-name>, <string-name><surname>Jin</surname> <given-names>Z</given-names></string-name>, <etal>et al</etal></person-group>. <article-title>A closer look at the limitations of instruction tuning</article-title>. In: <conf-name>Proceedings of the Forty-first International Conference on Machine Learning; 2024 Jul 21&#x2013;27</conf-name>; <publisher-loc>Vienna, Austria</publisher-loc>. p. <fpage>15559</fpage>&#x2013;<lpage>89</lpage>.</mixed-citation></ref>
<ref id="ref-39"><label>[39]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Lightman</surname> <given-names>H</given-names></string-name>, <string-name><surname>Kosaraju</surname> <given-names>V</given-names></string-name>, <string-name><surname>Burda</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Edwards</surname> <given-names>H</given-names></string-name>, <string-name><surname>Baker</surname> <given-names>B</given-names></string-name>, <string-name><surname>Lee</surname> <given-names>T</given-names></string-name>, <etal>et al</etal></person-group>. <article-title>Let&#x2019;s verify step by step</article-title>. In: <conf-name>Proceedings of the Twelfth International Conference on Learning Representations; 2024 May 7&#x2013;11</conf-name>; <publisher-loc>Vienna, Austria</publisher-loc>.</mixed-citation></ref>
<ref id="ref-40"><label>[40]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Pang</surname> <given-names>J</given-names></string-name>, <string-name><surname>Di</surname> <given-names>N</given-names></string-name>, <string-name><surname>Zhu</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Wei</surname> <given-names>J</given-names></string-name>, <string-name><surname>Cheng</surname> <given-names>H</given-names></string-name>, <string-name><surname>Qian</surname> <given-names>C</given-names></string-name>, <etal>et al</etal></person-group>. <article-title>Token cleaning: fine-grained data selection for LLM supervised fine-tuning</article-title>. In: <conf-name>Proceedings of the Forty-second International Conference on Machine Learning; 2025 Jul 13&#x2013;19</conf-name>; <publisher-loc>Vancouver, BC, Canada</publisher-loc>.</mixed-citation></ref>
<ref id="ref-41"><label>[41]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><surname>Liu</surname> <given-names>J</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Lin</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Chen</surname> <given-names>M</given-names></string-name>, <string-name><surname>Hao</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Hu</surname> <given-names>L</given-names></string-name></person-group>. <article-title>Natural language fine-tuning</article-title>. <comment>arXiv:2412.20382. 2024</comment>.</mixed-citation></ref>
<ref id="ref-42"><label>[42]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Rafailov</surname> <given-names>R</given-names></string-name>, <string-name><surname>Sharma</surname> <given-names>A</given-names></string-name>, <string-name><surname>Mitchell</surname> <given-names>E</given-names></string-name>, <string-name><surname>Manning</surname> <given-names>CD</given-names></string-name>, <string-name><surname>Ermon</surname> <given-names>S</given-names></string-name>, <string-name><surname>Finn</surname> <given-names>C</given-names></string-name></person-group>. <article-title>Direct preference optimization: your language model is secretly a reward model</article-title>. <source>Adv Neural Inform Process Syst</source>. <year>2023</year>;<volume>36</volume>:<fpage>53728</fpage>&#x2013;<lpage>41</lpage>.</mixed-citation></ref>
<ref id="ref-43"><label>[43]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Moradi</surname> <given-names>M</given-names></string-name>, <string-name><surname>Yan</surname> <given-names>K</given-names></string-name>, <string-name><surname>Colwell</surname> <given-names>D</given-names></string-name>, <string-name><surname>Samwald</surname> <given-names>M</given-names></string-name>, <string-name><surname>Asgari</surname> <given-names>R</given-names></string-name></person-group>. <article-title>A critical review of methods and challenges in large language models</article-title>. <source>Comput Mater Contin</source>. <year>2025</year>;<volume>82</volume>(<issue>2</issue>):<fpage>1681</fpage>&#x2013;<lpage>98</lpage>. doi:<pub-id pub-id-type="doi">10.32604/cmc.2025.061263</pub-id>.</mixed-citation></ref>
<ref id="ref-44"><label>[44]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Zeng</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Liu</surname> <given-names>G</given-names></string-name>, <string-name><surname>Ma</surname> <given-names>W</given-names></string-name>, <string-name><surname>Yang</surname> <given-names>N</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>H</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>J</given-names></string-name></person-group>. <article-title>Token-level direct preference optimization</article-title>. In: <conf-name>Proceedings of the Forty-first International Conference on Machine Learning; 2024 Jul 21&#x2013;27</conf-name>; <publisher-loc>Vienna, Austria</publisher-loc>. p. <fpage>58348</fpage>&#x2013;<lpage>65</lpage>.</mixed-citation></ref>
<ref id="ref-45"><label>[45]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Zhong</surname> <given-names>H</given-names></string-name>, <string-name><surname>Shan</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Feng</surname> <given-names>G</given-names></string-name>, <string-name><surname>Xiong</surname> <given-names>W</given-names></string-name>, <string-name><surname>Cheng</surname> <given-names>X</given-names></string-name>, <string-name><surname>Zhao</surname> <given-names>L</given-names></string-name>, <etal>et al</etal></person-group>. <article-title>DPO meets PPO: reinforced token optimization for RLHF</article-title>. In: <conf-name>Proceedings of the Forty-second International Conference on Machine Learning; 2025 Jul 13&#x2013;19</conf-name>; <publisher-loc>Vancouver, BC, Canada</publisher-loc>.</mixed-citation></ref>
<ref id="ref-46"><label>[46]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Wei</surname> <given-names>J</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>X</given-names></string-name>, <string-name><surname>Schuurmans</surname> <given-names>D</given-names></string-name>, <string-name><surname>Bosma</surname> <given-names>M</given-names></string-name>, <string-name><surname>Xia</surname> <given-names>F</given-names></string-name>, <string-name><surname>Chi</surname> <given-names>E</given-names></string-name>, <etal>et al</etal></person-group>. <article-title>Chain-of-thought prompting elicits reasoning in large language models</article-title>. <source>Adv Neural Inform Process Syst</source>. <year>2022</year>;<volume>35</volume>:<fpage>24824</fpage>&#x2013;<lpage>37</lpage>.</mixed-citation></ref>
<ref id="ref-47"><label>[47]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><surname>Ma</surname> <given-names>J</given-names></string-name>, <string-name><surname>Lyu</surname> <given-names>X</given-names></string-name>, <string-name><surname>Jiang</surname> <given-names>J</given-names></string-name>, <string-name><surname>Cui</surname> <given-names>Q</given-names></string-name>, <string-name><surname>Yao</surname> <given-names>H</given-names></string-name>, <string-name><surname>Tao</surname> <given-names>X</given-names></string-name></person-group>. <article-title>SplitFrozen: split learning with device-side model frozen for fine-tuning LLM on heterogeneous resource-constrained devices</article-title>. <comment>arXiv:2503.18986. 2025</comment>.</mixed-citation></ref>
</ref-list>
</back></article>