<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.1 20151215//EN" "http://jats.nlm.nih.gov/publishing/1.1/JATS-journalpublishing1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:mml="http://www.w3.org/1998/Math/MathML" xml:lang="en" article-type="research-article" dtd-version="1.1">
<front>
<journal-meta>
<journal-id journal-id-type="pmc">CMC</journal-id>
<journal-id journal-id-type="nlm-ta">CMC</journal-id>
<journal-id journal-id-type="publisher-id">CMC</journal-id>
<journal-title-group>
<journal-title>Computers, Materials &#x0026; Continua</journal-title>
</journal-title-group>
<issn pub-type="epub">1546-2226</issn>
<issn pub-type="ppub">1546-2218</issn>
<publisher>
<publisher-name>Tech Science Press</publisher-name>
<publisher-loc>USA</publisher-loc>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="publisher-id">66540</article-id>
<article-id pub-id-type="doi">10.32604/cmc.2025.066540</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Article</subject>
</subj-group>
</article-categories>
<title-group>
<article-title>VMHPE: Human Pose Estimation for Virtual Maintenance Tasks</article-title>
<alt-title alt-title-type="left-running-head">VMHPE: Human Pose Estimation for Virtual Maintenance Tasks</alt-title>
<alt-title alt-title-type="right-running-head">VMHPE: Human Pose Estimation for Virtual Maintenance Tasks</alt-title>
</title-group>
<contrib-group>
<contrib id="author-1" contrib-type="author">
<name name-style="western"><surname>Zhang</surname><given-names>Shuo</given-names></name></contrib>
<contrib id="author-2" contrib-type="author">
<name name-style="western"><surname>He</surname><given-names>Hanwu</given-names></name></contrib>
<contrib id="author-3" contrib-type="author" corresp="yes">
<name name-style="western"><surname>Wu</surname><given-names>Yueming</given-names></name><email>wuyueming@gdut.edu.cn</email></contrib>
<aff id="aff-1"><institution>School of Electromechanical Engineering, Guangdong University of Technology, Guangzhou Higher Education Mega Center</institution>, <addr-line>Guangzhou, 510006</addr-line>, <country>China</country></aff>
</contrib-group>
<author-notes>
<corresp id="cor1"><label>&#x002A;</label>Corresponding Author: Yueming Wu. Email: <email>wuyueming@gdut.edu.cn</email></corresp>
</author-notes>
<pub-date date-type="collection" publication-format="electronic">
<year>2025</year>
</pub-date>
<pub-date date-type="pub" publication-format="electronic">
<day>29</day><month>08</month><year>2025</year>
</pub-date>
<volume>85</volume>
<issue>1</issue>
<fpage>801</fpage>
<lpage>826</lpage>
<history>
<date date-type="received">
<day>10</day>
<month>4</month>
<year>2025</year>
</date>
<date date-type="accepted">
<day>18</day>
<month>6</month>
<year>2025</year>
</date>
</history>
<permissions>
<copyright-statement>&#x00A9; 2025 The Authors.</copyright-statement>
<copyright-year>2025</copyright-year>
<copyright-holder>Published by Tech Science Press.</copyright-holder>
<license xlink:href="https://creativecommons.org/licenses/by/4.0/">
<license-p>This work is licensed under a <ext-link ext-link-type="uri" xlink:type="simple" xlink:href="https://creativecommons.org/licenses/by/4.0/">Creative Commons Attribution 4.0 International License</ext-link>, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
</license>
</permissions>
<self-uri content-type="pdf" xlink:href="TSP_CMC_66540.pdf"></self-uri>
<abstract>
<p>Virtual maintenance, as an important means of industrial training and education, places strict requirements on the accuracy of participant pose perception and assessment of motion standardization. However, existing research mainly focuses on human pose estimation in general scenarios, lacking specialized solutions for maintenance scenarios. This paper proposes a virtual maintenance human pose estimation method based on multi-scale feature enhancement (VMHPE), which integrates adaptive input feature enhancement, multi-scale feature correction for improved expression of fine movements and complex poses, and multi-scale feature fusion to enhance keypoint localization accuracy. Meanwhile, this study constructs the first virtual maintenance-specific human keypoint dataset (VMHKP), which records standard action sequences of professional maintenance personnel in five typical maintenance tasks and provides a reliable benchmark for evaluating operator motion standardization. The dataset is publicly available at <ext-link ext-link-type="uri" xlink:href="https://doi.org/10.5281/zenodo.15525037">https://doi.org/10.5281/zenodo.15525037</ext-link>. Using high-precision keypoint prediction results, an action assessment system utilizing topological structure similarity was established. Experiments show that our method achieves significant performance improvements: average precision (AP) reaches 94.4%, an increase of 2.3 percentage points over baseline methods; average recall (AR) reaches 95.6%, an increase of 1.3 percentage points. This research establishes a scientific four-level evaluation standard based on comparative motion analysis and provides a reliable solution for standardizing industrial maintenance training.</p>
</abstract>
<kwd-group kwd-group-type="author">
<kwd>Virtual maintenance</kwd>
<kwd>human pose estimation</kwd>
<kwd>multi-scale feature fusion</kwd>
</kwd-group>
<funding-group>
<award-group id="awg1">
<funding-source>Multi-Person Collaborative Virtual Assembly/Disassembly Training</funding-source>
<award-id>23HK0101</award-id>
</award-group>
</funding-group>
</article-meta>
</front>
<body>
<sec id="s1">
<label>1</label>
<title>Introduction</title>
<p>Virtual maintenance is an innovative approach that utilizes Virtual Reality (VR) or Augmented Reality (AR) technologies to simulate real maintenance environments and processes. It enables users to perform equipment maintenance, fault diagnosis, and operational training in virtual environments without interacting with actual equipment, thereby reducing training costs and safety risks [<xref ref-type="bibr" rid="ref-1">1</xref>&#x2013;<xref ref-type="bibr" rid="ref-3">3</xref>]. By providing immersive, interactive learning experiences, virtual maintenance technology significantly enhances maintenance personnel&#x2019;s skill levels and work efficiency [<xref ref-type="bibr" rid="ref-4">4</xref>].</p>
<p>As a significant application of VR and AR, virtual maintenance technology is becoming a research hotspot in industries such as industrial manufacturing, healthcare, and education [<xref ref-type="bibr" rid="ref-5">5</xref>]. This technology constructs virtual maintenance environments that enable real-time interaction and operation among multiple users, thereby significantly improving maintenance task efficiency and safety [<xref ref-type="bibr" rid="ref-6">6</xref>]. However, achieving high-quality virtual maintenance faces a key challenge: how to accurately perceive and precisely reproduce participants&#x2019; movements [<xref ref-type="bibr" rid="ref-7">7</xref>]. In virtual maintenance, high-precision reproduction of human poses is crucial. Not only does it provide authentic interactive experiences, but more importantly, it can be used to correct improper operational behaviors and guide participants to perform correct maintenance actions [<xref ref-type="bibr" rid="ref-8">8</xref>]. For example, in complex mechanical maintenance tasks, the system can detect and correct incorrect operating postures in real-time by comparing the poses of experts and learners, thereby avoiding potential safety hazards and improving maintenance quality [<xref ref-type="bibr" rid="ref-9">9</xref>]. Human keypoint detection is the first and most critical step in achieving high-precision pose reproduction. In the field of general human keypoint detection, numerous studies have achieved significant results. For instance, OpenPose proposed by Cao et al. [<xref ref-type="bibr" rid="ref-10">10</xref>] and HRNet by Sun et al. [<xref ref-type="bibr" rid="ref-11">11</xref>] have shown excellent performance in multi-person pose estimation. Although human pose estimation is currently a research hotspot, existing studies mostly focus on general scenarios, lacking specialized research for maintenance tasks [<xref ref-type="bibr" rid="ref-12">12</xref>]. Furthermore, the field also lacks relevant professional datasets, which to some extent limits the depth of research development.</p>
<p>To address these issues, this paper proposes a multi-scale feature enhancement-based human pose estimation method for virtual maintenance, VMHPE. This method significantly improves pose estimation performance in complex maintenance scenarios through the organic combination of multi-scale feature correction and fusion attention mechanisms. To support related research development, we have constructed and open-sourced the virtual maintenance-specific dataset VMHKP, which covers five typical maintenance task scenarios and adopts the standard Common Objects in Context (COCO) keypoint annotation scheme, thus providing a reliable evaluation benchmark for research in this field. Through systematic experimental validation, our method demonstrates significant performance advantages in virtual maintenance scenarios, achieving notable improvements in key metrics compared to existing methods. This provides strong support for the further development of virtual maintenance technology.</p>
</sec>
<sec id="s2">
<label>2</label>
<title>Theoretical Background and Problem Analysis</title>
<sec id="s2_1">
<label>2.1</label>
<title>Human Keypoint Datasets</title>
<p>Human pose estimation is the crucial first step in virtual maintenance, where the accuracy of operators&#x2019; movements directly affects training effectiveness and operation quality. Accurate detection and correction of postures requires high-precision human keypoint detection technology supported by large-scale annotated datasets. In this field, researchers have developed several influential datasets including the representative COCO dataset [<xref ref-type="bibr" rid="ref-13">13</xref>], which covers various tasks such as object segmentation and human keypoint annotation. As research progressed, enhanced datasets emerged including COCO-WholeBody [<xref ref-type="bibr" rid="ref-14">14</xref>] with 133 keypoints, Halpe dataset [<xref ref-type="bibr" rid="ref-15">15</xref>,<xref ref-type="bibr" rid="ref-16">16</xref>] with improved keypoint quantity and annotation format, DensePose-Posetrack dataset [<xref ref-type="bibr" rid="ref-17">17</xref>] focusing on multi-person video sequences, DAVIS dataset [<xref ref-type="bibr" rid="ref-18">18</xref>] containing challenging poses, and EPIC-KITCHENS dataset [<xref ref-type="bibr" rid="ref-19">19</xref>] with egocentric perspectives.</p>
<p>However, these public datasets show apparent limitations in the virtual maintenance domain. First, virtual maintenance tasks typically involve specific operation poses and tool usage inadequately represented in general datasets. Second, maintenance tasks may require operators to adopt uncommon or complex poses, which have limited representation in existing datasets. Additionally, virtual maintenance environments may include special lighting conditions, occlusions, and backgrounds not fully considered in general datasets. Most critically, the attention requirements for specific keypoints in maintenance tasks differ significantly from the annotation focus of general datasets. Based on this analysis, constructing a dataset specifically for virtual maintenance scenarios becomes essential to provide samples closely aligned with actual application scenarios, cover key poses and actions in specific tasks, and ultimately improve the accuracy and training effectiveness of virtual maintenance systems. Such a specialized dataset would enable more precise detection of operators&#x2019; actions and provide timely correction and guidance.</p>
<p>Beyond dataset support, algorithm design and optimization are equally crucial for improving human pose estimation performance in virtual maintenance. While deep learning has brought revolutionary progress to this field, existing algorithms still face unique challenges in specific domains like virtual maintenance. The following section will analyze the development history and technical characteristics of existing algorithms, as well as their potential and limitations in virtual maintenance scenarios.</p>
</sec>
<sec id="s2_2">
<label>2.2</label>
<title>Human Keypoint Detection</title>
<p>The development of deep learning technology has revolutionized human pose estimation, with DeepPose [<xref ref-type="bibr" rid="ref-20">20</xref>] by Toshev and Szegedy pioneering the use of convolutional neural networks for direct regression of human joint positions. Subsequent research has evolved along two major technical approaches: bottom-up and top-down methods.</p>
<p>Bottom-up methods have shown significant progress with OpenPose [<xref ref-type="bibr" rid="ref-10">10</xref>] by Cao et al. achieving efficient multi-person pose estimation through multi-stage Convolutional Neural Network (CNN) and Part Affinity Fields. This approach was further developed in HigherHRNet [<xref ref-type="bibr" rid="ref-21">21</xref>] by Cheng et al., which improved accuracy through multi-resolution feature fusion, and Jin et al.&#x2019;s Whole-Body Human Pose Estimation [<xref ref-type="bibr" rid="ref-14">14</xref>] that expanded to simultaneous detection of body, face, and hand keypoints. Concurrently, top-down methods have achieved remarkable results, with HRNet [<xref ref-type="bibr" rid="ref-11">11</xref>] by Sun et al. maintaining high-resolution feature maps for precise pose estimation. This approach was enhanced in HRNetV2-W48 within MMPose for comprehensive detection of body, hand, and facial keypoints, while AlphaPose [<xref ref-type="bibr" rid="ref-22">22</xref>] incorporated object tracking capabilities along with efficient pose estimation. Recent innovations have further advanced keypoint detection accuracy and robustness. ViTPose [<xref ref-type="bibr" rid="ref-23">23</xref>] by Xu et al. leveraged Vision Transformer structures, FCPose [<xref ref-type="bibr" rid="ref-24">24</xref>] by Zeng et al. enhanced efficiency using fully convolutional networks, and Sun et al.&#x2019;s SimCC model [<xref ref-type="bibr" rid="ref-25">25</xref>] improved computational efficiency through one-dimensional vector representation. For temporal stability, SmoothNet [<xref ref-type="bibr" rid="ref-26">26</xref>] by Zeng et al. improved coherence through temporal modeling. Specialized solutions have emerged for specific applications, such as MediaPipe Hands [<xref ref-type="bibr" rid="ref-27">27</xref>] and InterHand2.6M [<xref ref-type="bibr" rid="ref-28">28</xref>] for hand pose estimation, and the Real-Time Multi-Object (RTMO) model [<xref ref-type="bibr" rid="ref-29">29</xref>] by Lu et al. for multi-person pose estimation.</p>
<p>Despite these advancements, general-purpose algorithms face significant challenges in virtual maintenance scenarios. First, maintenance tasks often require unconventional poses rarely represented in training datasets. Second, virtual maintenance demands higher precision, as small errors can lead to operational mistakes. Third, real-time performance is essential for effective feedback in training systems. These domain-specific requirements necessitate specialized approaches that can better address the unique characteristics of maintenance operations [<xref ref-type="bibr" rid="ref-30">30</xref>&#x2013;<xref ref-type="bibr" rid="ref-32">32</xref>].</p>
<p>Following this research trajectory, the CID (Contextual Instance Decoupling) [<xref ref-type="bibr" rid="ref-33">33</xref>] architecture has advanced general-purpose pose estimation by decoupling instance features from global context, yet it shows limitations in specialized domains like virtual maintenance that demand higher precision. Building upon these advances, this paper proposes the VMHPE model with three key contributions: 1) a multi-scale feature enhancement framework specifically designed for maintenance contexts to capture complex poses and fine movements; 2) two novel modules-MultiscaleFeatureRectifyModule and MultiscaleFusedAttention-that enhance model performance through adaptive feature correction and multi-perspective feature fusion; and 3) the first virtual maintenance-specific human keypoint dataset (VMHKP) as a standardized benchmark. The VMHPE model effectively addresses the unique challenges of virtual maintenance through targeted multi-scale processing strategies, demonstrating significant advantages in both accuracy and robustness for specialized maintenance operations.</p>
</sec>
</sec>
<sec id="s3">
<label>3</label>
<title>Data and Methods</title>
<sec id="s3_1">
<label>3.1</label>
<title>Virtual Maintenance Specialized Dataset (VMHKP) Definition</title>
<sec id="s3_1_1">
<label>3.1.1</label>
<title>Dataset Construction Process</title>
<p>To address limitations in existing datasets, we constructed the Virtual Maintenance Human Keypoint Dataset (VMHKP) through a systematic three-phase process: task design, data collection, and annotation. The dataset includes 1100 RGB images from 11 technicians performing standardized maintenance procedures. These technicians have different body types (lean, medium, and heavy builds) and wear various work clothing (lightweight work attire, standard long-sleeve work uniforms, short-sleeve work uniforms, and work uniforms in different colors). This diversity design significantly enhances the representativeness of the dataset and effectively reduces potential body characteristic bias. Data collection employs a Microsoft Kinect Azure Developer Kit (DK) depth camera integrated with a 12-megapixel RGB camera, fixed at a position 2.5 meters away from the operation area to ensure complete capture of the technicians&#x2019; full-body movement range. To ensure data quality and operational smoothness, an image acquisition strategy of collecting one frame every 10 frames is adopted, with 20 most representative images carefully selected from continuous maintenance operation video sequences, covering the preparation, execution, and completion phases of operations. The annotation process uses the COCO-Annotator tool, following the COCO dataset&#x2019;s 17 keypoint standard for 2D coordinate annotation. The annotation team consists of 3 researchers who received professional training in keypoint localization under the guidance of medical professionals. A dual verification mechanism is employed: after initial annotation, another annotator conducts independent review, and for keypoints with position differences exceeding 5 pixels, consensus is reached through discussion. During the annotation process, consistency checks are implemented by regularly sampling 10% of images for cross-annotator evaluation, ensuring the average error in keypoint localization is controlled within 3 pixels. After all annotations are completed, maintenance technicians with over 5 years of experience are invited to verify the standardization of action postures, ensuring that the annotated actions comply with actual maintenance operation standards. The final dataset passes integrity checks, providing high-quality standard data for model training and evaluation. capturing complete action sequences that embody professional characteristics such as wrist-elbow-shoulder coordination during tool operation and trunk-hip-knee support during component transport. These data provide a reliable foundation for establishing maintenance action evaluation standards. Based on industrial maintenance field investigation, we selected five representative maintenance tasks (<xref ref-type="fig" rid="fig-1">Fig. 1</xref>). Each task was designed according to actual maintenance manuals to cover operational difficulty, tool usage, and body movement range across different scenarios, ensuring comprehensive coverage and a complete evaluation benchmark for algorithm research.</p>
<fig id="fig-1">
<label>Figure 1</label>
<caption>
<title>Virtual maintenance task categories</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_66540-fig-1.tif"/>
</fig>
<p><xref ref-type="fig" rid="fig-2">Figs. 2</xref> and <xref ref-type="fig" rid="fig-3">3</xref> demonstrate typical operation scenarios across various maintenance tasks.</p>
<fig id="fig-2">
<label>Figure 2</label>
<caption>
<title>Task Action illustrations. (a&#x2013;d) Tool Operation (TO), (e&#x2013;h) Part Flipping (PF), (i&#x2013;l) Component Replacement (CR)</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_66540-fig-2.tif"/>
</fig><fig id="fig-3">
<label>Figure 3</label>
<caption>
<title>Task Action Illustrations (Part 2). (a&#x2013;d) PC (Part Carrying), (e&#x2013;h) Part Transfer (CT)</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_66540-fig-3.tif"/>
</fig>
<p>This systematic construction process ensures data quality and standardization while providing reliable training and evaluation data for the human pose estimation method proposed in <xref ref-type="sec" rid="s3_2">Section 3.2</xref>.</p>
</sec>
<sec id="s3_1_2">
<label>3.1.2</label>
<title>Dataset Characteristics and Advantages</title>
<p>The VMHKP dataset demonstrates unique advantages in three dimensions: data distribution, pose characteristics, and action continuity. Data are evenly distributed across five maintenance tasks, with 220 images per category. Each task includes rich operational scenario variations, such as the common tool application task covering operation poses for 8 maintenance tools including wrenches, electric screwdrivers, and sockets. To quantitatively analyze maintenance pose characteristics, we defined three key angle parameters (<xref ref-type="fig" rid="fig-4">Fig. 4</xref>): wrist angle, elbow angle, and trunk forward lean angle. The wrist angle is defined as the angle formed by the wrist-elbow-shoulder line, calculated separately for left and right wrists to describe the degree of arm bending. The elbow angle adopts a similar definition. Trunk forward lean angle is defined as the angle between the trunk midline and vertical direction, characterizing the degree of body forward lean.</p>
<fig id="fig-4">
<label>Figure 4</label>
<caption>
<title>Schematic diagram of key angle calculation</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_66540-fig-4.tif"/>
</fig>
<p>Statistical analysis of these angles revealed significant differences in joint angle distributions across maintenance tasks (<xref ref-type="table" rid="table-1">Table 1</xref>). Component transfer and transport tasks exhibited larger elbow angles (<inline-formula id="ieqn-1"><mml:math id="mml-ieqn-1"><mml:msup><mml:mn>88.3</mml:mn><mml:mrow><mml:mo>&#x2218;</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula><inline-formula id="ieqn-2"><mml:math id="mml-ieqn-2"><mml:mo>&#x00B1;</mml:mo></mml:math></inline-formula> <inline-formula id="ieqn-3"><mml:math id="mml-ieqn-3"><mml:msup><mml:mn>14.2</mml:mn><mml:mrow><mml:mo>&#x2218;</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula> and <inline-formula id="ieqn-4"><mml:math id="mml-ieqn-4"><mml:msup><mml:mn>102.3</mml:mn><mml:mrow><mml:mo>&#x2218;</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula><inline-formula id="ieqn-5"><mml:math id="mml-ieqn-5"><mml:mo>&#x00B1;</mml:mo></mml:math></inline-formula> <inline-formula id="ieqn-6"><mml:math id="mml-ieqn-6"><mml:msup><mml:mn>15.6</mml:mn><mml:mrow><mml:mo>&#x2218;</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula>, respectively), reflecting large-range motion characteristics. Component flip and transport tasks showed larger trunk forward lean angles (both around <inline-formula id="ieqn-7"><mml:math id="mml-ieqn-7"><mml:msup><mml:mn>42.7</mml:mn><mml:mrow><mml:mo>&#x2218;</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula>), demonstrating significant forward lean amplitudes. Tool operation tasks displayed relatively smaller wrist and elbow angles (approximately <inline-formula id="ieqn-8"><mml:math id="mml-ieqn-8"><mml:msup><mml:mn>85</mml:mn><mml:mrow><mml:mo>&#x2218;</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula> and <inline-formula id="ieqn-9"><mml:math id="mml-ieqn-9"><mml:msup><mml:mn>75.2</mml:mn><mml:mrow><mml:mo>&#x2218;</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula><inline-formula id="ieqn-10"><mml:math id="mml-ieqn-10"><mml:mo>&#x00B1;</mml:mo></mml:math></inline-formula> <inline-formula id="ieqn-11"><mml:math id="mml-ieqn-11"><mml:msup><mml:mn>12.3</mml:mn><mml:mrow><mml:mo>&#x2218;</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula>, respectively), indicating that fine operations primarily rely on small-amplitude wrist and elbow adjustments</p>
<table-wrap id="table-1">
<label>Table 1</label>
<caption>
<title>Joint angle distribution statistics</title>
</caption>
<table>
<colgroup>
<col/>
<col/>
<col/>
<col/>
</colgroup>
<thead>
<tr>
<th>Action type</th>
<th>Wrist angle range (&#x00B0;)</th>
<th>Elbow angle range (&#x00B0;)</th>
<th>Trunk forward lean angle (&#x00B0;)</th>
</tr>
</thead>
<tbody>
<tr>
<td>Tool Operation</td>
<td><inline-formula id="ieqn-15"><mml:math id="mml-ieqn-15"><mml:mn>82.4</mml:mn><mml:mo>&#x00B1;</mml:mo><mml:mn>15.3</mml:mn></mml:math></inline-formula>/<inline-formula id="ieqn-16"><mml:math id="mml-ieqn-16"><mml:mn>85.6</mml:mn><mml:mo>&#x00B1;</mml:mo><mml:mn>14.8</mml:mn></mml:math></inline-formula></td>
<td><inline-formula id="ieqn-17"><mml:math id="mml-ieqn-17"><mml:mn>75.2</mml:mn><mml:mo>&#x00B1;</mml:mo><mml:mn>12.3</mml:mn></mml:math></inline-formula></td>
<td><inline-formula id="ieqn-18"><mml:math id="mml-ieqn-18"><mml:mn>28.5</mml:mn><mml:mo>&#x00B1;</mml:mo><mml:mn>7.2</mml:mn></mml:math></inline-formula></td>
</tr>
<tr>
<td>Component Flip</td>
<td><inline-formula id="ieqn-19"><mml:math id="mml-ieqn-19"><mml:mn>88.5</mml:mn><mml:mo>&#x00B1;</mml:mo><mml:mn>16.2</mml:mn></mml:math></inline-formula>/<inline-formula id="ieqn-20"><mml:math id="mml-ieqn-20"><mml:mn>86.3</mml:mn><mml:mo>&#x00B1;</mml:mo><mml:mn>15.7</mml:mn></mml:math></inline-formula></td>
<td><inline-formula id="ieqn-21"><mml:math id="mml-ieqn-21"><mml:mn>82.7</mml:mn><mml:mo>&#x00B1;</mml:mo><mml:mn>13.5</mml:mn></mml:math></inline-formula></td>
<td><inline-formula id="ieqn-22"><mml:math id="mml-ieqn-22"><mml:mn>42.7</mml:mn><mml:mo>&#x00B1;</mml:mo><mml:mn>9.8</mml:mn></mml:math></inline-formula></td>
</tr>
<tr>
<td>Component Replacement</td>
<td><inline-formula id="ieqn-23"><mml:math id="mml-ieqn-23"><mml:mn>85.3</mml:mn><mml:mo>&#x00B1;</mml:mo><mml:mn>14.8</mml:mn></mml:math></inline-formula>/<inline-formula id="ieqn-24"><mml:math id="mml-ieqn-24"><mml:mn>87.2</mml:mn><mml:mo>&#x00B1;</mml:mo><mml:mn>15.2</mml:mn></mml:math></inline-formula></td>
<td><inline-formula id="ieqn-25"><mml:math id="mml-ieqn-25"><mml:mn>78.5</mml:mn><mml:mo>&#x00B1;</mml:mo><mml:mn>12.8</mml:mn></mml:math></inline-formula></td>
<td><inline-formula id="ieqn-26"><mml:math id="mml-ieqn-26"><mml:mn>35.6</mml:mn><mml:mo>&#x00B1;</mml:mo><mml:mn>8.4</mml:mn></mml:math></inline-formula></td>
</tr>
<tr>
<td>Component Transfer</td>
<td><inline-formula id="ieqn-27"><mml:math id="mml-ieqn-27"><mml:mn>92.4</mml:mn><mml:mo>&#x00B1;</mml:mo><mml:mn>17.5</mml:mn></mml:math></inline-formula>/<inline-formula id="ieqn-28"><mml:math id="mml-ieqn-28"><mml:mn>94.8</mml:mn><mml:mo>&#x00B1;</mml:mo><mml:mn>16.9</mml:mn></mml:math></inline-formula></td>
<td><inline-formula id="ieqn-29"><mml:math id="mml-ieqn-29"><mml:mn>88.3</mml:mn><mml:mo>&#x00B1;</mml:mo><mml:mn>14.2</mml:mn></mml:math></inline-formula></td>
<td><inline-formula id="ieqn-30"><mml:math id="mml-ieqn-30"><mml:mn>25.8</mml:mn><mml:mo>&#x00B1;</mml:mo><mml:mn>6.5</mml:mn></mml:math></inline-formula></td>
</tr>
<tr>
<td>Component Handling</td>
<td><inline-formula id="ieqn-31"><mml:math id="mml-ieqn-31"><mml:mn>95.8</mml:mn><mml:mo>&#x00B1;</mml:mo><mml:mn>18.2</mml:mn></mml:math></inline-formula>/<inline-formula id="ieqn-32"><mml:math id="mml-ieqn-32"><mml:mn>98.2</mml:mn><mml:mo>&#x00B1;</mml:mo><mml:mn>17.8</mml:mn></mml:math></inline-formula></td>
<td><inline-formula id="ieqn-33"><mml:math id="mml-ieqn-33"><mml:mn>102.3</mml:mn><mml:mo>&#x00B1;</mml:mo><mml:mn>15.6</mml:mn></mml:math></inline-formula></td>
<td><inline-formula id="ieqn-34"><mml:math id="mml-ieqn-34"><mml:mn>42.7</mml:mn><mml:mo>&#x00B1;</mml:mo><mml:mn>10.2</mml:mn></mml:math></inline-formula></td>
</tr>
</tbody>
</table>
</table-wrap>
<p>Another important VMHKP characteristic is action continuity. By analyzing keypoint displacement changes between adjacent frames, we quantified the continuity characteristics of maintenance actions (<xref ref-type="table" rid="table-2">Table 2</xref>). Different tasks show distinct motion amplitude differences. Tool operation tasks have relatively smaller average inter-frame displacements (<inline-formula id="ieqn-35"><mml:math id="mml-ieqn-35"><mml:mn>28.5</mml:mn><mml:mo>&#x00B1;</mml:mo><mml:mn>9.2</mml:mn></mml:math></inline-formula> pixels), while component transport and flip tasks show notably larger displacements (<inline-formula id="ieqn-36"><mml:math id="mml-ieqn-36"><mml:mn>45.8</mml:mn><mml:mo>&#x00B1;</mml:mo><mml:mn>14.2</mml:mn></mml:math></inline-formula> and <inline-formula id="ieqn-37"><mml:math id="mml-ieqn-37"><mml:mn>42.3</mml:mn><mml:mo>&#x00B1;</mml:mo><mml:mn>13.5</mml:mn></mml:math></inline-formula> pixels, respectively). Each task type includes complete action phases, with frame ranges between 12&#x2013;35 frames, meeting the requirements for continuous action analysis, with detailed continuity analysis results shown in <xref ref-type="table" rid="table-2">Table 2</xref>.</p>
<table-wrap id="table-2">
<label>Table 2</label>
<caption>
<title>Continuity analysis of maintenance action sequences</title>
</caption>
<table>
<colgroup>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
</colgroup>
<thead>
<tr>
<th align="center">Action type</th>
<th align="center">Average inter-frame displacement (Pixels)</th>
<th align="center">Maximum inter-frame displacement (Pixels)</th>
<th align="center">Action duration frames</th>
<th align="center">Typical action sequence</th>
</tr>
</thead>
<tbody>
<tr>
<td>Tool operation</td>
<td><inline-formula id="ieqn-38"><mml:math id="mml-ieqn-38"><mml:mn>28.5</mml:mn><mml:mo>&#x00B1;</mml:mo><mml:mn>9.2</mml:mn></mml:math></inline-formula></td>
<td><inline-formula id="ieqn-39"><mml:math id="mml-ieqn-39"><mml:mn>45.3</mml:mn></mml:math></inline-formula></td>
<td><inline-formula id="ieqn-40"><mml:math id="mml-ieqn-40"><mml:mn>15</mml:mn></mml:math></inline-formula>&#x2013;<inline-formula id="ieqn-41"><mml:math id="mml-ieqn-41"><mml:mn>20</mml:mn></mml:math></inline-formula></td>
<td>Select-Adjust-Operate-Retract</td>
</tr>
<tr>
<td>Component flip</td>
<td><inline-formula id="ieqn-42"><mml:math id="mml-ieqn-42"><mml:mn>42.3</mml:mn><mml:mo>&#x00B1;</mml:mo><mml:mn>13.5</mml:mn></mml:math></inline-formula></td>
<td><inline-formula id="ieqn-43"><mml:math id="mml-ieqn-43"><mml:mn>68.7</mml:mn></mml:math></inline-formula></td>
<td><inline-formula id="ieqn-44"><mml:math id="mml-ieqn-44"><mml:mn>18</mml:mn></mml:math></inline-formula>&#x2013;<inline-formula id="ieqn-45"><mml:math id="mml-ieqn-45"><mml:mn>25</mml:mn></mml:math></inline-formula></td>
<td>Bend-Grasp-Flip-Inspect</td>
</tr>
<tr>
<td>Component replacement</td>
<td><inline-formula id="ieqn-46"><mml:math id="mml-ieqn-46"><mml:mn>35.6</mml:mn><mml:mo>&#x00B1;</mml:mo><mml:mn>11.2</mml:mn></mml:math></inline-formula></td>
<td><inline-formula id="ieqn-47"><mml:math id="mml-ieqn-47"><mml:mn>52.8</mml:mn></mml:math></inline-formula></td>
<td><inline-formula id="ieqn-48"><mml:math id="mml-ieqn-48"><mml:mn>20</mml:mn></mml:math></inline-formula>&#x2013;<inline-formula id="ieqn-49"><mml:math id="mml-ieqn-49"><mml:mn>30</mml:mn></mml:math></inline-formula></td>
<td>Disassemble-Replace-Install</td>
</tr>
<tr>
<td>Component transfer</td>
<td><inline-formula id="ieqn-50"><mml:math id="mml-ieqn-50"><mml:mn>38.4</mml:mn><mml:mo>&#x00B1;</mml:mo><mml:mn>12.1</mml:mn></mml:math></inline-formula></td>
<td><inline-formula id="ieqn-51"><mml:math id="mml-ieqn-51"><mml:mn>58.9</mml:mn></mml:math></inline-formula></td>
<td><inline-formula id="ieqn-52"><mml:math id="mml-ieqn-52"><mml:mn>12</mml:mn></mml:math></inline-formula>&#x2013;<inline-formula id="ieqn-53"><mml:math id="mml-ieqn-53"><mml:mn>18</mml:mn></mml:math></inline-formula></td>
<td>Grasp-Transfer-Receive</td>
</tr>
<tr>
<td>Component transport</td>
<td><inline-formula id="ieqn-54"><mml:math id="mml-ieqn-54"><mml:mn>45.8</mml:mn><mml:mo>&#x00B1;</mml:mo><mml:mn>14.2</mml:mn></mml:math></inline-formula></td>
<td><inline-formula id="ieqn-55"><mml:math id="mml-ieqn-55"><mml:mn>75.6</mml:mn></mml:math></inline-formula></td>
<td><inline-formula id="ieqn-56"><mml:math id="mml-ieqn-56"><mml:mn>25</mml:mn></mml:math></inline-formula>&#x2013;<inline-formula id="ieqn-57"><mml:math id="mml-ieqn-57"><mml:mn>35</mml:mn></mml:math></inline-formula></td>
<td>Approach-Lift-Transport-Place</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>In summary, VMHKP provides a comprehensive foundation for human pose estimation research in virtual maintenance scenarios through balanced task distribution, rich pose characteristics, and complete action sequences. These characteristics enhance model adaptability to maintenance scenarios and provide an important basis for subsequent algorithm optimization.</p>
</sec>
</sec>
<sec id="s3_2">
<label>3.2</label>
<title>VMHPE Model Architecture</title>
<sec id="s3_2_1">
<label>3.2.1</label>
<title>Overall Architecture Design</title>
<p>The VMHPE method is specifically designed to enhance pose estimation performance in virtual maintenance scenarios through multi-scale feature processing and adaptive attention mechanisms. <xref ref-type="fig" rid="fig-5">Fig. 5</xref> illustrates the comprehensive architecture of our proposed approach.</p>
<fig id="fig-5">
<label>Figure 5</label>
<caption>
<title>Overall architecture design</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_66540-fig-5.tif"/>
</fig>
<p>The framework consists of four integrated modules working in sequence: the backbone network extracts initial features from input images, the Individual Representation Learning (IRL) module generates discriminative individual features, the Multi-scale Feature Correction (MFC) module enhances feature robustness across different scales, and finally the Keypoint Estimation (KE) module produces precise keypoint heatmaps. This sequential design progressively refines features from coarse to fine, enabling accurate pose estimation even under the challenging conditions typical in maintenance operations.</p>
<p>The MFC module performs adaptive correction on multi-scale feature maps F extracted by the backbone network under the guidance of individual representation. This module employs a Multi-scale Feature Correction Module (MFCM), enhancing feature robustness to pose variations and occlusions in both spatial and channel dimensions through adaptive scale weights and enhanced feature correction operations.</p>
<p>After obtaining individual feature correction results, VMHPE aggregates global features at different scales through the Multi-scale Feature Fusion Module (MFFM). This module combines cross-attention and channel embedding mechanisms, introducing multi-scale convolution to fuse local and global context information from multiple perspectives. The fused features, together with individual enhanced features, provide input for keypoint estimation.</p>
<p>Finally, the KE module adopts an encoder-decoder structure, comprehensively utilizing individual enhanced features and fused features to generate K keypoint heatmaps for each individual. During training, VMHPE jointly optimizes the entire framework in an end-to-end manner. The loss functions include heatmap loss, contrastive loss, and regularization terms. These are used to optimize keypoint localization accuracy, feature discrimination ability, and model generalization performance.</p>
</sec>
<sec id="s3_2_2">
<label>3.2.2</label>
<title>Basic Attention Module Design</title>
<p>This section first introduces two core attention modules in the VMHPE model: Channel Attention and Spatial Attention (<xref ref-type="fig" rid="fig-6">Fig. 6</xref>). As basic units for feature enhancement, these two modules achieve adaptive enhancement of input features through different feature interaction methods. In the subsequent multi-scale feature processing, these two modules will be used for correction and fusion of features at different scales.</p>
<fig id="fig-6">
<label>Figure 6</label>
<caption>
<title>Basic attention module</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_66540-fig-6.tif"/>
</fig>
<p>The Channel Attention module aims to capture dependencies between feature channels and adaptively adjust the importance of each channel. This module takes global features <inline-formula id="ieqn-58"><mml:math id="mml-ieqn-58"><mml:msub><mml:mi>F</mml:mi><mml:mrow><mml:mtext>global</mml:mtext></mml:mrow></mml:msub><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:mi>B</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mi>C</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mi>H</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mi>W</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula> and instance features <inline-formula id="ieqn-59"><mml:math id="mml-ieqn-59"><mml:msub><mml:mi>F</mml:mi><mml:mrow><mml:mtext>instance</mml:mtext></mml:mrow></mml:msub><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:mi>B</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mn>480</mml:mn></mml:mrow></mml:msup></mml:math></inline-formula> as input. Initially, the instance features are mapped to the same channel dimension as the global features through a learnable linear transformation, as shown in <xref ref-type="disp-formula" rid="eqn-1">Eq. (1)</xref>:
<disp-formula id="eqn-1"><label>(1)</label><mml:math id="mml-eqn-1" display="block"><mml:msubsup><mml:mi>F</mml:mi><mml:mrow><mml:mrow><mml:mtext>instance</mml:mtext></mml:mrow></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:msub><mml:mi>W</mml:mi><mml:mrow><mml:mrow><mml:mtext>atn</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:mo>&#x22C5;</mml:mo><mml:msub><mml:mi>F</mml:mi><mml:mrow><mml:mrow><mml:mtext>instance</mml:mtext></mml:mrow></mml:mrow></mml:msub></mml:math></disp-formula>where <inline-formula id="ieqn-60"><mml:math id="mml-ieqn-60"><mml:msub><mml:mi>W</mml:mi><mml:mrow><mml:mtext>atn</mml:mtext></mml:mrow></mml:msub></mml:math></inline-formula> represents the linear transformation matrix. The transformed features are then expanded to <inline-formula id="ieqn-61"><mml:math id="mml-ieqn-61"><mml:msubsup><mml:mi>F</mml:mi><mml:mrow><mml:mtext>instance</mml:mtext></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msubsup><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:mi>B</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mi>C</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mn>1</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup></mml:math></inline-formula> and broadcast to match the spatial dimensions of the global features. Subsequently, the two features are concatenated along the channel dimension, as described in <xref ref-type="disp-formula" rid="eqn-2">Eq. (2)</xref>:
<disp-formula id="eqn-2"><label>(2)</label><mml:math id="mml-eqn-2" display="block"><mml:msub><mml:mi>F</mml:mi><mml:mrow><mml:mrow><mml:mtext>cat</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mrow><mml:mtext>Concat</mml:mtext></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>F</mml:mi><mml:mrow><mml:mrow><mml:mtext>global</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msubsup><mml:mi>F</mml:mi><mml:mrow><mml:mrow><mml:mtext>instance</mml:mtext></mml:mrow></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msubsup><mml:mo stretchy="false">)</mml:mo></mml:math></disp-formula></p>
<p>To extract channel correlations, the module applies both global average pooling and max pooling operations to <inline-formula id="ieqn-62"><mml:math id="mml-ieqn-62"><mml:msub><mml:mi>F</mml:mi><mml:mrow><mml:mtext>cat</mml:mtext></mml:mrow></mml:msub></mml:math></inline-formula>, generating two global descriptors:
<disp-formula id="eqn-3"><label>(3)</label><mml:math id="mml-eqn-3" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:msub><mml:mi>F</mml:mi><mml:mrow><mml:mrow><mml:mtext>avg</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mrow><mml:mtext>AvgPool</mml:mtext></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>F</mml:mi><mml:mrow><mml:mrow><mml:mtext>cat</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:mi>B</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mn>2</mml:mn><mml:mi>C</mml:mi></mml:mrow></mml:msup></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:msub><mml:mi>F</mml:mi><mml:mrow><mml:mrow><mml:mtext>max</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mrow><mml:mtext>MaxPool</mml:mtext></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>F</mml:mi><mml:mrow><mml:mrow><mml:mtext>cat</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:mi>B</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mn>2</mml:mn><mml:mi>C</mml:mi></mml:mrow></mml:msup></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>These descriptors are further concatenated and processed through a multi-layer perceptron (MLP), as shown in <xref ref-type="disp-formula" rid="eqn-4">Eq. (4)</xref>:
<disp-formula id="eqn-4"><label>(4)</label><mml:math id="mml-eqn-4" display="block"><mml:msub><mml:mi>F</mml:mi><mml:mrow><mml:mrow><mml:mtext>attn</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mi>&#x03C3;</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mtext>MLP</mml:mtext></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mtext>Concat</mml:mtext></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>F</mml:mi><mml:mrow><mml:mrow><mml:mtext>avg</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>F</mml:mi><mml:mrow><mml:mrow><mml:mtext>max</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">)</mml:mo></mml:math></disp-formula>where the MLP consists of two fully connected layers with ReLU activation in between and Sigmoid activation <inline-formula id="ieqn-63"><mml:math id="mml-ieqn-63"><mml:mi>&#x03C3;</mml:mi></mml:math></inline-formula> at the end. The resulting attention weights are reshaped to [2,B,C,1,1] and applied to the original features in a residual manner:
<disp-formula id="eqn-5"><label>(5)</label><mml:math id="mml-eqn-5" display="block"><mml:msub><mml:mi>F</mml:mi><mml:mrow><mml:mrow><mml:mtext>out</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mi>F</mml:mi><mml:mrow><mml:mrow><mml:mtext>global</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:mo>&#x22C5;</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mn>1</mml:mn><mml:mo>+</mml:mo><mml:msub><mml:mi>W</mml:mi><mml:mi>c</mml:mi></mml:msub><mml:mo stretchy="false">[</mml:mo><mml:mn>0</mml:mn><mml:mo stretchy="false">]</mml:mo><mml:mo stretchy="false">)</mml:mo><mml:mo>+</mml:mo><mml:msubsup><mml:mi>F</mml:mi><mml:mrow><mml:mrow><mml:mtext>instance</mml:mtext></mml:mrow></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msubsup><mml:mo>&#x22C5;</mml:mo><mml:msub><mml:mi>W</mml:mi><mml:mi>c</mml:mi></mml:msub><mml:mo stretchy="false">[</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">]</mml:mo></mml:math></disp-formula></p>
<p>The Spatial Attention module focuses on spatial dimension information of feature maps and, in addition to processing global and instance features, optionally accepts instance coordinate information <inline-formula id="ieqn-64"><mml:math id="mml-ieqn-64"><mml:mtext>coords</mml:mtext><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:mi>B</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mn>2</mml:mn></mml:mrow></mml:msup></mml:math></inline-formula>. During feature interaction, the module computes element-wise multiplication between global features and transformed instance features:
<disp-formula id="eqn-6"><label>(6)</label><mml:math id="mml-eqn-6" display="block"><mml:msub><mml:mi>F</mml:mi><mml:mrow><mml:mrow><mml:mtext>feat</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mi>F</mml:mi><mml:mrow><mml:mrow><mml:mtext>global</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:mo>&#x22C5;</mml:mo><mml:msubsup><mml:mi>F</mml:mi><mml:mrow><mml:mrow><mml:mtext>instance,expanded</mml:mtext></mml:mrow></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msubsup></mml:math></disp-formula></p>
<p>When instance coordinates are provided, the module computes relative position encoding according to <xref ref-type="disp-formula" rid="eqn-7">Eqs. (7)</xref> and <xref ref-type="disp-formula" rid="eqn-8">(8)</xref>:
<disp-formula id="eqn-7"><label>(7)</label><mml:math id="mml-eqn-7" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd /><mml:mtd><mml:msub><mml:mi>R</mml:mi><mml:mrow><mml:mrow><mml:mtext>coords</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mrow><mml:mtext>coords.reshape</mml:mtext></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mn>2</mml:mn><mml:mo stretchy="false">)</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mrow><mml:mtext>pixel_coords.reshape</mml:mtext></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mn>2</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<disp-formula id="eqn-8"><label>(8)</label><mml:math id="mml-eqn-8" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd /><mml:mtd><mml:msub><mml:mi>R</mml:mi><mml:mrow><mml:mrow><mml:mtext>coords</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mi>R</mml:mi><mml:mrow><mml:mrow><mml:mtext>coords.permute</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mn>2</mml:mn><mml:mo>,</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo><mml:mo>.</mml:mo><mml:mrow><mml:mtext>reshape</mml:mtext></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>B</mml:mi><mml:mo>,</mml:mo><mml:mn>2</mml:mn><mml:mo>,</mml:mo><mml:mi>H</mml:mi><mml:mo>,</mml:mo><mml:mi>W</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mrow><mml:mo>/</mml:mo></mml:mrow><mml:mrow><mml:mtext>scale_factor</mml:mtext></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>After concatenating the spatial attention map with relative position encoding, a series of convolution operations generate the final spatial attention weights:
<disp-formula id="eqn-9"><label>(9)</label><mml:math id="mml-eqn-9" display="block"><mml:msub><mml:mi>W</mml:mi><mml:mi>s</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mi>&#x03C3;</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mrow><mml:mtext>Conv</mml:mtext></mml:mrow><mml:mrow><mml:mn>1</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mtext>ReLU</mml:mtext></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mtext>BN</mml:mtext></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mrow><mml:mtext>Conv</mml:mtext></mml:mrow><mml:mrow><mml:mn>3</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mn>3</mml:mn></mml:mrow></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mtext>Concat</mml:mtext></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>F</mml:mi><mml:mrow><mml:mrow><mml:mtext>sum</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>R</mml:mi><mml:mrow><mml:mrow><mml:mtext>coords</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">)</mml:mo></mml:math></disp-formula></p>
<p>The final output is computed through a similar residual mechanism, as described in <xref ref-type="disp-formula" rid="eqn-10">Eq. (10)</xref>:
<disp-formula id="eqn-10"><label>(10)</label><mml:math id="mml-eqn-10" display="block"><mml:msub><mml:mi>F</mml:mi><mml:mrow><mml:mrow><mml:mtext>out</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mi>F</mml:mi><mml:mrow><mml:mrow><mml:mtext>global</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:mo>&#x22C5;</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mn>1</mml:mn><mml:mo>+</mml:mo><mml:msub><mml:mi>W</mml:mi><mml:mi>s</mml:mi></mml:msub><mml:mo stretchy="false">[</mml:mo><mml:mo>:</mml:mo><mml:mo>,</mml:mo><mml:mo>:</mml:mo><mml:mi>C</mml:mi><mml:mo>,</mml:mo><mml:mo>:</mml:mo><mml:mo>,</mml:mo><mml:mo>:</mml:mo><mml:mo stretchy="false">]</mml:mo><mml:mo stretchy="false">)</mml:mo><mml:mo>+</mml:mo><mml:mrow><mml:mtext>pad</mml:mtext></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>F</mml:mi><mml:mrow><mml:mrow><mml:mtext>input</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:mo>&#x22C5;</mml:mo><mml:msub><mml:mi>W</mml:mi><mml:mi>s</mml:mi></mml:msub><mml:mo stretchy="false">[</mml:mo><mml:mo>:</mml:mo><mml:mo>,</mml:mo><mml:mi>C</mml:mi><mml:mo>:</mml:mo><mml:mo>,</mml:mo><mml:mo>:</mml:mo><mml:mo>,</mml:mo><mml:mo>:</mml:mo><mml:mo stretchy="false">]</mml:mo><mml:mo stretchy="false">)</mml:mo></mml:math></disp-formula>where the pad operation ensures consistent output dimensions with the input. Through this design, the spatial attention module effectively captures spatial dependencies in the features.</p>
</sec>
<sec id="s3_2_3">
<label>3.2.3</label>
<title>Multi-Scale Feature Correction</title>
<p>The Multiscale Feature Correction Module (MFCM) is a feature enhancement component specifically designed for human pose estimation tasks in virtual maintenance scenarios (<xref ref-type="fig" rid="fig-7">Fig. 7</xref>). Through adaptive multiscale processing and dual attention mechanisms, this module significantly enhances feature representation capabilities and adaptability to complex scenarios.</p>
<fig id="fig-7">
<label>Figure 7</label>
<caption>
<title>Multi-scale feature correction module</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_66540-fig-7.tif"/>
</fig>
<p>At the implementation level, MFCM first constructs a multiscale feature pyramid. The input global feature map F undergoes multiscale sampling, with sampling coefficients scales typically set to [<xref ref-type="bibr" rid="ref-1">1</xref>,<xref ref-type="bibr" rid="ref-2">2</xref>], corresponding to original scale and 1/2 scale. The sampling process is implemented using bilinear interpolation, which can be expressed as:
<disp-formula id="eqn-11"><label>(11)</label><mml:math id="mml-eqn-11" display="block"><mml:msub><mml:mi>F</mml:mi><mml:mi>s</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mrow><mml:mtext>Interpolate</mml:mtext></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>F</mml:mi><mml:mo>,</mml:mo><mml:mrow><mml:mtext>scale</mml:mtext></mml:mrow><mml:mo>=</mml:mo><mml:mi>s</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>,</mml:mo><mml:mi>s</mml:mi><mml:mo>&#x2208;</mml:mo><mml:mrow><mml:mtext>scales</mml:mtext></mml:mrow></mml:math></disp-formula>where <inline-formula id="ieqn-65"><mml:math id="mml-ieqn-65"><mml:mi>I</mml:mi><mml:mi>n</mml:mi><mml:mi>t</mml:mi><mml:mi>e</mml:mi><mml:mi>r</mml:mi><mml:mi>p</mml:mi><mml:mi>o</mml:mi><mml:mi>l</mml:mi><mml:mi>a</mml:mi><mml:mi>t</mml:mi><mml:mi>e</mml:mi></mml:math></inline-formula> represents the bilinear interpolation function, and <inline-formula id="ieqn-66"><mml:math id="mml-ieqn-66"><mml:mi>s</mml:mi></mml:math></inline-formula> is the scaling coefficient. This multiscale design enables the module to capture feature information at different scales simultaneously, laying the foundation for subsequent feature correction. The core of feature correction lies in the synergistic effect of channel attention and spatial attention. The channel attention module adopts a &#x201C;squeeze-and-excitation&#x201D; mechanism, with its detailed implementation process. First, global average pooling is performed on the feature map:
<disp-formula id="eqn-12"><label>(12)</label><mml:math id="mml-eqn-12" display="block"><mml:mi>z</mml:mi><mml:mo>=</mml:mo><mml:mrow><mml:mtext>GAP</mml:mtext></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>F</mml:mi><mml:mi>s</mml:mi></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:mrow><mml:mi>H</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mi>W</mml:mi></mml:mrow></mml:mfrac><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mi>H</mml:mi></mml:munderover><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>j</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mi>W</mml:mi></mml:munderover><mml:msub><mml:mi>F</mml:mi><mml:mi>s</mml:mi></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></disp-formula></p>
<p>Then channel attention weights are computed through two fully connected layers with nonlinear activations:
<disp-formula id="eqn-13"><label>(13)</label><mml:math id="mml-eqn-13" display="block"><mml:msub><mml:mi>A</mml:mi><mml:mrow><mml:mrow><mml:mtext>channel</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mi>&#x03C3;</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>W</mml:mi><mml:mn>2</mml:mn></mml:msub><mml:mi>&#x03B4;</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>W</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:mi>z</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">)</mml:mo></mml:math></disp-formula>where <inline-formula id="ieqn-67"><mml:math id="mml-ieqn-67"><mml:msub><mml:mi>W</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:mfrac><mml:mi>C</mml:mi><mml:mi>r</mml:mi></mml:mfrac><mml:mo>&#x00D7;</mml:mo><mml:mi>C</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula> and <inline-formula id="ieqn-68"><mml:math id="mml-ieqn-68"><mml:msub><mml:mi>W</mml:mi><mml:mn>2</mml:mn></mml:msub><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:mi>C</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mfrac><mml:mi>C</mml:mi><mml:mi>r</mml:mi></mml:mfrac></mml:mrow></mml:msup></mml:math></inline-formula> are parameters of two fully connected layers, <inline-formula id="ieqn-69"><mml:math id="mml-ieqn-69"><mml:mi>r</mml:mi></mml:math></inline-formula> is the reduction ratio (typically set to 16), <inline-formula id="ieqn-70"><mml:math id="mml-ieqn-70"><mml:mi>&#x03B4;</mml:mi></mml:math></inline-formula> is the ReLU activation function, and <inline-formula id="ieqn-71"><mml:math id="mml-ieqn-71"><mml:mi>&#x03C3;</mml:mi></mml:math></inline-formula> is the sigmoid activation function. This design learns inter-channel dependencies through two nonlinear transformations.</p>
<p>The spatial attention module focuses on the spatial information distribution of feature maps. It processes the feature map through parallel average pooling and max pooling branches:
<disp-formula id="eqn-14"><label>(14)</label><mml:math id="mml-eqn-14" display="block"><mml:msub><mml:mi>M</mml:mi><mml:mrow><mml:mrow><mml:mtext>avg</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mrow><mml:mtext>AvgPool</mml:mtext></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>F</mml:mi><mml:mi>s</mml:mi></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mi>M</mml:mi><mml:mrow><mml:mrow><mml:mtext>max</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mrow><mml:mtext>MaxPool</mml:mtext></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>F</mml:mi><mml:mi>s</mml:mi></mml:msub><mml:mo stretchy="false">)</mml:mo></mml:math></disp-formula></p>
<p>These pooled features are concatenated and processed by a convolution layer to generate spatial attention weights:
<disp-formula id="eqn-15"><label>(15)</label><mml:math id="mml-eqn-15" display="block"><mml:msub><mml:mi>A</mml:mi><mml:mrow><mml:mrow><mml:mtext>spatial</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mi>&#x03C3;</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mrow><mml:mtext>Conv</mml:mtext></mml:mrow><mml:mrow><mml:mn>7</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mn>7</mml:mn></mml:mrow></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:msub><mml:mi>M</mml:mi><mml:mrow><mml:mrow><mml:mtext>avg</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:mo>;</mml:mo><mml:msub><mml:mi>M</mml:mi><mml:mrow><mml:mrow><mml:mtext>max</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:mo stretchy="false">]</mml:mo><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">)</mml:mo></mml:math></disp-formula>where [;] represents channel-wise concatenation, and <inline-formula id="ieqn-72"><mml:math id="mml-ieqn-72"><mml:msub><mml:mtext>Conv</mml:mtext><mml:mrow><mml:mn>7</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mn>7</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula> is a <inline-formula id="ieqn-73"><mml:math id="mml-ieqn-73"><mml:mn>7</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mn>7</mml:mn></mml:math></inline-formula> convolution layer used to model spatial attention mapping. The larger kernel size enables capturing broader spatial context information. The feature correction process integrates channel attention and spatial attention outputs through adaptive weights:
<disp-formula id="eqn-16"><label>(16)</label><mml:math id="mml-eqn-16" display="block"><mml:msubsup><mml:mi>F</mml:mi><mml:mi>s</mml:mi><mml:mrow><mml:mrow><mml:mtext>corr</mml:mtext></mml:mrow></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:msub><mml:mi>F</mml:mi><mml:mi>s</mml:mi></mml:msub><mml:mo>+</mml:mo><mml:msub><mml:mi>&#x03BB;</mml:mi><mml:mi>c</mml:mi></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>F</mml:mi><mml:mi>s</mml:mi></mml:msub><mml:mo>&#x2299;</mml:mo><mml:msub><mml:mi>A</mml:mi><mml:mrow><mml:mrow><mml:mtext>channel</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mo>+</mml:mo><mml:msub><mml:mi>&#x03BB;</mml:mi><mml:mi>s</mml:mi></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>F</mml:mi><mml:mi>s</mml:mi></mml:msub><mml:mo>&#x2299;</mml:mo><mml:msub><mml:mi>A</mml:mi><mml:mrow><mml:mrow><mml:mtext>spatial</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:mo stretchy="false">)</mml:mo></mml:math></disp-formula>where <inline-formula id="ieqn-74"><mml:math id="mml-ieqn-74"><mml:msub><mml:mi>&#x03BB;</mml:mi><mml:mi>c</mml:mi></mml:msub></mml:math></inline-formula> and <inline-formula id="ieqn-75"><mml:math id="mml-ieqn-75"><mml:msub><mml:mi>&#x03BB;</mml:mi><mml:mi>s</mml:mi></mml:msub></mml:math></inline-formula> are learnable balance factors, initially set to 0.5, and <inline-formula id="ieqn-76"><mml:math id="mml-ieqn-76"><mml:mo>&#x2299;</mml:mo></mml:math></inline-formula> represents element-wise multiplication. This design allows the model to automatically adjust the relative importance of the two attention mechanisms during training. To better integrate multiscale information, MFCM introduces an adaptive scale weight mechanism. First, feature aggregation is performed on the global feature F:
<disp-formula id="eqn-17"><label>(17)</label><mml:math id="mml-eqn-17" display="block"><mml:mi>g</mml:mi><mml:mo>=</mml:mo><mml:msub><mml:mi>W</mml:mi><mml:mi>g</mml:mi></mml:msub><mml:mo>&#x22C5;</mml:mo><mml:mrow><mml:mtext>GAP</mml:mtext></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>F</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></disp-formula></p>
<p>Then scale weights are generated through the softmax function:
<disp-formula id="eqn-18"><label>(18)</label><mml:math id="mml-eqn-18" display="block"><mml:mi>w</mml:mi><mml:mo>=</mml:mo><mml:mrow><mml:mtext>softmax</mml:mtext></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>g</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></disp-formula>where <inline-formula id="ieqn-77"><mml:math id="mml-ieqn-77"><mml:msub><mml:mi>W</mml:mi><mml:mi>g</mml:mi></mml:msub></mml:math></inline-formula> is a learnable weight matrix. The final multiscale fusion feature is obtained through weighted summation:
<disp-formula id="eqn-19"><label>(19)</label><mml:math id="mml-eqn-19" display="block"><mml:msub><mml:mi>F</mml:mi><mml:mrow><mml:mrow><mml:mtext>enhanced</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:munder><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>s</mml:mi><mml:mo>&#x2208;</mml:mo><mml:mrow><mml:mtext>scales</mml:mtext></mml:mrow></mml:mrow></mml:munder><mml:msub><mml:mi>w</mml:mi><mml:mi>s</mml:mi></mml:msub><mml:mo>&#x22C5;</mml:mo><mml:mrow><mml:mtext>Upsample</mml:mtext></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:msubsup><mml:mi>F</mml:mi><mml:mi>s</mml:mi><mml:mrow><mml:mrow><mml:mtext>corr</mml:mtext></mml:mrow></mml:mrow></mml:msubsup><mml:mo stretchy="false">)</mml:mo></mml:math></disp-formula></p>
<p>To maintain feature consistency, features at all scales are upsampled to the original resolution before fusion. This design ensures that the final enhanced features retain both fine-grained details from the original scale and contextual information from larger scales.</p>
<p>Given the specificity of virtual maintenance scenarios, we designed key hyperparameters based on theoretical and practical experience. The setting of the dimensionality reduction ratio <inline-formula id="ieqn-78"><mml:math id="mml-ieqn-78"><mml:mi>r</mml:mi></mml:math></inline-formula> requires balancing feature representation capability and computational efficiency. The choice of <inline-formula id="ieqn-79"><mml:math id="mml-ieqn-79"><mml:mi>r</mml:mi><mml:mo>=</mml:mo><mml:mn>16</mml:mn></mml:math></inline-formula> is based on theoretical analysis of channel attention mechanisms, which can maintain sufficient feature representation space while controlling computational complexity. The initial values of attention scaling factors <inline-formula id="ieqn-80"><mml:math id="mml-ieqn-80"><mml:msub><mml:mi>&#x03BB;</mml:mi><mml:mi>c</mml:mi></mml:msub></mml:math></inline-formula> and <inline-formula id="ieqn-81"><mml:math id="mml-ieqn-81"><mml:msub><mml:mi>&#x03BB;</mml:mi><mml:mi>s</mml:mi></mml:msub></mml:math></inline-formula> are set to 0.5 based on the theoretical requirements of multi-scale feature fusion, aiming to achieve effective balance of features at different scales. The temperature parameter <inline-formula id="ieqn-82"><mml:math id="mml-ieqn-82"><mml:mi>&#x03C4;</mml:mi><mml:mo>=</mml:mo><mml:mn>0.07</mml:mn></mml:math></inline-formula> in contrastive learning follows the theoretical guidance of contrastive learning to ensure effective discrimination between positive and negative samples.</p>
<p>To improve the model&#x2019;s generalization capability and reduce overfitting risks, we designed multi-level data augmentation strategies: geometric transformation augmentation simulates different observation angles and operational postures through random rotation, scaling transformation, and horizontal flipping; photometric augmentation adapts to different lighting environments through color jittering, brightness and contrast adjustment; spatial perturbation includes random cropping and moderate perturbation of keypoint positions to enhance the model&#x2019;s adaptability to spatial variations. The design of these augmentation strategies fully considers the characteristics and challenges of virtual maintenance scenarios.</p>
</sec>
<sec id="s3_2_4">
<label>3.2.4</label>
<title>Multi-Scale Feature Fusion</title>
<p>The Multiscale Feature Fusion Module (MFFM) significantly enhances traditional feature fusion approaches through its innovative multi-level interaction and fusion strategies (<xref ref-type="fig" rid="fig-8">Fig. 8</xref>). Unlike conventional methods, MFFM addresses the unique challenges of virtual maintenance scenarios, particularly the critical requirement for fine motion recognition. This targeted design enables more effective feature integration across different scales.</p>
<fig id="fig-8">
<label>Figure 8</label>
<caption>
<title>Multi-scale feature fusion module</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_66540-fig-8.tif"/>
</fig>
<p>The first key component of MFFM is the cross-attention mechanism. Unlike traditional self-attention, cross-attention enables information exchange between features at different scales. The implementation begins with linear transformations of input features to generate queries, keys, and values:
<disp-formula id="eqn-20"><label>(20)</label><mml:math id="mml-eqn-20" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:msub><mml:mi>Q</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mtd><mml:mtd><mml:mi></mml:mi><mml:mo>=</mml:mo><mml:msub><mml:mi>W</mml:mi><mml:mi>q</mml:mi></mml:msub><mml:msub><mml:mi>F</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo>+</mml:mo><mml:msub><mml:mi>b</mml:mi><mml:mi>q</mml:mi></mml:msub><mml:mtext>&#xA0;</mml:mtext><mml:msub><mml:mi>K</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mtd><mml:mtd><mml:mo>=</mml:mo><mml:msub><mml:mi>W</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:msub><mml:mi>F</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo>+</mml:mo><mml:msub><mml:mi>b</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mtext>&#xA0;</mml:mtext><mml:msub><mml:mi>V</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mtd><mml:mtd><mml:mi></mml:mi><mml:mo>=</mml:mo><mml:msub><mml:mi>W</mml:mi><mml:mi>v</mml:mi></mml:msub><mml:msub><mml:mi>F</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo>+</mml:mo><mml:msub><mml:mi>b</mml:mi><mml:mi>v</mml:mi></mml:msub></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>where <inline-formula id="ieqn-83"><mml:math id="mml-ieqn-83"><mml:mi>i</mml:mi><mml:mo>&#x2208;</mml:mo><mml:mrow><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mn>2</mml:mn></mml:mrow></mml:math></inline-formula> represents different scales, <inline-formula id="ieqn-84"><mml:math id="mml-ieqn-84"><mml:msub><mml:mi>W</mml:mi><mml:mi>q</mml:mi></mml:msub></mml:math></inline-formula>, <inline-formula id="ieqn-85"><mml:math id="mml-ieqn-85"><mml:msub><mml:mi>W</mml:mi><mml:mi>k</mml:mi></mml:msub></mml:math></inline-formula>, <inline-formula id="ieqn-86"><mml:math id="mml-ieqn-86"><mml:msub><mml:mi>W</mml:mi><mml:mi>v</mml:mi></mml:msub></mml:math></inline-formula> and <inline-formula id="ieqn-87"><mml:math id="mml-ieqn-87"><mml:msub><mml:mi>b</mml:mi><mml:mi>q</mml:mi></mml:msub></mml:math></inline-formula>, <inline-formula id="ieqn-88"><mml:math id="mml-ieqn-88"><mml:msub><mml:mi>b</mml:mi><mml:mi>k</mml:mi></mml:msub></mml:math></inline-formula>, <inline-formula id="ieqn-89"><mml:math id="mml-ieqn-89"><mml:msub><mml:mi>b</mml:mi><mml:mi>v</mml:mi></mml:msub></mml:math></inline-formula> are learnable parameters. The cross-attention scores are then computed through scaled dot-product attention:
<disp-formula id="eqn-21"><label>(21)</label><mml:math id="mml-eqn-21" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:msub><mml:mi>A</mml:mi><mml:mrow><mml:mn>12</mml:mn></mml:mrow></mml:msub></mml:mtd><mml:mtd><mml:mi></mml:mi><mml:mo>=</mml:mo><mml:mrow><mml:mtext>softmax</mml:mtext></mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mfrac><mml:mrow><mml:msub><mml:mi>Q</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:msubsup><mml:mi>K</mml:mi><mml:mn>2</mml:mn><mml:mi>T</mml:mi></mml:msubsup></mml:mrow><mml:msqrt><mml:msub><mml:mi>d</mml:mi><mml:mi>k</mml:mi></mml:msub></mml:msqrt></mml:mfrac><mml:mo>)</mml:mo></mml:mrow></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:msub><mml:mi>A</mml:mi><mml:mrow><mml:mn>21</mml:mn></mml:mrow></mml:msub></mml:mtd><mml:mtd><mml:mi></mml:mi><mml:mo>=</mml:mo><mml:mrow><mml:mtext>softmax</mml:mtext></mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mfrac><mml:mrow><mml:msub><mml:mi>Q</mml:mi><mml:mn>2</mml:mn></mml:msub><mml:msubsup><mml:mi>K</mml:mi><mml:mn>1</mml:mn><mml:mi>T</mml:mi></mml:msubsup></mml:mrow><mml:msqrt><mml:msub><mml:mi>d</mml:mi><mml:mi>k</mml:mi></mml:msub></mml:msqrt></mml:mfrac><mml:mo>)</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>where <inline-formula id="ieqn-90"><mml:math id="mml-ieqn-90"><mml:msub><mml:mi>d</mml:mi><mml:mi>k</mml:mi></mml:msub></mml:math></inline-formula> is the feature dimension. These attention scores are used to weight and combine features across scales:
<disp-formula id="eqn-22"><label>(22)</label><mml:math id="mml-eqn-22" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:msubsup><mml:mi>F</mml:mi><mml:mn>1</mml:mn><mml:mrow><mml:mrow><mml:mtext>ca</mml:mtext></mml:mrow></mml:mrow></mml:msubsup></mml:mtd><mml:mtd><mml:mi></mml:mi><mml:mo>=</mml:mo><mml:mrow><mml:mtext>LayerNorm</mml:mtext></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>F</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:mo>+</mml:mo><mml:msub><mml:mi>A</mml:mi><mml:mrow><mml:mn>12</mml:mn></mml:mrow></mml:msub><mml:msub><mml:mi>V</mml:mi><mml:mn>2</mml:mn></mml:msub><mml:mo stretchy="false">)</mml:mo></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:msubsup><mml:mi>F</mml:mi><mml:mn>2</mml:mn><mml:mrow><mml:mrow><mml:mtext>ca</mml:mtext></mml:mrow></mml:mrow></mml:msubsup></mml:mtd><mml:mtd><mml:mi></mml:mi><mml:mo>=</mml:mo><mml:mrow><mml:mtext>LayerNorm</mml:mtext></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>F</mml:mi><mml:mn>2</mml:mn></mml:msub><mml:mo>+</mml:mo><mml:msub><mml:mi>A</mml:mi><mml:mrow><mml:mn>21</mml:mn></mml:mrow></mml:msub><mml:msub><mml:mi>V</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:mo stretchy="false">)</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>where <inline-formula id="ieqn-91"><mml:math id="mml-ieqn-91"><mml:mtext>LayerNorm</mml:mtext></mml:math></inline-formula> denotes layer normalization, which stabilizes the feature distributions and accelerates training. The channel embedding module further enhances feature representation capability by processing spatial and channel information in parallel paths. First, the cross-attention outputs are concatenated and dimensionally reduced:
<disp-formula id="eqn-23"><label>(23)</label><mml:math id="mml-eqn-23" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd /><mml:mtd><mml:msub><mml:mi>F</mml:mi><mml:mrow><mml:mrow><mml:mtext>concat</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mrow><mml:mtext>Concat</mml:mtext></mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:msubsup><mml:mi>F</mml:mi><mml:mn>1</mml:mn><mml:mrow><mml:mrow><mml:mtext>ca</mml:mtext></mml:mrow></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:msubsup><mml:mi>F</mml:mi><mml:mn>2</mml:mn><mml:mrow><mml:mrow><mml:mtext>ca</mml:mtext></mml:mrow></mml:mrow></mml:msubsup><mml:mo stretchy="false">]</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<disp-formula id="eqn-24"><label>(24)</label><mml:math id="mml-eqn-24" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd /><mml:mtd><mml:msub><mml:mi>F</mml:mi><mml:mrow><mml:mrow><mml:mtext>reduced</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mtext>Conv</mml:mtext></mml:mrow><mml:mrow><mml:mn>1</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>F</mml:mi><mml:mrow><mml:mrow><mml:mtext>concat</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:mo stretchy="false">)</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>The reduced features are then processed through spatial and channel branches:
<disp-formula id="eqn-25"><label>(25)</label><mml:math id="mml-eqn-25" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:msub><mml:mi>F</mml:mi><mml:mrow><mml:mrow><mml:mtext>spatial</mml:mtext></mml:mrow></mml:mrow></mml:msub></mml:mtd><mml:mtd><mml:mi></mml:mi><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mtext>DWConv</mml:mtext></mml:mrow><mml:mi>k</mml:mi></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>F</mml:mi><mml:mrow><mml:mrow><mml:mtext>reduced</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:mo stretchy="false">)</mml:mo></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:msub><mml:mi>F</mml:mi><mml:mrow><mml:mrow><mml:mtext>channel</mml:mtext></mml:mrow></mml:mrow></mml:msub></mml:mtd><mml:mtd><mml:mi></mml:mi><mml:mo>=</mml:mo><mml:msub><mml:mi>F</mml:mi><mml:mrow><mml:mrow><mml:mtext>reduced</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:mo>&#x2299;</mml:mo><mml:mi>&#x03C3;</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mtext>MLP</mml:mtext></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mtext>GAP</mml:mtext></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>F</mml:mi><mml:mrow><mml:mrow><mml:mtext>reduced</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">)</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>where <inline-formula id="ieqn-92"><mml:math id="mml-ieqn-92"><mml:msub><mml:mtext>DWConv</mml:mtext><mml:mi>k</mml:mi></mml:msub></mml:math></inline-formula> represents depthwise separable convolution with kernel size <inline-formula id="ieqn-93"><mml:math id="mml-ieqn-93"><mml:mi>k</mml:mi></mml:math></inline-formula>, and <inline-formula id="ieqn-94"><mml:math id="mml-ieqn-94"><mml:mtext>MLP</mml:mtext></mml:math></inline-formula> denotes a multilayer perceptron. The outputs from both branches are combined and processed through a <inline-formula id="ieqn-95"><mml:math id="mml-ieqn-95"><mml:mn>1</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mn>1</mml:mn></mml:math></inline-formula> convolution:
<disp-formula id="eqn-26"><label>(26)</label><mml:math id="mml-eqn-26" display="block"><mml:msub><mml:mi>F</mml:mi><mml:mrow><mml:mrow><mml:mtext>ce</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mtext>Conv</mml:mtext></mml:mrow><mml:mrow><mml:mn>1</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>F</mml:mi><mml:mrow><mml:mrow><mml:mtext>spatial</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:msub><mml:mi>F</mml:mi><mml:mrow><mml:mrow><mml:mtext>channel</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:mo stretchy="false">)</mml:mo></mml:math></disp-formula></p>
<p>The multiscale convolution module captures context information at multiple scales through parallel convolutions with different kernel sizes:
<disp-formula id="eqn-27"><label>(27)</label><mml:math id="mml-eqn-27" display="block"><mml:msub><mml:mi>F</mml:mi><mml:mrow><mml:mrow><mml:mtext>ms</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:munder><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>k</mml:mi><mml:mo>&#x2208;</mml:mo><mml:mi>K</mml:mi></mml:mrow></mml:munder><mml:msub><mml:mi>w</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mo>&#x22C5;</mml:mo><mml:msub><mml:mrow><mml:mtext>Conv</mml:mtext></mml:mrow><mml:mi>k</mml:mi></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>F</mml:mi><mml:mrow><mml:mrow><mml:mtext>ce</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:mo stretchy="false">)</mml:mo></mml:math></disp-formula>where <inline-formula id="ieqn-96"><mml:math id="mml-ieqn-96"><mml:mi>K</mml:mi><mml:mo>=</mml:mo><mml:mrow><mml:mn>3</mml:mn><mml:mo>,</mml:mo><mml:mn>5</mml:mn><mml:mo>,</mml:mo><mml:mn>7</mml:mn></mml:mrow></mml:math></inline-formula> represents the set of convolution kernel sizes, and <inline-formula id="ieqn-97"><mml:math id="mml-ieqn-97"><mml:msub><mml:mi>w</mml:mi><mml:mi>k</mml:mi></mml:msub></mml:math></inline-formula> are learnable weights for each scale. The final fusion feature combines multiscale convolution output with channel embedding features:
<disp-formula id="eqn-28"><label>(28)</label><mml:math id="mml-eqn-28" display="block"><mml:msub><mml:mi>F</mml:mi><mml:mrow><mml:mrow><mml:mtext>fused</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mi>F</mml:mi><mml:mrow><mml:mrow><mml:mtext>ms</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:msub><mml:mi>F</mml:mi><mml:mrow><mml:mrow><mml:mtext>ce</mml:mtext></mml:mrow></mml:mrow></mml:msub></mml:math></disp-formula></p>
<p>The training process employs an end-to-end approach with a comprehensive loss function. The keypoint detection loss uses Mean Square Error (MSE) to measure the difference between predicted and ground truth heatmaps:
<disp-formula id="eqn-29"><label>(29)</label><mml:math id="mml-eqn-29" display="block"><mml:msub><mml:mi>L</mml:mi><mml:mrow><mml:mrow><mml:mtext>kpt</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:mi>N</mml:mi></mml:mfrac><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mi>N</mml:mi></mml:munderover><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>j</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mi>K</mml:mi></mml:munderover><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:msub><mml:mi>H</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mrow><mml:mover><mml:mi>H</mml:mi><mml:mo stretchy="false">&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:msubsup><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mn>2</mml:mn><mml:mn>2</mml:mn></mml:msubsup></mml:math></disp-formula>where <italic>N</italic> is the number of samples, <italic>K</italic> is the number of keypoints, <italic>H</italic> and <inline-formula id="ieqn-98"><mml:math id="mml-ieqn-98"><mml:mrow><mml:mover><mml:mi>H</mml:mi><mml:mo stretchy="false">&#x005E;</mml:mo></mml:mover></mml:mrow></mml:math></inline-formula> are predicted and ground truth heatmaps, respectively. To enhance feature discriminability, a contrastive loss is incorporated:
<disp-formula id="eqn-30"><label>(30)</label><mml:math id="mml-eqn-30" display="block"><mml:msub><mml:mi>L</mml:mi><mml:mrow><mml:mrow><mml:mtext>contra</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mi>log</mml:mi><mml:mo>&#x2061;</mml:mo><mml:mfrac><mml:mrow><mml:mi>exp</mml:mi><mml:mo>&#x2061;</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>f</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo>&#x22C5;</mml:mo><mml:msubsup><mml:mi>f</mml:mi><mml:mi>i</mml:mi><mml:mo>+</mml:mo></mml:msubsup><mml:mrow><mml:mo>/</mml:mo></mml:mrow><mml:mi>&#x03C4;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mrow><mml:munder><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>j</mml:mi><mml:mo>&#x2260;</mml:mo><mml:mi>i</mml:mi></mml:mrow></mml:munder><mml:mi>exp</mml:mi><mml:mo>&#x2061;</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>f</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo>&#x22C5;</mml:mo><mml:msub><mml:mi>f</mml:mi><mml:mi>j</mml:mi></mml:msub><mml:mrow><mml:mo>/</mml:mo></mml:mrow><mml:mi>&#x03C4;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mfrac></mml:math></disp-formula>where <inline-formula id="ieqn-99"><mml:math id="mml-ieqn-99"><mml:msub><mml:mi>f</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:math></inline-formula> and <inline-formula id="ieqn-100"><mml:math id="mml-ieqn-100"><mml:msubsup><mml:mi>f</mml:mi><mml:mi>i</mml:mi><mml:mo>+</mml:mo></mml:msubsup></mml:math></inline-formula> are feature representations of the same individual from different viewpoints, and <inline-formula id="ieqn-101"><mml:math id="mml-ieqn-101"><mml:mi>&#x03C4;</mml:mi></mml:math></inline-formula> is the temperature parameter. The final loss function combines keypoint detection loss, contrastive loss, and L2 regularization:
<disp-formula id="eqn-31"><label>(31)</label><mml:math id="mml-eqn-31" display="block"><mml:mi>L</mml:mi><mml:mo>=</mml:mo><mml:msub><mml:mi>L</mml:mi><mml:mrow><mml:mrow><mml:mtext>kpt</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:mi>&#x03B1;</mml:mi><mml:msub><mml:mi>L</mml:mi><mml:mrow><mml:mrow><mml:mtext>contra</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:mi>&#x03B2;</mml:mi><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mi>&#x03B8;</mml:mi><mml:msub><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mn>2</mml:mn></mml:msub></mml:math></disp-formula>where <inline-formula id="ieqn-102"><mml:math id="mml-ieqn-102"><mml:mi>&#x03B1;</mml:mi></mml:math></inline-formula> and <inline-formula id="ieqn-103"><mml:math id="mml-ieqn-103"><mml:mi>&#x03B2;</mml:mi></mml:math></inline-formula> are weight coefficients that balance the contribution of each term. The L2 regularization term <inline-formula id="ieqn-104"><mml:math id="mml-ieqn-104"><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mi>&#x03B8;</mml:mi><mml:msub><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mn>2</mml:mn></mml:msub></mml:math></inline-formula> helps prevent overfitting by constraining the magnitude of model parameters.</p>
<p>This multi-level, multi-angle feature processing approach enables MFCM and MFFM to effectively handle various challenges in virtual maintenance scenarios, providing robust feature support for subsequent keypoint estimation. The comprehensive design of attention mechanisms, feature fusion strategies, and training objectives ensures effective learning of pose-relevant features while maintaining computational efficiency through the use of depthwise separable convolutions and residual connections.</p>
</sec>
</sec>
<sec id="s3_3">
<label>3.3</label>
<title>Keypoint Estimation</title>
<p>Using individual enhanced features <inline-formula id="ieqn-105"><mml:math id="mml-ieqn-105"><mml:mo fence="false" stretchy="false">{</mml:mo><mml:msup><mml:mi>F</mml:mi><mml:mi>i</mml:mi></mml:msup><mml:msubsup><mml:mo fence="false" stretchy="false">}</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mi>N</mml:mi></mml:msubsup></mml:math></inline-formula> and the fused multi-scale global features <inline-formula id="ieqn-106"><mml:math id="mml-ieqn-106"><mml:msub><mml:mi>X</mml:mi><mml:mrow><mml:mtext>fused</mml:mtext></mml:mrow></mml:msub></mml:math></inline-formula>, VMHPE ultimately generates K keypoint heatmaps for each individual through the Keypoint Estimation (KE) module. The KE module adopts a structure similar to HRNet, gradually extracting features through an encoder and then progressively restoring spatial resolution through a decoder to finally output keypoint heatmaps that match the original image dimensions.</p>
<p>Specifically, the encoder part of the KE module consists of several convolutional blocks, with each block containing two <inline-formula id="ieqn-107"><mml:math id="mml-ieqn-107"><mml:mn>3</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mn>3</mml:mn></mml:math></inline-formula> convolutional layers and a downsampling layer. The encoder reduces the spatial dimensions of the input features by a factor of 8 while increasing the number of channels to 512. In the decoder part, the KE module employs a U-Net-like structure, concatenating shallow features from the encoder with decoder features through skip connections at each level, then restoring spatial resolution through upsampling and convolution operations. The decoder&#x2019;s output passes through a <inline-formula id="ieqn-108"><mml:math id="mml-ieqn-108"><mml:mn>1</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mn>1</mml:mn></mml:math></inline-formula> convolutional layer to obtain the final K-channel keypoint heatmaps <inline-formula id="ieqn-109"><mml:math id="mml-ieqn-109"><mml:msup><mml:mi>H</mml:mi><mml:mi>i</mml:mi></mml:msup><mml:mo>&#x2208;</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mn>1</mml:mn><mml:msup><mml:mo stretchy="false">]</mml:mo><mml:mrow><mml:mi>H</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mi>W</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mi>K</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula>:<disp-formula id="eqn-32"><label>(32)</label><mml:math id="mml-eqn-32" display="block"><mml:msup><mml:mi>H</mml:mi><mml:mi>i</mml:mi></mml:msup><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mtext>Conv</mml:mtext></mml:mrow><mml:mrow><mml:mn>1</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mtext>Decoder</mml:mtext></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mtext>Encoder</mml:mtext></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:msup><mml:mi>F</mml:mi><mml:mi>i</mml:mi></mml:msup><mml:mo>,</mml:mo><mml:msub><mml:mi>X</mml:mi><mml:mrow><mml:mrow><mml:mtext>fused</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">)</mml:mo></mml:math></disp-formula>where <italic>H</italic> and <italic>W</italic> represent the height and width of the input image, respectively. During training, VMHPE optimizes the entire framework in an end-to-end manner, with a loss function comprising heatmap loss <inline-formula id="ieqn-110"><mml:math id="mml-ieqn-110"><mml:msub><mml:mi>L</mml:mi><mml:mrow><mml:mtext>hm</mml:mtext></mml:mrow></mml:msub></mml:math></inline-formula>, contrastive loss <inline-formula id="ieqn-111"><mml:math id="mml-ieqn-111"><mml:msub><mml:mi>L</mml:mi><mml:mrow><mml:mtext>cl</mml:mtext></mml:mrow></mml:msub></mml:math></inline-formula>, and regularization term <inline-formula id="ieqn-112"><mml:math id="mml-ieqn-112"><mml:msub><mml:mi>L</mml:mi><mml:mrow><mml:mtext>reg</mml:mtext></mml:mrow></mml:msub></mml:math></inline-formula>. The heatmap loss <inline-formula id="ieqn-113"><mml:math id="mml-ieqn-113"><mml:msub><mml:mi>L</mml:mi><mml:mrow><mml:mtext>hm</mml:mtext></mml:mrow></mml:msub></mml:math></inline-formula> measures the difference between predicted and ground truth heatmaps, using a variant of Focal Loss; the contrastive loss <inline-formula id="ieqn-114"><mml:math id="mml-ieqn-114"><mml:msub><mml:mi>L</mml:mi><mml:mrow><mml:mtext>cl</mml:mtext></mml:mrow></mml:msub></mml:math></inline-formula> optimizes feature discriminability during individual representation learning; the regularization term <inline-formula id="ieqn-115"><mml:math id="mml-ieqn-115"><mml:msub><mml:mi>L</mml:mi><mml:mrow><mml:mtext>reg</mml:mtext></mml:mrow></mml:msub></mml:math></inline-formula> includes parameter norm penalties and sparsity constraints to control model complexity and prevent overfitting.</p>
<p>The contrastive loss <inline-formula id="ieqn-116"><mml:math id="mml-ieqn-116"><mml:msub><mml:mi>L</mml:mi><mml:mrow><mml:mtext>cl</mml:mtext></mml:mrow></mml:msub></mml:math></inline-formula> is defined as:<disp-formula id="eqn-33"><label>(33)</label><mml:math id="mml-eqn-33" display="block"><mml:msub><mml:mi>L</mml:mi><mml:mrow><mml:mrow><mml:mtext>cl</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:mi>N</mml:mi></mml:mfrac><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mi>N</mml:mi></mml:munderover><mml:mrow><mml:mi>log</mml:mi><mml:mo>&#x2061;</mml:mo><mml:mfrac><mml:mrow><mml:mi>exp</mml:mi><mml:mo>&#x2061;</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:msup><mml:mi>f</mml:mi><mml:mi>i</mml:mi></mml:msup><mml:mo>&#x22C5;</mml:mo><mml:msubsup><mml:mi>f</mml:mi><mml:mi>i</mml:mi><mml:mo>+</mml:mo></mml:msubsup><mml:mrow><mml:mo>/</mml:mo></mml:mrow><mml:mi>&#x03C4;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mrow><mml:mi>exp</mml:mi><mml:mo>&#x2061;</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:msup><mml:mi>f</mml:mi><mml:mi>i</mml:mi></mml:msup><mml:mo>&#x22C5;</mml:mo><mml:msubsup><mml:mi>f</mml:mi><mml:mi>i</mml:mi><mml:mo>+</mml:mo></mml:msubsup><mml:mrow><mml:mo>/</mml:mo></mml:mrow><mml:mi>&#x03C4;</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>+</mml:mo><mml:munder><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>j</mml:mi><mml:mo>&#x2260;</mml:mo><mml:mi>i</mml:mi></mml:mrow></mml:munder><mml:mrow><mml:mi>exp</mml:mi><mml:mo>&#x2061;</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:msup><mml:mi>f</mml:mi><mml:mi>i</mml:mi></mml:msup><mml:mo>&#x22C5;</mml:mo><mml:msubsup><mml:mi>f</mml:mi><mml:mi>j</mml:mi><mml:mo>&#x2212;</mml:mo></mml:msubsup><mml:mrow><mml:mo>/</mml:mo></mml:mrow><mml:mi>&#x03C4;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow></mml:mfrac></mml:mrow></mml:math></disp-formula>where <inline-formula id="ieqn-117"><mml:math id="mml-ieqn-117"><mml:msup><mml:mi>f</mml:mi><mml:mi>i</mml:mi></mml:msup></mml:math></inline-formula> represents the features of the <italic>i</italic>-th individual, <inline-formula id="ieqn-118"><mml:math id="mml-ieqn-118"><mml:msubsup><mml:mi>f</mml:mi><mml:mi>i</mml:mi><mml:mo>+</mml:mo></mml:msubsup></mml:math></inline-formula> represents the positive sample features obtained through data augmentation, <inline-formula id="ieqn-119"><mml:math id="mml-ieqn-119"><mml:msubsup><mml:mi>f</mml:mi><mml:mi>j</mml:mi><mml:mo>&#x2212;</mml:mo></mml:msubsup></mml:math></inline-formula> represents features from other individuals, and <inline-formula id="ieqn-120"><mml:math id="mml-ieqn-120"><mml:mi>&#x03C4;</mml:mi></mml:math></inline-formula> is the temperature coefficient. In our experiments, <inline-formula id="ieqn-121"><mml:math id="mml-ieqn-121"><mml:mi>&#x03C4;</mml:mi></mml:math></inline-formula> is set to 0.07, a value widely adopted in contrastive learning. The regularization term <inline-formula id="ieqn-122"><mml:math id="mml-ieqn-122"><mml:msub><mml:mi>L</mml:mi><mml:mrow><mml:mtext>reg</mml:mtext></mml:mrow></mml:msub></mml:math></inline-formula> includes parameter norm penalties and sparsity constraints to control model complexity and prevent overfitting:<disp-formula id="eqn-34"><label>(34)</label><mml:math id="mml-eqn-34" display="block"><mml:msub><mml:mi>L</mml:mi><mml:mrow><mml:mrow><mml:mtext>reg</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mi>&#x03BB;</mml:mi><mml:mi>w</mml:mi></mml:msub><mml:mo>&#x22C5;</mml:mo><mml:munder><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>&#x03B8;</mml:mi></mml:mrow></mml:munder><mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mi>&#x03B8;</mml:mi><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:msubsup><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mn>2</mml:mn><mml:mn>2</mml:mn></mml:msubsup></mml:mrow><mml:mo>+</mml:mo><mml:msub><mml:mi>&#x03BB;</mml:mi><mml:mi>s</mml:mi></mml:msub><mml:mo>&#x22C5;</mml:mo><mml:munder><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>A</mml:mi></mml:mrow></mml:munder><mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mi>A</mml:mi><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:msub><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mn>1</mml:mn></mml:msub></mml:mrow></mml:math></disp-formula>where <inline-formula id="ieqn-123"><mml:math id="mml-ieqn-123"><mml:mi>&#x03B8;</mml:mi></mml:math></inline-formula> represents the learnable parameters of the model, <italic>A</italic> represents attention matrices, and <inline-formula id="ieqn-124"><mml:math id="mml-ieqn-124"><mml:msub><mml:mi>&#x03BB;</mml:mi><mml:mi>w</mml:mi></mml:msub></mml:math></inline-formula> and <inline-formula id="ieqn-125"><mml:math id="mml-ieqn-125"><mml:msub><mml:mi>&#x03BB;</mml:mi><mml:mi>s</mml:mi></mml:msub></mml:math></inline-formula> are balancing coefficients. The <inline-formula id="ieqn-126"><mml:math id="mml-ieqn-126"><mml:msub><mml:mi>L</mml:mi><mml:mn>1</mml:mn></mml:msub></mml:math></inline-formula> norm constraint encourages sparsity in attention matrices, making the model focus more on salient features.</p>
<p>The final loss function is a weighted sum of the three components:<disp-formula id="eqn-35"><label>(35)</label><mml:math id="mml-eqn-35" display="block"><mml:mi>L</mml:mi><mml:mo>=</mml:mo><mml:msub><mml:mi>L</mml:mi><mml:mrow><mml:mrow><mml:mtext>hm</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:msub><mml:mi>&#x03BB;</mml:mi><mml:mi>c</mml:mi></mml:msub><mml:mo>&#x22C5;</mml:mo><mml:msub><mml:mi>L</mml:mi><mml:mrow><mml:mrow><mml:mtext>cl</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:msub><mml:mi>L</mml:mi><mml:mrow><mml:mrow><mml:mtext>reg</mml:mtext></mml:mrow></mml:mrow></mml:msub></mml:math></disp-formula>where <inline-formula id="ieqn-127"><mml:math id="mml-ieqn-127"><mml:msub><mml:mi>&#x03BB;</mml:mi><mml:mi>c</mml:mi></mml:msub></mml:math></inline-formula> is the weight coefficient for contrastive loss. Through cross-validation, we found that <inline-formula id="ieqn-128"><mml:math id="mml-ieqn-128"><mml:msub><mml:mi>&#x03BB;</mml:mi><mml:mi>c</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mn>0.1</mml:mn></mml:math></inline-formula> achieves the best performance. This indicates that while contrastive loss is important, it should not dominate the entire training process, as our ultimate goal is to generate accurate keypoint heatmaps rather than optimize feature space distribution. This multi-objective training strategy enables VMHPE to simultaneously optimize individual representation, feature correction, feature fusion, and keypoint estimation components, significantly improving model performance in pose estimation for virtual maintenance scenarios.</p>
</sec>
</sec>
<sec id="s4">
<label>4</label>
<title>Experimental Comparison</title>
<sec id="s4_1">
<label>4.1</label>
<title>Experimental Setup</title>
<p>This section first verified the accuracy of the VMHPE algorithm in human keypoint prediction, and then established a maintenance action assessment system based on high-precision prediction results. The experiments were implemented based on the PyTorch framework and conducted on a workstation equipped with an NVIDIA RTX 3090 GPU. The model used ImageNet pre-trained HRNet-W32 as the backbone network. To ensure evaluation reliability, input images were uniformly adjusted to 512 pixels on the short side while maintaining the original aspect ratio. Average Precision (AP) and Average Recall (AR) were used as basic metrics for evaluation.</p>
</sec>
<sec id="s4_2">
<label>4.2</label>
<title>Comparison with Existing Methods</title>
<p>We compared VMHPE with mainstream human pose estimation methods. <xref ref-type="table" rid="table-3">Table 3</xref> shows our method significantly outperforms existing approaches. Compared to RTMO-S, we achieved a 21.0 percentage point improvement in AP. Against YOLOXPose models, our method improved AP by 11.7&#x2013;12.9 percentage points. Using the same HRNet-W32 backbone as Disentangled Keypoint Regression (DEKR) and CID, VMHPE improved AP by 2.3 percentage points and AR by 1.3&#x2013;1.5 percentage points.</p>
<table-wrap id="table-3">
<label>Table 3</label>
<caption>
<title>Performance comparison of different methods on dataset</title>
</caption>
<table>
<colgroup>
<col/>
<col/>
<col/>
<col/>
<col/>
</colgroup>
<thead>
<tr>
<th>Model configuration</th>
<th>Backbone</th>
<th>AP <bold>(%)</bold></th>
<th>AR <bold>(%)</bold></th>
<th>Improvement</th>
</tr>
</thead>
<tbody>
<tr>
<td>RTMO&#x2212;S</td>
<td>CSPDarknet</td>
<td>73.4</td>
<td>78.0</td>
<td>&#x2212;18.7% AP, &#x2212;16.3% AR</td>
</tr>
<tr>
<td>YOLOXPose&#x2212;S</td>
<td>CSPDarknet</td>
<td>82.7</td>
<td>86.2</td>
<td>&#x2212;9.4% AP, &#x2212;8.1% AR</td>
</tr>
<tr>
<td>YOLOXPose&#x2212;L</td>
<td>CSPDarknet</td>
<td>81.5</td>
<td>84.6</td>
<td>&#x2212;10.6% AP, &#x2212;9.7% AR</td>
</tr>
<tr>
<td>DEKR</td>
<td>HRNet&#x2212;W32</td>
<td>92.1</td>
<td>93.9</td>
<td>0% AP, &#x2212;0.4% AR</td>
</tr>
<tr>
<td>CID</td>
<td>HRNet-W32</td>
<td>92.1</td>
<td>94.3</td>
<td>Baseline</td>
</tr>
<tr>
<td><bold>VMHPE (ours)</bold></td>
<td><bold>HRNet-W32</bold></td>
<td><bold>94.4</bold></td>
<td><bold>95.8</bold></td>
<td><bold>&#x002B;2.3% AP, &#x002B;1.5% <inline-formula id="ieqn-129"><mml:math id="mml-ieqn-129"><mml:msup><mml:mtext>AR</mml:mtext><mml:mn>1</mml:mn></mml:msup></mml:math></inline-formula></bold></td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<fn id="table-3fn1" fn-type="other"><p>Note: Bold entry indicates the proposed method in this study.</p></fn>
</table-wrap-foot>
</table-wrap>
<p>Considering the dual requirements of accuracy and real-time performance in virtual maintenance scenarios, we further evaluated the practicality of different methods. Based on accuracy and inference time metrics, existing methods are divided into four levels (as shown in <xref ref-type="table" rid="table-4">Table 4</xref>).</p>
<table-wrap id="table-4">
<label>Table 4</label>
<caption>
<title>Classification standards for human pose estimation methods in virtual maintenance scenarios</title>
</caption>
<table>
<colgroup>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
</colgroup>
<thead>
<tr>
<th align="center">Performance level</th>
<th align="center">AP range <bold>(%)</bold></th>
<th align="center">Inference time (ms)</th>
<th align="center">Application scenario characteristics</th>
</tr>
</thead>
<tbody>
<tr>
<td>Professional</td>
<td><inline-formula id="ieqn-130"><mml:math id="mml-ieqn-130"><mml:mo>&#x003E;</mml:mo><mml:mspace width="negativethinmathspace" /><mml:mn>90</mml:mn></mml:math></inline-formula></td>
<td><inline-formula id="ieqn-131"><mml:math id="mml-ieqn-131"><mml:mo>&#x003C;</mml:mo><mml:mspace width="negativethinmathspace" /><mml:mn>250</mml:mn></mml:math></inline-formula></td>
<td>Suitable for high-precision maintenance training and assessment</td>
</tr>
<tr>
<td>Practical</td>
<td><inline-formula id="ieqn-132"><mml:math id="mml-ieqn-132"><mml:mn>80</mml:mn></mml:math></inline-formula>&#x2013;<inline-formula id="ieqn-133"><mml:math id="mml-ieqn-133"><mml:mn>90</mml:mn></mml:math></inline-formula></td>
<td><inline-formula id="ieqn-134"><mml:math id="mml-ieqn-134"><mml:mo>&#x003C;</mml:mo><mml:mspace width="negativethinmathspace" /><mml:mn>100</mml:mn></mml:math></inline-formula></td>
<td>Meets general maintenance training needs</td>
</tr>
<tr>
<td>Basic guidance</td>
<td><inline-formula id="ieqn-135"><mml:math id="mml-ieqn-135"><mml:mn>70</mml:mn></mml:math></inline-formula>&#x2013;<inline-formula id="ieqn-136"><mml:math id="mml-ieqn-136"><mml:mn>80</mml:mn></mml:math></inline-formula></td>
<td><inline-formula id="ieqn-137"><mml:math id="mml-ieqn-137"><mml:mo>&#x003C;</mml:mo><mml:mspace width="negativethinmathspace" /><mml:mn>150</mml:mn></mml:math></inline-formula></td>
<td>Suitable for simple maintenance action guidance</td>
</tr>
<tr>
<td>Needs optimization</td>
<td><inline-formula id="ieqn-138"><mml:math id="mml-ieqn-138"><mml:mo>&#x003C;</mml:mo><mml:mspace width="negativethinmathspace" /><mml:mn>70</mml:mn></mml:math></inline-formula></td>
<td><inline-formula id="ieqn-139"><mml:math id="mml-ieqn-139"><mml:mrow><mml:mo>&#x003E;</mml:mo></mml:mrow><mml:mn>150</mml:mn></mml:math></inline-formula></td>
<td>Requires further improvement</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>Experimental results demonstrate that VMHPE achieves professional-level performance (AP &#x003D; 94.4%, inference time 213.4 ms), meeting both accuracy and real-time requirements for virtual maintenance. As shown in <xref ref-type="fig" rid="fig-9">Fig. 9</xref>, VMHPE successfully breaks through the performance bottleneck that appears when AP and AR exceed 90%, demonstrating highly balanced characteristics with an AP/AR ratio of 0.985. In contrast, YOLOXPose models offer faster inference (46.3&#x2013;67.5 ms) but only reach practical-level accuracy (AP <inline-formula id="ieqn-140"><mml:math id="mml-ieqn-140"><mml:mo>&#x2248;</mml:mo></mml:math></inline-formula> 82%); RTMO-S performs at a basic level (AP &#x003D; 73.4%) suitable only for simple guidance; and DEKR, despite high accuracy (AP &#x003D; 92.1%), is limited by its longer inference time (336.1 ms). By maintaining inference efficiency comparable to the baseline CID (213.4 vs. 213.1 ms) while significantly improving accuracy (&#x002B;2.3% AP, &#x002B;1.5% AR), VMHPE uniquely balances precision and speed. This makes it particularly valuable for professional maintenance training systems, where its multi-scale feature processing strategy simultaneously ensures accurate action assessment, high recall rates, and responsive real-time interaction.</p>
<fig id="fig-9">
<label>Figure 9</label>
<caption>
<title>Performance distribution of human pose estimation methods</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_66540-fig-9.tif"/>
</fig>
<p>To gain an in-depth understanding of the improvement effects of VMHKD-HPE in maintenance scenarios, we conducted detailed analysis on the prediction accuracy of various key points. As shown in <xref ref-type="table" rid="table-5">Table 5</xref>, our method demonstrates differentiated improvement effects on different types of key points.</p>
<table-wrap id="table-5">
<label>Table 5</label>
<caption>
<title>Comparative analysis of the percentage of correct keypoints (PCK) metrics for different keypoints</title>
</caption>
<table>
<colgroup>
<col/>
<col/>
<col/>
<col/>
<col/>
</colgroup>
<thead>
<tr>
<th>Type</th>
<th>Base <bold>(%)</bold></th>
<th>Ours <bold>(%)</bold></th>
<th><bold>&#x002B;</bold><bold>(%)</bold></th>
<th>Analysis description</th>
</tr>
</thead>
<tbody>
<tr>
<td>Wrist</td>
<td>85.6</td>
<td>94.8</td>
<td>&#x002B;9.2</td>
<td>Manual operation precision is key for improvement display</td>
</tr>
<tr>
<td>Elbow</td>
<td>88.3</td>
<td>95.1</td>
<td>&#x002B;6.8</td>
<td>Critical for control of tool operation posture in heavy lifting tasks</td>
</tr>
<tr>
<td>Shoulder</td>
<td>92.1</td>
<td>96.7</td>
<td>&#x002B;4.6</td>
<td>Support body movement with improved precision</td>
</tr>
<tr>
<td>Hip</td>
<td>90.4</td>
<td>95.8</td>
<td>&#x002B;5.4</td>
<td>Critical support points for transportation tasks</td>
</tr>
<tr>
<td>Knee</td>
<td>87.9</td>
<td>94.3</td>
<td>&#x002B;6.4</td>
<td>Important points for operational improvement display</td>
</tr>
<tr>
<td>Ankle</td>
<td>89.2</td>
<td>93.8</td>
<td>&#x002B;4.6</td>
<td>Foundation for postural balance and stability improvement</td>
</tr>
<tr>
<td>Head</td>
<td>95.7</td>
<td>97.8</td>
<td>&#x002B;2.1</td>
<td>Already high precision with limited improvement space</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>From the data, it can be seen that VMHKD-HPE achieved the most significant improvements in wrist and elbow key points (9.2% and 6.8% improvement, respectively). This is highly consistent with the characteristics of maintenance tasks: tool operation and component installation primarily depend on precise control of hands and elbows, which is exactly the strength of our method. In comparison, the improvement for head and other relatively stable key points is smaller (only 2.1%). Further analysis of different maintenance task scenarios reveals that in precision tool operation scenarios, the prediction stability of wrist key points (return position change standard deviation) decreased from the baseline 15.3 (pixel units) to 8.7 (pixels); in heavy component transportation scenarios, the prediction accuracy of hip-knee-ankle support chain angles improved by 7.8%. This targeted improvement validates the advantages of multi-scale feature extraction mechanisms in fine-grained maintenance action details.</p>
</sec>
<sec id="s4_3">
<label>4.3</label>
<title>Ablation Experiments</title>
<sec id="s4_3_1">
<label>4.3.1</label>
<title>Component Analysis</title>
<p>To verify VMHPE&#x2019;s core components, we performed progressive addition of modules to the CID baseline. As shown in <xref ref-type="table" rid="table-6">Table 6</xref>, adding the multi-scale feature correction module (MFC) improves AP by 0.8% while maintaining AR. Further introducing the multi-scale fusion attention mechanism (MFA) improves AP and AR by 0.9% and 1.1%, respectively. The complete model (MFC&#x002B;MFA) achieves a total improvement of 2.3% in AP and 1.3% in AR compared to baseline.</p>
<table-wrap id="table-6">
<label>Table 6</label>
<caption>
<title>Performance comparison of different component combinations</title>
</caption>
<table>
<colgroup>
<col/>
<col/>
<col/>
<col/>
</colgroup>
<thead>
<tr>
<th>Model configuration</th>
<th>AP <bold>(%)</bold></th>
<th>AR <bold>(%)</bold></th>
<th>Improvement</th>
</tr>
</thead>
<tbody>
<tr>
<td>Without MFC And MFA</td>
<td>92.1</td>
<td>94.3</td>
<td>Reference benchmark</td>
</tr>
<tr>
<td>With MFC Only</td>
<td>92.9</td>
<td>94.3</td>
<td>&#x002B;0.8% AP</td>
</tr>
<tr>
<td>With MFA Only</td>
<td>93.8</td>
<td>95.4</td>
<td>&#x002B;0.9%AP, &#x002B;1.1%AR</td>
</tr>
<tr>
<td><bold>Ours(With MFC And MFA)</bold></td>
<td><bold>94.4</bold></td>
<td><bold>95.6</bold></td>
<td><bold>&#x002B;2.3%AP, &#x002B;1.3%AR</bold></td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<fn id="table-6fn1" fn-type="other"><p>Note: Bold entry represents the complete proposed model with all components.</p></fn>
</table-wrap-foot>
</table-wrap>
<p><xref ref-type="fig" rid="fig-10">Fig. 10</xref> shows the performance change trend across components. Model improvement exhibits nonlinear characteristics, with MFA playing a crucial role in feature integration. AR increases significantly after adding fusion attention, from 94.3% to 95.4%, indicating enhanced detection capability across different scales. AP shows continuous improvement, reflecting the cumulative effect of components in enhancing localization accuracy.</p>
<fig id="fig-10">
<label>Figure 10</label>
<caption>
<title>Performance comparison of different component combinations in VMHPE model</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_66540-fig-10.tif"/>
</fig>
</sec>
<sec id="s4_3_2">
<label>4.3.2</label>
<title>Embedding Dimension Analysis</title>
<p>As shown in <xref ref-type="fig" rid="fig-11">Fig. 11</xref>, feature embedding dimensions significantly impact model performance. The model demonstrates good results with 8-dimensional features (AP &#x003D; 93.9%, AR &#x003D; 95.3%), reaches optimal performance at 16 dimensions (AP &#x003D; 94.4%, AR &#x003D; 95.8%), but shows performance degradation at 32 dimensions (AP &#x003D; 92.9%, AR &#x003D; 94.5%). This asymmetric performance curve indicates that excessive dimensions have a stronger negative impact than insufficient dimensions, likely due to overfitting in high-dimensional feature spaces. The 16-dimension configuration achieves an optimal balance between performance and computational efficiency.</p>
<fig id="fig-11">
<label>Figure 11</label>
<caption>
<title>3D performance surface analysis of feature embedding dimensions</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_66540-fig-11.tif"/>
</fig>
<p>As shown in <xref ref-type="fig" rid="fig-12">Fig. 12</xref>, there are significant differences in keypoint localization accuracy among different pose estimation methods in virtual maintenance scenarios. The VMHPE model significantly improves keypoint prediction accuracy through multi-scale feature processing mechanisms, especially in key maintenance areas such as hand operations and torso support. In comparison, RTMO-S and YOLOXPose series lack precision in capturing fine movements, while DEKR performs well but still has errors in complex postures. This advantage in accuracy lays a reliable foundation for subsequent maintenance action evaluation based on keypoints.</p>
<fig id="fig-12">
<label>Figure 12</label>
<caption>
<title>Comparison of pose estimation performance across different models <bold>(a&#x2013;e)</bold> in Virtual maintenance scenarios</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_66540-fig-12.tif"/>
</fig>
</sec>
</sec>
<sec id="s4_4">
<label>4.4</label>
<title>Maintenance Action Assessment</title>
<sec id="s4_4_1">
<label>4.4.1</label>
<title>Assessment Standard Establishment</title>
<p>Based on high-precision keypoint prediction results, we calculated similarity between operators&#x2019; actions and professional standards by comparing topological structure similarity between keypoint sequences. Professional personnel&#x2019;s standard actions exhibited clear pattern characteristics-with stable wrist-elbow-shoulder angles during tool operation and specific trunk-hip-knee support angles during transport-forming reliable benchmarks for evaluation. <xref ref-type="table" rid="table-7">Table 7</xref> presents a quantitative comparison analysis of key angle chains between professional maintenance personnel and ordinary operators, highlighting the differences in posture precision across different action types.</p>
<table-wrap id="table-7">
<label>Table 7</label>
<caption>
<title>Key angle comparison analysis between professional maintenance personnel and ordinary operators</title>
</caption>
<table>
<colgroup>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
</colgroup>
<thead>
<tr>
<th align="center">Action type</th>
<th align="center">Key angle chain</th>
<th align="center">Professional standard Range</th>
<th align="center">Ordinary operation Average range</th>
<th align="center">Similarity Score</th>
</tr>
</thead>
<tbody>
<tr>
<td>Tool operation</td>
<td>Wrist-Elbow-Shoulder</td>
<td><inline-formula id="ieqn-141"><mml:math id="mml-ieqn-141"><mml:mn>75.2</mml:mn><mml:mo>&#x00B1;</mml:mo><mml:mn>12.3</mml:mn></mml:math></inline-formula></td>
<td><inline-formula id="ieqn-142"><mml:math id="mml-ieqn-142"><mml:mn>82.4</mml:mn><mml:mo>&#x00B1;</mml:mo><mml:mn>15.3</mml:mn></mml:math></inline-formula></td>
<td><inline-formula id="ieqn-143"><mml:math id="mml-ieqn-143"><mml:mn>88.5</mml:mn><mml:mi mathvariant="normal">&#x0025;</mml:mi></mml:math></inline-formula></td>
</tr>
<tr>
<td>Component transport</td>
<td>Trunk-Hip-Knee</td>
<td><inline-formula id="ieqn-144"><mml:math id="mml-ieqn-144"><mml:mn>42.7</mml:mn><mml:mo>&#x00B1;</mml:mo><mml:mn>9.8</mml:mn></mml:math></inline-formula></td>
<td><inline-formula id="ieqn-145"><mml:math id="mml-ieqn-145"><mml:mn>55.8</mml:mn><mml:mo>&#x00B1;</mml:mo><mml:mn>14.2</mml:mn></mml:math></inline-formula></td>
<td><inline-formula id="ieqn-146"><mml:math id="mml-ieqn-146"><mml:mn>82.3</mml:mn><mml:mi mathvariant="normal">&#x0025;</mml:mi></mml:math></inline-formula></td>
</tr>
<tr>
<td>Component transfer</td>
<td>Two-hand Coordination Symmetry</td>
<td><inline-formula id="ieqn-147"><mml:math id="mml-ieqn-147"><mml:mn>88.3</mml:mn><mml:mo>&#x00B1;</mml:mo><mml:mn>14.2</mml:mn></mml:math></inline-formula></td>
<td><inline-formula id="ieqn-148"><mml:math id="mml-ieqn-148"><mml:mn>95.8</mml:mn><mml:mo>&#x00B1;</mml:mo><mml:mn>18.2</mml:mn></mml:math></inline-formula></td>
<td><inline-formula id="ieqn-149"><mml:math id="mml-ieqn-149"><mml:mn>85.7</mml:mn><mml:mi mathvariant="normal">&#x0025;</mml:mi></mml:math></inline-formula></td>
</tr>
<tr>
<td>Component replacement</td>
<td>Squatting Support Chain</td>
<td><inline-formula id="ieqn-150"><mml:math id="mml-ieqn-150"><mml:mn>78.5</mml:mn><mml:mo>&#x00B1;</mml:mo><mml:mn>12.8</mml:mn></mml:math></inline-formula></td>
<td><inline-formula id="ieqn-151"><mml:math id="mml-ieqn-151"><mml:mn>85.3</mml:mn><mml:mo>&#x00B1;</mml:mo><mml:mn>14.8</mml:mn></mml:math></inline-formula></td>
<td><inline-formula id="ieqn-152"><mml:math id="mml-ieqn-152"><mml:mn>84.2</mml:mn><mml:mi mathvariant="normal">&#x0025;</mml:mi></mml:math></inline-formula></td>
</tr>
<tr>
<td>Component flip</td>
<td>Trunk Forward Lean Angle</td>
<td><inline-formula id="ieqn-153"><mml:math id="mml-ieqn-153"><mml:mn>42.7</mml:mn><mml:mo>&#x00B1;</mml:mo><mml:mn>10.2</mml:mn></mml:math></inline-formula></td>
<td><inline-formula id="ieqn-154"><mml:math id="mml-ieqn-154"><mml:mn>52.4</mml:mn><mml:mo>&#x00B1;</mml:mo><mml:mn>13.5</mml:mn></mml:math></inline-formula></td>
<td><inline-formula id="ieqn-155"><mml:math id="mml-ieqn-155"><mml:mn>83.6</mml:mn><mml:mi mathvariant="normal">&#x0025;</mml:mi></mml:math></inline-formula></td>
</tr>
</tbody>
</table>
</table-wrap>
<p>Through experimental validation, we established a four-level assessment standard as shown in <xref ref-type="table" rid="table-8">Table 8</xref>: <inline-formula id="ieqn-156"><mml:math id="mml-ieqn-156"><mml:mrow><mml:mo>&#x2265;</mml:mo></mml:mrow><mml:mn>90</mml:mn><mml:mi mathvariant="normal">&#x0025;</mml:mi></mml:math></inline-formula> (professional level), 80%&#x2013;89% (proficient level), 70%&#x2013;79% (qualified level), and &#x0003C;70% (needs improvement), providing an objective, data-driven approach for standardizing maintenance training. <xref ref-type="table" rid="table-8">Table 8</xref> provides detailed descriptions of maintenance action evaluation standards based on skeletal topology similarity, including similarity ranges, evaluation standard descriptions, and key feature performance indicators for each level.</p>
<table-wrap id="table-8">
<label>Table 8</label>
<caption>
<title>Maintenance action evaluation standards based on skeletal topology similarity</title>
</caption>
<table>
<colgroup>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
</colgroup>
<thead>
<tr>
<th align="center">Assessment level</th>
<th align="center">Similarity range</th>
<th align="center">Evaluation standard description</th>
<th align="center">Key feature performance</th>
</tr>
</thead>
<tbody>
<tr>
<td>Professional level</td>
<td><inline-formula id="ieqn-157"><mml:math id="mml-ieqn-157"><mml:mrow><mml:mo>&#x2265;</mml:mo></mml:mrow><mml:mn>90</mml:mn><mml:mi mathvariant="normal">&#x0025;</mml:mi></mml:math></inline-formula></td>
<td>Actions highly conform to professional standards, good stability</td>
<td>Key angle chain fluctuation range <inline-formula id="ieqn-158"><mml:math id="mml-ieqn-158"><mml:mrow><mml:mo>&#x2264;</mml:mo></mml:mrow><mml:msup><mml:mn>10</mml:mn><mml:mrow><mml:mo>&#x2218;</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula>, continuous movements</td>
</tr>
<tr>
<td>Proficient level</td>
<td><inline-formula id="ieqn-159"><mml:math id="mml-ieqn-159"><mml:mn>80</mml:mn><mml:mi mathvariant="normal">&#x0025;</mml:mi></mml:math></inline-formula>&#x2013;<inline-formula id="ieqn-160"><mml:math id="mml-ieqn-160"><mml:mn>89</mml:mn><mml:mi mathvariant="normal">&#x0025;</mml:mi></mml:math></inline-formula></td>
<td>Basically meets standard action requirements, details need optimization</td>
<td>Key angle chain fluctuation range <inline-formula id="ieqn-161"><mml:math id="mml-ieqn-161"><mml:mrow><mml:mo>&#x2264;</mml:mo></mml:mrow><mml:msup><mml:mn>15</mml:mn><mml:mrow><mml:mo>&#x2218;</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula>, natural transitions</td>
</tr>
<tr>
<td>Qualified level</td>
<td><inline-formula id="ieqn-162"><mml:math id="mml-ieqn-162"><mml:mn>70</mml:mn><mml:mi mathvariant="normal">&#x0025;</mml:mi></mml:math></inline-formula>&#x2013;<inline-formula id="ieqn-163"><mml:math id="mml-ieqn-163"><mml:mn>79</mml:mn><mml:mi mathvariant="normal">&#x0025;</mml:mi></mml:math></inline-formula></td>
<td>Actions generally correct, standardization needs improvement</td>
<td>Key angle chain fluctuation range <inline-formula id="ieqn-164"><mml:math id="mml-ieqn-164"><mml:mrow><mml:mo>&#x2264;</mml:mo></mml:mrow><mml:msup><mml:mn>20</mml:mn><mml:mrow><mml:mo>&#x2218;</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula>, some instability exists</td>
</tr>
<tr>
<td>Needs improvement</td>
<td><inline-formula id="ieqn-165"><mml:math id="mml-ieqn-165"><mml:mrow><mml:mo>&#x003C;</mml:mo></mml:mrow><mml:mn>70</mml:mn><mml:mi mathvariant="normal">&#x0025;</mml:mi></mml:math></inline-formula></td>
<td>Actions show obvious deviations, requires focused guidance</td>
<td>Key angle chain fluctuation range <inline-formula id="ieqn-166"><mml:math id="mml-ieqn-166"><mml:mrow><mml:mo>&#x003E;</mml:mo></mml:mrow><mml:msup><mml:mn>20</mml:mn><mml:mrow><mml:mo>&#x2218;</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula>, discontinuous movements</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>To ensure the scientific validity and practicality of the similarity assessment standards, we conducted rigorous statistical validation through expert evaluation experiments and field application verification. First, we invited 6 experienced maintenance trainers to provide professional ratings (1&#x2013;10 scale) for 20 operators of different skill levels, and compared these ratings with the system-calculated similarity scores through comparative analysis. As shown in <xref ref-type="fig" rid="fig-13">Fig. 13</xref>, expert ratings and system similarity demonstrate a high positive correlation (Pearson correlation coefficient r &#x003D; 0.89, <italic>p</italic> &#x0003C; 0.001). Particularly around the 90% similarity threshold, expert rating consistency reached its highest level (Cohen&#x2019;s Kappa &#x003D; 0.86), providing strong support for our professional level threshold setting.</p>
<fig id="fig-13">
<label>Figure 13</label>
<caption>
<title>Correlation analysis between expert scores and system similarity</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_66540-fig-13.tif"/>
</fig>
<p>Second, we conducted detailed recording and analysis of the actual work performance of these 20 operators. As shown in <xref ref-type="table" rid="table-9">Table 9</xref>, operators in different similarity intervals exhibited significant differences in work efficiency, operational accuracy, and skill levels.</p>
<table-wrap id="table-9">
<label>Table 9</label>
<caption>
<title>Correlation analysis between similarity ratings and actual work performance</title>
</caption>
<table>
<colgroup>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
</colgroup>
<thead>
<tr>
<th align="center">Similarity range</th>
<th align="center">Sample size</th>
<th align="center">Maintenance efficiency (min/task)</th>
<th align="center">Operation error rate <bold>(%)</bold></th>
<th align="center">Skill score (100-pt)</th>
<th align="center">Supervisor consistency</th>
</tr>
</thead>
<tbody>
<tr>
<td><inline-formula id="ieqn-167"><mml:math id="mml-ieqn-167"><mml:mrow><mml:mo>&#x2265;</mml:mo></mml:mrow><mml:mn>90</mml:mn><mml:mi mathvariant="normal">&#x0025;</mml:mi></mml:math></inline-formula></td>
<td>5</td>
<td><inline-formula id="ieqn-168"><mml:math id="mml-ieqn-168"><mml:mn>12.4</mml:mn><mml:mo>&#x00B1;</mml:mo><mml:mn>2.1</mml:mn></mml:math></inline-formula></td>
<td><inline-formula id="ieqn-169"><mml:math id="mml-ieqn-169"><mml:mn>2.3</mml:mn><mml:mo>&#x00B1;</mml:mo><mml:mn>0.8</mml:mn></mml:math></inline-formula></td>
<td><inline-formula id="ieqn-170"><mml:math id="mml-ieqn-170"><mml:mn>94.5</mml:mn><mml:mo>&#x00B1;</mml:mo><mml:mn>3.2</mml:mn></mml:math></inline-formula></td>
<td>0.91</td>
</tr>
<tr>
<td>80%&#x2013;89%</td>
<td>7</td>
<td><inline-formula id="ieqn-171"><mml:math id="mml-ieqn-171"><mml:mn>17.6</mml:mn><mml:mo>&#x00B1;</mml:mo><mml:mn>3.2</mml:mn></mml:math></inline-formula></td>
<td><inline-formula id="ieqn-172"><mml:math id="mml-ieqn-172"><mml:mn>5.7</mml:mn><mml:mo>&#x00B1;</mml:mo><mml:mn>1.4</mml:mn></mml:math></inline-formula></td>
<td><inline-formula id="ieqn-173"><mml:math id="mml-ieqn-173"><mml:mn>85.7</mml:mn><mml:mo>&#x00B1;</mml:mo><mml:mn>4.5</mml:mn></mml:math></inline-formula></td>
<td>0.84</td>
</tr>
<tr>
<td>70%&#x2013;79%</td>
<td>5</td>
<td><inline-formula id="ieqn-174"><mml:math id="mml-ieqn-174"><mml:mn>25.3</mml:mn><mml:mo>&#x00B1;</mml:mo><mml:mn>4.5</mml:mn></mml:math></inline-formula></td>
<td><inline-formula id="ieqn-175"><mml:math id="mml-ieqn-175"><mml:mn>12.4</mml:mn><mml:mo>&#x00B1;</mml:mo><mml:mn>2.8</mml:mn></mml:math></inline-formula></td>
<td><inline-formula id="ieqn-176"><mml:math id="mml-ieqn-176"><mml:mn>76.2</mml:mn><mml:mo>&#x00B1;</mml:mo><mml:mn>5.3</mml:mn></mml:math></inline-formula></td>
<td>0.77</td>
</tr>
<tr>
<td><inline-formula id="ieqn-177"><mml:math id="mml-ieqn-177"><mml:mrow><mml:mo>&#x003C;</mml:mo></mml:mrow><mml:mn>70</mml:mn><mml:mi mathvariant="normal">&#x0025;</mml:mi></mml:math></inline-formula></td>
<td>3</td>
<td><inline-formula id="ieqn-178"><mml:math id="mml-ieqn-178"><mml:mn>38.7</mml:mn><mml:mo>&#x00B1;</mml:mo><mml:mn>6.3</mml:mn></mml:math></inline-formula></td>
<td><inline-formula id="ieqn-179"><mml:math id="mml-ieqn-179"><mml:mn>23.2</mml:mn><mml:mo>&#x00B1;</mml:mo><mml:mn>4.6</mml:mn></mml:math></inline-formula></td>
<td><inline-formula id="ieqn-180"><mml:math id="mml-ieqn-180"><mml:mn>62.8</mml:mn><mml:mo>&#x00B1;</mml:mo><mml:mn>7.1</mml:mn></mml:math></inline-formula></td>
<td>0.82</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>The data shows that operators with similarity <inline-formula id="ieqn-181"><mml:math id="mml-ieqn-181"><mml:mo>&#x2265;</mml:mo><mml:mn>90</mml:mn><mml:mi mathvariant="normal">&#x0025;</mml:mi></mml:math></inline-formula> demonstrated significantly higher maintenance efficiency than other groups (<italic>p</italic> &#x0003C; 0.01), the lowest operational error rate (2.3% <inline-formula id="ieqn-182"><mml:math id="mml-ieqn-182"><mml:mo>&#x00B1;</mml:mo></mml:math></inline-formula> 0.8%), and the highest consistency with supervisor evaluations (Kappa &#x003D; 0.91). Through one-way analysis of variance, we verified that these differences have statistical significance (F &#x003D; 16.8, <italic>p</italic> &#x0003C; 0.001), strongly supporting the setting of 90% as the professional level threshold.</p>
<p>It is noteworthy that we discovered an important phenomenon in our data analysis: the relationship between skill level and similarity exhibits distinct performance jumps at critical threshold points. Particularly at the 90% similarity point, the operational error rate significantly decreased from 5.7% to 2.3%, and similar &#x02018;step effects&#x2019; were observed at 80% and 70% thresholds. This nonlinear relationship further confirms the rationality and practical value of our similarity threshold settings.</p>
</sec>
<sec id="s4_4_2">
<label>4.4.2</label>
<title>Application Value Analysis</title>
<p>VMHPE&#x2019;s high-precision prediction performance provides important support for virtual maintenance training systems. The system can collect trainees&#x2019; operation videos in real-time and extract skeletal keypoints through the algorithm. It then compares them with professional personnel&#x2019;s standard actions to provide objective evaluation results. This evaluation method based on precise keypoint prediction achieves standardization and regularization of maintenance training, providing technical guarantee for improving training quality.</p>
</sec>
</sec>
</sec>
<sec id="s5">
<label>5</label>
<title>Discussion and Analysis</title>
<sec id="s5_1">
<label>5.1</label>
<title>Core Mechanism Analysis</title>
<p>Based on <xref ref-type="sec" rid="s4">Section 4</xref>&#x2019;s experimental results, we analyze how VMHPE&#x2019;s core mechanisms address virtual maintenance challenges. The multi-scale feature correction mechanism improves model AP from 92.1% to 92.9% through adaptive processing of features at different scales. This improvement stems from the feature correction unit&#x2019;s adaptive capability, enhancing local detail features during fine operations while strengthening global structure information during large-scale movements, effectively addressing pose diversity and occlusion challenges. The multi-scale fusion attention mechanism further enhances feature expression and integration capabilities through spatial and channel dimension attention computation, improving AP and AR to 93.8% and 95.4%, respectively. Spatial attention effectively solves pose interference in multi-person scenarios, while channel attention enhances understanding of complex poses. This performance improvement has significant implications for maintenance action assessment, as high-precision keypoint prediction ensures reliability when comparing with standard actions. In terms of real-time performance, VMHPE achieves a good balance between accuracy and speed, with minimal computational overhead (inference time increased by only 0.3 ms) while significantly improving performance (AP increased by 2.3%). This efficient architecture design makes the model particularly suitable for real-time applications like virtual maintenance.</p>
</sec>
<sec id="s5_2">
<label>5.2</label>
<title>Parameter Configuration Analysis</title>
<p>Regarding feature dimension selection, experiments find that 16-dimensional feature configuration (AP &#x003D; 94.4%, AR &#x003D; 95.8%) achieves optimal performance. This result indicates that under our proposed feature correction and fusion mechanisms, medium-scale feature space is sufficient to capture key information in virtual maintenance scenarios. Lower dimensions might lead to information loss, while higher dimensions introduce redundant information that affects feature discrimination capability. The 16-dimensional configuration achieves a good balance between performance and computational overhead, providing an important reference for system deployment. The complete VMHPE achieves significant performance improvements with AP and AR reaching 94.4% and 95.6% respectively through the organic combination of multi-scale feature correction and fusion attention mechanisms. This improvement validates our technical solution&#x2019;s rationality, particularly in complex maintenance scenarios, where these mechanisms provide strong support for accurate pose estimation.</p>
</sec>
<sec id="s5_3">
<label>5.3</label>
<title>Dataset Validation Analysis</title>
<p>The maintenance action assessment system established in this research has two key characteristics. First, the assessment benchmark derives from professional maintenance personnel&#x2019;s standard action data, ensuring evaluation standard authority. Second, high-precision skeletal keypoint prediction ensures the reliability of action similarity calculation.</p>
<p>Analysis of action characteristic differences between professional personnel and ordinary operators reveals that key angle chain distribution patterns effectively reflect operation skill levels. Professional personnel show significantly smaller angle fluctuations in hand key chains during fine operations compared to ordinary operators, providing a basis for establishing quantitative evaluation standards.</p>
<p>Different maintenance tasks have varying action precision requirements. Tool operation tasks emphasize hand movement precision, while component transport focuses on overall posture coordination. Our assessment system adapts to these differences by setting task-related weight coefficients, achieving flexible evaluation standard adaptation.</p>
</sec>
<sec id="s5_4">
<label>5.4</label>
<title>Integration with Pose Optimization Methods</title>
<p>The VMHPE method achieves accurate single-frame pose estimation in virtual maintenance scenarios, but constructing a complete training system requires addressing temporal consistency issues. Continuous operation sequences in virtual maintenance training demand natural smoothness of actions in the temporal dimension, yet existing pose estimators often suffer from jittering problems. SmoothNet models the natural smoothness characteristics of body motion by learning long-term temporal relationships of joints [<xref ref-type="bibr" rid="ref-33">33</xref>]. Integration with VMHPE can significantly improve temporal smoothness, which is of great significance for ensuring continuity and standardization in maintenance operation evaluation. From an application development perspective, the expansion of virtual maintenance systems toward three-dimensional evaluation is an inevitable trend. By combining the VMHKP dataset constructed in this study with multi-scale feature fusion strategies, integrating 3D pose optimization methods such as Filter with Learned Kinematics (FLK) [<xref ref-type="bibr" rid="ref-34">34</xref>] can provide spatial constraint capabilities for the system. This expansion not only leverages VMHPE&#x2019;s accuracy advantages in 2D pose estimation but also ensures the rationality of poses in three-dimensional space through biomechanical constraints, making it particularly suitable for safety evaluation requirements in complex maintenance environments. Based on the four-level motion standardization evaluation system proposed in this paper, integrating pose optimization methods can further enhance the reliability and practicality of evaluation. VMHPE&#x2019;s multi-scale feature correction mechanism provides accuracy assurance for basic detection, temporal smoothing algorithms ensure coherence of action sequences, and spatial constraint methods guarantee physiological reasonableness of poses. This multi-level integrated architecture not only improves technical indicators but, more importantly, establishes a complete technical chain from accurate detection to standardized evaluation for virtual maintenance training systems, making professional-level (<inline-formula id="ieqn-183"><mml:math id="mml-ieqn-183"><mml:mo>&#x2265;</mml:mo></mml:math></inline-formula>90%) action standardization evaluation more reliable and practical.</p>
</sec>
</sec>
<sec id="s6">
<label>6</label>
<title>Conclusion and Future Work</title>
<p>This paper proposes the VMHPE method for human pose estimation in virtual maintenance scenarios, effectively addressing challenges of pose diversity and occlusions through multi-scale feature correction and fusion attention mechanisms. The VMHKP dataset we constructed provides an important benchmark for this field. Experimental results confirm our method&#x2019;s superior performance in virtual maintenance contexts. Nevertheless, optimization opportunities remain, particularly in inference time for complex scenarios. As a 2D pose estimation approach, it has limitations in handling complex rotations and occlusions, while 3D pose information is crucial for accurate assessment in virtual maintenance. Future work will focus on dataset extension and technical innovation, including increasing maintenance scenario types, introducing 3D annotations, exploring 2D to 3D pose estimation methods, and researching multi-modal feature fusion strategies. Through continuous innovation, we aim to achieve high-quality, standardized, and intelligent development of virtual maintenance training.</p>
</sec>
</body>
<back>
<ack>
<p>The authors would like to thank the School of Electromechanical Engineering at Guangdong University of Technology and the Virtual Reality and Visualization Laboratory for providing the research platform and facilities, Pharmapack Technologies Corporation for their support in the joint development project, the technical personnel who participated in the dataset collection and maintenance, and the reviewers for their valuable comments and suggestions.</p>
</ack>
<sec>
<title>Funding Statement</title>
<p>This research was funded by the Joint Development Project with Pharmapack Technologies Corporation: Open Multi-Person Collaborative Virtual Assembly/Disassembly Training and Virtual Engineering Visualization Platform, Grant Number 23HK0101.</p>
</sec>
<sec>
<title>Author Contributions</title>
<p>Research conception and design: Shuo Zhang, Hanwu He, and Yueming Wu. Data collection: Shuo Zhang and Yueming Wu. Results analysis and interpretation: Shuo Zhang, Hanwu He, and Yueming Wu. Manuscript draft preparation: Shuo Zhang and Yueming Wu. All authors reviewed the results and approved the final version of the manuscript.</p>
</sec>
<sec sec-type="data-availability">
<title>Availability of Data and Materials</title>
<p>The Virtual Maintenance Human Keypoint Dataset (VMHKP) constructed in this study is publicly available at <ext-link ext-link-type="uri" xlink:href="https://doi.org/10.5281/zenodo.15525037">https://doi.org/10.5281/zenodo.15525037</ext-link>. Additional experimental data and code will be made available upon reasonable request to the corresponding author.</p>
</sec>
<sec>
<title>Ethics Approval</title>
<p>Not applicable.</p>
</sec>
<sec sec-type="COI-statement">
<title>Conflicts of Interest</title>
<p>The authors declare no conflicts of interest regarding the present study.</p>
</sec>
<ref-list content-type="authoryear">
<title>References</title>
<ref id="ref-1"><label>[1]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>He</surname> <given-names>Q</given-names></string-name>, <string-name><surname>Cheng</surname> <given-names>X</given-names></string-name>, <string-name><surname>Cheng</surname> <given-names>Z</given-names></string-name></person-group>. <article-title>A VR-based complex equipment maintenance training system</article-title>. In: <conf-name>2019 Chinese Automation Congress (CAC); 2019 Nov 22&#x2013;24</conf-name>; <publisher-loc>Hangzhou, China</publisher-loc>. p. <fpage>1741</fpage>&#x2013;<lpage>6</lpage>. doi:<pub-id pub-id-type="doi">10.1109/cac48633.2019.8996496</pub-id>.</mixed-citation></ref>
<ref id="ref-2"><label>[2]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Dubey</surname> <given-names>S</given-names></string-name>, <string-name><surname>Dixit</surname> <given-names>M</given-names></string-name></person-group>. <article-title>A comprehensive survey on human pose estimation approaches</article-title>. <source>Multimed Syst</source>. <year>2023</year>;<volume>29</volume>(<issue>1</issue>):<fpage>167</fpage>&#x2013;<lpage>95</lpage>. doi:<pub-id pub-id-type="doi">10.1007/s00530-022-00980-0</pub-id>.</mixed-citation></ref>
<ref id="ref-3"><label>[3]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Dong</surname> <given-names>C</given-names></string-name>, <string-name><surname>Du</surname> <given-names>G</given-names></string-name></person-group>. <article-title>An enhanced real-time human pose estimation method based on modified YOLOv8 framework</article-title>. <source>Sci Rep</source>. <year>2024</year>;<volume>14</volume>(<issue>1</issue>):<fpage>8012</fpage>. doi:<pub-id pub-id-type="doi">10.1038/s41598-024-58146-z</pub-id>; <pub-id pub-id-type="pmid">38580704</pub-id></mixed-citation></ref>
<ref id="ref-4"><label>[4]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Du</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Shang</surname> <given-names>X</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>J</given-names></string-name>, <string-name><surname>Duan</surname> <given-names>M</given-names></string-name></person-group>. <article-title>A distributed virtual reality system based on real-time dynamic calculation and multi-person collaborative operation applied to the development of subsea production systems</article-title>. <source>Int J Maritime Eng</source>. <year>2021</year>;<volume>163</volume>(<issue>A3</issue>). doi:<pub-id pub-id-type="doi">10.5750/ijme.v163ia3.798</pub-id>.</mixed-citation></ref>
<ref id="ref-5"><label>[5]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Numfu</surname> <given-names>M</given-names></string-name>, <string-name><surname>Riel</surname> <given-names>A</given-names></string-name>, <string-name><surname>Noel</surname> <given-names>F</given-names></string-name></person-group>. <article-title>Virtual reality based digital chain for maintenance training</article-title>. <source>Procedia CIRP</source>. <year>2019</year>;<volume>84</volume>(<issue>7</issue>):<fpage>1069</fpage>&#x2013;<lpage>74</lpage>. doi:<pub-id pub-id-type="doi">10.1016/j.procir.2019.04.268</pub-id>.</mixed-citation></ref>
<ref id="ref-6"><label>[6]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Guo</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Zhou</surname> <given-names>D</given-names></string-name>, <string-name><surname>Chen</surname> <given-names>J</given-names></string-name>, <string-name><surname>Geng</surname> <given-names>J</given-names></string-name>, <string-name><surname>Lv</surname> <given-names>C</given-names></string-name>, <string-name><surname>Zeng</surname> <given-names>S</given-names></string-name></person-group>. <article-title>Using virtual reality to support the product&#x015B; maintainability design: immersive maintainability verification and evaluation system</article-title>. <source>Comput Ind</source>. <year>2018</year>;<volume>101</volume>:<fpage>41</fpage>&#x2013;<lpage>50</lpage>. doi:<pub-id pub-id-type="doi">10.1016/j.compind.2018.06.007</pub-id>.</mixed-citation></ref>
<ref id="ref-7"><label>[7]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Zhang</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>C</given-names></string-name>, <string-name><surname>Qin</surname> <given-names>W</given-names></string-name>, <string-name><surname>Zeng</surname> <given-names>W</given-names></string-name></person-group>. <article-title>Fusing wearable IMUs with multi-view images for human pose estimation: a geometric approach</article-title>. In: <conf-name>Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); 2020 Jun 13&#x2013;19</conf-name>; <publisher-loc>Seattle, WA, USA</publisher-loc>. p. <fpage>2200</fpage>&#x2013;<lpage>9</lpage>.</mixed-citation></ref>
<ref id="ref-8"><label>[8]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Masood</surname> <given-names>T</given-names></string-name>, <string-name><surname>Egger</surname> <given-names>J</given-names></string-name></person-group>. <article-title>Augmented reality in support of Industry 4.0&#x2014;Implementation challenges and success factors</article-title>. <source>Robot Comput-Integr Manuf</source>. <year>2019</year>;<volume>58</volume>(<issue>2</issue>):<fpage>181</fpage>&#x2013;<lpage>95</lpage>. doi:<pub-id pub-id-type="doi">10.1016/j.rcim.2019.02.003</pub-id>.</mixed-citation></ref>
<ref id="ref-9"><label>[9]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Bhattacharya</surname> <given-names>B</given-names></string-name>, <string-name><surname>Winer</surname> <given-names>EH</given-names></string-name></person-group>. <article-title>Augmented reality via expert demonstration authoring (AREDA)</article-title>. <source>Comput Ind</source>. <year>2019</year>;<volume>105</volume>:<fpage>61</fpage>&#x2013;<lpage>79</lpage>. doi:<pub-id pub-id-type="doi">10.1016/j.compind.2018.04.021</pub-id>.</mixed-citation></ref>
<ref id="ref-10"><label>[10]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><surname>Cao</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Hidalgo</surname> <given-names>G</given-names></string-name>, <string-name><surname>Simon</surname> <given-names>T</given-names></string-name>, <string-name><surname>Wei</surname> <given-names>SE</given-names></string-name>, <string-name><surname>Sheikh</surname> <given-names>Y</given-names></string-name></person-group>. <article-title>OpenPose: realtime multi-person 2D pose estimation using part affinity fields</article-title>. <source>IEEE Trans Pattern Anal Mach Intell</source>. <year>2021</year>;<volume>43</volume>:<fpage>172</fpage>&#x2013;<lpage>86</lpage>.</mixed-citation></ref>
<ref id="ref-11"><label>[11]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Sun</surname> <given-names>K</given-names></string-name>, <string-name><surname>Xiao</surname> <given-names>B</given-names></string-name>, <string-name><surname>Liu</surname> <given-names>D</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>J</given-names></string-name></person-group>. <article-title>Deep high-resolution representation learning for human pose estimation</article-title>. In: <conf-name>Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); 2019 Jun 15&#x2013;20</conf-name>; <publisher-loc>Long Beach, CA, USA</publisher-loc>. p. <fpage>5686</fpage>&#x2013;<lpage>96</lpage>.</mixed-citation></ref>
<ref id="ref-12"><label>[12]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Jiang</surname> <given-names>W</given-names></string-name>, <string-name><surname>Kolotouros</surname> <given-names>N</given-names></string-name>, <string-name><surname>Pavlakos</surname> <given-names>G</given-names></string-name>, <string-name><surname>Zhou</surname> <given-names>X</given-names></string-name>, <string-name><surname>Daniilidis</surname> <given-names>K</given-names></string-name></person-group>. <article-title>Coherent reconstruction of multiple humans from a single image</article-title>. In: <conf-name>Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); 2020 Jun 13&#x2013;19</conf-name>; <publisher-loc>Seattle, WA, USA</publisher-loc>. p. <fpage>5578</fpage>&#x2013;<lpage>87</lpage>.</mixed-citation></ref>
<ref id="ref-13"><label>[13]</label><mixed-citation publication-type="book"><person-group person-group-type="author"><string-name><surname>Lin</surname> <given-names>TY</given-names></string-name>, <string-name><surname>Maire</surname> <given-names>M</given-names></string-name>, <string-name><surname>Belongie</surname> <given-names>S</given-names></string-name>, <string-name><surname>Hays</surname> <given-names>J</given-names></string-name>, <string-name><surname>Perona</surname> <given-names>P</given-names></string-name>, <string-name><surname>Ramanan</surname> <given-names>D</given-names></string-name>, <etal>et al</etal></person-group>. <chapter-title>Microsoft COCO: common objects in context</chapter-title>. In: <person-group person-group-type="editor"><string-name><surname>Fleet</surname> <given-names>D</given-names></string-name>, <string-name><surname>Pajdla</surname> <given-names>T</given-names></string-name>, <string-name><surname>Schiele</surname> <given-names>B</given-names></string-name>, <string-name><surname>Tuytelaars</surname> <given-names>T</given-names></string-name></person-group>, editors. <source>Computer Vision&#x2014;ECCV 2014</source>. <publisher-loc>Cham, Switzerland</publisher-loc>: <publisher-name>Springer International Publishing</publisher-name>; <year>2014</year>. p. <fpage>740</fpage>&#x2013;<lpage>55</lpage>. doi:<pub-id pub-id-type="doi">10.1007/978-3-319-10602-1_48</pub-id>.</mixed-citation></ref>
<ref id="ref-14"><label>[14]</label><mixed-citation publication-type="book"><person-group person-group-type="author"><string-name><surname>Jin</surname> <given-names>S</given-names></string-name>, <string-name><surname>Xu</surname> <given-names>L</given-names></string-name>, <string-name><surname>Xu</surname> <given-names>J</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>C</given-names></string-name>, <string-name><surname>Liu</surname> <given-names>W</given-names></string-name>, <string-name><surname>Qian</surname> <given-names>C</given-names></string-name>, <etal>et al</etal></person-group>. <chapter-title>Whole-body human pose estimation in the wild</chapter-title>. In: <person-group person-group-type="editor"><string-name><surname>Vedaldi</surname> <given-names>A</given-names></string-name>, <string-name><surname>Bischof</surname> <given-names>H</given-names></string-name>, <string-name><surname>Brox</surname> <given-names>T</given-names></string-name>, <string-name><surname>Frahm</surname> <given-names>JM</given-names></string-name></person-group>, editors. <source>Computer Vision&#x2014;ECCV 2020</source>. <publisher-loc>Cham, Switzerland</publisher-loc>: <publisher-name>Springer International Publishing</publisher-name>; <year>2020</year>. p. <fpage>196</fpage>&#x2013;<lpage>214</lpage>. doi:<pub-id pub-id-type="doi">10.1007/978-3-030-58545-7_12</pub-id>.</mixed-citation></ref>
<ref id="ref-15"><label>[15]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Fang</surname> <given-names>HS</given-names></string-name>, <string-name><surname>Xie</surname> <given-names>S</given-names></string-name>, <string-name><surname>Tai</surname> <given-names>YW</given-names></string-name>, <string-name><surname>Lu</surname> <given-names>C</given-names></string-name></person-group>. <article-title>RMPE: regional multi-person pose estimation</article-title>. In: <conf-name>Proceedings of the IEEE International Conference on Computer Vision (ICCV); 2017 Oct 22&#x2013;29</conf-name>; <publisher-loc>Venice, Italy</publisher-loc>. p. <fpage>2353</fpage>&#x2013;<lpage>62</lpage>.</mixed-citation></ref>
<ref id="ref-16"><label>[16]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Li</surname> <given-names>J</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>C</given-names></string-name>, <string-name><surname>Zhu</surname> <given-names>H</given-names></string-name>, <string-name><surname>Mao</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Fang</surname> <given-names>HS</given-names></string-name>, <string-name><surname>Lu</surname> <given-names>C</given-names></string-name></person-group>. <article-title>CrowdPose: efficient crowded scenes pose estimation and a new benchmark</article-title>. In: <conf-name>Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); 2019 Jun 15&#x2013;20</conf-name>; <publisher-loc>Long Beach, CA, USA</publisher-loc>. p. <fpage>10855</fpage>&#x2013;<lpage>64</lpage>.</mixed-citation></ref>
<ref id="ref-17"><label>[17]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><surname>G&#x00FC;ler</surname> <given-names>RA</given-names></string-name>, <string-name><surname>Neverova</surname> <given-names>N</given-names></string-name>, <string-name><surname>Kokkinos</surname> <given-names>I</given-names></string-name></person-group>. <article-title>DensePose: dense human pose estimation in the wild</article-title>. <comment>arXiv:1802.00434. 2018</comment>.</mixed-citation></ref>
<ref id="ref-18"><label>[18]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Perazzi</surname> <given-names>F</given-names></string-name>, <string-name><surname>Pont-Tuset</surname> <given-names>J</given-names></string-name>, <string-name><surname>McWilliams</surname> <given-names>B</given-names></string-name>, <string-name><surname>Van Gool</surname> <given-names>L</given-names></string-name>, <string-name><surname>Gross</surname> <given-names>M</given-names></string-name>, <string-name><surname>Sorkine-Hornung</surname> <given-names>A</given-names></string-name></person-group>. <article-title>A Benchmark dataset and evaluation methodology for video object segmentation</article-title>. In: <conf-name>Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR); 2016 Jun 27&#x2013;30</conf-name>; <publisher-loc>Las Vegas, NV, USA</publisher-loc>. p. <fpage>724</fpage>&#x2013;<lpage>32</lpage>.</mixed-citation></ref>
<ref id="ref-19"><label>[19]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Damen</surname> <given-names>D</given-names></string-name>, <string-name><surname>Doughty</surname> <given-names>H</given-names></string-name>, <string-name><surname>Farinella</surname> <given-names>GM</given-names></string-name>, <string-name><surname>Fidler</surname> <given-names>S</given-names></string-name>, <string-name><surname>Furnari</surname> <given-names>A</given-names></string-name>, <string-name><surname>Kazakos</surname> <given-names>E</given-names></string-name>, <etal>et al</etal></person-group>. <article-title>Scaling egocentric vision: the EPIC-KITCHENS dataset</article-title>. In: <conf-name>Proceedings of the European Conference on Computer Vision (ECCV); 2018 Sep 8&#x2013;14</conf-name>; <publisher-loc>Munich, Germany</publisher-loc>. p. <fpage>1</fpage>&#x2013;<lpage>17</lpage>.</mixed-citation></ref>
<ref id="ref-20"><label>[20]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Toshev</surname> <given-names>A</given-names></string-name>, <string-name><surname>Szegedy</surname> <given-names>C</given-names></string-name></person-group>. <article-title>Human pose estimation via deep neural networks</article-title>. In: <conf-name>Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR); 2014 Jun 23&#x2013;28</conf-name>; <publisher-loc>Columbus, OH, USA</publisher-loc>. p. <fpage>1653</fpage>&#x2013;<lpage>60</lpage>.</mixed-citation></ref>
<ref id="ref-21"><label>[21]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Cheng</surname> <given-names>B</given-names></string-name>, <string-name><surname>Xiao</surname> <given-names>B</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>J</given-names></string-name>, <string-name><surname>Shi</surname> <given-names>H</given-names></string-name>, <string-name><surname>Huang</surname> <given-names>TS</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>L</given-names></string-name></person-group>. <article-title>Scale-aware representation learning for bottom-up human pose estimation</article-title>. In: <conf-name>Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); 2020 Jun 13&#x2013;19</conf-name>; <publisher-loc>Seattle, WA, USA</publisher-loc>. p. <fpage>5385</fpage>&#x2013;<lpage>94</lpage>.</mixed-citation></ref>
<ref id="ref-22"><label>[22]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Fang</surname> <given-names>HS</given-names></string-name>, <string-name><surname>Li</surname> <given-names>J</given-names></string-name>, <string-name><surname>Tang</surname> <given-names>H</given-names></string-name>, <string-name><surname>Xu</surname> <given-names>C</given-names></string-name>, <string-name><surname>Zhu</surname> <given-names>H</given-names></string-name>, <string-name><surname>Xiu</surname> <given-names>Y</given-names></string-name>, <etal>et al</etal></person-group>. <article-title>AlphaPose: whole-body regional multi-person pose estimation and tracking in real-time</article-title>. <source>IEEE Trans Pattern Anal Mach Intell</source>. <year>2023</year>;<volume>45</volume>(<issue>6</issue>):<fpage>7157</fpage>&#x2013;<lpage>73</lpage>. doi:<pub-id pub-id-type="doi">10.1109/tpami.2022.3222784</pub-id>; <pub-id pub-id-type="pmid">37145952</pub-id></mixed-citation></ref>
<ref id="ref-23"><label>[23]</label><mixed-citation publication-type="book"><person-group person-group-type="author"><string-name><surname>Xu</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>J</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>Q</given-names></string-name>, <string-name><surname>Tao</surname> <given-names>D</given-names></string-name></person-group>. <chapter-title>Simple vision transformer baselines for human pose estimation</chapter-title>. In: <person-group person-group-type="editor"><string-name><surname>Koyejo</surname> <given-names>S</given-names></string-name>, <string-name><surname>Mohamed</surname> <given-names>S</given-names></string-name>, <string-name><surname>Agarwal</surname> <given-names>A</given-names></string-name>, <string-name><surname>Belgrave</surname> <given-names>D</given-names></string-name>, <string-name><surname>Cho</surname> <given-names>K</given-names></string-name>, <string-name><surname>Oh</surname> <given-names>A</given-names></string-name></person-group>, editors. <source>Advances in neural information processing systems</source>. Vol. <volume>35</volume>. <publisher-loc>Red Hook, NY, USA</publisher-loc>: <publisher-name>Curran Associates, Inc.</publisher-name>; <year>2022</year>. p. <fpage>38571</fpage>&#x2013;<lpage>84</lpage>.</mixed-citation></ref>
<ref id="ref-24"><label>[24]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Mao</surname> <given-names>W</given-names></string-name>, <string-name><surname>Tian</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>X</given-names></string-name>, <string-name><surname>Shen</surname> <given-names>C</given-names></string-name></person-group>. <article-title>FCPose: fully convolutional multi-person pose estimation with dynamic instance-aware convolutions</article-title>. In: <conf-name>Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); 2021 Jun 20&#x2013;25</conf-name>; <publisher-loc>Nashville, TN, USA</publisher-loc>. p. <fpage>9034</fpage>&#x2013;<lpage>43</lpage>.</mixed-citation></ref>
<ref id="ref-25"><label>[25]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Li</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Yang</surname> <given-names>S</given-names></string-name>, <string-name><surname>Liu</surname> <given-names>P</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>S</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>Z</given-names></string-name>, <etal>et al</etal></person-group>. <article-title>SimCC: a simple coordinate classification perspective for human pose estimation</article-title>. In: <conf-name>European Conference on Computer Vision</conf-name>. <publisher-loc>Cham, Switzerland</publisher-loc>: <publisher-name>Springer</publisher-name>; <year>2022</year>. p. <fpage>89</fpage>&#x2013;<lpage>106</lpage>.</mixed-citation></ref>
<ref id="ref-26"><label>[26]</label><mixed-citation publication-type="book"><person-group person-group-type="author"><string-name><surname>Zeng</surname> <given-names>A</given-names></string-name>, <string-name><surname>Yang</surname> <given-names>L</given-names></string-name>, <string-name><surname>Ju</surname> <given-names>X</given-names></string-name>, <string-name><surname>Li</surname> <given-names>J</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>J</given-names></string-name>, <string-name><surname>Xu</surname> <given-names>Q</given-names></string-name></person-group>. <chapter-title>SmoothNet: a plug-and-play network for refining human poses in videos</chapter-title>. In: <person-group person-group-type="editor"><string-name><surname>Avidan</surname> <given-names>S</given-names></string-name>, <string-name><surname>Brostow</surname> <given-names>G</given-names></string-name>, <string-name><surname>Ciss&#x00E9;</surname> <given-names>M</given-names></string-name>, <string-name><surname>Farinella</surname> <given-names>GM</given-names></string-name>, <string-name><surname>Hassner</surname> <given-names>T</given-names></string-name></person-group>, editors. <source>Computer Vision&#x2014;ECCV 2022</source>. <publisher-loc>Cham, Switzerland</publisher-loc>: <publisher-name>Springer Nature</publisher-name>; <year>2022</year>. p. <fpage>625</fpage>&#x2013;<lpage>42</lpage>. doi:<pub-id pub-id-type="doi">10.1007/978-3-031-20065-6_36</pub-id>.</mixed-citation></ref>
<ref id="ref-27"><label>[27]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><surname>Zhang</surname> <given-names>F</given-names></string-name>, <string-name><surname>Bazarevsky</surname> <given-names>V</given-names></string-name>, <string-name><surname>Vakunov</surname> <given-names>A</given-names></string-name>, <string-name><surname>Tkachenka</surname> <given-names>A</given-names></string-name>, <string-name><surname>Sung</surname> <given-names>G</given-names></string-name>, <string-name><surname>Chang</surname> <given-names>C</given-names></string-name>, <etal>et al</etal></person-group>. <article-title>MediaPipe hands: on-device real-time hand tracking</article-title>. <comment>arXiv:2006.10214. 2020</comment>.</mixed-citation></ref>
<ref id="ref-28"><label>[28]</label><mixed-citation publication-type="book"><person-group person-group-type="author"><string-name><surname>Moon</surname> <given-names>G</given-names></string-name>, <string-name><surname>Yu</surname> <given-names>SI</given-names></string-name>, <string-name><surname>Wen</surname> <given-names>H</given-names></string-name>, <string-name><surname>Shiratori</surname> <given-names>T</given-names></string-name>, <string-name><surname>Lee</surname> <given-names>KM</given-names></string-name></person-group>. <chapter-title>InterHand2.6M: a dataset and baseline for 3D interacting hand pose estimation from a single RGB image</chapter-title>. In: <person-group person-group-type="editor"><string-name><surname>Vedaldi</surname> <given-names>A</given-names></string-name>, <string-name><surname>Bischof</surname> <given-names>H</given-names></string-name>, <string-name><surname>Brox</surname> <given-names>T</given-names></string-name>, <string-name><surname>Frahm</surname> <given-names>JM</given-names></string-name></person-group>, editors. <source>Computer Vision&#x2014;ECCV 2020</source>. <publisher-loc>Cham, Switzerland</publisher-loc>: <publisher-name>Springer International Publishing</publisher-name>; <year>2020</year>. p. <fpage>548</fpage>&#x2013;<lpage>64</lpage>. doi:<pub-id pub-id-type="doi">10.1007/978-3-030-58565-5_33</pub-id>.</mixed-citation></ref>
<ref id="ref-29"><label>[29]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Lu</surname> <given-names>P</given-names></string-name>, <string-name><surname>Jiang</surname> <given-names>T</given-names></string-name>, <string-name><surname>Li</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Li</surname> <given-names>X</given-names></string-name>, <string-name><surname>Chen</surname> <given-names>K</given-names></string-name>, <string-name><surname>Yang</surname> <given-names>W</given-names></string-name></person-group>. <article-title>RTMO: towards high-performance one-stage real-time multi-person pose estimation</article-title>. In: <conf-name>Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); 2024 Jun 16&#x2013;22</conf-name>. <publisher-loc>Seattle, WA, USA</publisher-loc>. p. <fpage>1491</fpage>&#x2013;<lpage>500</lpage>.</mixed-citation></ref>
<ref id="ref-30"><label>[30]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Yang</surname> <given-names>X</given-names></string-name>, <string-name><surname>Huang</surname> <given-names>F</given-names></string-name>, <string-name><surname>Jiang</surname> <given-names>J</given-names></string-name>, <string-name><surname>Chen</surname> <given-names>Z</given-names></string-name></person-group>. <article-title>A virtual-reality spatial matching algorithm and its application on equipment maintenance support: system design and user study</article-title>. <source>Signal Process Image Commun</source>. <year>2024</year>;<volume>129</volume>:<fpage>117188</fpage>. doi:<pub-id pub-id-type="doi">10.1016/j.image.2024.117188</pub-id>.</mixed-citation></ref>
<ref id="ref-31"><label>[31]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Rane</surname> <given-names>M</given-names></string-name>, <string-name><surname>Date</surname> <given-names>A</given-names></string-name>, <string-name><surname>Deshmukh</surname> <given-names>V</given-names></string-name>, <string-name><surname>Deshpande</surname> <given-names>P</given-names></string-name>, <string-name><surname>Dharmadhikari</surname> <given-names>A</given-names></string-name></person-group>. <article-title>Virtual gym tracker: AI pose estimation</article-title>. In: <conf-name>2024 Second International Conference on Advances in Information Technology (ICAIT); 2024 Jul 27&#x2013;27</conf-name>; <publisher-loc>Chikkamagaluru, India</publisher-loc>. p. <fpage>1</fpage>&#x2013;<lpage>6</lpage>.</mixed-citation></ref>
<ref id="ref-32"><label>[32]</label><mixed-citation publication-type="book"><person-group person-group-type="author"><string-name><surname>Urgo</surname> <given-names>M</given-names></string-name>, <string-name><surname>Berardinucci</surname> <given-names>F</given-names></string-name>, <string-name><surname>Zheng</surname> <given-names>P</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>L</given-names></string-name></person-group>. <chapter-title>AI-based pose estimation of human operators in manufacturing environments</chapter-title>. In: <source>CIRP novel topics in production engineering</source>. Vol. <volume>1</volume>. <publisher-loc>Cham, Switzerland</publisher-loc>: <publisher-name>Springer</publisher-name>; <year>2024</year>. p. <fpage>3</fpage>&#x2013;<lpage>38</lpage>. doi: <pub-id pub-id-type="doi">10.1007/978-3-031-54034-9_1</pub-id>.</mixed-citation></ref>
<ref id="ref-33"><label>[33]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Wang</surname> <given-names>D</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>S</given-names></string-name></person-group>. <article-title>Contextual instance decoupling for robust multi-person pose estimation</article-title>. In: <conf-name>Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognitionn; 2022 Jun 18&#x2013;24</conf-name>; <publisher-loc>New Orleans, LA, USA</publisher-loc>. p. <fpage>11060</fpage>&#x2013;<lpage>8</lpage>.</mixed-citation></ref>
<ref id="ref-34"><label>[34]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Martini</surname> <given-names>E</given-names></string-name>, <string-name><surname>Boldo</surname> <given-names>M</given-names></string-name>, <string-name><surname>Bombieri</surname> <given-names>N</given-names></string-name></person-group>. <article-title>FLK: a filter with learned kinematics for real-time 3D human pose estimation</article-title>. <source>Signal Process</source>. <year>2024</year>;<volume>224</volume>(<issue>1</issue>):<fpage>109598</fpage>. doi:<pub-id pub-id-type="doi">10.1016/j.sigpro.2024.109598</pub-id>.</mixed-citation></ref>
</ref-list>
</back></article>