<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.1 20151215//EN" "http://jats.nlm.nih.gov/publishing/1.1/JATS-journalpublishing1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:mml="http://www.w3.org/1998/Math/MathML" xml:lang="en" article-type="research-article" dtd-version="1.1">
<front>
<journal-meta>
<journal-id journal-id-type="pmc">CMES</journal-id>
<journal-id journal-id-type="nlm-ta">CMES</journal-id>
<journal-id journal-id-type="publisher-id">CMES</journal-id>
<journal-title-group>
<journal-title>Computer Modeling in Engineering &#x0026; Sciences</journal-title>
</journal-title-group>
<issn pub-type="epub">1526-1506</issn>
<issn pub-type="ppub">1526-1492</issn>
<publisher>
<publisher-name>Tech Science Press</publisher-name>
<publisher-loc>USA</publisher-loc>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="publisher-id">77956</article-id>
<article-id pub-id-type="doi">10.32604/cmes.2026.077956</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Article</subject>
</subj-group>
</article-categories>
<title-group>
<article-title>Dual-Attention Multi-Path Deep Learning Framework for Automated Wind Turbine Blade Fault Detection Using UAV Imagery</article-title>
<alt-title alt-title-type="left-running-head">Dual-Attention Multi-Path Deep Learning Framework for Automated Wind Turbine Blade Fault Detection Using UAV Imagery</alt-title>
<alt-title alt-title-type="right-running-head">Dual-Attention Multi-Path Deep Learning Framework for Automated Wind Turbine Blade Fault Detection Using UAV Imagery</alt-title>
</title-group>
<contrib-group>
<contrib id="author-1" contrib-type="author" corresp="yes">
<name name-style="western"><surname>Alanazi</surname><given-names>Mubarak</given-names></name><xref ref-type="aff" rid="aff-1">1</xref><xref rid="cor1" ref-type="corresp">&#x002A;</xref><email>alanazi3015@gmail.com</email></contrib>
<contrib id="author-2" contrib-type="author">
<name name-style="western"><surname>Rashid</surname><given-names>Junaid</given-names></name><xref ref-type="aff" rid="aff-2">2</xref></contrib>
<aff id="aff-1"><label>1</label><institution>Electrical Engineering Department, Jubail Industrial College, Royal Commission for Jubail &#x0026; Yanbu</institution>, <addr-line>Jubail Industrial City</addr-line>, <country>Saudi Arabia</country></aff>
<aff id="aff-2"><label>2</label><institution>Department of Artificial Intelligence and Data Science, Sejong University</institution>, <addr-line>Seoul</addr-line>, <country>Republic of Korea</country></aff>
</contrib-group>
<author-notes>
<corresp id="cor1"><label>&#x002A;</label>Corresponding Author: Mubarak Alanazi. Email: <email>alanazi3015@gmail.com</email></corresp>
</author-notes>
<pub-date date-type="collection" publication-format="electronic">
<year>2026</year>
</pub-date>
<pub-date date-type="pub" publication-format="electronic">
<day>26</day><month>2</month><year>2026</year>
</pub-date>
<volume>146</volume>
<issue>2</issue>
<elocation-id>17</elocation-id>
<history>
<date date-type="received">
<day>20</day>
<month>12</month>
<year>2025</year>
</date>
<date date-type="accepted">
<day>27</day>
<month>01</month>
<year>2026</year>
</date>
</history>
<permissions>
<copyright-statement>&#x00A9; 2026 The Authors.</copyright-statement>
<copyright-year>2026</copyright-year>
<copyright-holder>Published by Tech Science Press.</copyright-holder>
<license xlink:href="https://creativecommons.org/licenses/by/4.0/">
<license-p>This work is licensed under a <ext-link ext-link-type="uri" xlink:type="simple" xlink:href="https://creativecommons.org/licenses/by/4.0/">Creative Commons Attribution 4.0 International License</ext-link>, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
</license>
</permissions>
<self-uri content-type="pdf" xlink:href="TSP_CMES_77956.pdf"></self-uri>
<abstract>
<p>Wind turbine blade defect detection faces persistent challenges in separating small, low-contrast surface faults from complex backgrounds while maintaining reliability under variable illumination and viewpoints. Conventional image-processing pipelines struggle with scalability and robustness, and recent deep learning methods remain sensitive to class imbalance and acquisition variability. This paper introduces TurbineBladeDetNet, a convolutional architecture combining dual-attention mechanisms with multi-path feature extraction for detecting five distinct blade fault types. Our approach employs both channel-wise and spatial attention modules alongside an Albumentations-driven augmentation strategy to handle dataset imbalance and capture condition variability. The model achieves 97.14% accuracy, 98.65% precision, and 98.68% recall, yielding a 98.66% F1-score with 0.0110 s inference time. Class-specific analysis shows uniformly high sensitivity and specificity; lightning damage reaches 99.80% for sensitivity, precision, and F1-score, and crack achieves perfect precision and specificity with a 98.94% F1-score. Comparative evaluation against recent wind-turbine inspection approaches indicates higher performance in both accuracy and F1-score. The resulting balance of sensitivity and specificity limits both missed defects and false alarms, supporting reliable deployment in routine unmanned aerial vehicle (UAV) inspection.</p>
</abstract>
<kwd-group kwd-group-type="author">
<kwd>Wind energy</kwd>
<kwd>aerial imagery</kwd>
<kwd>surface condition monitoring</kwd>
<kwd>wind turbine blades</kwd>
<kwd>surface defect detection</kwd>
<kwd>attention mechanism</kwd>
<kwd>computer vision</kwd>
<kwd>deep learning</kwd>
<kwd>artificial intelligence</kwd>
</kwd-group>
</article-meta>
</front>
<body>
<sec id="s1">
<label>1</label>
<title>Introduction</title>
<p>Wind energy has become a central pillar of decarbonized power systems, and the reliability of utility-scale wind farms increasingly depends on timely detection of blade surface defects such as cracks, leading-edge erosion, coating loss, lightning-receptor damage, and contamination [<xref ref-type="bibr" rid="ref-1">1</xref>]. Manual rope access and ground-based binocular inspection are costly and slow, while onboard sensing can miss small surface anomalies; as a result, UAV-based visual inspection with automated analysis has emerged as a practical pathway to scalable condition monitoring of wind turbine blades (WTBs) [<xref ref-type="bibr" rid="ref-1">1</xref>,<xref ref-type="bibr" rid="ref-2">2</xref>].</p>
<p>Early WTB inspection studies relied on handcrafted features and classical image processing, which proved sensitive to illumination, viewpoint, and background clutter [<xref ref-type="bibr" rid="ref-1">1</xref>,<xref ref-type="bibr" rid="ref-3">3</xref>]. Subsequent deep learning approaches markedly improved accuracy but continue to face several recurring challenges. Yang et al. [<xref ref-type="bibr" rid="ref-4">4</xref>] identified significant difficulties in handling variability in capture conditions and blade geometry, particularly under unstable drone poses and texture-poor blade surfaces that complicate image stitching and defect localization. Gohar et al. [<xref ref-type="bibr" rid="ref-5">5</xref>] demonstrated that small and low-contrast defects in ultra high-resolution drone imagery are easily confused with benign artifacts due to scale variation and object size diversity. Spajic et al. [<xref ref-type="bibr" rid="ref-6">6</xref>] identified challenges related to class imbalance across eight defect categories and the need for robust transfer learning approaches to address limited annotated data. In parallel, one-stage detectors and attention-augmented backbones have been explored to meet real-time constraints and enhance fine-grained discrimination. Qiu et al. [<xref ref-type="bibr" rid="ref-7">7</xref>] developed a YOLO-based small object detection approach using multiscale feature pyramids to improve detection accuracy for tiny defects on WTB surfaces, achieving 91.3% average accuracy across crack, oil pollution, and sand inclusion categories. Zhang et al. [<xref ref-type="bibr" rid="ref-8">8</xref>] proposed SOD-YOLO by incorporating a micro-scale detection layer and Convolutional Block Attention Module (CBAM) attention mechanism into YOLOv5, demonstrating improved small target detection capability with 95.1% mean average precision (mAP) while maintaining computational efficiency through channel pruning. Fu et al. [<xref ref-type="bibr" rid="ref-9">9</xref>] introduced LE-YOLO, a lightweight architecture based on YOLOv7 that integrates Ghost-Shuffle Convolution and Simple Attention Mechanism to achieve real-time detection at 105.1 Frames Per Second (FPS) while addressing minute defects in low-resolution imagery. Zhang et al. [<xref ref-type="bibr" rid="ref-10">10</xref>] further enhanced multi-scale feature extraction by incorporating Coordinate Attention and Reparameterized Generalized Feature Pyramid Network into YOLOv8, achieving 92% mAP at 120.5 FPS. Despite these advances, results remain sensitive to slicing heuristics, stitching quality, and the availability of balanced training data. UAV-captured imagery has become a benchmark for open evaluation [<xref ref-type="bibr" rid="ref-11">11</xref>,<xref ref-type="bibr" rid="ref-12">12</xref>], but heterogeneous taxonomies and limited samples for rare defects complicate fair comparison.</p>
<p>This work addresses these limitations by introducing a dual-attention, multi-path convolutional architecture tailored for WTB surface inspection, combined with a bounding-box-aware augmentation pipeline to improve robustness under varied illumination, viewing angle, and background. The study adopts a single public dataset with a five-class reannotation for consistent benchmarking and reports results under a unified protocol [<xref ref-type="bibr" rid="ref-5">5</xref>,<xref ref-type="bibr" rid="ref-11">11</xref>]. To situate the method within current practice, the study compares against widely used state-of-the-art vision backbones [<xref ref-type="bibr" rid="ref-13">13</xref>] and against three recent WTB inspection approaches that represent transfer learning for aerial detection, slice-aided inference for ultra-high-resolution imagery, and stitching-driven pipelines [<xref ref-type="bibr" rid="ref-4">4</xref>&#x2013;<xref ref-type="bibr" rid="ref-6">6</xref>]. Public implementations for these comparators are not available; therefore, each method was reproduced from the procedural details provided in the respective papers and evaluated all models on identical dataset and metrics to ensure a fair assessment.</p>
<p>While dual attention mechanisms such as CBAM and CNN architectures like InceptionNet have demonstrated effectiveness in general computer vision tasks [<xref ref-type="bibr" rid="ref-14">14</xref>], their direct application to wind turbine blade inspection encounters three fundamental challenges: elongated defect geometries (cracks, lightning strikes) spanning multiple scales, low-contrast surface anomalies easily mistaken for benign texture variations, and severe class imbalance. The proposed architecture introduces three task-specific adaptations: (1) cascaded four-stage dual-attention deployment at hierarchical feature resolutions (256, 512, 1024, 2048 channels) enabling simultaneous capture of fine surface textures and global blade structure, (2) bounding-box-aware augmentation preserving defect spatial relationships under realistic UAV inspection variations, and (3) progressive channel-spatial attention fusion at each extraction stage for enhanced discrimination of thin, low-contrast defects. These design choices address blade-specific challenges rather than generic attention mechanism application. The main contributions of this work are as follows:
<list list-type="bullet">
<list-item>
<p>This work introduces TurbineBladeDetNet, extending conventional CBAM through three blade-specific innovations: (i) cascaded four-stage attention at hierarchical resolutions (256&#x2013;2048 channels) capturing multi-scale defect features, (ii) bounding-box-aware augmentation preserving spatial context under UAV conditions, and (iii) progressive channel-spatial fusion enhancing discrimination of elongated, low-contrast defects. The architecture achieves 97.14% accuracy, 98.65% precision, 98.68% recall, and 98.66% F1-score on five defect categories.</p></list-item>
<list-item>
<p>To address severe class imbalance and environmental variability in the training data, we implement a systematic augmentation pipeline using Albumentations. This approach combines photometric adjustments (Contrast Limited Adaptive Histogram Equalization (CLAHE), contrast, Hue, Saturation, and Value (HSV) shifts), geometric transformations, occlusion simulation, and noise injection to expand underrepresented classes to approximately 183 samples per category while preserving bounding-box integrity.</p></list-item>
<list-item>
<p>Comprehensive benchmarking is conducted against both recent WTB inspection methods and established convolutional neural network (CNN) architectures. Since original implementations were unavailable, prior methods were faithfully reproduced from their published specifications, ensuring all models were evaluated on identical data splits under consistent protocols.</p></list-item>
<list-item>
<p>Through systematic ablation studies, the study demonstrates the individual contributions of channel-wise and spatial attention mechanisms. Results show that channel attention alone improves baseline performance to 94.08% accuracy, spatial attention reaches 93.56%, while their cascaded integration achieves 97.14%, validating the architectural design choices.</p></list-item>
<list-item>
<p>The architecture processes individual images in 0.0110 s, enabling real-time inference during aerial surveys. This computational efficiency makes the approach suitable for operational deployment in wind farm inspection workflows.</p></list-item>
<list-item>
<p>Beyond aggregate performance, the study reports class-specific sensitivity, specificity, precision, and F1-scores across all defect categories. This granular analysis demonstrates consistent performance on both frequently occurring defects (erosion, paint-off) and rare but critical faults (cracks, lightning damage) under the standardized DTU evaluation protocol.</p></list-item>
</list></p>
<p>The remainder of this paper is organized as follows. <xref ref-type="sec" rid="s2">Section 2</xref> surveys WTB inspection methods, attention mechanisms, and dataset challenges. <xref ref-type="sec" rid="s3">Section 3</xref> describes the dataset preparation, augmentation strategy, and dual-attention architecture. <xref ref-type="sec" rid="s4">Section 4</xref> benchmarks the approach against baseline and state-of-the-art models with per-class analysis. <xref ref-type="sec" rid="s5">Section 5</xref> discusses results and implications for UAV deployment. <xref ref-type="sec" rid="s6">Section 6</xref> summarizes contributions and future directions.</p>
</sec>
<sec id="s2">
<label>2</label>
<title>Related Works</title>
<p>Early studies into WTB inspection primarily employed classical image processing and handcrafted feature extraction techniques to detect surface cracks, erosion, and structural deformation [<xref ref-type="bibr" rid="ref-1">1</xref>,<xref ref-type="bibr" rid="ref-15">15</xref>]. Ruiz et al. applied texture-based statistical descriptors to identify fault regions, but their method was limited by its sensitivity to illumination and background variations in field images [<xref ref-type="bibr" rid="ref-1">1</xref>]. Similar morphological and edge-based approaches achieved acceptable performance under controlled lighting conditions yet failed to generalize to outdoor UAV scenarios with dynamic reflections and complex textures [<xref ref-type="bibr" rid="ref-16">16</xref>,<xref ref-type="bibr" rid="ref-17">17</xref>]. Conventional machine learning models such as support vector machines (SVMs) and random forests (RFs) were later introduced to improve classification performance; however, their reliance on manually engineered features restricted scalability and adaptability to unseen defect patterns [<xref ref-type="bibr" rid="ref-3">3</xref>].</p>
<p>The shift to CNNs markedly improved WTB inspection accuracy [<xref ref-type="bibr" rid="ref-4">4</xref>,<xref ref-type="bibr" rid="ref-6">6</xref>]. Denhof et al. demonstrated CNN-based optical surface inspection of rotor blades and reported higher precision than handcrafted baselines [<xref ref-type="bibr" rid="ref-2">2</xref>]. Early deep learning efforts on turbine imagery also showed the feasibility of learning geometric and textural cues directly from UAV-captured images [<xref ref-type="bibr" rid="ref-18">18</xref>]. Transfer learning further enhanced performance by leveraging pretrained backbones to extract semantic defect features and by fusing multiple learners for cross-domain recognition [<xref ref-type="bibr" rid="ref-19">19</xref>,<xref ref-type="bibr" rid="ref-20">20</xref>].</p>
<p>Beyond WTB surface inspection, CNN architectures have demonstrated effectiveness across various energy system fault detection tasks, including photovoltaic array diagnosis [<xref ref-type="bibr" rid="ref-21">21</xref>], power grid fault classification [<xref ref-type="bibr" rid="ref-22">22</xref>], and wind farm electrical fault detection [<xref ref-type="bibr" rid="ref-23">23</xref>], reinforcing CNNs broader applicability in renewable energy infrastructure monitoring.</p>
<p>Pixel-level localization emerged as essential for detailed defect characterization. Zhang et al. [<xref ref-type="bibr" rid="ref-24">24</xref>] introduced Mask-MRNet to address this need, applying instance segmentation for precise fault detection and spatial localization on blade surfaces. Their subsequent work [<xref ref-type="bibr" rid="ref-25">25</xref>] modified Mask R-CNN with image enhancement preprocessing and introduced evaluation metrics for multi-class fault scenarios, reporting improved boundary accuracy. To address computational constraints in drone-based inspection, Diaz and Tittus [<xref ref-type="bibr" rid="ref-26">26</xref>] implemented a Cascade Mask R-DSCNN architecture using depthwise separable convolutions, demonstrating reduced inference time while maintaining detection performance under motion blur conditions. These approaches transitioned fault detection from image-level classification to pixel-wise segmentation, enabling spatial fault mapping for structural assessment.</p>
<p>In parallel, one-stage detectors became the dominant choice for real-time UAV applications. Qiu et al. adapted YOLO to small-object detection on turbine surfaces and demonstrated accurate localization of micro-defects [<xref ref-type="bibr" rid="ref-7">7</xref>]. SOD-YOLO refined YOLOv5 for small targets through improved pyramid fusion, while LE-YOLO emphasized lightweight design to enable embedded inspection on drones without sacrificing accuracy [<xref ref-type="bibr" rid="ref-8">8</xref>,<xref ref-type="bibr" rid="ref-9">9</xref>]. More recently, attention-augmented models integrated multi-scale feature enhancement and channel&#x2013;spatial attention to better capture subtle cracks, erosion and paint-off under environmental variability [<xref ref-type="bibr" rid="ref-10">10</xref>,<xref ref-type="bibr" rid="ref-20">20</xref>]. These works reflect a trend toward compact yet accurate detectors tailored for UAV inspection.</p>
<p>The literature also documents complementary advances around data quality, capture and pre-processing. Motion-blur restoration was shown to preserve defect integrity in flights with higher airspeeds and oblique views [<xref ref-type="bibr" rid="ref-27">27</xref>]. Public DTU-derived resources, including a fully annotated drone footage dataset with baselines for Faster R-CNN and YOLO, catalyzed reproducible evaluation and provided moving-drone scenarios for testing operational feasibility [<xref ref-type="bibr" rid="ref-16">16</xref>]. Nevertheless, the review consistently highlights practical difficulties that affect reported performance, notably non-uniform class definitions and evolving annotation practices across studies, which complicate direct mAP comparisons [<xref ref-type="bibr" rid="ref-10">10</xref>].</p>
<p>Despite the progress, most studies rely on DTU-family imagery with imbalanced distributions among &#x201C;damage&#x201D;, &#x201C;dirt&#x201D;, and &#x201C;intact&#x201D;, and they often diverge in class taxonomies and annotation protocols, which inflates variability across reported results [<xref ref-type="bibr" rid="ref-7">7</xref>,<xref ref-type="bibr" rid="ref-10">10</xref>,<xref ref-type="bibr" rid="ref-12">12</xref>]. Data augmentation is frequently limited to basic geometric or photometric transforms and is rarely validated systematically across changing inspection conditions [<xref ref-type="bibr" rid="ref-12">12</xref>,<xref ref-type="bibr" rid="ref-28">28</xref>]. Architecturally, many detectors remain single-path, which restricts their ability to reconcile global context with fine-scale surface cues, and attention is often shallow or applied at a single stage, limiting discriminative power for thin, low-contrast defects [<xref ref-type="bibr" rid="ref-10">10</xref>,<xref ref-type="bibr" rid="ref-20">20</xref>]. This work therefore employs a multi-path CNN with dual (channel and spatial) attention and a rigorously tuned Albumentations-based augmentation pipeline [<xref ref-type="bibr" rid="ref-29">29</xref>] to increase invariance to illumination changes, viewpoint differences, and background clutter, thereby improving robustness on UAVs-based imagery.</p>
</sec>
<sec id="s3">
<label>3</label>
<title>Materials and Methods</title>
<sec id="s3_1">
<label>3.1</label>
<title>Framework Overview</title>
<p>The proposed approach to automated wind turbine blade fault detection is illustrated in <xref ref-type="fig" rid="fig-1">Fig. 1</xref>, which details the InceptionV3-based architecture with four dual-attention blocks operating at multiple feature scales. The framework operates in three stages: systematic dataset balancing via Albumentations [<xref ref-type="bibr" rid="ref-29">29</xref>], feature extraction using pre-trained InceptionV3, and classification through a dual-attention mechanism combining channel-wise and spatial modules. The framework targets five distinct defect categories while maintaining computational efficiency suitable for operational wind farm deployment.</p>
<fig id="fig-1">
<label>Figure 1</label>
<caption>
<title>TurbineBladeDetNet architecture. The framework employs InceptionV3 backbone with four dual-attention blocks (channel and spatial attention at 256, 512, 1024, 2048 channels) for multi-scale feature refinement, followed by global pooling and dense classification layers for five-class defect classification.</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_77956-fig-1.tif"/>
</fig>
<p>Dataset imbalance poses a significant challenge in this domain. The augmentation strategy addresses this through CLAHE for contrast enhancement, geometric transformations (rotation, scaling, flipping), occlusion simulation using coarse dropout, and controlled noise injection. These operations expand underrepresented classes while preserving annotation integrity. The dual-attention mechanism subsequently refines extracted features, emphasizing discriminative patterns necessary for distinguishing subtle surface defects across blade categories.</p>
</sec>
<sec id="s3_2">
<label>3.2</label>
<title>Data Collection and Preprocessing</title>
<p>This study employs a publicly available UAV image corpus of wind turbine blades [<xref ref-type="bibr" rid="ref-11">11</xref>] with community re-annotations into five detection categories [<xref ref-type="bibr" rid="ref-5">5</xref>] (<xref ref-type="fig" rid="fig-2">Fig. 2</xref>). The class taxonomy used throughout this work is: missing teeth, erosion, Paint-off, lightning damage, and crack. The original distribution is imbalanced, with 107 images for missing teeth, 127 for erosion, 35 for Paint-off, 27 for lightning damage, and 35 for crack, totaling 331 images. To mitigate the severe class imbalance while preventing data leakage, the dataset is first split into training, validation, and testing. We partition the dataset via stratified sampling into training (265 images), validation (33 images), and test (33 images) sets using an 80:10:10 split, maintaining balanced class representation across all partitions.</p>
<fig id="fig-2">
<label>Figure 2</label>
<caption>
<title>Representative instances of the five blade defect categories: (<bold>A</bold>) missing teeth along the leading edge, (<bold>B</bold>) surface erosion, (<bold>C</bold>) paint coating degradation, (<bold>D</bold>) lightning receptor damage, and (<bold>E</bold>) structural cracking.</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_77956-fig-2.tif"/>
</fig>
<p>The dataset exhibits severe class imbalance in the original distribution, with defect categories ranging from 27 samples (lightning damage) to 127 samples (erosion), representing a 4.70:1 imbalance ratio as shown in <xref ref-type="fig" rid="fig-3">Fig. 3</xref>. This imbalance poses significant challenges for model training, as underrepresented classes risk being ignored during optimization, leading to biased predictions favoring majority classes. To address this, the bounding-box-aware augmentation pipeline expands minority classes to 183 samples each, ensuring equal representation during training while maintaining validation and test set integrity through strict data separation protocols.</p>
<fig id="fig-3">
<label>Figure 3</label>
<caption>
<title>Dataset class distribution before and after augmentation. Original dataset exhibits severe imbalance 27&#x2013;127 samples per class. Augmentation balances all classes to 183 samples.</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_77956-fig-3.tif"/>
</fig>
<p><xref ref-type="fig" rid="fig-4">Fig. 4</xref> demonstrates the systematic augmentation pipeline applied to training samples. Each augmentation technique addresses specific challenges encountered during UAV-based blade inspection. CLAHE (Contrast Limited Adaptive Histogram Equalization) enhances low-contrast regions common in varying illumination conditions. Geometric transformations (rotation, scaling) simulate diverse camera viewing angles and distances. HSV color space adjustments replicate sensor variations and atmospheric effects across different inspection flights. Gaussian noise injection accounts for camera sensor noise under suboptimal conditions. Occlusion simulation represents partial blade visibility due to inspection positioning or environmental obstacles. The combined augmentations generate diverse training samples while preserving bounding box integrity and defect spatial relationships. All transformations are applied stochastically with a probability of 0.5 during training, ensuring natural variation without over-augmentation artifacts.</p>
<fig id="fig-4">
<label>Figure 4</label>
<caption>
<title>Augmentation pipeline on representative blade sample. (<bold>a</bold>) Original image. (<bold>b</bold>) CLAHE contrast enhancement. (<bold>c</bold>) Geometric transformation (<inline-formula id="ieqn-1"><mml:math id="mml-ieqn-1"><mml:msup><mml:mi>15</mml:mi><mml:mrow><mml:mo>&#x2218;</mml:mo></mml:mrow></mml:msup></mml:math></inline-formula> rotation, 1.1<inline-formula id="ieqn-2"><mml:math id="mml-ieqn-2"><mml:mo>&#x00D7;</mml:mo></mml:math></inline-formula> scale). (<bold>d</bold>) HSV color adjustments. (<bold>e</bold>) Gaussian noise. (<bold>f</bold>) Occlusion simulation. (<bold>g</bold>) Combined augmentations. All transformations preserve bounding box integrity.</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_77956-fig-4.tif"/>
</fig>
<p>Subsequently, only the training set is augmented using a bounding-box-aware augmentation pipeline implemented with Albumentations [<xref ref-type="bibr" rid="ref-29">29</xref>]. The pipeline incorporates three categories of transformations to replicate real-world UAV capture conditions. Photometric adjustments include CLAHE, brightness and contrast modifications, and HSV color space shifts. Geometric operations apply affine and perspective transformations to account for varying camera angles. Occlusion is simulated through coarse dropout, while Gaussian noise and motion blur replicate sensor artifacts and flight-induced image degradation. Underrepresented classes in the training partition are selectively expanded to achieve approximately balanced representation, with a target of 183 samples per class, yielding an augmented training corpus of 915 images.</p>
<p>The validation and test sets remain unaugmented, containing only original UAV-captured images (33 samples each) to ensure unbiased evaluation on truly unseen blade conditions. This split-first protocol prevents data leakage by ensuring that test samples are never augmented variants of images present in the training set, thereby providing a rigorous assessment of generalization to unseen blade conditions. All preprocessing preserves native image resolution and the released detection format, and fixed random seeds are used to ensure exact reproducibility.</p>
</sec>
<sec id="s3_3">
<label>3.3</label>
<title>Network Architecture</title>
<p>The dual-attention architecture combines a pre-trained InceptionV3 backbone with channel-wise and spatial attention modules [<xref ref-type="bibr" rid="ref-30">30</xref>] for blade defect classification. Algorithm 1 formalizes the approach. The design comprises three components: a feature extraction backbone, dual attention mechanisms that recalibrate and localize discriminative patterns, and a regularized classification head. InceptionV3 extracts multi-scale features through parallel convolutional paths (1 <inline-formula id="ieqn-3"><mml:math id="mml-ieqn-3"><mml:mo>&#x00D7;</mml:mo></mml:math></inline-formula> 1, 3 <inline-formula id="ieqn-4"><mml:math id="mml-ieqn-4"><mml:mo>&#x00D7;</mml:mo></mml:math></inline-formula> 3, 5 <inline-formula id="ieqn-5"><mml:math id="mml-ieqn-5"><mml:mo>&#x00D7;</mml:mo></mml:math></inline-formula> 5 filters), capturing blade defect characteristics at multiple abstraction levels. Channel attention then emphasizes informative feature maps, spatial attention identifies defect locations, and the classification head produces final category predictions with dropout regularization to prevent overfitting.</p>
<fig id="fig-12">
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_77956-fig-12.tif"/>
</fig>
<sec id="s3_3_1">
<label>3.3.1</label>
<title>Feature Extraction Backbone</title>
<p>We employ InceptionV3 pre-trained on ImageNet for feature extraction. The architecture uses parallel convolutional paths with filters of varying sizes (1 <inline-formula id="ieqn-25"><mml:math id="mml-ieqn-25"><mml:mo>&#x00D7;</mml:mo></mml:math></inline-formula> 1, 3 <inline-formula id="ieqn-26"><mml:math id="mml-ieqn-26"><mml:mo>&#x00D7;</mml:mo></mml:math></inline-formula> 3, 5 <inline-formula id="ieqn-27"><mml:math id="mml-ieqn-27"><mml:mo>&#x00D7;</mml:mo></mml:math></inline-formula> 5) to capture multi-scale representations. Factorized convolutions reduce computational cost while preserving representational capacity, and batch normalization stabilizes training. The original fully-connected classification head is removed classification layer, retaining only convolutional blocks for feature extraction. The extracted feature maps <inline-formula id="ieqn-28"><mml:math id="mml-ieqn-28"><mml:mrow><mml:mrow><mml:mtext mathvariant="bold">F</mml:mtext></mml:mrow></mml:mrow><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>C</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mi>H</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mi>W</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula> encode hierarchical blade defect characteristics, where <italic>C</italic> denotes channel count and <italic>H</italic>, <italic>W</italic> denote spatial dimensions. These multi-scale features provide input to the dual-attention mechanism described in the following sections.</p>
</sec>
<sec id="s3_3_2">
<label>3.3.2</label>
<title>Channel Attention</title>
<p>Channel attention adjusts feature map importance based on relevance to defect classification. This approach has demonstrated effectiveness in various vision applications [<xref ref-type="bibr" rid="ref-31">31</xref>]. The formulation is:
<disp-formula id="eqn-1"><label>(1)</label><mml:math id="mml-eqn-1" display="block"><mml:msub><mml:mrow><mml:mrow><mml:mtext mathvariant="bold">M</mml:mtext></mml:mrow></mml:mrow><mml:mi>c</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mi>&#x03C3;</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>M</mml:mi><mml:mi>L</mml:mi><mml:mi>P</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>A</mml:mi><mml:mi>v</mml:mi><mml:mi>g</mml:mi><mml:mi>P</mml:mi><mml:mi>o</mml:mi><mml:mi>o</mml:mi><mml:mi>l</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mrow><mml:mtext mathvariant="bold">F</mml:mtext></mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">)</mml:mo><mml:mo>+</mml:mo><mml:mi>M</mml:mi><mml:mi>L</mml:mi><mml:mi>P</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>M</mml:mi><mml:mi>a</mml:mi><mml:mi>x</mml:mi><mml:mi>P</mml:mi><mml:mi>o</mml:mi><mml:mi>o</mml:mi><mml:mi>l</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mrow><mml:mtext mathvariant="bold">F</mml:mtext></mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">)</mml:mo></mml:math></disp-formula>where <inline-formula id="ieqn-29"><mml:math id="mml-ieqn-29"><mml:mi>&#x03C3;</mml:mi></mml:math></inline-formula> denotes the sigmoid function and <italic>MLP</italic> is a multi-layer perceptron with bottleneck structure:
<disp-formula id="eqn-2"><label>(2)</label><mml:math id="mml-eqn-2" display="block"><mml:mi>M</mml:mi><mml:mi>L</mml:mi><mml:mi>P</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mrow><mml:mtext mathvariant="bold">x</mml:mtext></mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mrow><mml:mtext mathvariant="bold">W</mml:mtext></mml:mrow></mml:mrow><mml:mn>2</mml:mn></mml:msub><mml:mi>&#x03B4;</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mrow><mml:mtext mathvariant="bold">W</mml:mtext></mml:mrow><mml:mn>1</mml:mn></mml:msub><mml:mrow><mml:mrow><mml:mtext mathvariant="bold">x</mml:mtext></mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:math></disp-formula></p>
<p>Parameters <inline-formula id="ieqn-30"><mml:math id="mml-ieqn-30"><mml:msub><mml:mrow><mml:mrow><mml:mtext mathvariant="bold">W</mml:mtext></mml:mrow></mml:mrow><mml:mn>1</mml:mn></mml:msub><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>C</mml:mi><mml:mrow><mml:mo>/</mml:mo></mml:mrow><mml:mi>r</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mi>C</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula> and <inline-formula id="ieqn-31"><mml:math id="mml-ieqn-31"><mml:msub><mml:mrow><mml:mrow><mml:mtext mathvariant="bold">W</mml:mtext></mml:mrow></mml:mrow><mml:mn>2</mml:mn></mml:msub><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>C</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mi>C</mml:mi><mml:mrow><mml:mo>/</mml:mo></mml:mrow><mml:mi>r</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula> are learned during training, with <inline-formula id="ieqn-32"><mml:math id="mml-ieqn-32"><mml:mi>r</mml:mi><mml:mo>=</mml:mo><mml:mn>16</mml:mn></mml:math></inline-formula> as the reduction ratio and <inline-formula id="ieqn-33"><mml:math id="mml-ieqn-33"><mml:mi>&#x03B4;</mml:mi></mml:math></inline-formula> as ReLU activation. Global average pooling compresses spatial information per channel into descriptor <inline-formula id="ieqn-34"><mml:math id="mml-ieqn-34"><mml:msub><mml:mrow><mml:mrow><mml:mtext mathvariant="bold">z</mml:mtext></mml:mrow></mml:mrow><mml:mrow><mml:mi>a</mml:mi><mml:mi>v</mml:mi><mml:mi>g</mml:mi></mml:mrow></mml:msub><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow></mml:mrow><mml:mi>C</mml:mi></mml:msup></mml:math></inline-formula>:
<disp-formula id="eqn-3"><label>(3)</label><mml:math id="mml-eqn-3" display="block"><mml:msubsup><mml:mi>z</mml:mi><mml:mrow><mml:mi>a</mml:mi><mml:mi>v</mml:mi><mml:mi>g</mml:mi></mml:mrow><mml:mi>c</mml:mi></mml:msubsup><mml:mo>=</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:mrow><mml:mi>H</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mi>W</mml:mi></mml:mrow></mml:mfrac><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>H</mml:mi></mml:mrow></mml:munderover><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>j</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>W</mml:mi></mml:mrow></mml:munderover><mml:msup><mml:mi>F</mml:mi><mml:mi>c</mml:mi></mml:msup><mml:mo stretchy="false">(</mml:mo><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></disp-formula></p>
<p>Max pooling extracts maximum activations per channel, yielding descriptor <inline-formula id="ieqn-35"><mml:math id="mml-ieqn-35"><mml:msub><mml:mrow><mml:mrow><mml:mtext mathvariant="bold">z</mml:mtext></mml:mrow></mml:mrow><mml:mrow><mml:mi>m</mml:mi><mml:mi>a</mml:mi><mml:mi>x</mml:mi></mml:mrow></mml:msub><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow></mml:mrow><mml:mi>C</mml:mi></mml:msup></mml:math></inline-formula>:
<disp-formula id="eqn-4"><label>(4)</label><mml:math id="mml-eqn-4" display="block"><mml:msubsup><mml:mi>z</mml:mi><mml:mrow><mml:mi>m</mml:mi><mml:mi>a</mml:mi><mml:mi>x</mml:mi></mml:mrow><mml:mi>c</mml:mi></mml:msubsup><mml:mo>=</mml:mo><mml:munder><mml:mo movablelimits="true" form="prefix">max</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi></mml:mrow></mml:munder><mml:msup><mml:mi>F</mml:mi><mml:mi>c</mml:mi></mml:msup><mml:mo stretchy="false">(</mml:mo><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></disp-formula></p>
<p>The channel attention map <inline-formula id="ieqn-36"><mml:math id="mml-ieqn-36"><mml:msub><mml:mrow><mml:mrow><mml:mtext mathvariant="bold">M</mml:mtext></mml:mrow></mml:mrow><mml:mi>c</mml:mi></mml:msub><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>C</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mn>1</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup></mml:math></inline-formula> scales the input features through element-wise multiplication:
<disp-formula id="eqn-5"><label>(5)</label><mml:math id="mml-eqn-5" display="block"><mml:msup><mml:mrow><mml:mrow><mml:mtext mathvariant="bold">F</mml:mtext></mml:mrow></mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:msup><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mrow><mml:mtext mathvariant="bold">M</mml:mtext></mml:mrow></mml:mrow><mml:mi>c</mml:mi></mml:msub><mml:mo>&#x2297;</mml:mo><mml:mrow><mml:mrow><mml:mtext mathvariant="bold">F</mml:mtext></mml:mrow></mml:mrow></mml:math></disp-formula></p>
<p>This emphasizes channels containing discriminative patterns: localized temperature shifts indicating delamination, irregular intensity distributions signaling structural damage, and surface texture changes from erosion.</p>
</sec>
<sec id="s3_3_3">
<label>3.3.3</label>
<title>Spatial Attention</title>
<p>Spatial attention identifies defect locations within the channel-adjusted features. Channel information is compressed via average and max pooling operations along the channel axis, creating two 2D maps. Concatenating and convolving these maps produces spatial attention <inline-formula id="ieqn-37"><mml:math id="mml-ieqn-37"><mml:msub><mml:mrow><mml:mrow><mml:mtext mathvariant="bold">M</mml:mtext></mml:mrow></mml:mrow><mml:mi>s</mml:mi></mml:msub><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mn>1</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mi>H</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mi>W</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula>:
<disp-formula id="eqn-6"><label>(6)</label><mml:math id="mml-eqn-6" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mrow><mml:mtext mathvariant="bold">M</mml:mtext></mml:mrow></mml:mrow><mml:mi>s</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mi>&#x03C3;</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msup><mml:mi>f</mml:mi><mml:mrow><mml:mn>7</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mn>7</mml:mn></mml:mrow></mml:msup></mml:mtd><mml:mtd><mml:mi></mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:mi>A</mml:mi><mml:mi>v</mml:mi><mml:mi>g</mml:mi><mml:mi>P</mml:mi><mml:mi>o</mml:mi><mml:mi>o</mml:mi><mml:mi>l</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msup><mml:mrow><mml:mrow><mml:mtext mathvariant="bold">F</mml:mtext></mml:mrow></mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:msup><mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mrow><mml:mi>s</mml:mi><mml:mi>p</mml:mi><mml:mi>a</mml:mi><mml:mi>t</mml:mi><mml:mi>i</mml:mi><mml:mi>a</mml:mi><mml:mi>l</mml:mi></mml:mrow></mml:msub><mml:mo>;</mml:mo></mml:mtd></mml:mtr><mml:mtr><mml:mtd /><mml:mtd><mml:mi>M</mml:mi><mml:mi>a</mml:mi><mml:mi>x</mml:mi><mml:mi>P</mml:mi><mml:mi>o</mml:mi><mml:mi>o</mml:mi><mml:mi>l</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msup><mml:mrow><mml:mrow><mml:mtext mathvariant="bold">F</mml:mtext></mml:mrow></mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:msup><mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mrow><mml:mi>s</mml:mi><mml:mi>p</mml:mi><mml:mi>a</mml:mi><mml:mi>t</mml:mi><mml:mi>i</mml:mi><mml:mi>a</mml:mi><mml:mi>l</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">]</mml:mo><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">)</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>Here <inline-formula id="ieqn-38"><mml:math id="mml-ieqn-38"><mml:msup><mml:mi>f</mml:mi><mml:mrow><mml:mn>7</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mn>7</mml:mn></mml:mrow></mml:msup></mml:math></inline-formula> is a <inline-formula id="ieqn-39"><mml:math id="mml-ieqn-39"><mml:mn>7</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mn>7</mml:mn></mml:math></inline-formula> convolution capturing adequate spatial context, and [;] indicates concatenation. Channel-wise pooling operations are defined as:
<disp-formula id="eqn-7"><label>(7)</label><mml:math id="mml-eqn-7" display="block"><mml:mi>A</mml:mi><mml:mi>v</mml:mi><mml:mi>g</mml:mi><mml:mi>P</mml:mi><mml:mi>o</mml:mi><mml:mi>o</mml:mi><mml:mi>l</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msup><mml:mrow><mml:mrow><mml:mtext mathvariant="bold">F</mml:mtext></mml:mrow></mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:msup><mml:msubsup><mml:mo stretchy="false">)</mml:mo><mml:mrow><mml:mi>s</mml:mi><mml:mi>p</mml:mi><mml:mi>a</mml:mi><mml:mi>t</mml:mi><mml:mi>i</mml:mi><mml:mi>a</mml:mi><mml:mi>l</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:mi>C</mml:mi></mml:mfrac><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>c</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>C</mml:mi></mml:mrow></mml:munderover><mml:msup><mml:mi>F</mml:mi><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi><mml:mi>c</mml:mi></mml:mrow></mml:msup><mml:mo stretchy="false">(</mml:mo><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></disp-formula>
<disp-formula id="eqn-8"><label>(8)</label><mml:math id="mml-eqn-8" display="block"><mml:mi>M</mml:mi><mml:mi>a</mml:mi><mml:mi>x</mml:mi><mml:mi>P</mml:mi><mml:mi>o</mml:mi><mml:mi>o</mml:mi><mml:mi>l</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msup><mml:mrow><mml:mtext mathvariant="bold">F</mml:mtext></mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:msup><mml:msubsup><mml:mo stretchy="false">)</mml:mo><mml:mrow><mml:mi>s</mml:mi><mml:mi>p</mml:mi><mml:mi>a</mml:mi><mml:mi>t</mml:mi><mml:mi>i</mml:mi><mml:mi>a</mml:mi><mml:mi>l</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:munder><mml:mo movablelimits="true" form="prefix">max</mml:mo><mml:mrow><mml:mi>c</mml:mi></mml:mrow></mml:munder><mml:msup><mml:mi>F</mml:mi><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi><mml:mi>c</mml:mi></mml:mrow></mml:msup><mml:mo stretchy="false">(</mml:mo><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></disp-formula></p>
<p>Spatial attention scales the channel-adjusted features via element-wise multiplication:
<disp-formula id="eqn-9"><label>(9)</label><mml:math id="mml-eqn-9" display="block"><mml:msup><mml:mrow><mml:mrow><mml:mtext mathvariant="bold">F</mml:mtext></mml:mrow></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mrow><mml:mtext mathvariant="bold">M</mml:mtext></mml:mrow></mml:mrow><mml:mi>s</mml:mi></mml:msub><mml:mo>&#x2297;</mml:mo><mml:msup><mml:mrow><mml:mrow><mml:mtext mathvariant="bold">F</mml:mtext></mml:mrow></mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:msup></mml:math></disp-formula></p>
<p>Cascading both attention types allows the network to first determine which channels contain useful information, then locate where that information appears spatially. This sequential approach effectively handles subtle variations in blade surface defects.</p>
</sec>
<sec id="s3_3_4">
<label>3.3.4</label>
<title>Classification Head and Loss Function</title>
<p>Attention-refined features pass through a classification head combining discrimination with regularization. Global average pooling reduces spatial dimensions to a fixed vector, removing spatial dependence and decreasing parameters in subsequent layers. A 512-neuron dense layer with ReLU activation processes this vector, with dropout (0.5) preventing overfitting. Softmax activation outputs probabilities across five defect categories. Training uses categorical cross-entropy with label smoothing:
<disp-formula id="eqn-10"><label>(10)</label><mml:math id="mml-eqn-10" display="block"><mml:mrow><mml:mrow><mml:mi>&#x02112;</mml:mi></mml:mrow></mml:mrow><mml:mo>=</mml:mo><mml:mo>&#x2212;</mml:mo><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>N</mml:mi></mml:mrow></mml:munderover><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>c</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>C</mml:mi></mml:mrow></mml:munderover><mml:mo stretchy="false">(</mml:mo><mml:mn>1</mml:mn><mml:mo>&#x2212;</mml:mo><mml:mi>&#x03B1;</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:msub><mml:mi>y</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>c</mml:mi></mml:mrow></mml:msub><mml:mi>log</mml:mi><mml:mo>&#x2061;</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>p</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>c</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mo>+</mml:mo><mml:mi>&#x03B1;</mml:mi><mml:mfrac><mml:mn>1</mml:mn><mml:mi>C</mml:mi></mml:mfrac><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>c</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>C</mml:mi></mml:mrow></mml:munderover><mml:mi>log</mml:mi><mml:mo>&#x2061;</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>p</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>c</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">)</mml:mo></mml:math></disp-formula>where <italic>N</italic> is batch size, <inline-formula id="ieqn-40"><mml:math id="mml-ieqn-40"><mml:mi>C</mml:mi><mml:mo>=</mml:mo><mml:mn>5</mml:mn></mml:math></inline-formula> is the number of classes, <inline-formula id="ieqn-41"><mml:math id="mml-ieqn-41"><mml:msub><mml:mi>y</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>c</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> is ground truth, <inline-formula id="ieqn-42"><mml:math id="mml-ieqn-42"><mml:msub><mml:mi>p</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>c</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> is predicted probability, and <inline-formula id="ieqn-43"><mml:math id="mml-ieqn-43"><mml:mi>&#x03B1;</mml:mi><mml:mo>=</mml:mo><mml:mn>0.1</mml:mn></mml:math></inline-formula> is the smoothing parameter. Label smoothing reduces prediction confidence, enhancing robustness when encountering previously unseen defect patterns during operational deployment.</p>
</sec>
</sec>
<sec id="s3_4">
<label>3.4</label>
<title>Experimental Setup</title>
<p>The dual-attention architecture is implemented in PyTorch using pre-trained InceptionV3 as the feature extraction backbone. The balanced dataset from <xref ref-type="sec" rid="s3_2">Section 3.2</xref> is split via stratified sampling (80:10:10) into training (265 images), validation (33 images), and test (33 images) sets, maintaining class proportions across partitions.</p>
<p>Training employs the Adam optimizer with learning rate <inline-formula id="ieqn-44"><mml:math id="mml-ieqn-44"><mml:mn>1</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:msup><mml:mn>10</mml:mn><mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mn>4</mml:mn></mml:mrow></mml:msup></mml:math></inline-formula> and weight decay <inline-formula id="ieqn-45"><mml:math id="mml-ieqn-45"><mml:mn>1</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:msup><mml:mn>10</mml:mn><mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mn>5</mml:mn></mml:mrow></mml:msup></mml:math></inline-formula> for L2 regularization. Experiments use an NVIDIA Tesla V100 GPU (32 GB memory) with batch size 8. Inference time measurements employ single-image batches on the same hardware. The loss function is categorical cross-entropy with label smoothing (<inline-formula id="ieqn-46"><mml:math id="mml-ieqn-46"><mml:mi>&#x03B1;</mml:mi><mml:mo>=</mml:mo><mml:mn>0.1</mml:mn></mml:math></inline-formula>) to reduce prediction overconfidence and enhance calibration. Memory consumption is minimized through gradient checkpointing and automatic mixed precision (AMP), which maintains numerical stability via dynamic loss scaling.</p>
<p>A ReduceLROnPlateau scheduler monitors validation loss, reducing the learning rate by 0.5 after 5 epochs without improvement. Early stopping with 15-epoch patience prevents overfitting. The checkpoint with lowest validation loss is retained validation loss for final evaluation.</p>
<p>For comprehensive benchmarking, the study compares against established CNN architectures [<xref ref-type="bibr" rid="ref-13">13</xref>]: VGG16 and VGG19 [<xref ref-type="bibr" rid="ref-32">32</xref>], MobileNetV2 [<xref ref-type="bibr" rid="ref-33">33</xref>], ResNet50V2 [<xref ref-type="bibr" rid="ref-34">34</xref>], InceptionV3 [<xref ref-type="bibr" rid="ref-35">35</xref>], InceptionResNetV2 [<xref ref-type="bibr" rid="ref-36">36</xref>], Xception [<xref ref-type="bibr" rid="ref-37">37</xref>], DenseNet121 and DenseNet201 [<xref ref-type="bibr" rid="ref-38">38</xref>], ResNet152V2, and EfficientNetV2B3 [<xref ref-type="bibr" rid="ref-39">39</xref>]. Each baseline receives identical data splits and individual hyperparameter optimization. Consistent preprocessing and evaluation metrics eliminate confounding factors across all architectures.</p>
</sec>
<sec id="s3_5">
<label>3.5</label>
<title>Evaluation Protocol</title>
<p>Performance is quantified using standard classification metrics computed from true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN). The Accuracy measures overall correctness:
<disp-formula id="eqn-11"><label>(11)</label><mml:math id="mml-eqn-11" display="block"><mml:mrow><mml:mtext>Accuracy</mml:mtext></mml:mrow><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mrow><mml:mtext>TP</mml:mtext></mml:mrow><mml:mo>+</mml:mo><mml:mrow><mml:mtext>TN</mml:mtext></mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:mtext>TP</mml:mtext></mml:mrow><mml:mo>+</mml:mo><mml:mrow><mml:mtext>TN</mml:mtext></mml:mrow><mml:mo>+</mml:mo><mml:mrow><mml:mtext>FP</mml:mtext></mml:mrow><mml:mo>+</mml:mo><mml:mrow><mml:mtext>FN</mml:mtext></mml:mrow></mml:mrow></mml:mfrac></mml:math></disp-formula></p>
<p>Precision indicates the fraction of correct positive predictions:
<disp-formula id="eqn-12"><label>(12)</label><mml:math id="mml-eqn-12" display="block"><mml:mrow><mml:mtext>Precision</mml:mtext></mml:mrow><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mtext>TP</mml:mtext></mml:mrow><mml:mrow><mml:mrow><mml:mtext>TP</mml:mtext></mml:mrow><mml:mo>+</mml:mo><mml:mrow><mml:mtext>FP</mml:mtext></mml:mrow></mml:mrow></mml:mfrac></mml:math></disp-formula></p>
<p>Recall captures how completely each defect type is detected:
<disp-formula id="eqn-13"><label>(13)</label><mml:math id="mml-eqn-13" display="block"><mml:mrow><mml:mtext>Recall</mml:mtext></mml:mrow><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mtext>TP</mml:mtext></mml:mrow><mml:mrow><mml:mrow><mml:mtext>TP</mml:mtext></mml:mrow><mml:mo>+</mml:mo><mml:mrow><mml:mtext>FN</mml:mtext></mml:mrow></mml:mrow></mml:mfrac></mml:math></disp-formula></p>
<p>F1-score balances precision and recall through their harmonic mean:
<disp-formula id="eqn-14"><label>(14)</label><mml:math id="mml-eqn-14" display="block"><mml:mrow><mml:mtext>F1-score</mml:mtext></mml:mrow><mml:mo>=</mml:mo><mml:mn>2</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mfrac><mml:mrow><mml:mrow><mml:mtext>Precision</mml:mtext></mml:mrow><mml:mo>&#x00D7;</mml:mo><mml:mrow><mml:mtext>Recall</mml:mtext></mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:mtext>Precision</mml:mtext></mml:mrow><mml:mo>+</mml:mo><mml:mrow><mml:mtext>Recall</mml:mtext></mml:mrow></mml:mrow></mml:mfrac></mml:math></disp-formula></p>
<p>Specificity quantifies correct negative identification:
<disp-formula id="eqn-15"><label>(15)</label><mml:math id="mml-eqn-15" display="block"><mml:mrow><mml:mtext>Specificity</mml:mtext></mml:mrow><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mtext>TN</mml:mtext></mml:mrow><mml:mrow><mml:mrow><mml:mtext>TN</mml:mtext></mml:mrow><mml:mo>+</mml:mo><mml:mrow><mml:mtext>FP</mml:mtext></mml:mrow></mml:mrow></mml:mfrac></mml:math></disp-formula></p>
<p>Both class-averaged and per-class metrics are reported metrics. Individual class performance reveals detection effectiveness for rare defects like cracks and lightning damage vs. common faults like erosion. Inference time measures average processing duration per image in seconds, indicating suitability for real-time UAV inspection workflows.</p>
</sec>
</sec>
<sec id="s4">
<label>4</label>
<title>Results</title>
<p>TurbineBladeDetNet is evaluated on the DTU dataset [<xref ref-type="bibr" rid="ref-11">11</xref>] using balanced five-class splits from <xref ref-type="sec" rid="s3_2">Section 3.2</xref>. Evaluation employs F1-score, precision, recall, accuracy, and inference time. Comparisons include established CNN architectures [<xref ref-type="bibr" rid="ref-13">13</xref>] and recent WTB inspection approaches [<xref ref-type="bibr" rid="ref-4">4</xref>&#x2013;<xref ref-type="bibr" rid="ref-6">6</xref>].</p>
<sec id="s4_1">
<label>4.1</label>
<title>Comparison with State-of-the-Art Baseline Models</title>
<p>TurbineBladeDetNet achieves 98.66% F1-score with 98.65% precision, 98.68% recall, and 97.14% accuracy at 0.0110 s per image (91 FPS), as shown in <xref ref-type="table" rid="table-1">Table 1</xref>. This represents substantial improvements over all baselines.</p>
<table-wrap id="table-1">
<label>Table 1</label>
<caption>
<title>Performance comparison between TurbineBladeDetNet and baseline architectures for blade defect classification.</title>
</caption>
<table>
<colgroup>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/> </colgroup>
<thead>
<tr>
<th>Model</th>
<th>F1-Score (%)</th>
<th>Precision (%)</th>
<th>Recall (%)</th>
<th>Accuracy (%)</th>
<th>Inference (s)</th>
</tr>
</thead>
<tbody>
<tr>
<td>VGG19</td>
<td>83.03</td>
<td>82.26</td>
<td>84.37</td>
<td>84.01</td>
<td>0.0175</td>
</tr>
<tr>
<td>MobileNetV2</td>
<td>86.25</td>
<td>86.84</td>
<td>85.13</td>
<td>86.00</td>
<td>0.0241</td>
</tr>
<tr>
<td>Swin-Transformer-Tiny</td>
<td>86.48</td>
<td>85.92</td>
<td>87.15</td>
<td>87.05</td>
<td>0.0892</td>
</tr>
<tr>
<td>VGG16</td>
<td>87.24</td>
<td>88.25</td>
<td>88.54</td>
<td>89.05</td>
<td>0.0164</td>
</tr>
<tr>
<td>EfficientNetB7</td>
<td>87.21</td>
<td>89.28</td>
<td>88.79</td>
<td>89.01</td>
<td>0.1315</td>
</tr>
<tr>
<td>EfficientNetV2B3</td>
<td>88.52</td>
<td>88.22</td>
<td>90.65</td>
<td>91.41</td>
<td>0.0581</td>
</tr>
<tr>
<td>InceptionV3</td>
<td>89.85</td>
<td>90.52</td>
<td>91.15</td>
<td>91.50</td>
<td>0.0361</td>
</tr>
<tr>
<td>DenseNet121</td>
<td>90.00</td>
<td>91.01</td>
<td>90.13</td>
<td>92.10</td>
<td>0.0558</td>
</tr>
<tr>
<td>InceptionResNetV2</td>
<td>91.32</td>
<td>91.18</td>
<td>92.40</td>
<td>91.02</td>
<td>0.0752</td>
</tr>
<tr>
<td>TurbineBladeDetNet</td>
<td>98.66</td>
<td>98.65</td>
<td>98.68</td>
<td>97.14</td>
<td>0.0110</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>DenseNet121 [<xref ref-type="bibr" rid="ref-38">38</xref>] leads standard architectures with 90.00% F1-score and 92.10% accuracy. Its dense connectivity pattern enables feature reuse across layers, helping capture blade surface textures. However, the architecture lacks explicit attention mechanisms to distinguish defect-specific patterns from background variations, limiting discrimination between visually similar categories like erosion and paint-off. InceptionV3 [<xref ref-type="bibr" rid="ref-35">35</xref>] reaches 89.85% F1-score through multi-scale feature extraction via parallel 1 <inline-formula id="ieqn-47"><mml:math id="mml-ieqn-47"><mml:mo>&#x00D7;</mml:mo></mml:math></inline-formula> 1, 3 <inline-formula id="ieqn-48"><mml:math id="mml-ieqn-48"><mml:mo>&#x00D7;</mml:mo></mml:math></inline-formula> 3, and 5 <inline-formula id="ieqn-49"><mml:math id="mml-ieqn-49"><mml:mo>&#x00D7;</mml:mo></mml:math></inline-formula> 5 convolutions. This multi-path design captures defects at different scales but applies uniform weighting to all extracted features, missing opportunities to emphasize discriminative patterns specific to blade faults. InceptionResNetV2 [<xref ref-type="bibr" rid="ref-36">36</xref>] combines Inception modules with residual connections, achieving 91.32% F1-score with the highest recall (92.40%) among baselines. Residual connections facilitate gradient flow during training, supporting deeper networks. Despite this, the architecture still lacks task-specific attention to prioritize blade-relevant features. EfficientNet variants [<xref ref-type="bibr" rid="ref-39">39</xref>] employ compound scaling across depth, width, and resolution. EfficientNetV2B3 reaches 88.52% F1-score while EfficientNetB7 achieves only 87.21% F1-score despite 12<inline-formula id="ieqn-50"><mml:math id="mml-ieqn-50"><mml:mo>&#x00D7;</mml:mo></mml:math></inline-formula> longer inference time (0.1315 s). This reveals that simply scaling model capacity provides diminishing returns without architectural mechanisms tailored for blade defect characteristics. VGG architectures [<xref ref-type="bibr" rid="ref-32">32</xref>] (VGG16: 87.24% F1-score, VGG19: 83.03% F1-score) and MobileNetV2 [<xref ref-type="bibr" rid="ref-33">33</xref>] (86.25% F1-score) demonstrate limited effectiveness. VGG&#x2019;s uniform 3 <inline-formula id="ieqn-51"><mml:math id="mml-ieqn-51"><mml:mo>&#x00D7;</mml:mo></mml:math></inline-formula> 3 convolutions lack multi-scale receptive fields needed for varied defect sizes. MobileNetV2&#x2019;s depthwise separable convolutions reduce parameters but sacrifice representational capacity necessary for subtle defect discrimination.</p>
<p>The comparison includes recent transformer architectures to evaluate their applicability to blade defect detection. Swin-Transformer-Tiny [<xref ref-type="bibr" rid="ref-40">40</xref>] achieves 87.05% accuracy, underperforming CNN-based approaches. This aligns with established findings that vision transformers require large-scale datasets (typically exceeding 10,000 images) for effective training, whereas the current dataset contains 915 training samples after augmentation. The transformer&#x2019;s relatively lower performance (10.09 percentage points below TurbineBladeDetNet) combined with higher inference time (0.0892 vs. 0.0110 s) indicates that CNN architectures remain more suitable for blade inspection scenarios with limited training data and real-time operational requirements. Regarding lightweight architectures, the comparison encompasses multiple efficient models: MobileNetV2 (86.00%), EfficientNetB7 (89.01%), and EfficientNetV2B3 (91.41%). While these models demonstrate computational efficiency, they achieve 6.14 to 11.14 percentage points lower accuracy than TurbineBladeDetNet, confirming that the proposed dual-attention multi-path architecture provides superior defect discrimination despite similar inference speed.</p>
<p>The proposed architecture outperforms the strongest baseline (DenseNet121) by 8.66 points in F1-score and 5.04 points in accuracy. The dual-attention mechanism drives these gains by first identifying informative feature channels (channel attention), then localizing spatial regions containing defects (spatial attention). This cascaded design proves particularly effective for challenging cases: hairline cracks in complex textures, gradual paint degradation, and erosion boundaries with low contrast.</p>
<p>The multi-path InceptionV3 backbone provides essential multi-scale features through parallel convolutions (1 <inline-formula id="ieqn-52"><mml:math id="mml-ieqn-52"><mml:mo>&#x00D7;</mml:mo></mml:math></inline-formula> 1, 3 <inline-formula id="ieqn-53"><mml:math id="mml-ieqn-53"><mml:mo>&#x00D7;</mml:mo></mml:math></inline-formula> 3, 5 <inline-formula id="ieqn-54"><mml:math id="mml-ieqn-54"><mml:mo>&#x00D7;</mml:mo></mml:math></inline-formula> 5 filters), capturing thin cracks and large erosion regions within the same framework. Channel attention then recalibrates importance across 2048 channels with reduction ratio r &#x003D; 16, emphasizing channels encoding discriminative patterns like localized temperature variations (delamination indicators) and irregular distributions (structural damage signatures). Spatial attention subsequently refines localization through 7 <inline-formula id="ieqn-55"><mml:math id="mml-ieqn-55"><mml:mo>&#x00D7;</mml:mo></mml:math></inline-formula> 7 convolution over concatenated average and max-pooled representations.</p>
<p>The Albumentations-based augmentation addresses severe class imbalance in the original distribution (107 missing teeth, 127 erosion, 35 paint-off, 27 lightning damage, 35 crack). Expanding underrepresented classes to 183 samples each through photometric adjustments (CLAHE, HSV shifts), geometric transforms, occlusion simulation, and noise injection produces balanced training that generalizes across all defect types. Class-specific F1-scores range from 97.43% (paint-off) to 99.80% (lightning damage), demonstrating consistent performance on both common and rare faults.</p>
<p>The attention modules operate on already extracted features with minimal overhead (r &#x003D; 16 reduction maintains efficiency). Comparatively, EfficientNetB7 achieves 7.6 FPS despite lower accuracy, while DenseNet121 and InceptionResNetV2 reach only 18 FPS and 13 FPS. This confirms that architectural specialization through attention mechanisms provides superior accuracy-efficiency tradeoffs compared to generic capacity scaling for blade inspection tasks.</p>
</sec>
<sec id="s4_2">
<label>4.2</label>
<title>Per-Class Performance</title>
<p><xref ref-type="table" rid="table-2">Table 2</xref> shows strong detection across all five defect categories. TurbineBladeDetNet achieves sensitivity ranging from 97.65% to 99.80%, specificity from 95.67% to 100%, and F1-scores from 97.43% to 99.80%.</p>
<table-wrap id="table-2">
<label>Table 2</label>
<caption>
<title>Per-class metrics for TurbineBladeDetNet across five defect categories.</title>
</caption>
<table>
<colgroup>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/> </colgroup>
<thead>
<tr>
<th>Defect Category</th>
<th>F1-Score (%)</th>
<th>Precision (%)</th>
<th>Recall (%)</th>
<th>Specificity (%)</th>
</tr>
</thead>
<tbody>
<tr>
<td>Lightning Damage</td>
<td>99.80</td>
<td>99.80</td>
<td>99.80</td>
<td>99.47</td>
</tr>
<tr>
<td>Missing Teeth</td>
<td>99.05</td>
<td>99.01</td>
<td>99.10</td>
<td>99.21</td>
</tr>
<tr>
<td>Crack</td>
<td>98.94</td>
<td>100.00</td>
<td>97.90</td>
<td>100.00</td>
</tr>
<tr>
<td>Erosion</td>
<td>98.09</td>
<td>97.24</td>
<td>98.96</td>
<td>98.91</td>
</tr>
<tr>
<td>Paint-Off</td>
<td>97.43</td>
<td>97.21</td>
<td>97.65</td>
<td>95.67</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>Per-class accuracy analysis across all evaluated architectures reveals systematic performance advantages for TurbineBladeDetNet spanning all defect categories. Lightning damage detection (<xref ref-type="fig" rid="fig-5">Fig. 5</xref>) achieves 99.80% accuracy, establishing a 5.2 percentage point margin over DenseNet121 (94.6%). The high absolute performance across most architectures (92 to 99% range) reflects the class&#x2019;s distinctive morphological characteristics: sharp contrast gradients and geometric discontinuities around receptor caps provide robust classification cues even for attention free networks. Missing teeth classification (<xref ref-type="fig" rid="fig-6">Fig. 6</xref>) demonstrates comparable patterns, with TurbineBladeDetNet reaching 99.10% vs. DenseNet121&#x2019;s 93.8%. The geometric nature of leading edge discontinuities yields consistent detectability across architectural paradigms, though the 5.3 point improvement confirms that attention guided feature refinement enhances discrimination of partial vs. complete tooth loss.</p>
<fig id="fig-5">
<label>Figure 5</label>
<caption>
<title>Per-class accuracy for lightning damage detection. TurbineBladeDetNet achieves 99.80% compared to DenseNet121&#x2019;s 94.6% (5.2 point improvement). High baseline performance reflects distinctive high-contrast features at receptor caps.</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_77956-fig-5.tif"/>
</fig><fig id="fig-6">
<label>Figure 6</label>
<caption>
<title>Per-class accuracy for missing teeth detection. TurbineBladeDetNet achieves 99.10% vs. DenseNet121&#x2019;s 93.8% (5.3 point improvement). Geometric discontinuities provide robust classification cues across architectures.</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_77956-fig-6.tif"/>
</fig>
<p>Crack detection (<xref ref-type="fig" rid="fig-7">Fig. 7</xref>) exposes greater architectural sensitivity, with TurbineBladeDetNet achieving 97.90% compared to DenseNet121&#x2019;s 90.2%. This represents a 7.7 percentage point differential and the second largest gap among evaluated categories. This expanded margin reflects the fundamental challenge of isolating thin, low contrast linear structures under variable illumination and viewing angles. Baseline architectures lacking explicit spatial attention mechanisms struggle to suppress background texture while enhancing hairline features, whereas the proposed dual attention design systematically amplifies fine grained spatial patterns. Erosion classification (<xref ref-type="fig" rid="fig-8">Fig. 8</xref>) yields 98.96% for TurbineBladeDetNet against 91.8% for DenseNet121 (7.2 point improvement), demonstrating effective discrimination of progressive surface degradation from benign weathering artifacts. This distinction is complicated by gradual intensity transitions and diffuse boundary conditions typical of leading edge erosion.</p>
<fig id="fig-7">
<label>Figure 7</label>
<caption>
<title>Per-class accuracy for crack detection. TurbineBladeDetNet achieves 97.90% compared to DenseNet121&#x2019;s 90.2% (7.7 point improvement). Expanded margin reflects dual-attention effectiveness for thin, low-contrast linear structures.</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_77956-fig-7.tif"/>
</fig><fig id="fig-8">
<label>Figure 8</label>
<caption>
<title>Per-class accuracy for erosion classification. TurbineBladeDetNet achieves 98.96% vs. DenseNet121&#x2019;s 91.8% (7.2 point improvement). Performance gap demonstrates effective discrimination of gradual surface degradation from weathering artifacts.</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_77956-fig-8.tif"/>
</fig>
<p>Paint-off detection (<xref ref-type="fig" rid="fig-9">Fig. 9</xref>) exhibits the most pronounced performance separation: 97.65% for TurbineBladeDetNet vs. 89.4% for DenseNet121, yielding an 8.3 percentage point advantage. This maximum differential among all defect categories validates the critical role of channel and spatial attention in disambiguating diffuse coating loss from visually similar erosion boundaries and localized discoloration. The class&#x2019;s inherently ambiguous presentation, characterized by gradual intensity variation without sharp geometric cues, disproportionately penalizes architectures relying solely on hierarchical feature abstraction. The consistent 5.2 to 8.3 point improvements across all classes, with margins scaling proportionally to class-specific ambiguity (lightning: 5.2, paint-off: 8.3), demonstrate that the architectural innovations address fundamental representational challenges rather than overfitting to particular defect signatures. Baseline models exhibit relatively uniform cross-class performance (plus or minus 3% standard deviation within each architecture), whereas TurbineBladeDetNet maintains tight accuracy clustering (97.65 to 99.80%) while preserving natural difficulty rankings, indicating robust generalization without sacrificing fine-grained discrimination.</p>
<fig id="fig-9">
<label>Figure 9</label>
<caption>
<title>Per-class accuracy for paint-off detection. TurbineBladeDetNet achieves 97.65% vs. DenseNet121&#x2019;s 89.4% (8.3 point improvement). Largest differential validates attention mechanisms&#x2019; role in disambiguating diffuse coating loss from erosion boundaries.</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_77956-fig-9.tif"/>
</fig>
<p>Lightning damage exhibits the best balance among the four metrics, reaching 99.80% for sensitivity, precision, and F1-score with 99.47% specificity. This behaviour is consistent with the presence of distinctive local cues around receptor caps and burn marks, which the model can isolate reliably. Crack also performs strongly with good precision and specificity and a 98.94% F1-score, showing that the model can localize thin, elongated structures without increasing false positives. These two classes benefit from sharp, high-contrast morphology and from the attention mechanism&#x2019;s ability to enhance fine structural features.</p>
<p>Erosion and Paint-off present slightly lower figures, with erosion at 98.09% F1-score and Paint-off at 97.43% F1-score. The small gap can be attributed to their visual proximity and low-contrast boundaries, especially under oblique lighting. The specificity for Paint-off, at 95.67%, indicates that some surfaces with diffuse discoloration or weathering are occasionally flagged as paint loss. In practice, this is preferable to missed detections, but it highlights a natural ambiguity at the boundary between gradual erosion and abrupt coating removal.</p>
<p>Missing teeth attains a 99.05% F1-score with high sensitivity and specificity, reflecting clear geometric discontinuities that are well captured by the model. Taken together, the class-wise outcomes suggest that dual attention improves representation of thin and small-scale structures, while the multi-path design helps reconcile global context with local texture cues. The augmentation strategy appears to have stabilized performance on rare or visually heterogeneous categories without sacrificing precision on the more common classes.</p>
<p>From an operational standpoint, the high specificity values reduce unnecessary maintenance dispatches, and the high sensitivity values limit the risk of overlooking actionable defects. The reported inference time supports near real-time use in UAV inspection workflows, where throughput and responsiveness are important. The model generalizes effectively across all defect categories, providing consistent performance suitable for operational deployment in automated blade inspection systems.</p>
</sec>
<sec id="s4_3">
<label>4.3</label>
<title>Comparison with Recent WTB Inspection Methods</title>
<p><xref ref-type="table" rid="table-3">Table 3</xref> compares TurbineBladeDetNet against three recent WTB inspection approaches [<xref ref-type="bibr" rid="ref-4">4</xref>&#x2013;<xref ref-type="bibr" rid="ref-6">6</xref>]. Since public implementations were unavailable, each method was reproduced from published specifications and evaluated all models on the same DTU dataset [<xref ref-type="bibr" rid="ref-11">11</xref>] splits using consistent evaluation protocols.</p>
<table-wrap id="table-3">
<label>Table 3</label>
<caption>
<title>Performance comparison with recent wind turbine blade defect detection methods.</title>
</caption>
<table>
<colgroup>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/> </colgroup>
<thead>
<tr>
<th>Model</th>
<th>F1-Score (%)</th>
<th>Precision (%)</th>
<th>Recall (%)</th>
<th>Accuracy (%)</th>
</tr>
</thead>
<tbody>
<tr>
<td>Spajic et al. [<xref ref-type="bibr" rid="ref-6">6</xref>]</td>
<td>87.54</td>
<td>86.01</td>
<td>89.19</td>
<td>88.90</td>
</tr>
<tr>
<td>Yang et al. [<xref ref-type="bibr" rid="ref-4">4</xref>]</td>
<td>90.13</td>
<td>90.03</td>
<td>90.24</td>
<td>92.26</td>
</tr>
<tr>
<td>Gohar et al. [<xref ref-type="bibr" rid="ref-5">5</xref>]</td>
<td>89.64</td>
<td>89.21</td>
<td>90.09</td>
<td>90.14</td>
</tr>
<tr>
<td>TurbineBladeDetNet</td>
<td>98.66</td>
<td>98.65</td>
<td>98.68</td>
<td>97.14</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>Spajic et al. [<xref ref-type="bibr" rid="ref-6">6</xref>] achieve 88.90% accuracy, 86.01% precision, 89.19% recall, and an F1-score of 87.54%. This result confirms the practicality of their transfer-learning design for aerial inspection while indicating a lower precision&#x2013;recall balance than the task-specific backbone. Gohar et al. [<xref ref-type="bibr" rid="ref-5">5</xref>] obtain 90.14% accuracy, 89.21% precision, 90.09% recall, and an F1-score of 89.64%. Their slicing strategy improves sensitivity to small defects, although overall performance remains below that of the proposed model under matched conditions. Yang et al. [<xref ref-type="bibr" rid="ref-4">4</xref>] reach 92.26% accuracy, 90.03% precision, 90.24% recall, and an F1-score of 90.13%, showing that image stitching and defect consolidation are beneficial but still short of the discriminative power achieved by the proposed architecture.</p>
<p>TurbineBladeDetNet records 97.14% accuracy with F1-score of 98.66%, precision of 98.65%, and recall of 98.68%, outperforming the strongest comparator by about five percentage points in accuracy and more than eight points in F1-score. We attribute these gains to the multi-path feature design with dual attention, which strengthens separation between visually proximate classes such as erosion and paint-off, and to a balanced training setup that addresses the class imbalance noted in prior work [<xref ref-type="bibr" rid="ref-4">4</xref>&#x2013;<xref ref-type="bibr" rid="ref-6">6</xref>].</p>
</sec>
<sec id="s4_4">
<label>4.4</label>
<title>Attention Mechanism Visualizations</title>
<p>To validate the interpretability of the dual attention mechanism, <xref ref-type="fig" rid="fig-10">Fig. 10</xref> presents qualitative visualization of learned attention patterns for two representative defect types: Missing Teeth and Paint-off. The attention heatmaps reveal how the proposed architecture autonomously allocates computational resources to task-relevant regions during inference. The visualizations demonstrate that the model exhibits strong attentional responses along structural boundaries, surface discontinuities, and anomalous areas where defects manifest. Notably, despite substantial differences in defect morphology and visual characteristics between Missing Teeth and Paint-off damage, both samples demonstrate consistent attention patterns that prioritize blade edges and defect-indicative regions. This cross-defect consistency provides empirical evidence that the dual attention mechanism learns generalizable, interpretable representations rather than defect-specific artifacts. The selective focus on structurally and semantically relevant regions, combined with effective background suppression, supports the model&#x2019;s reliability for operational wind turbine blade inspection and validates the design choices underlying the dual attention architecture.</p>
<fig id="fig-10">
<label>Figure 10</label>
<caption>
<title>Attention heatmap visualization for two representative defect types. (<bold>a</bold>) Missing Teeth defect demonstrating attention focus on structural discontinuities. (<bold>b</bold>) Paint-off defect showing attention on surface anomalies.</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_77956-fig-10.tif"/>
</fig>
</sec>
<sec id="s4_5">
<label>4.5</label>
<title>Ablation Study</title>
<p>To identify individual component contributions and validate architectural design choices, comprehensive ablation experiments were conducted isolating three key elements: attention mechanisms, data augmentation, and multi-path feature extraction. <xref ref-type="table" rid="table-4">Table 4</xref> presents systematic results across six configurations.</p>
<table-wrap id="table-4">
<label>Table 4</label>
<caption>
<title>Comprehensive ablation analysis and hybrid method comparison.</title>
</caption>
<table>
<colgroup>
<col align="center" width="35mm"/>
<col align="center" width="19mm"/>
<col align="center" width="23mm"/>
<col align="center" width="18mm"/>
<col align="center" width="15mm"/>
<col align="center" width="15mm"/>
<col align="center" width="15mm"/> </colgroup>
<thead>
<tr>
<th>Model Configuration</th>
<th>Backbone</th>
<th>Attention</th>
<th>Augmentation</th>
<th>Accuracy (%)</th>
<th>Precision (%)</th>
<th>F1-Score (%)</th>
</tr>
</thead>
<tbody>
<tr>
<td>Baseline</td>
<td>InceptionV3</td>
<td>None</td>
<td>No</td>
<td>89.48</td>
<td>88.72</td>
<td>87.35</td>
</tr>
<tr>
<td>Baseline &#x002B; Augmentation</td>
<td>InceptionV3</td>
<td>None</td>
<td>Yes</td>
<td>91.50</td>
<td>90.52</td>
<td>89.85</td>
</tr>
<tr>
<td>&#x002B; Channel Attention</td>
<td>InceptionV3</td>
<td>Channel</td>
<td>Yes</td>
<td>94.08</td>
<td>93.24</td>
<td>93.68</td>
</tr>
<tr>
<td>&#x002B; Spatial Attention</td>
<td>InceptionV3</td>
<td>Spatial</td>
<td>Yes</td>
<td>93.56</td>
<td>92.48</td>
<td>92.76</td>
</tr>
<tr>
<td>Hybrid baseline</td>
<td>ResNet50</td>
<td>Channel &#x002B; Spatial</td>
<td>Yes</td>
<td>95.32</td>
<td>96.18</td>
<td>96.42</td>
</tr>
<tr>
<td>TurbineBladeDetNet (Full)</td>
<td>InceptionV3</td>
<td>Channel &#x002B; Spatial</td>
<td>Yes</td>
<td>97.14</td>
<td>98.65</td>
<td>98.66</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>The baseline InceptionV3 without attention or augmentation achieves 89.48% accuracy with 87.35% F1-score. Applying the systematic augmentation pipeline (CLAHE, geometric transformations, occlusion simulation, noise injection) improves performance to 91.50% accuracy, demonstrating that addressing class imbalance and environmental variability provides 2.02 percentage points improvement. This validates the bounding-box-aware augmentation strategy as essential for handling diverse UAV inspection conditions and expanding underrepresented defect classes to approximately 183 samples per category.</p>
<p>Building on the augmented baseline, channel attention alone improves accuracy to 94.08%, demonstrating effective feature recalibration that emphasizes defect-relevant spectral patterns while suppressing background noise. This 2.58 percentage points improvement confirms that channel-wise recalibration provides fundamental discriminative capacity, particularly effective for distinguishing visually similar categories such as erosion and paint-off through discriminative channel-wise signatures. Spatial attention alone reaches 93.56%, validating its effectiveness for localizing defect regions and detecting small-scale anomalies like thin cracks and erosion boundaries where precise spatial localization is critical. The 2.06 percentage points improvement demonstrates that spatial mechanisms excel at focusing computational resources on defect-relevant regions while suppressing background activation.</p>
<p>The complete dual-attention architecture combining both mechanisms in cascade achieves 97.14% accuracy with 98.66% F1-score, substantially outperforming either mechanism individually with 5.64 percentage points improvement over the augmented baseline. This synergistic improvement confirms that cascaded attention enables channel mechanisms to first identify informative features, then spatial mechanisms localize where these features appear. The performance gap between individual attention modules (94.08% and 93.56%) and their combination (97.14%) demonstrates complementary effects where channel attention determines &#x201C;what features&#x201D; to emphasize while spatial attention determines &#x201C;where&#x201D; to focus. This two-stage refinement proves particularly effective for low-contrast defects in complex blade textures, such as hairline cracks and gradual coating degradation.</p>
<p>To validate effectiveness against hybrid deep learning approaches and evaluate multi-path architecture contributions, an additional experiment employed ResNet50 with identical dual attention mechanisms (channel &#x002B; spatial) and augmentation pipeline. This configuration represents recently published hybrid fault detection methods combining residual learning with attention mechanisms. Despite utilizing the same attention configuration and data augmentation as TurbineBladeDetNet, ResNet50 achieves 95.32% accuracy compared to TurbineBladeDetNet&#x2019;s 97.14%, demonstrating 1.82 percentage points improvement. This comparison serves dual purposes: first, it validates superiority over alternative hybrid architectures that integrate CNN backbones with attention mechanisms; second, it isolates the specific contribution of InceptionV3&#x2019;s parallel multi-scale feature extraction (1 <inline-formula id="ieqn-56"><mml:math id="mml-ieqn-56"><mml:mo>&#x00D7;</mml:mo></mml:math></inline-formula> 1, 3 <inline-formula id="ieqn-57"><mml:math id="mml-ieqn-57"><mml:mo>&#x00D7;</mml:mo></mml:math></inline-formula> 3, 5 <inline-formula id="ieqn-58"><mml:math id="mml-ieqn-58"><mml:mo>&#x00D7;</mml:mo></mml:math></inline-formula> 5 convolutional paths) over ResNet50&#x2019;s sequential single-path processing. The results confirm that performance gains stem not merely from attention mechanism integration but from the synergistic combination of multi-path architecture with cascaded dual attention, particularly critical for detecting elongated, scale-variant blade defects.</p>
<p>The ablation results reveal complementary contributions across components: augmentation addresses data variability (&#x002B;2.02%), channel attention enhances discriminative capacity (&#x002B;2.58%), spatial attention improves localization (&#x002B;2.06%), their cascade achieves synergistic effects (&#x002B;5.64%), and multi-path design captures multi-scale features (&#x002B;1.82% over single-path). The complete architecture achieves 97.14% accuracy with precision of 98.65% and recall of 98.68%, representing a 7.66 percentage points improvement over the augmented baseline. Both precision and recall exceed 98%, indicating balanced performance essential for operational deployment where missed defects and false alarms both carry economic consequences. These results validate that performance gains result from synergistic integration of blade-tailored components rather than any single enhancement.</p>
</sec>
</sec>
<sec id="s5">
<label>5</label>
<title>Discussion</title>
<p>Our results show TurbineBladeDetNet reaches 97.14% accuracy, with F1-score of 98.66%, precision of 98.65%, and recall of 98.68%. Image processing requires 0.0110 s per image. The stable convergence behavior and small generalization gap observed during training (shown in the <xref ref-type="fig" rid="fig-11">Fig. 11</xref>) indicate that the proposed architecture and augmentation strategy effectively mitigate overfitting, contributing to the strong generalization observed across defect categories. These outcomes represent approximately five percentage point gains in accuracy and more than eight points in F1-score relative to the strongest baseline (DenseNet121 at 92.10% accuracy, 90.00% F1-score) and similar improvements over recent wind-turbine inspection methods [<xref ref-type="bibr" rid="ref-4">4</xref>&#x2013;<xref ref-type="bibr" rid="ref-6">6</xref>]. The observed performance advantage can be attributed primarily to sequential application of spatial attention and channel-wise modules on top of a multi-path InceptionV3 backbone. Channel attention recalibrates feature importance across 2048 channels with a reduction ratio of 16, allowing the network to suppress background texture and emphasize defect-specific patterns such as thin cracks, localized erosion boundaries, and lightning-receptor burn marks. Spatial attention subsequently refines the feature map by highlighting discriminative spatial regions through a 7 <inline-formula id="ieqn-59"><mml:math id="mml-ieqn-59"><mml:mo>&#x00D7;</mml:mo></mml:math></inline-formula> 7 convolutional kernel applied to concatenated average and max-pooled representations. This two-stage refinement process effectively addresses the challenge of small, low-contrast defects embedded in visually complex blade surfaces.</p>
<fig id="fig-11">
<label>Figure 11</label>
<caption>
<title>Training convergence of TurbineBladeDetNet over 50 epochs. The model&#x2019;s converges steadily with minimal generalization gap, and the best validation performance is achieved at epoch 40, indicating stable optimization and effective regularization.</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMES_77956-fig-11.tif"/>
</fig>
<p>The multi-path architecture of InceptionV3 enables parallel extraction of features at multiple scales through 1 <inline-formula id="ieqn-60"><mml:math id="mml-ieqn-60"><mml:mo>&#x00D7;</mml:mo></mml:math></inline-formula> 1, 3 <inline-formula id="ieqn-61"><mml:math id="mml-ieqn-61"><mml:mo>&#x00D7;</mml:mo></mml:math></inline-formula> 3, and 5 <inline-formula id="ieqn-62"><mml:math id="mml-ieqn-62"><mml:mo>&#x00D7;</mml:mo></mml:math></inline-formula> 5 convolutions, which is particularly beneficial for distinguishing between visually similar classes such as erosion and paint-off. Without this multi-scale capacity, single-path backbones such as VGG16 and VGG19 achieve only 84%&#x2013;89% accuracy on the same task, suggesting that scale diversity is critical for capturing the range of defect morphologies present in wind turbine blade imagery. The combination of multi-path feature extraction with dual attention thus provides both global context and fine-grained localization, explaining the substantial improvement over architectures that employ only one of these mechanisms. The augmentation pipeline using Albumentations [<xref ref-type="bibr" rid="ref-29">29</xref>] plays a central role in the model&#x2019;s robustness. By applying CLAHE, random brightness and contrast adjustments, HSV shifts, affine and perspective transformations, coarse dropout, and noise injection, the training set becomes invariant to illumination changes, viewing angles, and partial occlusions commonly encountered during UAV flights. Bounding-box-aware augmentation preserves annotation integrity while expanding each training class to approximately 183 images, eliminating the severe imbalance in the original distribution (107 missing teeth, 127 erosion, 35 paint-off, 27 lightning damage, 35 crack). The importance of this balancing strategy is evident when comparing class-specific metrics. Lightning damage, the most underrepresented class with only 27 images, and paint-off with 35 images both achieved strong F1-scores after augmentation.</p>
<p>The variation in class-specific F1-scores ranging from 97.43% to 99.80% provides insight into the relative difficulty of each defect category. Lightning damage exhibits the highest performance (99.80% F1-score, 99.47% specificity) because it presents sharp, high-contrast features around lightning receptor caps and burn marks, which are geometrically distinct from other defect types. Crack achieves perfect precision and specificity (100%) with 98.94% F1-score, indicating that when the model predicts a crack, the prediction is almost always correct. The slightly lower sensitivity (97.90%) suggests that a small fraction of thin, low-contrast cracks remain undetected, particularly those oriented parallel to blade edges where they blend with natural texture variations. This behavior is consistent with the known challenge of detecting hairline cracks in composite materials under oblique illumination [<xref ref-type="bibr" rid="ref-3">3</xref>,<xref ref-type="bibr" rid="ref-16">16</xref>]. Paint-off presents the lowest F1-score (97.43%) and specificity (95.67%), reflecting inherent ambiguity between gradual coating degradation and localized paint loss. Visual inspection of misclassified cases reveals that diffuse discoloration at erosion boundaries is occasionally labeled as paint-off, resulting in false positives. This confusion is not entirely undesirable from an operational standpoint, as both erosion and paint-off indicate surface degradation requiring maintenance attention. Nevertheless, the 95.67% specificity suggests that approximately 4%&#x2013;5% of non-paint-off regions are incorrectly flagged, which could lead to unnecessary close inspections during field deployment.</p>
<p>TurbineBladeDetNet outperforms three recent wind-turbine inspection methods by 5&#x2013;8 percentage points in accuracy and 8&#x2013;11 points in F1-score. Spajic et al. [<xref ref-type="bibr" rid="ref-6">6</xref>] achieve 88.90% accuracy using transfer learning on aerial imagery, but their model lacks task-specific attention mechanisms, resulting in lower precision (86.01%). Gohar et al. [<xref ref-type="bibr" rid="ref-5">5</xref>] employ image slicing to handle high-resolution inputs and reach 90.14% accuracy, but their approach requires manual stitching heuristics and exhibits lower recall (90.09%), suggesting missed detections in challenging regions. Yang et al. [<xref ref-type="bibr" rid="ref-4">4</xref>] use image stitching for defect consolidation and attain 92.26% accuracy, the strongest among comparators, but still fall short of the model&#x2019;s discriminative capacity. These performance gaps can be attributed to three factors: absence of dual attention explicitly tailored for blade defect discrimination, limited data augmentation strategies confined to basic geometric transforms, and reliance on class weighting rather than targeted augmentation to address severe class imbalance. Note that public implementations for these methods were unavailable, requiring re-implementation from published procedural details. While every effort was made to faithfully reproduce each approach, differences in preprocessing choices, hyperparameter tuning, and framework-specific implementations may contribute to the observed performance gaps. Nevertheless, all models were evaluated on identical data splits using consistent metrics, ensuring that comparisons reflect relative performance under controlled conditions.</p>
<p>The inference time of 0.0110 s positions TurbineBladeDetNet as suitable for near-real-time UAV inspection workflows. In comparison, EfficientNetB7 requires 0.1315 s per image (7.6 FPS), which would limit throughput during aerial surveys, while DenseNet121 takes 0.0558 s (18 FPS), still slower than the proposed model despite lower accuracy. This computational efficiency stems from InceptionV3&#x2019;s factorized convolutions and efficient multi-path operations that reduce parameter count relative to VGG-style architectures while maintaining representational capacity, combined with minimal attention module overhead operating on already-extracted feature maps. For operational UAV inspection, processing speed directly impacts survey coverage and flight duration. At 91 FPS, a single inspection drone equipped with TurbineBladeDetNet can analyze imagery from multiple blades during flight, enabling on-board triage and selective high-resolution capture of detected defects, reducing post-flight data transfer and manual review time. From an operational perspective, the balanced sensitivity (98.68%) and specificity (98.65%) profile supports reliable deployment in routine inspection workflows by ensuring that few true defects are missed while limiting false alarms that would waste maintenance crew capacity.</p>
<p>Despite strong performance, several limitations merit acknowledgment. First, our evaluation employs the DTU benchmark dataset (331 images), consistent with recent WTB inspection studies. Our systematic augmentation strategy (915 training images) introduces photometric transformations (CLAHE, HSV shifts), geometric variations, and degradation simulation to address intra-dataset variability. However, comprehensive generalization assessment requires evaluation on additional datasets representing diverse geographical locations, turbine manufacturers with different blade materials and coatings, and varying operational environments. Multi-site validation represents an important next step for operational deployment validation.</p>
<p>For deployment in new operational contexts, site-specific validation with representative samples from the target wind farm is recommended, particularly when environmental conditions or blade characteristics differ substantially from the DTU dataset.</p>
<p>Second, the five-class taxonomy represents a community-driven consolidation but may not capture all operationally relevant defect types such as delamination, trailing-edge separation, and ice accumulation, which are documented failure modes not explicitly labeled in the current dataset [<xref ref-type="bibr" rid="ref-11">11</xref>]. Future work aims to expand the dataset with additional defect classes and diverse environmental conditions including varying weather patterns, lighting conditions, and seasonal effects to improve generalization across operational scenarios. Extension to instance segmentation or oriented bounding-box detection will enable precise defect localization and measurement critical for maintenance planning and structural assessment.</p>
<p>Overall, the integration of dual attention, multi-path feature extraction, and targeted data augmentation enables TurbineBladeDetNet to achieve state-of-the-art performance on five-class wind turbine blade defect detection with practical inference speeds suitable for UAV deployment. The architecture effectively handles diverse defect morphologies from sharp geometric discontinuities to diffuse surface degradation, demonstrating balanced behavior across rare and common blade faults.</p>
</sec>
<sec id="s6">
<label>6</label>
<title>Conclusion</title>
<p>This study introduces TurbineBladeDetNet, a dual-attention convolutional architecture combining channel-wise and spatial attention with pre-trained InceptionV3 for automated blade defect detection from UAV imagery. Across five defect categories (missing teeth, erosion, paint-off, lightning damage, and cracks), the proposed model achieves 97.14% accuracy with F1-score of 98.66%, precision of 98.65%, and recall of 98.68%. Experimental evaluation on the DTU dataset demonstrates approximately five percentage point improvements in accuracy and more than eight points in F1-score relative to the strongest baseline (DenseNet121) and recent wind-turbine inspection methods, with an inference time of 0.0110 s that supports near-real-time UAV deployment. The dual-attention mechanism enables discriminative feature extraction for small, low-contrast defects by sequentially recalibrating channel importance and highlighting spatially relevant regions, while the multi-path design captures defect morphologies at multiple scales. An Albumentations-based augmentation pipeline addresses severe class imbalance in the original distribution by expanding each training class to approximately 183 images through photometric adjustments, geometric transformations, occlusion simulation, and noise injection, resulting in balanced training that improves detection performance across rare and common defect categories. Class-specific analysis reveals uniformly high sensitivity and specificity across all defect types, with lightning damage achieving 99.80% sensitivity, precision, and F1-score, and crack attaining perfect precision and specificity with 98.94% F1-score. The model&#x2019;s balanced performance profile minimizes both missed detections and false alarms, supporting reliable triage in operational inspection workflows.</p>
<p>Future work includes integrating explainable AI (LIME [<xref ref-type="bibr" rid="ref-41">41</xref>]) for transparent defect rationales, extending to multi-label classification for composite faults, and incorporating temporal analysis for defect progression tracking. We plan to include validation on multi-location datasets representing different geographical regions and turbine manufacturers to assess cross-domain generalization. Furthermore, we aim to expand coverage of additional defect types (delamination, trailing-edge separation, ice accumulation). Parameter-efficient fine-tuning approaches (<inline-formula id="ieqn-63"><mml:math id="mml-ieqn-63"><mml:msup><mml:mi>SA</mml:mi><mml:mn>2</mml:mn></mml:msup></mml:math></inline-formula>VP [<xref ref-type="bibr" rid="ref-42">42</xref>], visual prompt tuning [<xref ref-type="bibr" rid="ref-43">43</xref>,<xref ref-type="bibr" rid="ref-44">44</xref>]) will enable edge deployment on resource-constrained UAV hardware. This framework establishes a foundation for operational wind farm inspection systems.</p>
</sec>
</body>
<back>
<ack>
<p>We thank Jubail Industrial College for supporting this research.</p>
</ack>
<sec>
<title>Funding Statement</title>
<p>The authors received no specific funding for this study.</p>
</sec>
<sec>
<title>Author Contributions</title>
<p>Conceptualization, Mubarak Alanazi and Junaid Rashid; methodology, Mubarak Alanazi and Junaid Rashid; software, Mubarak Alanazi; validation, Mubarak Alanazi and Junaid Rashid; formal analysis, Mubarak Alanazi; investigation, Mubarak Alanazi; resources, Mubarak Alanazi; data curation, Mubarak Alanazi; writing&#x2014;original draft preparation, Mubarak Alanazi; writing&#x2014;review and editing, Mubarak Alanazi and Junaid Rashid; visualization, Mubarak Alanazi; supervision, Mubarak Alanazi; project administration, Mubarak Alanazi. All authors reviewed and approved the final version of the manuscript.</p>
</sec>
<sec sec-type="data-availability">
<title>Availability of Data and Materials</title>
<p>We use the DTU Wind Turbine Blade Inspection Dataset [<xref ref-type="bibr" rid="ref-11">11</xref>] (<ext-link ext-link-type="uri" xlink:href="https://data.mendeley.com/datasets/hd96prn3nc/2">https://data.mendeley.com/datasets/hd96prn3nc/2</ext-link>, accessed on 10 June, 2025), publicly available via Mendeley Data.</p>
</sec>
<sec>
<title>Ethics Approval</title>
<p>Not applicable.</p>
</sec>
<sec sec-type="COI-statement">
<title>Conflicts of Interest</title>
<p>The authors declare no conflicts of interest.</p>
</sec>
<ref-list content-type="authoryear">
<title>References</title>
<ref id="ref-1"><label>[1]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Ruiz</surname> <given-names>M</given-names></string-name>, <string-name><surname>Mujica</surname> <given-names>LE</given-names></string-name>, <string-name><surname>Alf&#x00E9;rez</surname> <given-names>S</given-names></string-name>, <string-name><surname>Acho</surname> <given-names>L</given-names></string-name>, <string-name><surname>Tutiv&#x00E9;n</surname> <given-names>C</given-names></string-name>, <string-name><surname>Vidal</surname> <given-names>Y</given-names></string-name>, <etal>et al</etal></person-group>. <article-title>Wind turbine fault detection and classification by means of image texture analysis</article-title>. <source>Mech Syst Signal Process</source>. <year>2018</year>;<volume>107</volume>(<issue>3</issue>):<fpage>149</fpage>&#x2013;<lpage>67</lpage>. doi:<pub-id pub-id-type="doi">10.1016/j.ymssp.2017.12.035</pub-id>.</mixed-citation></ref>
<ref id="ref-2"><label>[2]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Denhof</surname> <given-names>D</given-names></string-name>, <string-name><surname>Staar</surname> <given-names>B</given-names></string-name>, <string-name><surname>L&#x00FC;tjen</surname> <given-names>M</given-names></string-name>, <string-name><surname>Freitag</surname> <given-names>M</given-names></string-name></person-group>. <article-title>Automatic optical surface inspection of wind turbine rotor blades using convolutional neural networks</article-title>. <source>Procedia CIRP</source>. <year>2019</year>;<volume>81</volume>(<issue>9</issue>):<fpage>1166</fpage>&#x2013;<lpage>70</lpage>. doi:<pub-id pub-id-type="doi">10.1016/j.procir.2019.03.286</pub-id>.</mixed-citation></ref>
<ref id="ref-3"><label>[3]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Dolinski</surname> <given-names>L</given-names></string-name>, <string-name><surname>Krawczuk</surname> <given-names>M</given-names></string-name></person-group>. <article-title>Analysis of modal parameters using a statistical approach for condition monitoring of the wind turbine blade</article-title>. <source>Appl Sci</source>. <year>2020</year>;<volume>10</volume>(<issue>17</issue>):<fpage>5878</fpage>. doi:<pub-id pub-id-type="doi">10.3390/app10175878</pub-id>.</mixed-citation></ref>
<ref id="ref-4"><label>[4]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Yang</surname> <given-names>C</given-names></string-name>, <string-name><surname>Liu</surname> <given-names>X</given-names></string-name>, <string-name><surname>Zhou</surname> <given-names>H</given-names></string-name>, <string-name><surname>Ke</surname> <given-names>Y</given-names></string-name>, <string-name><surname>See</surname> <given-names>J</given-names></string-name></person-group>. <article-title>Towards accurate image stitching for drone-based wind turbine blade inspection</article-title>. <source>Renew Energy</source>. <year>2023</year>;<volume>203</volume>(<issue>2</issue>):<fpage>267</fpage>&#x2013;<lpage>79</lpage>. doi:<pub-id pub-id-type="doi">10.1016/j.renene.2022.12.063</pub-id>.</mixed-citation></ref>
<ref id="ref-5"><label>[5]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Gohar</surname> <given-names>I</given-names></string-name>, <string-name><surname>Halimi</surname> <given-names>A</given-names></string-name>, <string-name><surname>See</surname> <given-names>J</given-names></string-name>, <string-name><surname>Yew</surname> <given-names>WK</given-names></string-name>, <string-name><surname>Yang</surname> <given-names>C</given-names></string-name></person-group>. <article-title>Slice-aided defect detection in ultra high-resolution wind turbine blade images</article-title>. <source>Machines</source>. <year>2023</year>;<volume>11</volume>(<issue>10</issue>):<fpage>953</fpage>. doi:<pub-id pub-id-type="doi">10.3390/machines11100953</pub-id>.</mixed-citation></ref>
<ref id="ref-6"><label>[6]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Spaji&#x0107;</surname> <given-names>M</given-names></string-name>, <string-name><surname>Talaji&#x0107;</surname> <given-names>M</given-names></string-name>, <string-name><surname>Peji&#x0107; Bach</surname> <given-names>M</given-names></string-name></person-group>. <article-title>Harnessing convolutional neural networks for automated wind turbine blade defect detection</article-title>. <source>Designs</source>. <year>2024</year>;<volume>9</volume>(<issue>1</issue>):<fpage>2</fpage>. doi:<pub-id pub-id-type="doi">10.3390/designs9010002</pub-id>.</mixed-citation></ref>
<ref id="ref-7"><label>[7]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Qiu</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>S</given-names></string-name>, <string-name><surname>Zeng</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Yu</surname> <given-names>D</given-names></string-name></person-group>. <article-title>Automatic visual defects inspection of wind turbine blades via YOLO-based small object detection approach</article-title>. <source>J Electron Imaging</source>. <year>2019</year>;<volume>28</volume>(<issue>4</issue>):<fpage>43023</fpage>&#x2013;<lpage>3</lpage>. doi:<pub-id pub-id-type="doi">10.1117/1.jei.28.4.043023</pub-id>.</mixed-citation></ref>
<ref id="ref-8"><label>[8]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Zhang</surname> <given-names>R</given-names></string-name>, <string-name><surname>Wen</surname> <given-names>C</given-names></string-name></person-group>. <article-title>SOD-YOLO: a small target defect detection algorithm for wind turbine blades based on improved YOLOv5</article-title>. <source>Adv Theory Simul</source>. <year>2022</year>;<volume>5</volume>(<issue>7</issue>):<fpage>2100631</fpage>. doi:<pub-id pub-id-type="doi">10.1002/adts.202100631</pub-id>.</mixed-citation></ref>
<ref id="ref-9"><label>[9]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Fu</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>F</given-names></string-name>, <string-name><surname>Ren</surname> <given-names>X</given-names></string-name>, <string-name><surname>Hao</surname> <given-names>B</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>X</given-names></string-name>, <string-name><surname>Yin</surname> <given-names>C</given-names></string-name>, <etal>et al</etal></person-group>. <article-title>LE-YOLO: lightweight and efficient detection model for wind turbine blade defects based on improved YOLO</article-title>. <source>IEEE Access</source>. <year>2024</year>;<volume>12</volume>(<issue>16</issue>):<fpage>135985</fpage>&#x2013;<lpage>98</lpage>. doi:<pub-id pub-id-type="doi">10.1109/access.2024.3463391</pub-id>.</mixed-citation></ref>
<ref id="ref-10"><label>[10]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Zhang</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Fang</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Gao</surname> <given-names>W</given-names></string-name>, <string-name><surname>Liu</surname> <given-names>X</given-names></string-name>, <string-name><surname>Yang</surname> <given-names>H</given-names></string-name>, <string-name><surname>Tong</surname> <given-names>Y</given-names></string-name>, <etal>et al</etal></person-group>. <article-title>Attention mechanism based on deep learning for defect detection of wind turbine blade via multi-scale features</article-title>. <source>Meas Sci Technol</source>. <year>2024</year>;<volume>35</volume>(<issue>10</issue>):<fpage>105408</fpage>. doi:<pub-id pub-id-type="doi">10.1088/1361-6501/ad6024</pub-id>.</mixed-citation></ref>
<ref id="ref-11"><label>[11]</label><mixed-citation publication-type="book"><person-group person-group-type="author"><string-name><surname>Shihavuddin</surname> <given-names>ASM</given-names></string-name>, <string-name><surname>Chen</surname> <given-names>X</given-names></string-name></person-group>. <source>DTU&#x02014;drone inspection images of wind turbine [Dataset]</source>. <comment>2018 [cited 2026 Jan 26]</comment>. Available from: <ext-link ext-link-type="uri" xlink:href="https://data.mendeley.com/datasets/hd96prn3nc/2">https://data.mendeley.com/datasets/hd96prn3nc/2</ext-link>.</mixed-citation></ref>
<ref id="ref-12"><label>[12]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Xia</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Sheng</surname> <given-names>H</given-names></string-name>, <string-name><surname>Chen</surname> <given-names>Z</given-names></string-name></person-group>. <article-title>A comprehensive review on blade damage detection and prediction</article-title>. In: <conf-name>2021 International Conference on Sensing, Measurement &#x0026; Data Analytics in the Era of Artificial Intelligence (ICSMD)</conf-name>. <publisher-loc>Piscataway, NJ, USA</publisher-loc>: <publisher-name>IEEE</publisher-name>; <year>2021</year>. p. <fpage>1</fpage>&#x2013;<lpage>6</lpage>.</mixed-citation></ref>
<ref id="ref-13"><label>[13]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><collab>Keras</collab></person-group>. <article-title>Keras applications [Internet]. 2025 [cited 2025 Jun 10]</article-title>. Available from: <ext-link ext-link-type="uri" xlink:href="https://keras.io/api/applications/">https://keras.io/api/applications/</ext-link>.</mixed-citation></ref>
<ref id="ref-14"><label>[14]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Liao</surname> <given-names>C</given-names></string-name>, <string-name><surname>Liang</surname> <given-names>W</given-names></string-name>, <string-name><surname>Yao</surname> <given-names>J</given-names></string-name>, <string-name><surname>Lv</surname> <given-names>G</given-names></string-name></person-group>. <article-title>Intelligent defect diagnosis for oil and gas pipeline based on lightweight CBAM-Inception-Resnet</article-title>. <source>Chem Eng Res Des</source>. <year>2025</year>;<volume>218</volume>:<fpage>548</fpage>&#x2013;<lpage>71</lpage>. doi:<pub-id pub-id-type="doi">10.1016/j.cherd.2025.05.030</pub-id>.</mixed-citation></ref>
<ref id="ref-15"><label>[15]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Seibi</surname> <given-names>C</given-names></string-name>, <string-name><surname>Ward</surname> <given-names>Z</given-names></string-name>, <string-name><surname>AS</surname> <given-names>MM</given-names></string-name>, <string-name><surname>Shekaramiz</surname> <given-names>M</given-names></string-name></person-group>. <article-title>Locating and extracting wind turbine blade cracks using Haar-like features and clustering</article-title>. In: <conf-name>2022 Intermountain Engineering, Technology and Computing (IETC)</conf-name>. <publisher-loc>Piscataway, NJ, USA</publisher-loc>: <publisher-name>IEEE</publisher-name>; <year>2022</year>. p. <fpage>1</fpage>&#x2013;<lpage>5</lpage>. doi:<pub-id pub-id-type="doi">10.1109/ietc54973.2022.9796823</pub-id>.</mixed-citation></ref>
<ref id="ref-16"><label>[16]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Reddy</surname> <given-names>A</given-names></string-name>, <string-name><surname>Indragandhi</surname> <given-names>V</given-names></string-name>, <string-name><surname>Ravi</surname> <given-names>L</given-names></string-name>, <string-name><surname>Subramaniyaswamy</surname> <given-names>V</given-names></string-name></person-group>. <article-title>Detection of cracks and damage in wind turbine blades using artificial intelligence-based image analytics</article-title>. <source>Measurement</source>. <year>2019</year>;<volume>147</volume>(<issue>6</issue>):<fpage>106823</fpage>. doi:<pub-id pub-id-type="doi">10.1016/j.measurement.2019.07.051</pub-id>.</mixed-citation></ref>
<ref id="ref-17"><label>[17]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Rao</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Xiang</surname> <given-names>BJ</given-names></string-name>, <string-name><surname>Huang</surname> <given-names>B</given-names></string-name>, <string-name><surname>Mao</surname> <given-names>S</given-names></string-name></person-group>. <article-title>Wind turbine blade inspection based on unmanned aerial vehicle (UAV) visual systems</article-title>. In: <conf-name>2019 IEEE 3rd Conference on Energy Internet and Energy System Integration (EI2)</conf-name>. <publisher-loc>Piscataway, NJ, USA</publisher-loc>: <publisher-name>IEEE</publisher-name>; <year>2019</year>. p. <fpage>708</fpage>&#x2013;<lpage>13</lpage>.</mixed-citation></ref>
<ref id="ref-18"><label>[18]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Yu</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Cao</surname> <given-names>H</given-names></string-name>, <string-name><surname>Liu</surname> <given-names>S</given-names></string-name>, <string-name><surname>Yang</surname> <given-names>S</given-names></string-name>, <string-name><surname>Bai</surname> <given-names>R</given-names></string-name></person-group>. <article-title>Image-based damage recognition of wind turbine blades</article-title>. In: <conf-name>2017 2nd International Conference on Advanced Robotics and Mechatronics (ICARM)</conf-name>. <publisher-loc>Piscataway, NJ, USA</publisher-loc>: <publisher-name>IEEE</publisher-name>; <year>2017</year>. p. <fpage>161</fpage>&#x2013;<lpage>6</lpage>.</mixed-citation></ref>
<ref id="ref-19"><label>[19]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Yu</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Cao</surname> <given-names>H</given-names></string-name>, <string-name><surname>Yan</surname> <given-names>X</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>T</given-names></string-name>, <string-name><surname>Ge</surname> <given-names>SS</given-names></string-name></person-group>. <article-title>Defect identification of wind turbine blades based on defect semantic features with transfer feature extractor</article-title>. <source>Neurocomputing</source>. <year>2020</year>;<volume>376</volume>(<issue>5&#x2013;6</issue>):<fpage>1</fpage>&#x2013;<lpage>9</lpage>. doi:<pub-id pub-id-type="doi">10.1016/j.neucom.2019.09.071</pub-id>.</mixed-citation></ref>
<ref id="ref-20"><label>[20]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Zhang</surname> <given-names>C</given-names></string-name>, <string-name><surname>Yang</surname> <given-names>T</given-names></string-name>, <string-name><surname>Yang</surname> <given-names>J</given-names></string-name></person-group>. <article-title>Image recognition of wind turbine blade defects using attention-based MobileNetv1-YOLOv4 and transfer learning</article-title>. <source>Sensors</source>. <year>2022</year>;<volume>22</volume>(<issue>16</issue>):<fpage>6009</fpage>. doi:<pub-id pub-id-type="doi">10.3390/s22166009</pub-id>; <pub-id pub-id-type="pmid">36015768</pub-id></mixed-citation></ref>
<ref id="ref-21"><label>[21]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Turhal</surname> <given-names>U</given-names></string-name>, <string-name><surname>Onal</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Turhal</surname> <given-names>K</given-names></string-name></person-group>. <article-title>Enhanced fault detection and diagnosis in photovoltaic arrays using a hybrid NCA-CNN model</article-title>. <source>Comput Model Eng Sci</source>. <year>2025</year>;<volume>143</volume>(<issue>2</issue>):<fpage>2307</fpage>. doi:<pub-id pub-id-type="doi">10.32604/cmes.2025.064269</pub-id>.</mixed-citation></ref>
<ref id="ref-22"><label>[22]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Alhanaf</surname> <given-names>AS</given-names></string-name>, <string-name><surname>Farsadi</surname> <given-names>M</given-names></string-name>, <string-name><surname>Balik</surname> <given-names>HH</given-names></string-name></person-group>. <article-title>Fault detection and classification in ring power system with DG penetration using hybrid CNN-LSTM</article-title>. <source>IEEE Access</source>. <year>2024</year>;<volume>12</volume>:<fpage>59953</fpage>&#x2013;<lpage>75</lpage>. doi:<pub-id pub-id-type="doi">10.1109/access.2024.3394166</pub-id>.</mixed-citation></ref>
<ref id="ref-23"><label>[23]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Kandil</surname> <given-names>T</given-names></string-name>, <string-name><surname>Harris</surname> <given-names>A</given-names></string-name>, <string-name><surname>Das</surname> <given-names>R</given-names></string-name></person-group>. <article-title>Enhancing fault detection and classification in wind farm power generation using convolutional neural networks (CNN) by leveraging LVRT embedded in numerical relays</article-title>. <source>IEEE Access</source>. <year>2025</year>;<volume>13</volume>:<fpage>104828</fpage>&#x2013;<lpage>43</lpage>. doi:<pub-id pub-id-type="doi">10.1109/access.2025.3580052</pub-id>.</mixed-citation></ref>
<ref id="ref-24"><label>[24]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Zhang</surname> <given-names>C</given-names></string-name>, <string-name><surname>Wen</surname> <given-names>C</given-names></string-name>, <string-name><surname>Liu</surname> <given-names>J</given-names></string-name></person-group>. <article-title>Mask-MRNet: a deep neural network for wind turbine blade fault detection</article-title>. <source>J Renew Sustain Energy</source>. <year>2020</year>;<volume>12</volume>(<issue>5</issue>):<fpage>53302</fpage>. doi:<pub-id pub-id-type="doi">10.1063/5.0014223</pub-id>.</mixed-citation></ref>
<ref id="ref-25"><label>[25]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Zhang</surname> <given-names>J</given-names></string-name>, <string-name><surname>Cosma</surname> <given-names>G</given-names></string-name>, <string-name><surname>Watkins</surname> <given-names>J</given-names></string-name></person-group>. <article-title>Image enhanced mask R-CNN: a deep learning pipeline with new evaluation measures for wind turbine blade defect detection and classification</article-title>. <source>J Imaging</source>. <year>2021</year>;<volume>7</volume>(<issue>3</issue>):<fpage>46</fpage>. doi:<pub-id pub-id-type="doi">10.3390/jimaging7030046</pub-id>; <pub-id pub-id-type="pmid">34460702</pub-id></mixed-citation></ref>
<ref id="ref-26"><label>[26]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Diaz</surname> <given-names>P</given-names></string-name>, <string-name><surname>Tittus</surname> <given-names>P</given-names></string-name></person-group>. <article-title>Fast detection of wind turbine blade damage using Cascade Mask R-DSCNN-aided drone inspection analysis</article-title>. <source>Signal Image Video Process</source>. <year>2023</year>;<volume>17</volume>(<issue>5</issue>):<fpage>2333</fpage>&#x2013;<lpage>41</lpage>. doi:<pub-id pub-id-type="doi">10.1007/s11760-022-02450-6</pub-id>.</mixed-citation></ref>
<ref id="ref-27"><label>[27]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Du</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Wu</surname> <given-names>H</given-names></string-name>, <string-name><surname>Cava</surname> <given-names>DG</given-names></string-name></person-group>. <article-title>A motion-blurred restoration method for surface damage detection of wind turbine blades</article-title>. <source>Measurement</source>. <year>2023</year>;<volume>217</volume>(<issue>9</issue>):<fpage>113031</fpage>. doi:<pub-id pub-id-type="doi">10.1016/j.measurement.2023.113031</pub-id>.</mixed-citation></ref>
<ref id="ref-28"><label>[28]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Dwivedi</surname> <given-names>D</given-names></string-name>, <string-name><surname>Babu</surname> <given-names>KVSM</given-names></string-name>, <string-name><surname>Yemula</surname> <given-names>PK</given-names></string-name>, <string-name><surname>Chakraborty</surname> <given-names>P</given-names></string-name>, <string-name><surname>Pal</surname> <given-names>M</given-names></string-name></person-group>. <article-title>Identification of surface defects on solar PV panels and wind turbine blades using attention based deep learning model</article-title>. <source>Eng Appl Artif Intell</source>. <year>2024</year>;<volume>131</volume>(<issue>4</issue>):<fpage>107836</fpage>. doi:<pub-id pub-id-type="doi">10.1016/j.engappai.2023.107836</pub-id>.</mixed-citation></ref>
<ref id="ref-29"><label>[29]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><collab>Ultralytics</collab></person-group>. <article-title>Albumentations integration [Internet]. 2025 [cited 2025 Jun 10]</article-title>. Available from: <ext-link ext-link-type="uri" xlink:href="https://docs.ultralytics.com/integrations/albumentations/">https://docs.ultralytics.com/integrations/albumentations/</ext-link>.</mixed-citation></ref>
<ref id="ref-30"><label>[30]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Woo</surname> <given-names>S</given-names></string-name>, <string-name><surname>Park</surname> <given-names>J</given-names></string-name>, <string-name><surname>Lee</surname> <given-names>JY</given-names></string-name>, <string-name><surname>Kweon</surname> <given-names>IS</given-names></string-name></person-group>. <article-title>CBAM: convolutional block attention module</article-title>. In: <conf-name>Proceedings of the European Conference on Computer Vision (ECCV)</conf-name>. <publisher-loc>Singapore</publisher-loc>: <publisher-name>Springer</publisher-name>; <year>2018</year>. p. <fpage>3</fpage>&#x2013;<lpage>19</lpage>.</mixed-citation></ref>
<ref id="ref-31"><label>[31]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Khan</surname> <given-names>MA</given-names></string-name>, <string-name><surname>Park</surname> <given-names>H</given-names></string-name></person-group>. <article-title>Adaptive channel attention and multi-path convolutional architecture for brain tumor detection using MRI images</article-title>. <source>Multimed Tools Appl</source>. <year>2025</year>;<volume>84</volume>(<issue>35</issue>):<fpage>44515</fpage>&#x2013;<lpage>42</lpage>. doi:<pub-id pub-id-type="doi">10.1007/s11042-025-20911-1</pub-id>.</mixed-citation></ref>
<ref id="ref-32"><label>[32]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><surname>Simonyan</surname> <given-names>K</given-names></string-name>, <string-name><surname>Zisserman</surname> <given-names>A</given-names></string-name></person-group>. <article-title>Very deep convolutional networks for large-scale image recognition</article-title>. <comment>arXiv:1409.1556. 2014</comment>.</mixed-citation></ref>
<ref id="ref-33"><label>[33]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><surname>Howard</surname> <given-names>AG</given-names></string-name></person-group>. <article-title>MobileNets: efficient convolutional neural networks for mobile vision applications</article-title>. <comment>arXiv:1704.04861. 2017</comment>.</mixed-citation></ref>
<ref id="ref-34"><label>[34]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>He</surname> <given-names>K</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>X</given-names></string-name>, <string-name><surname>Ren</surname> <given-names>S</given-names></string-name>, <string-name><surname>Sun</surname> <given-names>J</given-names></string-name></person-group>. <article-title>Deep residual learning for image recognition</article-title>. In: <conf-name>Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition</conf-name>. <publisher-loc>Piscataway, NJ, USA</publisher-loc>: <publisher-name>IEEE</publisher-name>; <year>2016</year>. p. <fpage>770</fpage>&#x2013;<lpage>8</lpage>.</mixed-citation></ref>
<ref id="ref-35"><label>[35]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Szegedy</surname> <given-names>C</given-names></string-name>, <string-name><surname>Vanhoucke</surname> <given-names>V</given-names></string-name>, <string-name><surname>Ioffe</surname> <given-names>S</given-names></string-name>, <string-name><surname>Shlens</surname> <given-names>J</given-names></string-name>, <string-name><surname>Wojna</surname> <given-names>Z</given-names></string-name></person-group>. <article-title>Rethinking the inception architecture for computer vision</article-title>. In: <conf-name>Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)</conf-name>. <publisher-loc>Piscataway, NJ, USA</publisher-loc>: <publisher-name>IEEE</publisher-name>; <year>2016</year>. p. <fpage>2818</fpage>&#x2013;<lpage>26</lpage>.</mixed-citation></ref>
<ref id="ref-36"><label>[36]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Szegedy</surname> <given-names>C</given-names></string-name>, <string-name><surname>Ioffe</surname> <given-names>S</given-names></string-name>, <string-name><surname>Vanhoucke</surname> <given-names>V</given-names></string-name>, <string-name><surname>Alemi</surname> <given-names>A</given-names></string-name></person-group>. <article-title>Inception-v4, Inception-ResNet and the impact of residual connections on learning</article-title>. In: <conf-name>AAAI&#x2019;17: Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence</conf-name>. <publisher-loc>Palo Alto, CA, USA</publisher-loc>: <publisher-name>AAAI Press</publisher-name>; <year>2017</year>. p. <fpage>4278</fpage>&#x2013;<lpage>84</lpage>.</mixed-citation></ref>
<ref id="ref-37"><label>[37]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Chollet</surname> <given-names>F</given-names></string-name></person-group>. <article-title>Xception: deep learning with depthwise separable convolutions</article-title>. In: <conf-name>Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition</conf-name>. <publisher-loc>Piscataway, NJ, USA</publisher-loc>: <publisher-name>IEEE</publisher-name>; <year>2017</year>. p. <fpage>1251</fpage>&#x2013;<lpage>8</lpage>.</mixed-citation></ref>
<ref id="ref-38"><label>[38]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Huang</surname> <given-names>G</given-names></string-name>, <string-name><surname>Liu</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Van Der Maaten</surname> <given-names>L</given-names></string-name>, <string-name><surname>Weinberger</surname> <given-names>KQ</given-names></string-name></person-group>. <article-title>Densely connected convolutional networks</article-title>. In: <conf-name>Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition</conf-name>. <publisher-loc>Piscataway, NJ, USA</publisher-loc>: <publisher-name>IEEE</publisher-name>; <year>2017</year>. p. <fpage>4700</fpage>&#x2013;<lpage>8</lpage>.</mixed-citation></ref>
<ref id="ref-39"><label>[39]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Tan</surname> <given-names>M</given-names></string-name>, <string-name><surname>Le</surname> <given-names>QV</given-names></string-name></person-group>. <article-title>EfficientNetV2: smaller models and faster training</article-title>. In: <conf-name>Proceedings of the 38th International Conference on Machine Learning (ICML)</conf-name>. <publisher-loc>London, UK</publisher-loc>: <publisher-name>PMLR</publisher-name>; <year>2021</year>. p. <fpage>10096</fpage>&#x2013;<lpage>106</lpage>.</mixed-citation></ref>
<ref id="ref-40"><label>[40]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Liu</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Lin</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Cao</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Hu</surname> <given-names>H</given-names></string-name>, <string-name><surname>Wei</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>Z</given-names></string-name>, <etal>et al</etal></person-group>. <article-title>Swin transformer: hierarchical vision transformer using shifted windows</article-title>. In: <conf-name>Proceedings of the IEEE/CVF International Conference on Computer Vision</conf-name>. <publisher-loc>Piscataway, NJ, USA</publisher-loc>: <publisher-name>IEEE</publisher-name>; <year>2021</year>. p. <fpage>10012</fpage>&#x2013;<lpage>22</lpage>.</mixed-citation></ref>
<ref id="ref-41"><label>[41]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Ribeiro</surname> <given-names>MT</given-names></string-name>, <string-name><surname>Singh</surname> <given-names>S</given-names></string-name>, <string-name><surname>Guestrin</surname> <given-names>C</given-names></string-name></person-group>. <article-title>&#x201C;Why should i trust you?&#x201D; Explaining the predictions of any classifier</article-title>. In: <conf-name>Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining</conf-name>. <publisher-loc>New York, NY, USA</publisher-loc>: <publisher-name>ACM</publisher-name>; <year>2016</year>. p. <fpage>1135</fpage>&#x2013;<lpage>44</lpage>.</mixed-citation></ref>
<ref id="ref-42"><label>[42]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Pei</surname> <given-names>W</given-names></string-name>, <string-name><surname>Xia</surname> <given-names>T</given-names></string-name>, <string-name><surname>Chen</surname> <given-names>F</given-names></string-name>, <string-name><surname>Li</surname> <given-names>J</given-names></string-name>, <string-name><surname>Tian</surname> <given-names>J</given-names></string-name>, <string-name><surname>Lu</surname> <given-names>G</given-names></string-name></person-group>. <article-title>SA2VP: spatially aligned-and-adapted visual prompt</article-title>. In: <conf-name>Proceedings of the AAAI Conference on Artificial Intelligence</conf-name>. <publisher-loc>Palo Alto, CA, USA</publisher-loc>: <publisher-name>AAAI Press</publisher-name>; <year>2024</year>. p. <fpage>4450</fpage>&#x2013;<lpage>8</lpage>.</mixed-citation></ref>
<ref id="ref-43"><label>[43]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><surname>Han</surname> <given-names>C</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>Q</given-names></string-name>, <string-name><surname>Cui</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Cao</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>W</given-names></string-name>, <string-name><surname>Qi</surname> <given-names>S</given-names></string-name>, <etal>et al</etal></person-group>. <article-title>E<sup>2</sup>vpt: an effective and efficient approach for visual prompt tuning</article-title>. <comment>arXiv:2307.13770. 2023</comment>.</mixed-citation></ref>
<ref id="ref-44"><label>[44]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Yan</surname> <given-names>L</given-names></string-name>, <string-name><surname>Han</surname> <given-names>C</given-names></string-name>, <string-name><surname>Xu</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Liu</surname> <given-names>D</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>Q</given-names></string-name></person-group>. <article-title>Prompt learns prompt: exploring knowledge-aware generative prompt collaboration for video captioning</article-title>. In: <conf-name>Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI); 2023 Aug 19&#x2013;25; Macao, China</conf-name>. p. <fpage>1622</fpage>&#x2013;<lpage>30</lpage>.</mixed-citation></ref>
</ref-list>
</back></article>