<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.1 20151215//EN" "http://jats.nlm.nih.gov/publishing/1.1/JATS-journalpublishing1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:mml="http://www.w3.org/1998/Math/MathML" xml:lang="en" article-type="research-article" dtd-version="1.1">
<front>
<journal-meta>
<journal-id journal-id-type="pmc">CMC</journal-id>
<journal-id journal-id-type="nlm-ta">CMC</journal-id>
<journal-id journal-id-type="publisher-id">CMC</journal-id>
<journal-title-group>
<journal-title>Computers, Materials &#x0026; Continua</journal-title>
</journal-title-group>
<issn pub-type="epub">1546-2226</issn>
<issn pub-type="ppub">1546-2218</issn>
<publisher>
<publisher-name>Tech Science Press</publisher-name>
<publisher-loc>USA</publisher-loc>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="publisher-id">34876</article-id>
<article-id pub-id-type="doi">10.32604/cmc.2023.034876</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Article</subject>
</subj-group>
</article-categories>
<title-group>
<article-title>RT-YOLO: A Residual Feature Fusion Triple Attention Network for Aerial Image Target Detection</article-title>
<alt-title alt-title-type="left-running-head">RT-YOLO: A Residual Feature Fusion Triple Attention Network for Aerial Image Target Detection</alt-title>
<alt-title alt-title-type="right-running-head">RT-YOLO: A Residual Feature Fusion Triple Attention Network for Aerial Image Target Detection</alt-title>
</title-group>
<contrib-group>
<contrib id="author-1" contrib-type="author">
<name name-style="western"><surname>Zhang</surname><given-names>Pan</given-names></name></contrib>
<contrib id="author-2" contrib-type="author" corresp="yes">
<name name-style="western"><surname>Deng</surname><given-names>Hongwei</given-names></name><email>dhwwhd@163.com</email></contrib>
<contrib id="author-3" contrib-type="author">
<name name-style="western"><surname>Chen</surname><given-names>Zhong</given-names></name></contrib>
<aff id="aff-1"><institution>College of Computer Science and Technology, Hengyang Normal University</institution>, <addr-line>Hengyang, 421002</addr-line>, <country>China</country></aff>
</contrib-group>
<author-notes>
<corresp id="cor1"><label>&#x002A;</label>Corresponding Author: Hongwei Deng. Email: <email>dhwwhd@163.com</email></corresp>
</author-notes>
<pub-date date-type="collection" publication-format="electronic"><year>2023</year></pub-date>
<pub-date date-type="pub" publication-format="electronic"><day>24</day><month>1</month><year>2023</year></pub-date>
<volume>75</volume>
<issue>1</issue>
<fpage>1411</fpage>
<lpage>1430</lpage>
<history>
<date date-type="received"><day>30</day><month>7</month><year>2022</year></date>
<date date-type="accepted"><day>14</day><month>12</month><year>2022</year></date>
</history>
<permissions>
<copyright-statement>&#x00A9; 2023 Zhang et al.</copyright-statement>
<copyright-year>2023</copyright-year>
<copyright-holder>Zhang et al.</copyright-holder>
<license xlink:href="https://creativecommons.org/licenses/by/4.0/">
<license-p>This work is licensed under a <ext-link ext-link-type="uri" xlink:type="simple" xlink:href="https://creativecommons.org/licenses/by/4.0/">Creative Commons Attribution 4.0 International License</ext-link>, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
</license>
</permissions>
<self-uri content-type="pdf" xlink:href="TSP_CMC_34876.pdf"></self-uri>
<abstract><p>In recent years, target detection of aerial images of unmanned aerial vehicle (UAV) has become one of the hottest topics. However, target detection of UAV aerial images often presents false detection and missed detection. We proposed a modified you only look once (YOLO) model to improve the problems arising in object detection in UAV aerial images: (1) A new residual structure is designed to improve the ability to extract features by enhancing the fusion of the inner features of the single layer. At the same time, triplet attention module is added to strengthen the connection between space and channel and better retain important feature information. (2) The feature information is enriched by improving the multi-scale feature pyramid structure and strengthening the feature fusion at different scales. (3) A new loss function is created and the diagonal penalty term of the anchor frame is introduced to improve the speed of training and the accuracy of reasoning. The proposed model is called residual feature fusion triple attention YOLO (RT-YOLO). Experiments showed that the mean average precision (mAP) of RT-YOLO is increased from 57.2&#x0025; to 60.8&#x0025; on the vehicle detection in aerial image (VEDAI) dataset, and the mAP is also increased by 1.7&#x0025; on the remote sensing object detection (RSOD) dataset. The results show that the RT-YOLO outperforms other mainstream models in UAV aerial image object detection.</p>
</abstract>
<kwd-group kwd-group-type="author">
<kwd>Attention mechanism</kwd>
<kwd>small target detection</kwd>
<kwd>YOLOv5s</kwd>
<kwd>RT-YOLO</kwd>
</kwd-group>
</article-meta>
</front>
<body>
<sec id="s1"><label>1</label><title>Introduction</title>
<p>With the continuous development of artificial intelligence technology and computer hardware conditions, artificial intelligence has made a series of research progress in many fields, and human development has gradually entered the era of intelligence. Driven by scientific and technological innovation, all walks of life in society are exploring how to introduce artificial intelligence into their respective fields. In the unmanned aerial vehicle (UAV) field, UAV aerial photography offers a wide range of potential applications in military and public domains, including military reconnaissance, emergency disaster relief, land surveying, mapping, and agricultural plant protection [<xref ref-type="bibr" rid="ref-1">1</xref>].</p>
<p>In the future process of intelligent urban development, object detection involves identifying and classifying target positions from aerial photographs. It has become one of the most important tasks in the field of UAV aerial photography. However, the complex background of UAV aerial images, the uneven distribution of targets, a large number of small targets and overlapping targets. The detection effect of the existing general algorithms is not very good. Therefore, researchers have conducted many studies on UAV aerial image detection methods.</p>
<p>Current UAV aerial object detection techniques are based on convolutional neural network (CNN) [<xref ref-type="bibr" rid="ref-2">2</xref>] and are classified into two categories: a two-stage approach based on candidate regions and a single-stage strategy based on regression. The two-stage technique searches for the region of interest using region proposal network (RPN) and then generates the category and location information for each region. R-CNN [<xref ref-type="bibr" rid="ref-3">3</xref>], Fast R-CNN [<xref ref-type="bibr" rid="ref-4">4</xref>], Faster R-CNN [<xref ref-type="bibr" rid="ref-5">5</xref>], and Mask R-CNN [<xref ref-type="bibr" rid="ref-6">6</xref>] are examples of algorithms. Object detection is treated as a regression problem in the one-stage algorithm. The output layer collects the target&#x2019;s position and category information instantly after putting the target to gets detected. The YOLO series [<xref ref-type="bibr" rid="ref-7">7</xref>&#x2013;<xref ref-type="bibr" rid="ref-11">11</xref>] and the single shot multibox detector (SSD) series [<xref ref-type="bibr" rid="ref-12">12</xref>,<xref ref-type="bibr" rid="ref-13">13</xref>] are two examples of representative algorithms. Moreover, Bera&#x00A0;et&#x00A0;al.&#x00A0;[<xref ref-type="bibr" rid="ref-14">14</xref>] carried out a detailed performance and analysis of four CNN models: 1D CNN, 2D CNN, 3D CNN, and feature fusion based on CNN (FFCNN).</p>
<p>Many small object detection algorithms based on deep learning have been proposed in recent years by researchers. Based on general detectors, they have tried to make improvements from different directions such as accuracy and speed, including data enhancement, multi-scale learning, context learning, and combined with generative adversarial network (GAN) and other methods. Sommer&#x00A0;et&#x00A0;al.&#x00A0;[<xref ref-type="bibr" rid="ref-15">15</xref>] used Fast R-CNN and Faster R-CNN for vehicle detection in aerial images to accommodate small target detection by adjusting the size of the anchor frame and the resolution of the feature map. A simple and successful approach to ratio matching was described in the document [<xref ref-type="bibr" rid="ref-16">16</xref>]. Images are scaled and spliced throughout the training process, and large-size targets in the dataset are transformed into medium-size targets and medium-size targets into small-size targets. Increase the number and quality of small-scale goals. Yang&#x00A0;et&#x00A0;al.&#x00A0;[<xref ref-type="bibr" rid="ref-17">17</xref>] introduced the attention mechanism into target detection. Supervised multi-dimensional attention network (MDA-NET) is used to highlight the target features and weaken the background features. Ibrahim&#x00A0;et&#x00A0;al.&#x00A0;proposed an adaptive dynamic particle swarm algorithm [<xref ref-type="bibr" rid="ref-18">18</xref>]. In combination with guided whale optimization algorithm (WOA), the prediction performance of the algorithm is improved by enhancing the parameters of the long short-term memory (LSTM) classification method. Rao&#x00A0;et&#x00A0;al.&#x00A0;used the newly designed rectified linear unit (ReLU) to propose a new model [<xref ref-type="bibr" rid="ref-19">19</xref>], which inserts a ReLU layer before the convolution layer. This structure can more smoothly transfer semantic information from the shallow layer to the deep layer. It prevented network degradation and improved the performance of deep networks. Yang&#x00A0;et&#x00A0;al.&#x00A0;proposed the semi-supervised attention (SSA) model [<xref ref-type="bibr" rid="ref-20">20</xref>], which has a semi-supervised attention structure for different small target images. Using unlabeled data in the data can help reduce the change of the same category and achieve more distinguishing feature extraction. Singh&#x00A0;et&#x00A0;al.&#x00A0;introduced scaling normalization of image pyramid (SNIP) [<xref ref-type="bibr" rid="ref-21">21</xref>], a multi-scale training strategy that trains on each scale of the pyramid and efficiently employs all of the training data, despite the detection effect of small targets. Although there has been a tremendous improvement, the speed has slowed. High-resolution network (HRNet) [<xref ref-type="bibr" rid="ref-22">22</xref>] was published. It has achieved considerable progress in inaccuracy by using a parallel structure to fuse feature maps of many scales to generate more resilient multi-scale feature information. Xu&#x00A0;et&#x00A0;al.&#x00A0;[<xref ref-type="bibr" rid="ref-23">23</xref>] proposed a novel relational graph attention network that incorporates edge attributes. Considers the edge attributes by using top-k attention mechanisms to learn hidden semantic contextual, improved network performance. Chen&#x00A0;et&#x00A0;al.&#x00A0;[<xref ref-type="bibr" rid="ref-24">24</xref>] presented an improved YOLOv4 algorithm, which increases the dimension of the effective feature layer of the backbone network. It introduces the cross stage partial (CSP) structure into path aggregation network (PANet). The computational complexity of the model is reduced, and the polymerization efficiency of effective features at different scales is improved. Article [<xref ref-type="bibr" rid="ref-25">25</xref>] proposed a multi-scale symbolic method, which combines symbolization and multi-scale technology with compression to enhance the ability of feature extraction. Su&#x00A0;et&#x00A0;al.&#x00A0;[<xref ref-type="bibr" rid="ref-26">26</xref>] combined UAV sensing, multispectral imaging, vegetation segmentation, and u-net to design a spectrum-based classifier and conduct a systematic evaluation to improve performance in UAV visual perception. Paper [<xref ref-type="bibr" rid="ref-27">27</xref>] proposed multi-objective artificial hummingbird algorithm (MOAHA). A non-dominated sorting strategy is merged with MOAHA to construct a solution update mechanism, which effectively refines Pareto optimal solutions for improving the convergence of the algorithm.</p>
<p>To solve the problem that the features of low-resolution small targets cannot be detected, some scholars have combined the generation of confrontation networks with detection models. They proposed methods such as Perceptual GAN [<xref ref-type="bibr" rid="ref-28">28</xref>], SOD-MTGAN [<xref ref-type="bibr" rid="ref-29">29</xref>], and CGAN [<xref ref-type="bibr" rid="ref-30">30</xref>]. The complexity of generating an adversarial network is too high to meet the needs of UAV aerial image target detection. Therefore, some researchers advocated combining lightweight network models such as the MobileNet [<xref ref-type="bibr" rid="ref-31">31</xref>&#x2013;<xref ref-type="bibr" rid="ref-33">33</xref>] series, ShuffleNet [<xref ref-type="bibr" rid="ref-34">34</xref>,<xref ref-type="bibr" rid="ref-35">35</xref>] series, and others to be used in real-world applications.</p>
<p>Currently, the performance of commonly used object detection algorithms in detecting aerial picture targets is limited. There are mainly the following problems in aerial image object detection:
<list list-type="simple">
<list-item><label>(1)</label><p>In a UAV aerial image, the size range of the target is vast, the proportion of the small target in the image is very small, and the resolution provided is limited, which is difficult to detect.</p></list-item>
<list-item><label>(2)</label><p>There are identical items in the dense zone of targets in UAV aerial pictures, resulting in a higher incidence of missing or false alarm detection. Furthermore, a significant amount of background noise information will weaken or obscure the target, making consistent and complete identification impossible.</p></list-item>
</list></p>
<p>Based on the aforementioned issue, we propose an improved YOLOv5 model named residual feature fusion triple attention YOLO (RT-YOLO). First, the newly designed residual module is used to enhance the utilization of single-layer internal features, while adding a triplet attention module to establish spatial and channel connections, and improve the ability of the backbone network to extract features. Then, the feature pyramid structure is improved to integrate the feature maps at various scales and reduce the loss of feature information. Finally, we proposed a new loss function and introduced a diagonal penalty term for the anchor frames, which improves training speed and inference accuracy. On the vehicle detection in aerial image (VEDAI) dataset [<xref ref-type="bibr" rid="ref-36">36</xref>] and remote sensing object detection (RSOD) [<xref ref-type="bibr" rid="ref-37">37</xref>] dataset, we compare RT-YOLO to other advanced object detection algorithms. Experimental results show that RT-YOLO is more suitable for object detection in UAV aerial images.</p>
<p>The rest of this paper is organized as follows. Section 2 introduces the related work, including the UAV aerial datasets, YOLOv5, and triplet attention module. Section 3 introduces the method set out in the present paper. Section 4 presents the details of the experiment, including the experimental datasets, experimental environment, and evaluation indicators. Section 5 introduces the relevant experiments and a discussion of the experimental results. Section 6 summarizes the work of this paper.</p>
</sec>
<sec id="s2"><label>2</label><title>Related Work</title>
<sec id="s2_1"><label>2.1</label><title>Different Image Datasets</title>
<p>Because the aerial image of UAV is different from that of natural scene images, the target detection algorithm trained by using conventional image datasets is ineffective in the application of UAV scene tasks. Some researchers have proposed aviation image datasets for this problem, and the relevant image datasets are shown in <xref ref-type="table" rid="table-1">Table 1</xref>.</p>
<table-wrap id="table-1"><label>Table 1</label><caption><title>Comparison of different aerial image datasets</title></caption>
<table frame="hsides">
<colgroup>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
</colgroup>
<thead>
<tr>
<th align="left">Dataset</th>
<th align="left">Publish year</th>
<th align="left">Number of pictures</th>
<th align="left">Size of pictures</th>
<th align="left">Number of class</th>
<th align="left">Total of target</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">VEDAI</td>
<td align="left">2015</td>
<td align="left">1210</td>
<td align="left">1024</td>
<td align="left">9</td>
<td align="left">3640</td>
</tr>
<tr>
<td align="left">RSOD</td>
<td align="left">2015</td>
<td align="left">976</td>
<td align="left">1044</td>
<td align="left">4</td>
<td align="left">6950</td>
</tr>
<tr>
<td align="left">UCAS-AOD</td>
<td align="left">2015</td>
<td align="left">910</td>
<td align="left">1280</td>
<td align="left">2</td>
<td align="left">6029</td>
</tr>
<tr>
<td align="left">DOTA-v1.0</td>
<td align="left">2017</td>
<td align="left">2806</td>
<td align="left">800&#x2013;4000</td>
<td align="left">15</td>
<td align="left">188 282</td>
</tr>
<tr>
<td align="left">DOTA-v1.5</td>
<td align="left">2019</td>
<td align="left">2806</td>
<td align="left">800&#x2013;4000</td>
<td align="left">16</td>
<td align="left">400 000</td>
</tr>
<tr>
<td align="left">VisDrone</td>
<td align="left">2019</td>
<td align="left">10209</td>
<td align="left">2000</td>
<td align="left">10</td>
<td align="left">89 777</td>
</tr>
<tr>
<td align="left">Drone vehicle</td>
<td align="left">2020</td>
<td align="left">31064</td>
<td align="left">840</td>
<td align="left">5</td>
<td align="left">441 642</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>Images in the VEDAI dataset are derived from Utah and are widely used for multi-variety vehicle detection tasks under aerial images. The dataset of object detection in aerial images (UCAS-AOD) dataset has simply two categories: aircraft and vehicle, for target detection of vehicles and aircraft under aerial images. The dataset for object detection in aerial images (DOTA) dataset is a large dataset, where images are acquired through different sensors and platforms. It includes target objects with different proportions, orientations, and shapes. The vision meets drone (VisDrone) dataset contains videos and images from various weather and light conditions. It can be utilized for four challenge tasks: UAV aerial image target detection, video target detection, single-target tracking, and multi-target tracking. The drone based vehicle detection (DroneVehicle) dataset contains red, green, and blue (RGB) images and infrared images of vehicle detection and vehicle counting tasks.</p>
</sec>
<sec id="s2_2"><label>2.2</label><title>YOLOv5</title>
<p>YOLOv5 [<xref ref-type="bibr" rid="ref-11">11</xref>], which was released in 2020, is a regression-based target identification algorithm that comes in four versions: YOLOv5s, YOLOv5m, YOLOv51, and YOLOv5x. The network with the least depth and feature map width is YOLOv5s. This paper selects YOLOv5s with the minimum network depth and width for training to minimize processing costs and memory and make the network more lightweight, the exact structure is given in <xref ref-type="fig" rid="fig-1">Fig. 1</xref> below.</p>
<fig id="fig-1"><label>Figure 1</label><caption><title>The structure of the YOLOv5 network</title></caption><graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_34876-fig-1.tif"/></fig>
<p>The input part of the YOLOv5 network is used for data preprocessing, the backbone section is used for feature extraction, the neck part is used for feature fusion, and the output part is used for object detection. CSPDarknet53 serves as the backbone network, and feature maps of various sizes are retrieved via repeated convolutions and merging. The trunk network generates four-layer feature maps, as shown in <xref ref-type="fig" rid="fig-1">Fig. 1</xref>. When the input image is 640&#x2009;&#x00D7;&#x2009;640 pixels, the resulting feature maps are 160&#x2009;&#x00D7;&#x2009;160 pixels, 80&#x2009;&#x00D7;&#x2009;80 pixels, 40&#x2009;&#x00D7;&#x2009;40 pixels, and 20&#x2009;&#x00D7;&#x2009;20 pixels. The neck network uses the feature pyramid network (FPN) and path aggregation network (PAN). The FPN structure transmits semantic features from the top feature map to the bottom feature map, while the PAN structure transmits positioning features from the bottom feature map to the top feature map. Three feature fusion layers then fuse these different levels of feature maps. It can obtain more contextual information and generate three feature maps of various sizes. The output part recognizes and categorizes these feature maps of various sizes, which are 80&#x2009;&#x00D7;&#x2009;80&#x2009;&#x00D7;&#x2009;255, 40&#x2009;&#x00D7;&#x2009;40&#x2009;&#x00D7;&#x2009;255, and 20&#x2009;&#x00D7;&#x2009;20&#x2009;&#x00D7;&#x2009;255, where 255 denotes the number of channels, 80&#x2009;&#x00D7;&#x2009;80&#x2009;&#x00D7;&#x2009;255 features for identifying tiny things, and 20&#x2009;&#x00D7;&#x2009;20&#x2009;&#x00D7;&#x2009;255 features for detecting large objects, respectively.</p>

<p>We go over the basic module functionalities of yolov5 in depth to help you better grasp its architecture. The structure of each functional module is shown in <xref ref-type="fig" rid="fig-2">Fig. 2</xref> below. The convolution layer, batch normalization layer, and LeakyRelu activation functions (CBL) as shown in <xref ref-type="fig" rid="fig-2">Fig. 2a</xref>. CBL is the smallest component of the YOLO network. The Focus module as shown in <xref ref-type="fig" rid="fig-2">Fig. 2b</xref>, aids the backbone network in extracting features through slicing and connecting activities. The spatial pyramid pooling (SPP) module shown in <xref ref-type="fig" rid="fig-2">Fig. 2c</xref>, performs feature fusion by maximizing pooling kernels of different sizes. Cross stage partial (CSP) structure are separated into two categories, as shown in <xref ref-type="fig" rid="fig-2">Fig. 2d</xref>, with CSP1_X for the backbone network and CSP2_X for the neck network. CSP1_X is made up of X residual units, while CSP2_X is made up of CBL modules. Through cross-layer connection, the CSP structure decreases model complexity and speeds up reasoning speed.</p>
<fig id="fig-2"><label>Figure 2</label><caption><title>Structure of each functional module of yolov5: (a) CBL; (b) Focus; (c) SPP; (d) CSP</title></caption><graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_34876-fig-2.tif"/></fig>
</sec>
<sec id="s2_3"><label>2.3</label><title>Triplet Attention</title>
<p>Triplet attention [<xref ref-type="bibr" rid="ref-38">38</xref>] module to make full use of small target features. Through rotation operations and residual modifications, triplet attention builds the relationship between dimensions, which can improve the spatial and channel information qualities. <xref ref-type="fig" rid="fig-3">Fig. 3</xref> depicts the network structure.</p>
<fig id="fig-3"><label>Figure 3</label><caption><title>The module structure of triplet attention</title></caption><graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_34876-fig-3.tif"/></fig>
<p>The channel attention calculation branch, the channel C and space W dimension interactive capture branch, and the channel C and space H dimension interactive capture branch are the three branches of triplet attention. The indirect relationship between channels and weights can be avoided with this cross-channel interaction.</p>
</sec>
</sec>
<sec id="s3"><label>3</label><title>The Proposed Methods</title>
<p>We created a residual feature fusion triple attention network to boost the detection effect on the aerial small object detection challenge. Three improvements are made to the original yolov5 algorithm, as seen in <xref ref-type="fig" rid="fig-4">Fig. 4</xref>.</p>
<fig id="fig-4"><label>Figure 4</label><caption><title>Network structure of RT-YOLO method</title></caption><graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_34876-fig-4.tif"/></fig>
<p>Small targets occupy the main distribution in the UAV images. The resolution provided is limited, which is difficult to detect. Firstly, we design the Res2T residual module and add triplet attention. It can express multi-scale features at a fine-grained level while capturing cross-dimensional interact information, which is more conducive to extracting Single-layer internal feature information. To simplify the backbone network&#x2019;s parameter complexity, the original CSP structure&#x2019;s short message channel convolution layer is removed, and the Res-unit in the long channel is substituted. We renamed it CSP_RT. Secondly, combining with the characteristics of the YOLO neck network, we improve the spatial pyramid pooling module by enhancing the algorithm&#x2019;s ability to extract small target features by increasing the fusing of several receptive fields and utilizing a lower maximum pooling to make the algorithm pay more attention to local information. Finally, we propose RIOU_Loss as the bounding box regression&#x2019;s loss function, which takes into account the aspect ratio, center point distance, and diagonal length. This effectively solves the situation where the prediction box is inside the target box and the size of the prediction box is the same, improving positioning accuracy and speeding up network convergence.</p>
<sec id="s3_1"><label>3.1</label><title>Feature Extraction Enhancement Module</title>
<p>Small targets make up a small percentage of UAV aerial photos, and the resolution offered is restricted, making feature extraction challenging. We designed the Res2T single-layer feature fusion module and improved the CSP feature extraction structure to overcome this problem.</p>
<sec id="s3_1_1"><label>3.1.1</label><title>Res2T</title>
<p>In the original YOLOv5 algorithm backbone network, the CSP feature extraction module takes advantage of the residual structure. Although the problem of gradient fading is mitigated as the method is deepened, the CSP structure continues to utilize the hierarchical multi-scale representation to represent features, leaving internal features of a single-layer underutilized, and the YOLOv5 algorithm gives equal attention to each channel feature. The algorithm&#x2019;s detection performance is restricted to some extent by this architecture. In response to these issues, we learned from the Res2Net module proposed by Gao&#x00A0;et&#x00A0;al.&#x00A0;[<xref ref-type="bibr" rid="ref-39">39</xref>]. It combines with the triplet attention mechanism and designed the Res2T module. <xref ref-type="fig" rid="fig-5">Fig. 5</xref> depicts the specific structure.</p>
<fig id="fig-5"><label>Figure 5</label><caption><title>The structure of Res2T</title></caption><graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_34876-fig-5.tif"/></fig>
<p>In each Res2T structure, after a 1&#x2009;&#x00D7;&#x2009;1 convolutional layer, the input feature map is evenly divided into S sub-feature maps (S &#x003D; 4 is selected in this article). The size of each sub-feature map is the same one, but the number of sub-feature map channels is 1/S of the number of input feature map channels. For each sub-feature map K<sub>i</sub>, there is a corresponding 3&#x2009;&#x00D7;&#x2009;3 convolution, and the output is Y<sub>i.</sub> Each sub-feature map K<sub>i</sub> is added with the output Y<sub>i&#x2212;1</sub> of&#x00A0;&#x00A0;K<sub>i&#x2212;1</sub>, which is used as the input of K<sub>i</sub> corresponding to 3&#x2009;&#x00D7;&#x2009;3. To reduce the number of parameters, the 3&#x2009;&#x00D7;&#x2009;3 convolutional network of K<sub>1</sub> is omitted, which is specifically expressed as formula <xref ref-type="disp-formula" rid="eqn-1">(1)</xref>.
<disp-formula id="eqn-1"><label>(1)</label><mml:math id="mml-eqn-1" display="block"><mml:msub><mml:mrow><mml:mtext>Y</mml:mtext></mml:mrow><mml:mrow><mml:mrow><mml:mtext>i</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mrow><mml:mo>{</mml:mo><mml:mtable columnalign="left left" rowspacing=".2em" columnspacing="1em" displaystyle="false"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mi mathvariant="normal">K</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mtext>i</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:mo>,</mml:mo></mml:mtd><mml:mtd><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mrow><mml:mo>(</mml:mo><mml:mn>3</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mn>3</mml:mn><mml:mi>C</mml:mi><mml:mi>o</mml:mi><mml:mi>n</mml:mi><mml:mi>v</mml:mi><mml:mo>)</mml:mo></mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mrow><mml:mtext>K</mml:mtext></mml:mrow><mml:mrow><mml:mrow><mml:mtext>i</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:msub><mml:mrow><mml:mtext>Y</mml:mtext></mml:mrow><mml:mrow><mml:mrow><mml:mtext>i</mml:mtext></mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow><mml:mo>,</mml:mo></mml:mtd><mml:mtd><mml:mi>x</mml:mi><mml:mo>&#x2265;</mml:mo><mml:mn>0</mml:mn></mml:mtd></mml:mtr></mml:mtable><mml:mo fence="true" stretchy="true" symmetric="true"></mml:mo></mml:mrow></mml:math></disp-formula></p>
<p>As a control parameter, S is the number of input channels that divide multiple characteristic channels. The larger the S, the stronger the multi-scale capability of Res2T. Through different S, the output of different sizes of receptive fields can be obtained.</p>
</sec>
<sec id="s3_1_2"><label>3.1.2</label><title>CSP_RT</title>
<p>In the original YOLOv5 algorithm, as the network convolution deepens, the feature information of the small target becomes weaker and weaker. It results in missed detection and false detection in aerial photography small target identification tasks. In the network optimization process, the backbone network&#x2019;s CSP structure eliminates repeating gradient information. Multiple convolution kernels, on the other hand, increase in the number of parameters as the network depth increases. To address the aforementioned issues, this article improves the CSP structure by deleting the convolutional layer on the original module&#x2019;s short branch, directly connecting the CSP module&#x2019;s input feature map with the output feature map of the long branch, and using the Res2T module to replace the residual connected unit of the CSP long branch. The improved feature extraction module is known as CSP_RT, and its structure is represented in <xref ref-type="fig" rid="fig-6">Fig. 6</xref>.</p>
<fig id="fig-6"><label>Figure 6</label><caption><title>CSP_ RT module structure diagram</title></caption><graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_34876-fig-6.tif"/></fig>
<p>The CSP_RT structure, when compared to the original CSP structure, can extract shallow feature information more effectively, fuse crucial spatial and channel data in the feature map, and effectively increase the detection effect of small targets without increasing the number of parameters. Furthermore, multi-scale feature extraction improves the algorithm&#x2019;s semantic representation.</p>
</sec>
</sec>
<sec id="s3_2"><label>3.2</label><title>Feature Pyramid Module</title>
<p>A three-layer feature map detection design is used by the YOLOv5 algorithm. To detect targets of various sizes, feature maps sampled at 8 times, 16 times, and 32 times are employed as feature layers for the input picture scale of 640&#x2009;&#x00D7;&#x2009;640. To prevent the loss of tiny target information, each pixel should respond to the region corresponding to a small target in the image, corresponding to various output feature maps. As a result, through four groups of various maximum pooling layers, we modified the pooling module of the SPP spatial pyramid to improve the fusion of multiple receptive fields. To match the structure of YOLO output, the maximum pooling filter of 1&#x2009;&#x00D7;&#x2009;1, 3&#x2009;&#x00D7;&#x2009;3, 5&#x2009;&#x00D7;&#x2009;5, 7&#x2009;&#x00D7;&#x2009;7, and 9&#x2009;&#x00D7;&#x2009;9 were named SPP1, and the maximum pooling filter of 3&#x2009;&#x00D7;&#x2009;3, 5&#x2009;&#x00D7;&#x2009;5, 7&#x2009;&#x00D7;&#x2009;7, and 9&#x2009;&#x00D7;&#x2009;9 was named SPP2. These structures are shown in <xref ref-type="fig" rid="fig-7">Fig. 7</xref>.</p>
<fig id="fig-7"><label>Figure 7</label><caption><title>Improved spatial pyramid pooling micro structures: (a) SPP1; (b) SPP2</title></caption><graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_34876-fig-7.tif"/></fig>
<p>CBL is a combined module in <xref ref-type="fig" rid="fig-7">Fig. 7</xref> that consists of a convolution layer, a BN layer, and an activation function layer. The fusion of multiple receptive fields can be improved by adding a smaller maximum pooling layer, and the algorithm pays more attention to local information with a smaller maximum pooling layer, thus improving the detection accuracy of small targets. This paper uses the SPP1 module SPP module replacement of the original backbone network, at the same time respectively in the neck on the network for the first time after sampling increase SPP module, after the second sampling on SPP2 module, this according to the different characteristics of different output detection layer using pyramid pooling module, can enhance the characteristics of the corresponding output detection layer expression ability, achieve better detection effect.</p>
</sec>
<sec id="s3_3"><label>3.3</label><title>Loss Function</title>
<p>The YOLOv5 algorithm&#x2019;s initial loss function is shown in formula <xref ref-type="disp-formula" rid="eqn-2">(2)</xref>. Two types of cross-entropy loss functions are used for confidence and class loss functions. For the position loss function, generalized intersection over union (GIOU) is utilized, and GIOU_Loss is shown in the formula <xref ref-type="disp-formula" rid="eqn-3">(3)</xref>.
<disp-formula id="eqn-2"><label>(2)</label><mml:math id="mml-eqn-2" display="block"><mml:mrow><mml:mtext>Loss</mml:mtext></mml:mrow><mml:mo>=</mml:mo><mml:mrow><mml:mtext>GIOU</mml:mtext></mml:mrow><mml:mi mathvariant="normal">&#x005F;</mml:mi><mml:mrow><mml:mtext>Loss</mml:mtext></mml:mrow><mml:mo>+</mml:mo><mml:msub><mml:mrow><mml:mtext>Loss</mml:mtext></mml:mrow><mml:mrow><mml:mrow><mml:mtext>conf</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:msub><mml:mrow><mml:mtext>Loss</mml:mtext></mml:mrow><mml:mrow><mml:mrow><mml:mtext>class</mml:mtext></mml:mrow></mml:mrow></mml:msub></mml:math></disp-formula>
<disp-formula id="eqn-3"><label>(3)</label><mml:math id="mml-eqn-3" display="block"><mml:mrow><mml:mtext>GIOU</mml:mtext></mml:mrow><mml:mi mathvariant="normal">&#x005F;</mml:mi><mml:mrow><mml:mtext>Loss</mml:mtext></mml:mrow><mml:mo>=</mml:mo><mml:mn>1</mml:mn><mml:mo>&#x2212;</mml:mo><mml:mrow><mml:mtext>IOU</mml:mtext></mml:mrow><mml:mo>+</mml:mo><mml:mfrac><mml:mrow><mml:mo>|</mml:mo><mml:mrow><mml:mtext>Q</mml:mtext></mml:mrow><mml:mo>|</mml:mo></mml:mrow><mml:mrow><mml:mtext>C</mml:mtext></mml:mrow></mml:mfrac></mml:math></disp-formula>where C represents the smallest boundary rectangle between the detected frame and the previous frame, and Q represents the difference between the smallest boundary rectangle and the addition of these two boxes. However, like intersection over union (IOU), GIOU only considers the overlap degree of two frames. The overlap part cannot be optimized and has certain limitations, as shown in <xref ref-type="fig" rid="fig-8">Fig. 8</xref> when the detection frame and the real frame contain each other.</p>
<fig id="fig-8"><label>Figure 8</label><caption><title>These situations in which a GIOU loss degrades to an IOU loss</title></caption><graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_34876-fig-8.tif"/></fig>
<p>Because of the above problems, we designed the RIOU_Loss function. RIOU_Loss function considers three aspects of the overlap area, center point distance, and the diagonal length, and is specifically defined as for formulas <xref ref-type="disp-formula" rid="eqn-4">(4)</xref>&#x2013;<xref ref-type="disp-formula" rid="eqn-6">(6)</xref> below.
<disp-formula id="eqn-4"><label>(4)</label><mml:math id="mml-eqn-4" display="block"><mml:mrow><mml:mtext>RIOU</mml:mtext></mml:mrow><mml:mi mathvariant="normal">&#x005F;</mml:mi><mml:mrow><mml:mtext>Loss</mml:mtext></mml:mrow><mml:mo>=</mml:mo><mml:mn>1</mml:mn><mml:mo>&#x2212;</mml:mo><mml:mrow><mml:mtext>IOU</mml:mtext></mml:mrow><mml:mo>+</mml:mo><mml:mfrac><mml:mrow><mml:msup><mml:mrow><mml:mi mathvariant="normal">&#x03C1;</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msup><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mtext>p</mml:mtext></mml:mrow><mml:mo>,</mml:mo><mml:mrow><mml:mtext>g</mml:mtext></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mrow><mml:msup><mml:mrow><mml:mtext>C</mml:mtext></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msup></mml:mfrac><mml:mo>+</mml:mo><mml:mfrac><mml:mrow><mml:mo>|</mml:mo><mml:mrow><mml:mtext>R</mml:mtext></mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mrow><mml:mtext>r</mml:mtext></mml:mrow><mml:mo>|</mml:mo></mml:mrow><mml:mrow><mml:mrow><mml:mtext>C</mml:mtext></mml:mrow><mml:mo>+</mml:mo><mml:mrow><mml:mo>|</mml:mo><mml:mrow><mml:mtext>R</mml:mtext></mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mrow><mml:mtext>r</mml:mtext></mml:mrow><mml:mo>|</mml:mo></mml:mrow></mml:mrow></mml:mfrac></mml:math></disp-formula>
<disp-formula id="eqn-5"><label>(5)</label><mml:math id="mml-eqn-5" display="block"><mml:mrow><mml:mtext>R</mml:mtext></mml:mrow><mml:mo>=</mml:mo><mml:msqrt><mml:msup><mml:mrow><mml:mo>(</mml:mo><mml:msup><mml:mrow><mml:mtext>w</mml:mtext></mml:mrow><mml:mrow><mml:mrow><mml:mtext>g</mml:mtext></mml:mrow></mml:mrow></mml:msup><mml:mo>)</mml:mo></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msup><mml:mo>+</mml:mo><mml:msup><mml:mrow><mml:mo>(</mml:mo><mml:msup><mml:mrow><mml:mtext>h</mml:mtext></mml:mrow><mml:mrow><mml:mrow><mml:mtext>g</mml:mtext></mml:mrow></mml:mrow></mml:msup><mml:mo>)</mml:mo></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msup></mml:msqrt></mml:math></disp-formula>
<disp-formula id="eqn-6"><label>(6)</label><mml:math id="mml-eqn-6" display="block"><mml:mrow><mml:mtext>r</mml:mtext></mml:mrow><mml:mo>=</mml:mo><mml:msqrt><mml:msup><mml:mrow><mml:mo>(</mml:mo><mml:msup><mml:mrow><mml:mtext>w</mml:mtext></mml:mrow><mml:mrow><mml:mrow><mml:mtext>p</mml:mtext></mml:mrow></mml:mrow></mml:msup><mml:mo>)</mml:mo></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msup><mml:mo>+</mml:mo><mml:msup><mml:mrow><mml:mo>(</mml:mo><mml:msup><mml:mrow><mml:mtext>h</mml:mtext></mml:mrow><mml:mrow><mml:mrow><mml:mtext>p</mml:mtext></mml:mrow></mml:mrow></mml:msup><mml:mo>)</mml:mo></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msup></mml:msqrt></mml:math></disp-formula></p>
<p>As shown in <xref ref-type="fig" rid="fig-9">Fig. 9</xref>, &#x03C1;(p, g) represents the Euclidean distance between the central points of the prediction frame and the labeling frame, p and g are the central points of the two frames, C represents the diagonal lengths of the minimum enclosing matrix frames of the two frames, the w<sup>g</sup> and h<sup>g</sup> are the width and height of the prediction frame respectively, and the w<sup>p</sup> and h<sup>p</sup> are the width and height of the true frame respectively, R is the diagonal lengths of the target frame, and r is the diagonal lengths of the prediction frame.</p>
<fig id="fig-9"><label>Figure 9</label><caption><title>RIOU_Loss for bounding box regression</title></caption><graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_34876-fig-9.tif"/></fig>
<p>This limiting mechanism of increasing center point distance and diagonal length effectively avoids the problem that GIOU_Loss will produce a larger outer frame and loss value when the two frames are far apart, making the function convergence faster. At the same time, the c value is unchanged when the two boxes are in the inclusion relationship. However, the &#x03C1;(p, g) and R values will change, making the prediction box more consistent with the real box. Algorithm 1 shows the detailed steps of RIOU_Loss in our method:
</p>
<fig id="fig-12">
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_34876-fig-12.tif"/>
</fig>
</sec>
</sec>
<sec id="s4"><label>4</label><title>Evaluation Metrics</title>
<sec id="s4_1"><label>4.1</label><title>Experimental Datasets</title>
<p>To test the generalization ability of the model, two small aerial datasets, VEDAI and RSOD, are selected for testing in this experiment. In the experiment, the improved algorithm RT-YOLO was tested by the quantitative comparison method.</p>
<sec id="s4_1_1"><label>4.1.1</label><title>VEDAI Dataset</title>
<p>The public dataset of aerial image VEDAI proposed in 2014 is adopted, which contains 1210 RGB images with a resolution of 1024&#x2009;&#x00D7;&#x2009;1024 (or 512&#x2009;&#x00D7;&#x2009;512) pixels. The whole dataset includes 3640 instances including vehicles, ships and aircraft, including 9 categories of &#x201C;car&#x201D;, &#x201C;truck&#x201D;, &#x201C;camping car&#x201D;, &#x201C;tractor&#x201D;, &#x201C;aircraft&#x201D;, &#x201C;ship&#x201D;, &#x201C;pick up&#x201D;, &#x201C;Van&#x201D; and &#x201C;others&#x201D;. The number of each target is shown in <xref ref-type="table" rid="table-2">Table 2</xref>, because all targets are distributed in fields and grasslands in areas with a rich background such as mountains and urban areas. There are an average of 5.5 vehicles per image, the image illumination has a great impact, and the direction of vehicle targets is random. As a public dataset for small object detection, VEDAI is challenging.</p>
<table-wrap id="table-2"><label>Table 2</label><caption><title>Number of each target in the VEDAI dataset</title></caption>
<table frame="hsides">
<colgroup>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
</colgroup>
<thead>
<tr>
<th align="left">Class</th>
<th align="left">Boat</th>
<th align="left">Camping</th>
<th align="left">Car</th>
<th align="left">Pickup</th>
<th align="left">Tractors</th>
<th align="left">Trunk</th>
<th align="left">Vans</th>
<th align="left">Airplane</th>
<th align="left">Others</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">Number</td>
<td align="left">170</td>
<td align="left">390</td>
<td align="left">1340</td>
<td align="left">950</td>
<td align="left">190</td>
<td align="left">300</td>
<td align="left">100</td>
<td align="left">47</td>
<td align="left">200</td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
<sec id="s4_1_2"><label>4.1.2</label><title>RSOD Dataset</title>
<p>The RSOD dataset was released by Wuhan University in 2015 and is mainly used in the field of object detection. It contains 976 remote-sensing images. The detected targets have different scales, orientations, and current situations. The dataset is marked with 6950 target location information, and the target categories are 4 categories, including aircraft, oil tank, overpass, and playground. Before the experiment, we round off 40 pictures in the playground category in the RSOD dataset. Finally, the total dataset was 936 pictures, which were divided into the training set and test set according to the ratio of 2:1:1, including 468, 234, and 234 small target images respectively. <xref ref-type="table" rid="table-3">Table 3</xref> lists the number of targets in each category contained in the training set and test set respectively.</p>
<table-wrap id="table-3"><label>Table 3</label><caption><title>Number of targets in the RSOD dataset</title></caption>
<table frame="hsides">
<colgroup>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
</colgroup>
<thead>
<tr>
<th align="left">Class</th>
<th align="left">Train set</th>
<th align="left">Validation set</th>
<th align="left">Test set</th>
<th align="left">total</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">Aircraft</td>
<td align="left">2145</td>
<td align="left">1248</td>
<td align="left">1100</td>
<td align="left">4993</td>
</tr>
<tr>
<td align="left">Oil tank</td>
<td align="left">834</td>
<td align="left">362</td>
<td align="left">390</td>
<td align="left">1586</td>
</tr>
<tr>
<td align="left">Overpass</td>
<td align="left">88</td>
<td align="left">46</td>
<td align="left">46</td>
<td align="left">180</td>
</tr>
<tr>
<td align="left">Playground</td>
<td align="left">97</td>
<td align="left">47</td>
<td align="left">47</td>
<td align="left">191</td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
</sec>
<sec id="s4_2"><label>4.2</label><title>Experiment Environment</title>
<p>We carry out experiments on two aviation small target datasets, VEDAI and RSOD compared with other most advanced object detection algorithms. The following are the experimental conditions: Python 38, Pytorch 1.7.0, GPU 11.0 Framework. Ubuntu is the operating system. CPU:i7-7700k. NVIDIA GeForce RTX 3080 graphics card. We select the coco datasets commonly used in the object detection task for pre-training, and set the parameter initialization for the model training: the size of the input picture is 640&#x2009;&#x00D7;&#x2009;640, the initial learning rate is 0.001 and the batch size is set to 32. The optimizer is the Adam algorithm, and the training epochs are 300 times. The initialization parameters are displayed in <xref ref-type="table" rid="table-4">Table 4</xref>.</p>
<table-wrap id="table-4"><label>Table 4</label><caption><title>The initialization parameters of training</title></caption>
<table frame="hsides">
<colgroup>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
</colgroup>
<thead>
<tr>
<th align="left">Input size</th>
<th align="left">Batch size</th>
<th align="left">Momentum</th>
<th align="left">Learning rate</th>
<th align="left">Epoch</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">640&#x2009;&#x00D7;&#x2009;640</td>
<td align="left">32</td>
<td align="left">0.9</td>
<td align="left">0.001&#x2013;0.00001</td>
<td align="left">300</td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
<sec id="s4_3"><label>4.3</label><title>Evaluation Index</title>
<p>There are two types of acknowledged performance evaluation indicators for object detection algorithms: evaluating algorithm detection accuracy and assessing algorithm detection speed. Precision (P), recall (R), average precision (AP), mean average precision (mAP), and other metrics are used to evaluate the algorithm&#x2019;s detection ability. The algorithm&#x2019;s detecting speed is primarily measured in frames per second (FPS). These public indicators are also used in this paper&#x2019;s Evaluation Metrics.</p>
<p>In a detection algorithm that can identify C-type objects, the images containing the i-th object are detected in <inline-formula id="ieqn-1"><mml:math id="mml-ieqn-1"><mml:msub><mml:mrow><mml:mtext>M</mml:mtext></mml:mrow><mml:mrow><mml:mrow><mml:mtext>i</mml:mtext></mml:mrow></mml:mrow></mml:msub></mml:math></inline-formula> frames during the execution of the detection task, where <inline-formula id="ieqn-2"><mml:math id="mml-ieqn-2"><mml:mrow><mml:mi mathvariant="normal">i</mml:mi></mml:mrow><mml:mo>&#x2208;</mml:mo><mml:mrow><mml:mo>{</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mn>2</mml:mn><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:mrow><mml:mtext>C</mml:mtext></mml:mrow><mml:mo>}</mml:mo></mml:mrow></mml:math></inline-formula>, select the one with the highest confidence. The first N frames of images (<inline-formula id="ieqn-3"><mml:math id="mml-ieqn-3"><mml:mrow><mml:mtext>N</mml:mtext></mml:mrow><mml:mo>&#x2208;</mml:mo><mml:mrow><mml:mo>{</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mn>2</mml:mn><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:mrow><mml:mtext>M</mml:mtext></mml:mrow><mml:mo>}</mml:mo></mml:mrow></mml:math></inline-formula>), calculate the intersection and union ratio of the Bounding Box predicted by the algorithm in each image and the actual area corresponding to the target. The images whose intersection and the union ratio are greater than a certain threshold are classified as true positive (TP) with accurate prediction. The number is represented by <inline-formula id="ieqn-4"><mml:math id="mml-ieqn-4"><mml:msubsup><mml:mrow><mml:mrow><mml:mtext>TP</mml:mtext></mml:mrow></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>N</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula>, and the number of false positive (FP) images with the wrong prediction is represented by <inline-formula id="ieqn-5"><mml:math id="mml-ieqn-5"><mml:msubsup><mml:mrow><mml:mrow><mml:mtext>FP</mml:mtext></mml:mrow></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>N</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula>, <inline-formula id="ieqn-6"><mml:math id="mml-ieqn-6"><mml:msubsup><mml:mrow><mml:mrow><mml:mi mathvariant="normal">T</mml:mi><mml:mi mathvariant="normal">N</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>N</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula> represent the number of true negative (TN) prediction mistakes. Precision (P) is defined as the ratio of True Positives in the first N photos used to identify the target (i), and the formula is as follows:
<disp-formula id="eqn-7"><label>(7)</label><mml:math id="mml-eqn-7" display="block"><mml:msub><mml:mi>P</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mfrac><mml:msubsup><mml:mrow><mml:mrow><mml:mtext>TP</mml:mtext></mml:mrow></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>N</mml:mi></mml:mrow></mml:msubsup><mml:mi>N</mml:mi></mml:mfrac><mml:mo>=</mml:mo><mml:mfrac><mml:msubsup><mml:mrow><mml:mrow><mml:mtext>TP</mml:mtext></mml:mrow></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>N</mml:mi></mml:mrow></mml:msubsup><mml:mrow><mml:msubsup><mml:mrow><mml:mrow><mml:mtext>TP</mml:mtext></mml:mrow></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>N</mml:mi></mml:mrow></mml:msubsup><mml:mo>+</mml:mo><mml:msubsup><mml:mrow><mml:mrow><mml:mtext>FP</mml:mtext></mml:mrow></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>N</mml:mi></mml:mrow></mml:msubsup></mml:mrow></mml:mfrac></mml:math></disp-formula></p>
<p>The ratio of True Positives to the total number of image frames that contains the i-th object is known as recall (R). If it is assumed that K image frames containing objects of the i-th type exist. The following is the formula for calculating it:
<disp-formula id="eqn-8"><label>(8)</label><mml:math id="mml-eqn-8" display="block"><mml:msub><mml:mi>R</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mfrac><mml:msubsup><mml:mrow><mml:mrow><mml:mtext>TP</mml:mtext></mml:mrow></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mtext>N</mml:mtext></mml:mrow></mml:mrow></mml:msubsup><mml:mi>K</mml:mi></mml:mfrac><mml:mo>=</mml:mo><mml:mfrac><mml:msubsup><mml:mrow><mml:mrow><mml:mtext>TP</mml:mtext></mml:mrow></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>N</mml:mi></mml:mrow></mml:msubsup><mml:mrow><mml:msubsup><mml:mrow><mml:mrow><mml:mtext>TP</mml:mtext></mml:mrow></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>N</mml:mi></mml:mrow></mml:msubsup><mml:mo>+</mml:mo><mml:msubsup><mml:mrow><mml:mrow><mml:mtext>TN</mml:mtext></mml:mrow></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>N</mml:mi></mml:mrow></mml:msubsup></mml:mrow></mml:mfrac></mml:math></disp-formula></p>
<p>Average accuracy (AP) is also often used to quantitatively assess the performance of detection algorithms. Intuitively, AP is the area under the P-R curve. Generally speaking, the better the classifier, the higher the AP value. Further, the most important indicator in the object detection algorithm, mAP, can be obtained by averaging the AP of each category. The size of mAP must be in the [0, 1] interval, and the higher the index, the better the global accuracy of the algorithm. The formula for calculating mAP is as follows:
<disp-formula id="eqn-9"><label>(9)</label><mml:math id="mml-eqn-9" display="block"><mml:mrow><mml:mtext>AP</mml:mtext></mml:mrow><mml:mo>=</mml:mo><mml:msubsup><mml:mo>&#x222B;</mml:mo><mml:mrow><mml:mn>0</mml:mn></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msubsup><mml:mi>P</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mi>R</mml:mi><mml:mo>)</mml:mo></mml:mrow><mml:mi>d</mml:mi><mml:mi>R</mml:mi></mml:math></disp-formula>
<disp-formula id="eqn-10"><label>(10)</label><mml:math id="mml-eqn-10" display="block"><mml:mrow><mml:mtext>mAP</mml:mtext></mml:mrow><mml:mo>=</mml:mo><mml:msubsup><mml:mrow><mml:mo>&#x2211;</mml:mo></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>C</mml:mi></mml:mrow></mml:msubsup><mml:msub><mml:mrow><mml:mtext>AP</mml:mtext></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo>/</mml:mo></mml:mrow><mml:mi>C</mml:mi></mml:math></disp-formula>where AP is the average accuracy of a single target category, and C represents the number of classifications.</p>
<p>The detection speed is a significant indicator of the measuring algorithm&#x2019;s evaluation. It&#x2019;s critical to pay particular attention to whether the algorithm can analyze enough video frames/images promptly, for time-sensitive needs for high real-time detection systems. The number of frames per second (FPS) is the rate at which a detection system completes a object detection task in a given amount of time (s). FPS is a standard metric for determining the algorithm&#x2019;s detection speed.</p>
</sec>
</sec>
<sec id="s5"><label>5</label><title>Results and Discussion</title>
<p>We analyze the backbone network replacement, method ablation and different model comparison to reflect the effectiveness of the improved methods.</p>
<sec id="s5_1"><label>5.1</label><title>Experiment Comparison of the Improved Backbone</title>
<p>In the original YOLOv5 algorithm, the detection of the UAV aerial image dataset is often missed and misdetected. This may be caused by the poor ability of the backbone network to extract the features of the small targets, and the detection scale fails to match the scale size of the small targets in the image. On the one hand, the original CSP structure uses hierarchical multi-scale to utilize features, and does not fully utilize the internal features of a single layer. We design Res2T to enhance fine-grained monolayer Internal feature utilization. On the other hand, the original CSP structure consistently valued the characteristics of each channel. We increase triplet attention modules to establish relationships between dimensions in enhanced spatial and channel information quality. Considering the computational load of the algorithm. We remove the convolutional layers on the short branches of the original module. To verify the effectiveness of the modified backbone network, we perform the ablation experiments on the VEDAI datasets using Darknet53, CSPDarknet53, ResNet50, ResNet101, VGG16, and RepVGG. The experimental results are given in <xref ref-type="table" rid="table-5">Table 5</xref>.</p>
<table-wrap id="table-5"><label>Table 5</label><caption><title>Comparison experiment of the backbone network</title></caption>
<table frame="hsides">
<colgroup>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
</colgroup>
<thead>
<tr>
<th align="left">Model</th>
<th align="left">Backbone</th>
<th align="left">Input size</th>
<th align="left">mAP (&#x0025;)</th>
<th align="left">FPS</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">YOLOv5</td>
<td align="left">Darknet53</td>
<td align="left">640&#x2009;&#x00D7;&#x2009;640</td>
<td align="left">48.6</td>
<td align="left">29.5</td>
</tr>
<tr>
<td align="left">YOLOv5</td>
<td align="left">CSPDarknet53</td>
<td align="left">640&#x2009;&#x00D7;&#x2009;640</td>
<td align="left">57.2</td>
<td align="left">31.2</td>
</tr>
<tr>
<td align="left">YOLOv5</td>
<td align="left">ResNet50</td>
<td align="left">640&#x2009;&#x00D7;&#x2009;640</td>
<td align="left">41.5</td>
<td align="left">31.1</td>
</tr>
<tr>
<td align="left">YOLOv5</td>
<td align="left">ResNet101</td>
<td align="left">640&#x2009;&#x00D7;&#x2009;640</td>
<td align="left">43.4</td>
<td align="left">19.8</td>
</tr>
<tr>
<td align="left">YOLOv5</td>
<td align="left">VGG16</td>
<td align="left">640&#x2009;&#x00D7;&#x2009;640</td>
<td align="left">50.8</td>
<td align="left">16.8</td>
</tr>
<tr>
<td align="left">YOLOv5</td>
<td align="left">RepVGG</td>
<td align="left">640&#x2009;&#x00D7;&#x2009;640</td>
<td align="left">54.5</td>
<td align="left">49.2</td>
</tr>
<tr>
<td align="left">YOLOv5</td>
<td align="left">Ours</td>
<td align="left">640&#x2009;&#x00D7;&#x2009;640</td>
<td align="left">59.1</td>
<td align="left">30.3</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>On the same input dimension, the mAP of our method on the test data set exceeds the original csparknet53 network, 9.5&#x0025; higher than darknet53&#x0025;, and 17.6&#x0025; and 15.7&#x0025; higher than the other two residual networks resnet50 and resnet101, respectively. Compared with the newly proposed RepVGG, our method also has a better detection effect. The experimental results show that the improved backbone network. Therefore, strengthening the connection between internal feature utilization and enhanced channels can effectively improve the extraction ability of small target features, improve the detection performance of the algorithm for small targets, and effectively reduce the error detection of small targets in UAV aerial image tasks.</p>
</sec>
<sec id="s5_2"><label>5.2</label><title>Ablation Experiment</title>
<p>To verify the impact of the improved method on the detection power of the YOLOv5 algorithm, we perform the ablation experiments on the VEDAI dataset, and the results are shown in <xref ref-type="table" rid="table-6">Table 6</xref>. Where B represents the improved backbone network, N represents changing the SPP pooling in the neck network, and L represents the use of the improved loss function.</p>
<table-wrap id="table-6"><label>Table 6</label><caption><title>Ablation experiment of the improved module</title></caption>
<table frame="hsides">
<colgroup>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
</colgroup>
<thead>
<tr>
<th align="left">Model</th>
<th align="left">B</th>
<th align="left">N</th>
<th align="left">L</th>
<th align="left">Input size</th>
<th align="left">mAP (&#x0025;)</th>
<th align="left">FPS</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">YOLOv5</td>
<td align="left"/>
<td align="left"/>
<td align="left"/>
<td align="left">640&#x2009;&#x00D7;&#x2009;640</td>
<td align="left">57.2</td>
<td align="left">31.2</td>
</tr>
<tr>
<td align="left">YOLOv5</td>
<td align="left">&#x221A;</td>
<td align="left"/>
<td align="left"/>
<td align="left">640&#x2009;&#x00D7;&#x2009;640</td>
<td align="left">59.1</td>
<td align="left">29.7</td>
</tr>
<tr>
<td align="left">YOLOv5</td>
<td align="left">&#x221A;</td>
<td align="left">&#x221A;</td>
<td align="left"/>
<td align="left">640&#x2009;&#x00D7;&#x2009;640</td>
<td align="left">60.5</td>
<td align="left">29.4</td>
</tr>
<tr>
<td align="left">YOLOv5</td>
<td align="left">&#x221A;</td>
<td align="left">&#x221A;</td>
<td align="left">&#x221A;</td>
<td align="left">640&#x2009;&#x00D7;&#x2009;640</td>
<td align="left">60.8</td>
<td align="left">30.3</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>After the addition of the improved Backbone, the mAP value of the model is increased from 57.2&#x0025; to 59.1&#x0025;. This demonstrates that improved backbone networks can effectively enhance the feature utilization power of the model for small targets. In addition, the unusual size SPP structure is used in the neck network to improve the binding ability of the receptive field, which increases the mAP of the network detection test set from 59.1&#x0025; to 60.5&#x0025;. Finally, the improved RIOU loss function is used to accelerate the network convergence and improve the network detection accuracy by 0.3&#x0025;. Experimental results show that the improved RT-YOLO algorithm can effectively improve the tiny target detection of UAV aerial images.</p>
</sec>
<sec id="s5_3"><label>5.3</label><title>Comparative Experiment of Different Models</title>
<p>To validate the improved method, we compare it with other state-of-the-art algorithms on the VEDAI dataset and the RSOD datasets.</p>
<sec id="s5_3_1"><label>5.3.1</label><title>VEDAI Dataset Comparison Experiments</title>
<p>In the experiment of the VEDAI dataset, the training set and testing set are divided into 4:1. In the training phase, 994 images in the dataset are selected as training samples, and in the test phase, the remaining 248 images are selected as test samples. Comparing several mainstream object detection frameworks, the specific experimental results are shown in <xref ref-type="table" rid="table-7">Table 7</xref>.</p>
<table-wrap id="table-7"><label>Table 7</label><caption><title>The VEDAI dataset experimental results</title></caption>
<table frame="hsides">
<colgroup>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
</colgroup>
<thead>
<tr>
<th align="left" rowspan="2">Model</th>
<th align="left" colspan="10">AP (&#x0025;)</th>
<th align="left" rowspan="2">FPS</th>
</tr>
<tr>
<th align="left">Boat</th>
<th align="left">Camping</th>
<th align="left">Car</th>
<th align="left">Pickup</th>
<th align="left">Tractor</th>
<th align="left">Trunk</th>
<th align="left">Van</th>
<th align="left">Airplane</th>
<th align="left">Other</th>
<th align="left">mAP</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">Faster RCNN</td>
<td align="left">56.6</td>
<td align="left">75.4</td>
<td align="left">73.9</td>
<td align="left">69.7</td>
<td align="left">73.4</td>
<td align="left">46.4</td>
<td align="left">83.9</td>
<td align="left">77</td>
<td align="left">43.1</td>
<td align="left">65.7</td>
<td align="left">6.3</td>
</tr>
<tr>
<td align="left">SSD</td>
<td align="left">31.6</td>
<td align="left">51.3</td>
<td align="left">67.6</td>
<td align="left">50.4</td>
<td align="left">43.5</td>
<td align="left">45.1</td>
<td align="left">42.7</td>
<td align="left">61.5</td>
<td align="left">22.3</td>
<td align="left">46.1</td>
<td align="left">32.0</td>
</tr>
<tr>
<td align="left">YOLOV3</td>
<td align="left">64.5</td>
<td align="left">54.7</td>
<td align="left">69.1</td>
<td align="left">52.8</td>
<td align="left">52.5</td>
<td align="left">32.8</td>
<td align="left">51.3</td>
<td align="left">51.6</td>
<td align="left">31.4</td>
<td align="left">51.3</td>
<td align="left">29.5</td>
</tr>
<tr>
<td align="left">YOLOV4</td>
<td align="left">77.1</td>
<td align="left">71.9</td>
<td align="left">86.7</td>
<td align="left">75.2</td>
<td align="left">72.7</td>
<td align="left">68.1</td>
<td align="left">52.0</td>
<td align="left">81.0</td>
<td align="left">43.0</td>
<td align="left">72.5</td>
<td align="left">25.7</td>
</tr>
<tr>
<td align="left">YOLOv5s</td>
<td align="left">76.9</td>
<td align="left">57.9</td>
<td align="left">72.1</td>
<td align="left">56.8</td>
<td align="left">48.2</td>
<td align="left">42.5</td>
<td align="left">51.3</td>
<td align="left">51.2</td>
<td align="left">33.8</td>
<td align="left">57.2</td>
<td align="left">31.2</td>
</tr>
<tr>
<td align="left">RT-YOLO</td>
<td align="left">76.6</td>
<td align="left">54.2</td>
<td align="left">75.2</td>
<td align="left">61.2</td>
<td align="left">52.1</td>
<td align="left">48.2</td>
<td align="left">60.6</td>
<td align="left">63.5</td>
<td align="left">44.1</td>
<td align="left">60.8</td>
<td align="left">30.3</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>From the analysis of <xref ref-type="table" rid="table-7">Table 7</xref>, it can be seen that the mean mAP of the dataset increased from 57.2&#x0025; to 60.8&#x0025;, which is due to the improvement of the RT-YOLO algorithm using a better backbone network and higher resolution images as input. In <xref ref-type="table" rid="table-7">Table 7</xref>, the object detection performance of Tractors, Van, Trunk, Airplanes, and Other is improved greatly, and the average accuracy of Airplanes was increased from 51.2&#x0025; to 63.5&#x0025;, with an increase of 12.3&#x0025;. This shows that the improved algorithm for aerial image target detection performance has been greatly improved. Boat, Camping detection accuracy is slightly reduced, is due to the RT-YOLO algorithm using Triple Attention in reducing compute operations at the same time, to a certain extent weakening the expressive ability of convolution, thus to some extent affecting the part of the single layer feature change insensitive category target detection accuracy. From the comparison of experimental results, we can see that the improved algorithm has achieved a better detection effect. <xref ref-type="fig" rid="fig-10">Fig. 10</xref> shows the comparison of the test results of the VEDAI dataset. The comparison shows that RT-YOLO can detect smaller objects on the VEDAI dataset more efficiently.</p>
<fig id="fig-10"><label>Figure 10</label><caption><title>Detection results of the VEDAI dataset: (a) and (c) is the detection results of YOLOv5; (b) and (d) is the detection results of RT-YOLO</title></caption><graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_34876-fig-10.tif"/></fig>
</sec>
<sec id="s5_3_2"><label>5.3.2</label><title>RSOD Dataset Comparison Experiments</title>
<p><xref ref-type="table" rid="table-8">Table 8</xref> lists the object detection accuracy of the improved algorithm in this paper on the RSOD dataset. From the total average accuracy of various types, the method in this paper is higher than other algorithms except for yolov4. From the detection accuracy of the single category algorithm, our method is higher than the original yolov5s algorithm in aircraft, oil tanks, and overpasses. Through comparison, it can be seen that our method performs better in small object detection.</p>
<table-wrap id="table-8"><label>Table 8</label><caption><title>The comparative results of different categories in the RSOD dataset</title></caption>
<table frame="hsides">
<colgroup>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
</colgroup>
<thead>
<tr>
<th align="left" rowspan="2">Model</th>
<th align="left" colspan="5">AP (&#x0025;)</th>
<th align="left" rowspan="2">FPS</th>
</tr>
<tr>
<th align="left">Airplane</th>
<th align="left">Oil tank</th>
<th align="left">Overpass</th>
<th align="left">Playground</th>
<th align="left">mAP</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">Faster RCNN</td>
<td align="left">83.5</td>
<td align="left">98.1</td>
<td align="left">88.6</td>
<td align="left">97.8</td>
<td align="left">92.0</td>
<td align="left">9.6</td>
</tr>
<tr>
<td align="left">SSD</td>
<td align="left">71.8</td>
<td align="left">90.7</td>
<td align="left">90.2</td>
<td align="left">98.5</td>
<td align="left">87.8</td>
<td align="left">42.2</td>
</tr>
<tr>
<td align="left">YOLOV3</td>
<td align="left">89.7</td>
<td align="left">96.5</td>
<td align="left">80.9</td>
<td align="left">96.8</td>
<td align="left">91.6</td>
<td align="left">29.7</td>
</tr>
<tr>
<td align="left">YOLOV4</td>
<td align="left">92.3</td>
<td align="left">98.9</td>
<td align="left">86.9</td>
<td align="left">99.5</td>
<td align="left">95.2</td>
<td align="left">28.2</td>
</tr>
<tr>
<td align="left">YOLOV5</td>
<td align="left">93.6</td>
<td align="left">98.5</td>
<td align="left">83.8</td>
<td align="left">98.7</td>
<td align="left">93.7</td>
<td align="left">52.6</td>
</tr>
<tr>
<td align="left">RT-YOLO</td>
<td align="left">94.5</td>
<td align="left">99.4</td>
<td align="left">86.3</td>
<td align="left">97.1</td>
<td align="left">95.4</td>
<td align="left">49.4</td>
</tr>
</tbody>
</table>
</table-wrap>
<p><xref ref-type="fig" rid="fig-11">Fig. 11</xref> shows the detection results of our method on the RSOD dataset. Through the corresponding comparison between the original graph and the detection result graph, it can be concluded that the detection performance of our method is relatively excellent.</p>
<fig id="fig-11"><label>Figure 11</label><caption><title>Detection results of RSOD dataset: (a), (c), (e), and (g) is the original image of the RSOD dataset; (b), (d), (f), and (h) represent the detection results of the RT-YOLO</title></caption><graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_34876-fig-11.tif"/></fig>
</sec>
</sec>
</sec>
<sec id="s6"><label>6</label><title>Conclusion and Prospect</title>
<p>To improve the detection accuracy of aerial small object detection tasks, a new small object detection algorithm for aerial image is proposed in this paper. The algorithm is called RT-YOLO by us. We design a new feature extraction network structure CSP_RT, integrates the triplet attention mechanism to improve backbone network. In order to improve the sensitivity of RT-YOLO to small target, we design a new space Pyramid pooling SPP1 module and SPP2 module optimize the receptive field fusion. Considering the overlap area, center point distance, and aspect ratio, we propose the RIOU_Loss loss function. Using the VEDAI dataset, research, analysis, and proof of the attention mechanism to improve the performance of the small object detection algorithm, and found that adding the SPP model to the network neck is more friendly to the extraction of small target feature information. Experiments have proved that RT-YOLO effectively improves the detection accuracy of small object aerial image. The mAP@0.5 value on the VEDAI test set is increased by 3.6&#x0025; compared with YOLOv5s, and the mAP@0.5 on the RSOD dataset is increased by 1.7&#x0025;.</p>
<p>However, our method of integrating the attention module and adding and improving the SPP module will increase the number of algorithm parameters and floating-point operations, and reduce the real-time performance of algorithm detection. In the next research, we will consider compressing and pruning the model to lighten the network model and improve the real-time performance of model detection based on ensuring the accuracy of algorithm detection.</p>
</sec>
</body>
<back>
<ack>
<p>The author would like to thank the support of the Scientific Research Project of the Hunan Provincial Department of Education and the Science and Technology Plan Project of Hunan Province.</p>
</ack>
<sec><title>Funding Statement</title>
<p>This work was supported in part by the <funding-source>Scientific Research Project of Hunan Provincial Department of Education</funding-source> under Grant <award-id>18A332</award-id> and <award-id>19A066</award-id>, authors &#x00A0;HW.D and Z.C, <ext-link ext-link-type="uri" xlink:href="http://kxjsc.gov.hnedu.cn/">http://kxjsc.gov.hnedu.cn/</ext-link>; in part by the <funding-source>Science and Technology Plan Project of Hunan Province</funding-source> under Grant <award-id>2016TP1020</award-id>, author HW.D, <ext-link ext-link-type="uri" xlink:href="http://kjt.hunan.gov.cn/">http://kjt.hunan.gov.cn/</ext-link>.</p></sec>
<sec sec-type="COI-statement"><title>Conflicts of Interest</title>
<p>The authors declare that they have no conflicts of interest to report regarding the present study.</p></sec>
<ref-list content-type="authoryear"><title>References</title>
<ref id="ref-1"><label>[1]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>S.</given-names> <surname>Hu</surname></string-name> and <string-name><given-names>G. H.</given-names> <surname>Lee</surname></string-name></person-group>, &#x201C;<article-title>Image-based geo-localization using satellite imagery</article-title>,&#x201D; <source>International Journal of Computer Vision</source>, vol. <volume>128</volume>, no. <issue>5</issue>, pp. <fpage>1205</fpage>&#x2013;<lpage>1219</lpage>, <year>2020</year>.</mixed-citation></ref>
<ref id="ref-2"><label>[2]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>K.</given-names> <surname>Gao</surname></string-name>, <string-name><given-names>B.</given-names> <surname>Liu</surname></string-name>, <string-name><given-names>X.</given-names> <surname>Yu</surname></string-name>, <string-name><given-names>P.</given-names> <surname>Zhang</surname></string-name>, <string-name><given-names>X.</given-names> <surname>Tan</surname></string-name> <etal>et al.,</etal></person-group> &#x201C;<article-title>Small sample classification of hyperspectral image using model-agnostic meta-learning algorithm and convolutional neural network</article-title>,&#x201D; <source>International Journal of Remote Sensing</source>, vol. <volume>42</volume>, no. <issue>8</issue>, pp. <fpage>3090</fpage>&#x2013;<lpage>3122</lpage>, <year>2021</year>.</mixed-citation></ref>
<ref id="ref-3"><label>[3]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>R.</given-names> <surname>Girshick</surname></string-name>, <string-name><given-names>J.</given-names> <surname>Donahue</surname></string-name>, <string-name><given-names>T.</given-names> <surname>Darrell</surname></string-name> and <string-name><given-names>J.</given-names> <surname>Malik</surname></string-name></person-group>, &#x201C;<article-title>Rich feature hierarchies for accurate object detection and semantic segmentation</article-title>,&#x201D; in <conf-name>2014 IEEE Conf. on Computer Vision and Pattern Recognition (CVPR)</conf-name>, <conf-loc>Columbus, OH, USA</conf-loc>, pp. <fpage>580</fpage>&#x2013;<lpage>587</lpage>, <year>2014</year>.</mixed-citation></ref>
<ref id="ref-4"><label>[4]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>R.</given-names> <surname>Girshick</surname></string-name></person-group>, &#x201C;<article-title>Fast R-CNN</article-title>,&#x201D; in <conf-name>2015 IEEE Int. Conf. on Computer Vision (ICCV)</conf-name>, <conf-loc>Santiago, Chile</conf-loc>, pp. <fpage>1440</fpage>&#x2013;<lpage>1448</lpage>, <year>2015</year>.</mixed-citation></ref>
<ref id="ref-5"><label>[5]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>S.</given-names> <surname>Ren</surname></string-name>, <string-name><given-names>K.</given-names> <surname>He</surname></string-name>, <string-name><given-names>R.</given-names> <surname>Girshick</surname></string-name> and <string-name><given-names>J.</given-names> <surname>Sun</surname></string-name></person-group>, &#x201C;<article-title>Faster r-cnn: Towards real-time object detection with region proposal networks</article-title>,&#x201D; <source>IEEE Transactions on Pattern Analysis and Machine Intelligence</source>, vol. <volume>39</volume>, no. <issue>6</issue>, pp. <fpage>1137</fpage>&#x2013;<lpage>1149</lpage>, <year>2017</year>.</mixed-citation></ref>
<ref id="ref-6"><label>[6]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>K.</given-names> <surname>He</surname></string-name>, <string-name><given-names>G.</given-names> <surname>Gkioxari</surname></string-name>, <string-name><given-names>P.</given-names> <surname>Doll&#x00E1;r</surname></string-name> and <string-name><given-names>R.</given-names> <surname>Girshick</surname></string-name></person-group>, &#x201C;<article-title>Mask r-cnn</article-title>,&#x201D; in <conf-name>2017 IEEE Int. Conf. on Computer Vision (ICCV)</conf-name>, <conf-loc>Venice, Italy</conf-loc>, pp. <fpage>2961</fpage>&#x2013;<lpage>2969</lpage>, <year>2017</year>.</mixed-citation></ref>
<ref id="ref-7"><label>[7]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>J.</given-names> <surname>Redmon</surname></string-name>, <string-name><given-names>S.</given-names> <surname>Divvala</surname></string-name>, <string-name><given-names>R.</given-names> <surname>Girshick</surname></string-name> and <string-name><given-names>A.</given-names> <surname>Farhadi</surname></string-name></person-group>, &#x201C;<article-title>You only look once: Unified, real-time object detection</article-title>,&#x201D; in <conf-name>2016 IEEE Conf. on Computer Vision and Pattern Recognition (CVPR)</conf-name>, <conf-loc>Las Vegas, NV, USA</conf-loc>, pp. <fpage>779</fpage>&#x2013;<lpage>788</lpage>, <year>2016</year>.</mixed-citation></ref>
<ref id="ref-8"><label>[8]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>J.</given-names> <surname>Redmon</surname></string-name> and <string-name><given-names>A.</given-names> <surname>Farhadi</surname></string-name></person-group>, &#x201C;<article-title>YOLO9000: Better, faster, stronger</article-title>,&#x201D; in <conf-name>2017 IEEE Conf. on Computer Vision and Pattern Recognition (CVPR)</conf-name>, <conf-loc>Honolulu, HI, USA</conf-loc>, pp. <fpage>7263</fpage>&#x2013;<lpage>7271</lpage>, <year>2017</year>.</mixed-citation></ref>
<ref id="ref-9"><label>[9]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>S.</given-names> <surname>Zhang</surname></string-name>, <string-name><given-names>L.</given-names> <surname>Chai</surname></string-name> and <string-name><given-names>L.</given-names> <surname>Jin</surname></string-name></person-group>, &#x201C;<article-title>Vehicle detection in UAV aerial images based on improved YOLOv3</article-title>,&#x201D; in <conf-name>2020 IEEE Int. Conf. on Networking, Sensing and Control (ICNSC)</conf-name>, <conf-loc>Nanjing, China</conf-loc>, pp. <fpage>1</fpage>&#x2013;<lpage>6</lpage>, <year>2020</year>.</mixed-citation></ref>
<ref id="ref-10"><label>[10]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>J. H.</given-names> <surname>Sejr</surname></string-name>, <string-name><given-names>P.</given-names> <surname>Schneiderkamp</surname></string-name> and <string-name><given-names>N.</given-names> <surname>Ayoub</surname></string-name></person-group>, &#x201C;<article-title>Surrogate object detection explainer (SODEx) with YOLOv4 and LIME</article-title>,&#x201D; <source>Machine Learning and Knowledge Extraction</source>, vol. <volume>3</volume>, no. <issue>3</issue>, pp. <fpage>662</fpage>&#x2013;<lpage>671</lpage>, <year>2021</year>.</mixed-citation></ref>
<ref id="ref-11"><label>[11]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>Z.</given-names> <surname>Wang</surname></string-name>, <string-name><given-names>Y.</given-names> <surname>Zheng</surname></string-name>, <string-name><given-names>X.</given-names> <surname>Li</surname></string-name>, <string-name><given-names>X.</given-names> <surname>Jiang</surname></string-name>, <string-name><given-names>Z.</given-names> <surname>Yuan</surname></string-name> <etal>et al.,</etal></person-group> &#x201C;<article-title>DP-YOLOv5: Computer vision-based risk behavior detection in power grids</article-title>,&#x201D; in <conf-name>2021 Chinese Conf. on Pattern Recognition and Computer Vision (PRCV)</conf-name>, <conf-loc>Zhuhai, Guangdong, China</conf-loc>, pp. <fpage>318</fpage>&#x2013;<lpage>328</lpage>, <year>2021</year>.</mixed-citation></ref>
<ref id="ref-12"><label>[12]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>W.</given-names> <surname>Liu</surname></string-name>, <string-name><given-names>D.</given-names> <surname>Anguelov</surname></string-name>, <string-name><given-names>D.</given-names> <surname>Erhan</surname></string-name>, <string-name><given-names>C.</given-names> <surname>Szegedy</surname></string-name>, <string-name><given-names>S.</given-names> <surname>Reed</surname></string-name> <etal>et al.,</etal></person-group> &#x201C;<article-title>Ssd: Single shot multibox detector</article-title>,&#x201D; in <conf-name>2016 European Conf. on Computer Vision (ECCV)</conf-name>, <conf-loc>Amsterdam, Netherlands</conf-loc>, pp. <fpage>21</fpage>&#x2013;<lpage>37</lpage>, <year>2016</year>.</mixed-citation></ref>
<ref id="ref-13"><label>[13]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>Y.</given-names> <surname>Zhang</surname></string-name>, <string-name><given-names>W.</given-names> <surname>Zhou</surname></string-name>, <string-name><given-names>Y. J.</given-names> <surname>Wang</surname></string-name> and <string-name><given-names>L. J.</given-names> <surname>Xu</surname></string-name></person-group>, &#x201C;<article-title>A real-time recognition method of static gesture based on DSSD</article-title>,&#x201D; <source>Multimedia Tools and Applications</source>, vol. <volume>79</volume>, no. <issue>4</issue>, pp. <fpage>17445</fpage>&#x2013;<lpage>17461</lpage>, <year>2020</year>.</mixed-citation></ref>
<ref id="ref-14"><label>[14]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>S.</given-names> <surname>Bera</surname></string-name>, <string-name><given-names>V. K.</given-names> <surname>Shrivastava</surname></string-name> and <string-name><given-names>S. C.</given-names> <surname>Satapathy</surname></string-name></person-group>, &#x201C;<article-title>Advances in hyperspectral image classification based on convolutional neural networks: A review</article-title>,&#x201D; <source>Computer Modeling in Engineering &#x0026; Sciences</source>, vol. <volume>133</volume>, no. <issue>2</issue>, pp. <fpage>219</fpage>&#x2013;<lpage>250</lpage>, <year>2022</year>.</mixed-citation></ref>
<ref id="ref-15"><label>[15]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>L. W.</given-names> <surname>Sommer</surname></string-name>, <string-name><given-names>T.</given-names> <surname>Schuchert</surname></string-name> and <string-name><given-names>J.</given-names> <surname>Beyerer</surname></string-name></person-group>, &#x201C;<article-title>Fast deep vehicle detection in aerial images</article-title>,&#x201D; in <conf-name>2017 IEEE Winter Conf. on Applications of Computer Vision (WACV)</conf-name>, <conf-loc>Santa Rosa, CA, USA</conf-loc>, pp. <fpage>311</fpage>&#x2013;<lpage>319</lpage>, <year>2017</year>.</mixed-citation></ref>
<ref id="ref-16"><label>[16]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>X.</given-names> <surname>Yu</surname></string-name>, <string-name><given-names>Y.</given-names> <surname>Gong</surname></string-name>, <string-name><given-names>N.</given-names> <surname>Jiang</surname></string-name>, <string-name><given-names>Q.</given-names> <surname>Ye</surname></string-name> and <string-name><given-names>Z.</given-names> <surname>Han</surname></string-name></person-group>, &#x201C;<article-title>Scale match for tiny person detection</article-title>,&#x201D; in <conf-name>2020 IEEE Winter Conf. on Applications of Computer Vision (WACV)</conf-name>, <conf-loc>Santa Rosa, CA, USA</conf-loc>, pp. <fpage>1246</fpage>&#x2013;<lpage>1254</lpage>, <year>2020</year>.</mixed-citation></ref>
<ref id="ref-17"><label>[17]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>X.</given-names> <surname>Yang</surname></string-name>, <string-name><given-names>J.</given-names> <surname>Yang</surname></string-name>, <string-name><given-names>J.</given-names> <surname>Yan</surname></string-name>, <string-name><given-names>T.</given-names> <surname>Zhang</surname></string-name>, <string-name><given-names>Z.</given-names> <surname>Guo</surname></string-name> <etal>et al.,</etal></person-group> &#x201C;<article-title>SCRDet: Towards more robust detection for small, cluttered and rotated objects</article-title>,&#x201D; in <conf-name>2019 IEEE Int. Conf. on Computer Vision (ICCV)</conf-name>, <conf-loc>Seoul, Korea (South)</conf-loc>, pp. <fpage>8232</fpage>&#x2013;<lpage>8241</lpage>, <year>2019</year>.</mixed-citation></ref>
<ref id="ref-18"><label>[18]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>A.</given-names> <surname>Ibrahim</surname></string-name>, <string-name><given-names>S.</given-names> <surname>Mirjalili</surname></string-name>, <string-name><given-names>M.</given-names> <surname>El-Said</surname></string-name>, <string-name><given-names>S. S. M.</given-names> <surname>Ghoneim</surname></string-name>, <string-name><given-names>M.</given-names> <surname>Al-Harthi</surname></string-name> <etal>et al.,</etal></person-group> &#x201C;<article-title>Wind speed ensemble forecasting based on deep learning using adaptive dynamic optimization algorithm</article-title>,&#x201D; <source>IEEE Access</source>, vol. <volume>9</volume>, pp. <fpage>125787</fpage>&#x2013;<lpage>125804</lpage>, <year>2021</year>.</mixed-citation></ref>
<ref id="ref-19"><label>[19]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>Y.</given-names> <surname>Rao</surname></string-name>, <string-name><given-names>H.</given-names> <surname>Mu</surname></string-name>, <string-name><given-names>Z.</given-names> <surname>Yang</surname></string-name>, <string-name><given-names>W.</given-names> <surname>Zheng</surname></string-name>, <string-name><given-names>F.</given-names> <surname>Wang</surname></string-name> <etal>et al.,</etal></person-group> &#x201C;<article-title>B-PesNet: Smoothly propagating semantics for robust and reliable multi-scale object detection for secure systems</article-title>,&#x201D; <source>Computer Modeling in Engineering &#x0026; Sciences</source>, vol. <volume>132</volume>, no. <issue>3</issue>, pp. <fpage>1039</fpage>&#x2013;<lpage>1054</lpage>, <year>2022</year>.</mixed-citation></ref>
<ref id="ref-20"><label>[20]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>Y.</given-names> <surname>Yang</surname></string-name>, <string-name><given-names>N.</given-names> <surname>Zhu</surname></string-name>, <string-name><given-names>Y.</given-names> <surname>Wu</surname></string-name>, <string-name><given-names>J.</given-names> <surname>Cao</surname></string-name>, <string-name><given-names>D.</given-names> <surname>Zhan</surname></string-name> <etal>et al.,</etal></person-group> &#x201C;<article-title>A semi-supervised attention model for identifying authentic sneakers</article-title>,&#x201D; <source>Big Data Mining and Analytics</source>, vol. <volume>3</volume>, no. <issue>1</issue>, pp. <fpage>29</fpage>&#x2013;<lpage>40</lpage>, <year>2020</year>.</mixed-citation></ref>
<ref id="ref-21"><label>[21]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>B.</given-names> <surname>Singh</surname></string-name> and <string-name><given-names>L. S.</given-names> <surname>Davis</surname></string-name></person-group>, &#x201C;<article-title>An analysis of scale invariance in object detection snip</article-title>,&#x201D; in <conf-name>2018 IEEE Conf. on Computer Vision and Pattern Recognition (CVPR)</conf-name>, <conf-loc>Salt Lake City, UT, USA</conf-loc>, pp. <fpage>3578</fpage>&#x2013;<lpage>3587</lpage>, <year>2018</year>.</mixed-citation></ref>
<ref id="ref-22"><label>[22]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>C.</given-names> <surname>Yu</surname></string-name>, <string-name><given-names>B.</given-names> <surname>Xiao</surname></string-name>, <string-name><given-names>C.</given-names> <surname>Gao</surname></string-name>, <string-name><given-names>L.</given-names> <surname>Yuan</surname></string-name>, <string-name><given-names>L.</given-names> <surname>Zhang</surname></string-name> <etal>et al.,</etal></person-group> &#x201C;<article-title>Lite-hrnet: A lightweight high-resolution network</article-title>,&#x201D; in <conf-name>2021 IEEE Conf. on Computer Vision and Pattern Recognition (CVPR)</conf-name>, <conf-loc>Nashville, TN, USA</conf-loc>, pp. <fpage>10440</fpage>&#x2013;<lpage>10450</lpage>, <year>2021</year>.</mixed-citation></ref>
<ref id="ref-23"><label>[23]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>X.</given-names> <surname>Xu</surname></string-name>, <string-name><given-names>T.</given-names> <surname>Gao</surname></string-name>, <string-name><given-names>Y.</given-names> <surname>Wang</surname></string-name> and <string-name><given-names>X.</given-names> <surname>Xuan</surname></string-name></person-group>, &#x201C;<article-title>Event temporal relation extraction with attention mechanism and graph neural network</article-title>,&#x201D; <source>Tsinghua Science and Technology</source>, vol. <volume>27</volume>, no. <issue>1</issue>, pp. <fpage>79</fpage>&#x2013;<lpage>90</lpage>, <year>2022</year>.</mixed-citation></ref>
<ref id="ref-24"><label>[24]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>W.</given-names> <surname>Chen</surname></string-name>, <string-name><given-names>M.</given-names> <surname>Liu</surname></string-name>, <string-name><given-names>X.</given-names> <surname>Zhou</surname></string-name>, <string-name><given-names>J.</given-names> <surname>Pan</surname></string-name> and <string-name><given-names>H.</given-names> <surname>Tan</surname></string-name></person-group>, &#x201C;<article-title>Safety helmet wearing detection in aerial images using improved YOLOv4</article-title>,&#x201D; <source>Computers Materials &#x0026; Continua</source>, vol. <volume>72</volume>, no. <issue>2</issue>, pp. <fpage>3159</fpage>&#x2013;<lpage>3174</lpage>, <year>2022</year>.</mixed-citation></ref>
<ref id="ref-25"><label>[25]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>Y.</given-names> <surname>Li</surname></string-name>, <string-name><given-names>F.</given-names> <surname>Liu</surname></string-name>, <string-name><given-names>S.</given-names> <surname>Wang</surname></string-name> and <string-name><given-names>J.</given-names> <surname>Yin</surname></string-name></person-group>, &#x201C;<article-title>Multi-scale symbolic lempel-ziv: An effective feature extraction approach for fault diagnosis of railway vehicle systems</article-title>,&#x201D; <source>IEEE Transactions on Industrial Informatics</source>, vol. <volume>17</volume>, no. <issue>1</issue>, pp. <fpage>199</fpage>&#x2013;<lpage>208</lpage>, <year>2020</year>.</mixed-citation></ref>
<ref id="ref-26"><label>[26]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>J.</given-names> <surname>Su</surname></string-name>, <string-name><given-names>D.</given-names> <surname>Yi</surname></string-name>, <string-name><given-names>B.</given-names> <surname>Su</surname></string-name>, <string-name><given-names>Z.</given-names> <surname>Mi</surname></string-name> and <string-name><given-names>W. H.</given-names> <surname>Chen</surname></string-name></person-group>, &#x201C;<article-title>Aerial visual perception in smart farming: Field study of wheat yellow rust monitoring</article-title>,&#x201D; <source>IEEE Transactions on Industrial Informatics</source>, vol. <volume>17</volume>, no. <issue>3</issue>, pp. <fpage>2242</fpage>&#x2013;<lpage>2249</lpage>, <year>2020</year>.</mixed-citation></ref>
<ref id="ref-27"><label>[27]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>W.</given-names> <surname>Zhao</surname></string-name>, <string-name><given-names>Z.</given-names> <surname>Zhang</surname></string-name>, <string-name><given-names>S.</given-names> <surname>Mirjalili</surname></string-name>, <string-name><given-names>L.</given-names> <surname>Wang</surname></string-name>, <string-name><given-names>N.</given-names> <surname>Khodadadi</surname></string-name> <etal>et al.,</etal></person-group> &#x201C;<article-title>An effective multi-objective artificial hummingbird algorithm with dynamic elimination-based crowding distance for solving engineering design problems</article-title>,&#x201D; <source>Computer Methods in Applied Mechanics and Engineering</source>, vol. <volume>398</volume>, no. <issue>15</issue>, pp. <fpage>115</fpage>&#x2013;<lpage>223</lpage>, <year>2022</year>.</mixed-citation></ref>
<ref id="ref-28"><label>[28]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>J.</given-names> <surname>Li</surname></string-name>, <string-name><given-names>X.</given-names> <surname>Liang</surname></string-name>, <string-name><given-names>Y.</given-names> <surname>Wei</surname></string-name>, <string-name><given-names>T.</given-names> <surname>Xu</surname></string-name>, <string-name><given-names>J.</given-names> <surname>Feng</surname></string-name> <etal>et al.,</etal></person-group> &#x201C;<article-title>Perceptual generative adversarial networks for small object detection</article-title>,&#x201D; in <conf-name>2017 IEEE Conf. on Computer Vision and Pattern Recognition (CVPR)</conf-name>, <conf-loc>Honolulu, HI, USA</conf-loc>, pp. <fpage>1951</fpage>&#x2013;<lpage>1959</lpage>, <year>2017</year>.</mixed-citation></ref>
<ref id="ref-29"><label>[29]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>Y.</given-names> <surname>Bai</surname></string-name>, <string-name><given-names>Y.</given-names> <surname>Zhang</surname></string-name>, <string-name><given-names>M.</given-names> <surname>Ding</surname></string-name> and <string-name><given-names>B.</given-names> <surname>Ghanem</surname></string-name></person-group>, &#x201C;<article-title>Sod-mtgan: Small object detection via multi-task generative adversarial network</article-title>,&#x201D; in <conf-name>2018 Proc. of the European Conf. on Computer Vision (ECCV)</conf-name>, <conf-loc>Munich, Germany</conf-loc>, pp. <fpage>206</fpage>&#x2013;<lpage>221</lpage>, <year>2018</year>.</mixed-citation></ref>
<ref id="ref-30"><label>[30]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>D. K.</given-names> <surname>Das</surname></string-name>, <string-name><given-names>S.</given-names> <surname>Shit</surname></string-name>, <string-name><given-names>D. N.</given-names> <surname>Ray</surname></string-name> and <string-name><given-names>S.</given-names> <surname>Majumder</surname></string-name></person-group>, &#x201C;<article-title>CGAN: Closure-guided attention network for salient object detection</article-title>,&#x201D; <source>The Visual Computer</source>, vol. <volume>38</volume>, no. <issue>11</issue>, pp. <fpage>3803</fpage>&#x2013;<lpage>3817</lpage>, <year>2022</year>.</mixed-citation></ref>
<ref id="ref-31"><label>[31]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>T.</given-names> <surname>Zhao</surname></string-name>, <string-name><given-names>X.</given-names> <surname>Yi</surname></string-name>, <string-name><given-names>Z.</given-names> <surname>Zeng</surname></string-name> and <string-name><given-names>T.</given-names> <surname>Feng</surname></string-name></person-group>, &#x201C;<article-title>MobileNet-yolo based wildlife detection model: A case study in yunnan tongbiguan nature reserve, China</article-title>,&#x201D; <source>Journal of Intelligent &#x0026; Fuzzy Systems</source>, vol. <volume>41</volume>, no. <issue>1</issue>, pp. <fpage>2171</fpage>&#x2013;<lpage>2181</lpage>, <year>2021</year>.</mixed-citation></ref>
<ref id="ref-32"><label>[32]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>H.</given-names> <surname>Pan</surname></string-name>, <string-name><given-names>D.</given-names> <surname>Badawi</surname></string-name> and <string-name><given-names>A. E.</given-names> <surname>Cetin</surname></string-name></person-group>, &#x201C;<article-title>Fourier domain pruning of MobileNet-v2 with application to video based wildfire detection</article-title>,&#x201D; in <conf-name>2020 25th Int. Conf. on Pattern Recognition (ICPR)</conf-name>, <conf-loc>Milan, Italy</conf-loc>, pp. <fpage>1015</fpage>&#x2013;<lpage>1022</lpage>, <year>2020</year>.</mixed-citation></ref>
<ref id="ref-33"><label>[33]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>H.</given-names> <surname>Wang</surname></string-name>, <string-name><given-names>V.</given-names> <surname>Bhaskara</surname></string-name>, <string-name><given-names>A.</given-names> <surname>Levinshtein</surname></string-name>, <string-name><given-names>S.</given-names> <surname>Tsogkas</surname></string-name> and <string-name><given-names>A.</given-names> <surname>Jepson</surname></string-name></person-group>, &#x201C;<article-title>Efficient super-resolution using mobilenetv3</article-title>,&#x201D; in <conf-name>2020 European Conf. on Computer Vision (ECCV)</conf-name>, <conf-loc>Glasgow, US</conf-loc>, pp. <fpage>87</fpage>&#x2013;<lpage>102</lpage>, <year>2020</year>.</mixed-citation></ref>
<ref id="ref-34"><label>[34]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>X.</given-names> <surname>Zhang</surname></string-name>, <string-name><given-names>X.</given-names> <surname>Zhou</surname></string-name>, <string-name><given-names>M.</given-names> <surname>Lin</surname></string-name> and <string-name><given-names>J.</given-names> <surname>Sun</surname></string-name></person-group>, &#x201C;<article-title>Shufflenet: An extremely efficient convolutional neural network for mobile devices</article-title>,&#x201D; in <conf-name>2018 IEEE Conf. on Computer Vision and Pattern Recognition (CVPR)</conf-name>, <conf-loc>Salt Lake City, UT, USA</conf-loc>, pp. <fpage>6848</fpage>&#x2013;<lpage>6856</lpage>, <year>2018</year>.</mixed-citation></ref>
<ref id="ref-35"><label>[35]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>J.</given-names> <surname>Dong</surname></string-name>, <string-name><given-names>J.</given-names> <surname>Yuan</surname></string-name>, <string-name><given-names>L.</given-names> <surname>Li</surname></string-name>, <string-name><given-names>X.</given-names> <surname>Zhong</surname></string-name> and <string-name><given-names>W.</given-names> <surname>Liu</surname></string-name></person-group>, &#x201C;<article-title>An efficient semantic segmentation method using pyramid ShuffleNet V2 with vortex pooling</article-title>,&#x201D; in <conf-name>2019 31st IEEE Int. Conf. on Tools with Artificial Intelligence (ICTAI)</conf-name>, <conf-loc>Portland, OR, USA</conf-loc>, pp. <fpage>1214</fpage>&#x2013;<lpage>1220</lpage>, <year>2019</year>.</mixed-citation></ref>
<ref id="ref-36"><label>[36]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>S.</given-names> <surname>Razakarivony</surname></string-name> and <string-name><given-names>F.</given-names> <surname>Jurie</surname></string-name></person-group>, &#x201C;<article-title>Vehicle detection in aerial imagery: A small target detection benchmark</article-title>,&#x201D; <source>Journal of Visual Communication and Image Representation</source>, vol. <volume>34</volume>, no. <issue>1</issue>, pp. <fpage>187</fpage>&#x2013;<lpage>203</lpage>, <year>2016</year>.</mixed-citation></ref>
<ref id="ref-37"><label>[37]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>Y.</given-names> <surname>Long</surname></string-name>, <string-name><given-names>Y.</given-names> <surname>Gong</surname></string-name>, <string-name><given-names>Z.</given-names> <surname>Xiao</surname></string-name> and <string-name><given-names>Q.</given-names> <surname>Liu</surname></string-name></person-group>, &#x201C;<article-title>Accurate object localization in remote sensing images based on convolutional neural networks</article-title>,&#x201D; <source>IEEE Transactions on Geoscience and Remote Sensing</source>, vol. <volume>55</volume>, no. <issue>5</issue>, pp. <fpage>2486</fpage>&#x2013;<lpage>2498</lpage>, <year>2017</year>.</mixed-citation></ref>
<ref id="ref-38"><label>[38]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>Z.</given-names> <surname>Zheng</surname></string-name>, <string-name><given-names>P.</given-names> <surname>Wang</surname></string-name>, <string-name><given-names>D.</given-names> <surname>Ren</surname></string-name>, <string-name><given-names>W.</given-names> <surname>Liu</surname></string-name>, <string-name><given-names>R.</given-names> <surname>Ye</surname></string-name> <etal>et al.,</etal></person-group> &#x201C;<article-title>Enhancing geometric factors in model learning and inference for object detection and instance segmentation</article-title>,&#x201D; <source>IEEE Transactions on Cybernetics</source>, vol. <volume>52</volume>, no. <issue>8</issue>, pp. <fpage>8574</fpage>&#x2013;<lpage>8586</lpage>, <year>2022</year>.</mixed-citation></ref>
<ref id="ref-39"><label>[39]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>S.</given-names> <surname>Gao</surname></string-name>, <string-name><given-names>M. M.</given-names> <surname>Cheng</surname></string-name>, <string-name><given-names>K.</given-names> <surname>Zhao</surname></string-name>, <string-name><given-names>X.</given-names> <surname>Zhang</surname></string-name>, <string-name><given-names>M.</given-names> <surname>Yang</surname></string-name> <etal>et al.,</etal></person-group> &#x201C;<article-title>Res2net: A new multi-scale backbone architecture</article-title>,&#x201D; <source>IEEE Transactions on Pattern Analysis and Machine Intelligence</source>, vol. <volume>43</volume>, no. <issue>2</issue>, pp. <fpage>652</fpage>&#x2013;<lpage>662</lpage>, <year>2021</year>.</mixed-citation></ref>
</ref-list>
</back>
</article>