<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.1 20151215//EN" "http://jats.nlm.nih.gov/publishing/1.1/JATS-journalpublishing1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:mml="http://www.w3.org/1998/Math/MathML" xml:lang="en" article-type="research-article" dtd-version="1.1">
<front>
<journal-meta>
<journal-id journal-id-type="pmc">CMC</journal-id>
<journal-id journal-id-type="nlm-ta">CMC</journal-id>
<journal-id journal-id-type="publisher-id">CMC</journal-id>
<journal-title-group>
<journal-title>Computers, Materials &#x0026; Continua</journal-title>
</journal-title-group>
<issn pub-type="epub">1546-2226</issn>
<issn pub-type="ppub">1546-2218</issn>
<publisher>
<publisher-name>Tech Science Press</publisher-name>
<publisher-loc>USA</publisher-loc>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="publisher-id">44284</article-id>
<article-id pub-id-type="doi">10.32604/cmc.2023.044284</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Article</subject>
</subj-group>
</article-categories>
<title-group>
<article-title>Interactive Transformer for Small Object Detection</article-title>
<alt-title alt-title-type="left-running-head">Interactive Transformer for Small Object Detection</alt-title>
<alt-title alt-title-type="right-running-head">Interactive Transformer for Small Object Detection</alt-title>
</title-group>
<contrib-group>
<contrib id="author-1" contrib-type="author">
<name name-style="western"><surname>Wei</surname><given-names>Jian</given-names></name></contrib>
<contrib id="author-2" contrib-type="author" corresp="yes">
<name name-style="western"><surname>Wang</surname><given-names>Qinzhao</given-names></name><email>airy_snow@outlook.com</email></contrib>
<contrib id="author-3" contrib-type="author">
<name name-style="western"><surname>Zhao</surname><given-names>Zixu</given-names></name></contrib>
<aff id="aff-1"><institution>Department of Weaponry and Control, Army Academy of Armored Forces</institution>, <addr-line>Beijing, 100071</addr-line>, <country>China</country></aff>
</contrib-group>
<author-notes>
<corresp id="cor1"><label>&#x002A;</label>Corresponding Author: Qinzhao Wang. Email: <email>airy_snow@outlook.com</email></corresp>
</author-notes>
<pub-date date-type="collection" publication-format="electronic"><year>2023</year></pub-date>
<pub-date date-type="pub" publication-format="electronic"><day>29</day><month>11</month><year>2023</year></pub-date>
<volume>77</volume>
<issue>2</issue>
<fpage>1699</fpage>
<lpage>1717</lpage>
<history>
<date date-type="received"><day>26</day><month>7</month><year>2023</year></date>
<date date-type="accepted"><day>15</day><month>9</month><year>2023</year></date>
</history>
<permissions>
<copyright-statement>&#x00A9; 2023 Wei et al.</copyright-statement>
<copyright-year>2023</copyright-year>
<copyright-holder>Wei et al.</copyright-holder>
<license xlink:href="https://creativecommons.org/licenses/by/4.0/">
<license-p>This work is licensed under a <ext-link ext-link-type="uri" xlink:type="simple" xlink:href="https://creativecommons.org/licenses/by/4.0/">Creative Commons Attribution 4.0 International License</ext-link>, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
</license>
</permissions>
<self-uri content-type="pdf" xlink:href="TSP_CMC_44284.pdf"></self-uri>
<abstract>
<p>The detection of large-scale objects has achieved high accuracy, but due to the low peak signal to noise ratio (PSNR), fewer distinguishing features, and ease of being occluded by the surroundings, the detection of small objects, however, does not enjoy similar success. Endeavor to solve the problem, this paper proposes an attention mechanism based on cross-Key values. Based on the traditional transformer, this paper first improves the feature processing with the convolution module, effectively maintaining the local semantic context in the middle layer, and significantly reducing the number of parameters of the model. Then, to enhance the effectiveness of the attention mask, two Key values are calculated simultaneously along Query and Value by using the method of dual-branch parallel processing, which is used to strengthen the attention acquisition mode and improve the coupling of key information. Finally, focusing on the feature maps of different channels, the multi-head attention mechanism is applied to the channel attention mask to improve the feature utilization effect of the middle layer. By comparing three small object datasets, the plug-and-play interactive transformer (IT-transformer) module designed by us effectively improves the detection results of the baseline.</p>
</abstract>
<kwd-group kwd-group-type="author">
<kwd>Small object detection</kwd>
<kwd>attention</kwd>
<kwd>transformer</kwd>
<kwd>plug-and-play</kwd>
</kwd-group>
</article-meta>
</front>
<body>
<sec id="s1"><label>1</label><title>Introduction</title>
<p>The object detection model has achieved fruitful research results and has been widely used in production, life, and other fields, significantly improving efficiency. However, these detection models still face challenges from small object detection tasks. As shown in <xref ref-type="fig" rid="fig-1">Fig. 1</xref>, the model has more false detections and missed detections of small objects. There are three main reasons for this result: first, the small object lacks distinguishable and significant features, the second is that the small object is easy to be obliterated in the surrounding environment, and the third is that in the deep neural network, pooling, normalization, label matching and other modules will gradually attenuate the relevant features of the small objects layer by layer, resulting in the lack of relevant information at the detection head [<xref ref-type="bibr" rid="ref-1">1</xref>,<xref ref-type="bibr" rid="ref-2">2</xref>]. The combined effect of these factors leads to the poor detection results of traditional models on small objects.</p>
<fig id="fig-1"><label>Figure 1</label><caption><title>Small object detection. The traditional first-stage and second-stage object detection models cannot effectively deal with unfavorable factors such as object occlusion, environmental interference, and small object size, resulting in easy misdetection and missed detection. The improved model with the addition of the IT-transformer effectively overcomes these challenges</title></caption><graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_44284-fig-1.tif"/></fig>
<p>To solve this problem, models such as multi-scale pyramid [<xref ref-type="bibr" rid="ref-3">3</xref>&#x2013;<xref ref-type="bibr" rid="ref-5">5</xref>] and feature pyramid [<xref ref-type="bibr" rid="ref-6">6</xref>&#x2013;<xref ref-type="bibr" rid="ref-8">8</xref>] are used to process object features at different scales, that is to improve the detection accuracy of small objects by hierarchical processing and end fusion. Another approach is to use larger feature maps [<xref ref-type="bibr" rid="ref-1">1</xref>,<xref ref-type="bibr" rid="ref-9">9</xref>], such as [<xref ref-type="bibr" rid="ref-1">1</xref>] adding P2 layer features with less loss of feature information to the neck module, which effectively improves the available features of small objects; On the contrary, larger feature maps lead to slower inference speed; Focus [<xref ref-type="bibr" rid="ref-10">10</xref>] proposed a method of slice, which retains as many small object features as possible without compressing the size of the input image; The you only look once (YOLO) [<xref ref-type="bibr" rid="ref-11">11</xref>,<xref ref-type="bibr" rid="ref-12">12</xref>] models add data augmentation strategies, such as mosaic to diversify images in a wider range to improve the contribution of small objects to training loss. In [<xref ref-type="bibr" rid="ref-13">13</xref>,<xref ref-type="bibr" rid="ref-14">14</xref>], the method of deformation convolution is used to change the position of the convolution kernel and guide the convolution kernel to extract the characteristics of a more accurate position. Other studies have proposed the addition of the attention mechanism [<xref ref-type="bibr" rid="ref-15">15</xref>&#x2013;<xref ref-type="bibr" rid="ref-17">17</xref>], by adding an attention mask representing the importance of each region, to improve the attention of the model to different regions during processing, and effectively suppress the noise of irrelevant regions. At present, the attention mechanism model represented by the transformer [<xref ref-type="bibr" rid="ref-18">18</xref>] shines in many image processing tasks [<xref ref-type="bibr" rid="ref-19">19</xref>&#x2013;<xref ref-type="bibr" rid="ref-21">21</xref>] and has received more and more attention with its unique feature processing methods.</p>
<p>In summary, in terms of the actual task requirements, based on the transformer attention mechanism, to fully construct the global and local semantic context of the avatar, we propose an IT-tansformer attention mechanism to solve the detection problem of small objects. Specifically, the traditional transformer adopts the calculation method based on the fully connected layer, resulting in a heavy number of parameters, extremely high requirements for hardware, and insufficient local semantic characteristics due to the serialized data processing mode. Second, in the multi-head attention mechanism, the query (Q), key (K), and value (V) are obtained separately. That is to say, they do not explore Q and K deeply, nevertheless, the poorly explored relationship between each other weakens the effectiveness of attention masking. To solve these two problems, we design an interactive transformer module that can be plug-and-play. In detail, based on the previous research, we first replace the fully connected layer with a 2D convolution module, use the characteristics of shared weight to provide evidence, reduce the number of overall parameters of the model, realize the lightweight processing of the model, and at the same time, improve the local context between the associated regions with the help of the local field of view of the convolution module. Then, to further enhance the feature representation ability of the middle layer and improve the accuracy of the attention mask, a feature processing method based on cross-fusion K is proposed, and the coupling relationship in the features of the middle layer is highlighted by fusing the K of the Q and V bidirectional branches, to improve the model&#x2019;s attention to detailed information. Finally, unlike the fully connected layer to calculate the interaction effect between each pixel, we focus on the features between different channels, to maintain the consistency of the global spatial position relationship of the features, and effectively improve the feature representation of objects at each scale by applying channel-level multi-head attention to the features of the middle layer.</p>
<p>In summary, our main contributions are:</p>
<p>1. The object detection model based on the IT-transformer is proposed. From the perspective of improving the utilization efficiency of features in the middle layer, the dual-branch model is used to extract the key values of features and provide more effective comparison features for the attention module through cross-fusion. At the same time, to suppress the interference of noise channels, the multi-head attention mechanism is applied to the generation and optimization of channel attention masks, which significantly improves the differentiation of the characteristics of the middle layer.</p>
<p>2. A new small object detection dataset was collected and organized. Given the existing small object detection data set, the types of objects are mostly common objects, and the object acquisition angle and the scene are simple, etc. At the same time, to expand the application of intelligent detection algorithms in the military field, we collect and sort out an Armored Vehicle dataset with diverse viewing angles, variable distances, and complex scenes through network collection and unmanned aerial vehicle (UAV) shooting, and carry out experiments on small object detection models in it.</p>
<p>3. Extensive experimental comparisons and self-ablation experiments were carried out to verify the effectiveness of the module. The results show that the proposed IT-transformer can realize plug-and-play in the first-stage and second-stage detection models, which can effectively improve the detection accuracy of the baseline model. In the three datasets of Armored Vehicle, Guangdong University of Technology-Hardhat Wearing Detection (GDUT-HWD), and Visdrone-2019, the mAP was improved by 2.7, 1.1, and 1.3 compared with the baseline, respectively.</p>
</sec>
<sec id="s2"><label>2</label><title>Structure</title>
<sec id="s2_1"><label>2.1</label><title>Object Detection</title>
<p>Object detection models based on deep learning have been fully developed, and they are mainly divided into four branches: first, first-stage detection models, with YOLO [<xref ref-type="bibr" rid="ref-11">11</xref>,<xref ref-type="bibr" rid="ref-12">12</xref>,<xref ref-type="bibr" rid="ref-22">22</xref>,<xref ref-type="bibr" rid="ref-23">23</xref>], single shot multibox detector (SSD) [<xref ref-type="bibr" rid="ref-24">24</xref>], and Retina [<xref ref-type="bibr" rid="ref-25">25</xref>]. They integrate region of interest (ROI) generation and final result prediction, with faster image inference speed; Then there is the second-stage detection model, represented by Faster region-based convolutional network method (Faster RCNN) [<xref ref-type="bibr" rid="ref-26">26</xref>], Cascade RCNN [<xref ref-type="bibr" rid="ref-27">27</xref>], etc. Their main feature is to set a separate module for more accurate ROI extraction, and the addition of ROI alignment makes the detection accuracy of the object significantly improved; The third is the transformer-based detection model, such as vision transformer (ViT) [<xref ref-type="bibr" rid="ref-19">19</xref>], detection transformer (DETR) [<xref ref-type="bibr" rid="ref-21">21</xref>], DETR with improved denoising anchor box (DINO) [<xref ref-type="bibr" rid="ref-28">28</xref>], etc. Represented by the addition of transformers, they integrate the addition of transformers into object detection tasks, breaking the previous situation of convolutional modules in the image field, and with the unique attention mechanism in transformers, the detection accuracy of such models quickly catches up with a series of traditional state of the art (SOTA) models; The fourth is the detection architecture based on the diffusion model [<xref ref-type="bibr" rid="ref-29">29</xref>&#x2013;<xref ref-type="bibr" rid="ref-31">31</xref>]. Based on the diffusion model, they regard the positioning problem of the object as the iterative diffusion process from a random noise vector to the true value and complete the detection task of the object through cascade alignment. In this paper, we first take the second-stage detection model Cascade RCNN as the benchmark to make full use of the characteristics of the distributed model structure. At the same time, to further improve the model performance, we also integrate the transformer attention mechanism to achieve the organic integration of the two. Guided by the plug-and-play idea, we have designed an interactive attention module that can adapt to the existing first-stage and second-stage detection models, which can effectively improve the detection performance of the baseline model.</p>
</sec>
<sec id="s2_2"><label>2.2</label><title>Small Object Detection</title>
<p>Small object detection is an important part of the computer vision task. According to the definition of the COCO dataset, when the object size is less than 32&#x2009;&#x00D7;&#x2009;32 pixels, the object can provide very limited feature information, resulting in increased detection difficulty. To solve this problem, there are currently four main ideas: first, increase the size of the input image [<xref ref-type="bibr" rid="ref-9">9</xref>,<xref ref-type="bibr" rid="ref-10">10</xref>], so that the feature can remain relatively stable, but too large input size will lead to a significant decrease in inference speed, which is not suitable for scenarios with high real-time requirements; The second is the data augmentation strategy [<xref ref-type="bibr" rid="ref-32">32</xref>,<xref ref-type="bibr" rid="ref-33">33</xref>], represented by the mosaic and generative adversarial network (GAN). In the data preprocessing stage equipped with mosaic, through controllable parameter adjustment, the proportion of small objects in all training instances in the training process is increased, and the parameter update process dominated by large-size objects in the past is improved. In [<xref ref-type="bibr" rid="ref-34">34</xref>], GAN synthesis and diversification of small objects are used to increase the number of positive samples in the training process; Third, the multi-scale training and testing strategy [<xref ref-type="bibr" rid="ref-6">6</xref>,<xref ref-type="bibr" rid="ref-35">35</xref>,<xref ref-type="bibr" rid="ref-36">36</xref>] is adopted to improve the consistency detection ability of the model for objects at each scale by changing the input image size within a large range. The fourth is to add an attention mechanism [<xref ref-type="bibr" rid="ref-2">2</xref>,<xref ref-type="bibr" rid="ref-17">17</xref>,<xref ref-type="bibr" rid="ref-37">37</xref>], which improves the attention of the model to specific regions and objects by additional calculation of attention masks that indicate the importance of pixels. Starting from the perspective of improving the attention of the model, this paper proposes an interactive attention mechanism. With the help of the IT-transformer, the model can effectively represent the importance of the feature under the single-scale training strategy, to improve the accuracy of the small object.</p>
</sec>
<sec id="s2_3"><label>2.3</label><title>Transformer</title>
<p>Transformer [<xref ref-type="bibr" rid="ref-18">18</xref>] was originally used to process serialized text data and made its mark in the natural language processing (NLP) field. ViT [<xref ref-type="bibr" rid="ref-19">19</xref>] converts image data into serialized data composed of multiple pixel blocks for the first time, and then performs image classification and detection tasks in the way of transformers, opening the way for transformers to expand into the image field. Based on the transformer architecture, many excellent models have emerged, such as DETR [<xref ref-type="bibr" rid="ref-21">21</xref>], and Swin transformer [<xref ref-type="bibr" rid="ref-20">20</xref>]. The main feature of the transformer is the feature processing method based on mutual attention between tokens, which covers the global semantic information in a single position, which greatly improves the accuracy of the model inference results. However, under the single scale setting, the transformer controls the number of model parameters by dividing the specified number of tokens, but it still produces significantly higher parameters than the convolution module. Because of the serialized image, the semantic relationship between adjacent tokens is broken. Experiments show that when the dataset is small, the transformer-based model is difficult to effectively learn the effective interrelationship matrix, resulting in low performance. This paper uses the attention mechanism in the transformer to improve the cross-K value by integrating the middle-layer features. Furthermore, by integrating the convolution module, we strengthen the semantic correlation between tokens, to improve the performance of the model in the smaller dataset.</p>
</sec>
</sec>
<sec id="s3"><label>3</label><title>Method</title>
<p>In this part, first, we briefly introduce the relevant content of traditional transformers and then introduce the structure and optimization indicators of IT-transformers in detail.</p>
<sec id="s3_1"><label>3.1</label><title>Revisiting Transformer</title>
<p>Transformer is a deep neural network model based on an encoder and decoder, and its core content is the construction of the attention mechanism. Thanks to the globally encoded token, the transformer uses fully connected modules to ensure that each token has a broad field of view and a full range of connection relationships, which ensures better performance in advanced visual tasks such as object detection and segmentation. The transformer attention mechanism is based on the calculation process of the matrix, specifically, the calculation of Q, K and V based on the characteristics of the middle layer, and then transpose and multiply the three. The interrelationship matrix reflecting the importance of each token is obtained, that is, the attention mask. The structure of a traditional transformer is shown in <xref ref-type="fig" rid="fig-2">Fig. 2</xref>.</p>
<fig id="fig-2"><label>Figure 2</label><caption><title>The traditional transformer</title></caption><graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_44284-fig-2.tif"/></fig>
<p>Suppose the input characteristics are <inline-formula id="ieqn-1"><mml:math id="mml-ieqn-1"><mml:mi>X</mml:mi><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mi>R</mml:mi><mml:mrow><mml:mi>C</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mi>H</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mi>W</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula> that in the traditional transformer calculation process, it is necessary to first normalize the flattened two-dimensional matrix of X (<inline-formula id="ieqn-2"><mml:math id="mml-ieqn-2"><mml:msup><mml:mi>X</mml:mi><mml:mrow><mml:msup><mml:mi></mml:mi><mml:mo>&#x2032;</mml:mo></mml:msup></mml:mrow></mml:msup><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mi>R</mml:mi><mml:mrow><mml:mi>C</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mi>H</mml:mi><mml:mi>W</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula>), and then multiply it with three weight matrices (<inline-formula id="ieqn-3"><mml:math id="mml-ieqn-3"><mml:msub><mml:mi>W</mml:mi><mml:mrow><mml:mi>Q</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mspace width="thinmathspace" /><mml:msub><mml:mi>W</mml:mi><mml:mrow><mml:mi>K</mml:mi></mml:mrow></mml:msub><mml:mo>&#x22C5;</mml:mo><mml:msub><mml:mi>W</mml:mi><mml:mrow><mml:mi>V</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>), representing fully connected operations to obtain the representation of Q, K, and V, where Q is calculated by:
<disp-formula id="eqn-1"><label>(1)</label><mml:math id="mml-eqn-1" display="block"><mml:mi>Q</mml:mi><mml:mo>=</mml:mo><mml:msup><mml:mi>X</mml:mi><mml:mrow><mml:msup><mml:mi></mml:mi><mml:mo>&#x2032;</mml:mo></mml:msup></mml:mrow></mml:msup><mml:mo>&#x00D7;</mml:mo><mml:msub><mml:mi>W</mml:mi><mml:mrow><mml:mi>Q</mml:mi></mml:mrow></mml:msub></mml:math></disp-formula></p>
<p>K and V are calculated similarly. In particular, to ensure the unbiased nature of the extracted features, the bias coefficient needs to be set to zero when processing with a fully connected matrix.</p>
<p>Then, by transposing each multiplication of Q and K, the correlation matrix between the two is obtained. Finally, the softmax activation function is used to normalize it to (0&#x2013;1), that is, the spatial attention mask reflecting the importance of each token is obtained. The calculation process is:
<disp-formula id="eqn-2"><label>(2)</label><mml:math id="mml-eqn-2" display="block"><mml:mrow><mml:mi>A</mml:mi><mml:mi>t</mml:mi><mml:mi>t</mml:mi><mml:mi>e</mml:mi><mml:mi>n</mml:mi><mml:mi>t</mml:mi><mml:mi>i</mml:mi><mml:mi>o</mml:mi><mml:mi>n</mml:mi></mml:mrow><mml:mo>=</mml:mo><mml:mrow><mml:mtext>Softmax</mml:mtext></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>Q</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:msup><mml:mi>K</mml:mi><mml:mrow><mml:msup><mml:mi></mml:mi><mml:mo>&#x2032;</mml:mo></mml:msup></mml:mrow></mml:msup><mml:mo stretchy="false">)</mml:mo></mml:math></disp-formula></p>
<p>Under multi-head attention, several such attention matrices can be calculated at the same time. Next, these matrices are integrated by stitching and merging. At last, the hop connection method is used to weighted fusion with the input features to obtain the feature map optimized by attention masking, and send it to the subsequent detection module. Its calculation formula is:
<disp-formula id="eqn-3"><label>(3)</label><mml:math id="mml-eqn-3" display="block"><mml:msub><mml:mi>X</mml:mi><mml:mrow><mml:mi>o</mml:mi><mml:mi>u</mml:mi><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mi>X</mml:mi><mml:mo>+</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:msup><mml:mi>X</mml:mi><mml:mrow><mml:msup><mml:mi></mml:mi><mml:mo>&#x2032;</mml:mo></mml:msup></mml:mrow></mml:msup><mml:mo>&#x22C5;</mml:mo><mml:mrow><mml:mi>A</mml:mi><mml:mi>t</mml:mi><mml:mi>t</mml:mi><mml:mi>e</mml:mi><mml:mi>n</mml:mi><mml:mi>t</mml:mi><mml:mi>i</mml:mi><mml:mi>o</mml:mi><mml:mi>n</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo><mml:mo>.</mml:mo><mml:mrow><mml:mi>r</mml:mi><mml:mi>e</mml:mi><mml:mi>s</mml:mi><mml:mi>h</mml:mi><mml:mi>a</mml:mi><mml:mi>p</mml:mi><mml:mi>e</mml:mi></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>C</mml:mi><mml:mo>,</mml:mo><mml:mi>H</mml:mi><mml:mo>,</mml:mo><mml:mi>W</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></disp-formula></p>
<p>In this process, since the calculation of Q, K, and V uses a fully connected layer module, its parameter quantity is <inline-formula id="ieqn-4"><mml:math id="mml-ieqn-4"><mml:mo stretchy="false">(</mml:mo><mml:mi>C</mml:mi><mml:mi>H</mml:mi><mml:mi>W</mml:mi><mml:msup><mml:mo stretchy="false">)</mml:mo><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msup></mml:math></inline-formula>. The increased parameters will lead to a decrease in training efficiency, increased energy consumption, and other problems. Therefore, many jobs are faced with the problem of controlling the number of overall parameters of the model when designing and deploying transformer models.</p>
<p>In addition, it is worth noting that the calculation and processing of Q, K, and V are the core content of the transformer and directly determine the effectiveness of attention masking. However, traditional transformers are only processed through 3 separate fully connected layers. Q, K, and V are the basis for calculating attention masks, so it is necessary to explore their processing methods in more depth to improve the accuracy of attention masks.</p>
</sec>
<sec id="s3_2"><label>3.2</label><title>IT-Transformer</title>
<p>The overall structure of the IT-transformer is shown in <xref ref-type="fig" rid="fig-3">Fig. 3</xref>. The research shows that the transformer structure is different from the traditional convolution-based model. In feature processing, due to the lack of the local field of view, the transformer-based architecture cannot complete the acquisition of local semantic context, which will significantly affect the detection performance of the model when the training dataset is small. In addition, as we introduced earlier, transformers widely use the fully connected layer to calculate the characteristics of the middle layer, resulting in a large number of parameters. In this regard, referring to the research results of many existing structural convolutions and transformers, to balance the number of parameters and the demand for attention mechanisms, we design Q, K, and V calculation methods based on convolutional modules. First of all, through weight sharing, the convolution module can effectively use the local correlation semantic context between adjacent pixels, that is, the local field of view of the convolution kernel. On the other hand, it can significantly reduce the parameters.</p>
<fig id="fig-3"><label>Figure 3</label><caption><title>The IT-transformer</title></caption><graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_44284-fig-3.tif"/></fig>
<p>Taking the calculation of Q as an example, the traditional transformer middle layer features donated as X, and C, H, and W is 1024, 64, 64, 64, respectively, so its parameter quantity is: <inline-formula id="ieqn-5"><mml:math id="mml-ieqn-5"><mml:mrow><mml:mi>P</mml:mi><mml:mi>a</mml:mi><mml:mi>r</mml:mi><mml:mi>a</mml:mi><mml:mi>m</mml:mi></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>Q</mml:mi><mml:mrow><mml:mrow><mml:mi>t</mml:mi><mml:mi>r</mml:mi><mml:mi>a</mml:mi><mml:mi>n</mml:mi><mml:mi>s</mml:mi><mml:mi>f</mml:mi><mml:mi>o</mml:mi><mml:mi>r</mml:mi><mml:mi>m</mml:mi><mml:mi>e</mml:mi><mml:mi>r</mml:mi></mml:mrow></mml:mrow></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mn>1024</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mn>64</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mn>64</mml:mn><mml:msup><mml:mo stretchy="false">)</mml:mo><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msup></mml:math></inline-formula>. In IT-transformer, when the 3&#x2009;&#x00D7;&#x2009;3 convolution kernel module is used to obtain Q, K, and V, the parameters are: <inline-formula id="ieqn-6"><mml:math id="mml-ieqn-6"><mml:mrow><mml:mi>P</mml:mi><mml:mi>a</mml:mi><mml:mi>r</mml:mi><mml:mi>a</mml:mi><mml:mi>m</mml:mi></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>Q</mml:mi><mml:mrow><mml:mi>I</mml:mi><mml:mi>T</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mrow><mml:mi>t</mml:mi><mml:mi>r</mml:mi><mml:mi>a</mml:mi><mml:mi>n</mml:mi><mml:mi>s</mml:mi><mml:mi>f</mml:mi><mml:mi>o</mml:mi><mml:mi>r</mml:mi><mml:mi>m</mml:mi><mml:mi>e</mml:mi><mml:mi>r</mml:mi></mml:mrow></mml:mrow></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mn>1024</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mn>9</mml:mn><mml:msup><mml:mo stretchy="false">)</mml:mo><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msup></mml:math></inline-formula>. It can be seen that through this lightweight design, the number of parameters of the IT-transformer module has nothing to do with the size of the middle layer features, and the number of parameters is reduced by a factor of <inline-formula id="ieqn-7"><mml:math id="mml-ieqn-7"><mml:mo stretchy="false">(</mml:mo><mml:mn>64</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mn>64</mml:mn><mml:mrow><mml:mo>/</mml:mo></mml:mrow><mml:mn>9</mml:mn><mml:msup><mml:mo stretchy="false">)</mml:mo><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msup></mml:math></inline-formula> compared with the fully connected method in <xref ref-type="fig" rid="fig-2">Fig. 2</xref>. The number of parameters is greatly compressed, which helps to improve the efficiency of model training and reduce the hardware requirements of the model.</p>
<p>At the same time, to further strengthen the connection between Q, K, and V, we use synchronous calculation. As can be seen from <xref ref-type="fig" rid="fig-3">Fig. 3</xref>, Q and K<sub>1</sub>, K<sub>2</sub> and V are calculated by the same convolution module, and then through channel splitting, we get Q, K, and V with more close coupling effects.
<disp-formula id="eqn-4"><label>(4)</label><mml:math id="mml-eqn-4" display="block"><mml:mi>Q</mml:mi><mml:mo>,</mml:mo><mml:msub><mml:mi>K</mml:mi><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mrow><mml:mi>c</mml:mi><mml:mi>h</mml:mi><mml:mi>u</mml:mi><mml:mi>n</mml:mi><mml:mi>k</mml:mi></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>C</mml:mi><mml:mi>o</mml:mi><mml:mi>n</mml:mi><mml:msub><mml:mi>v</mml:mi><mml:mrow><mml:mi>K</mml:mi><mml:mi>Q</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mi>X</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>,</mml:mo><mml:mi>dim</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:math></disp-formula>
<disp-formula id="eqn-5"><label>(5)</label><mml:math id="mml-eqn-5" display="block"><mml:msub><mml:mi>K</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mi>V</mml:mi><mml:mo>=</mml:mo><mml:mrow><mml:mi>c</mml:mi><mml:mi>h</mml:mi><mml:mi>u</mml:mi><mml:mi>n</mml:mi><mml:mi>k</mml:mi></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>C</mml:mi><mml:mi>o</mml:mi><mml:mi>n</mml:mi><mml:msub><mml:mi>v</mml:mi><mml:mrow><mml:mi>K</mml:mi><mml:mi>V</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mi>X</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>,</mml:mo><mml:mi>dim</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:math></disp-formula></p>
<p>By setting up the dual branch, we obtain K rooted in Q and V, and it can be said that the extracted features <inline-formula id="ieqn-8"><mml:math id="mml-ieqn-8"><mml:msub><mml:mi>K</mml:mi><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>&#x22C5;</mml:mo><mml:msub><mml:mi>K</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula> contain more explicit cross-coupling features, which provide richer sampling information for attention calculations. When calculating attention, according to the unified requirements of the transformer architecture, we get the key feature expression after crossover, namely:
<disp-formula id="eqn-6"><label>(6)</label><mml:math id="mml-eqn-6" display="block"><mml:msub><mml:mi>K</mml:mi><mml:mrow><mml:mi>I</mml:mi><mml:mi>T</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mi>K</mml:mi><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:msub><mml:mi>K</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub></mml:math></disp-formula></p>
<p>We also use multi-head attention to complete the analysis from multiple different dimensions, and fully use the characteristics of the middle layer, to achieve the purpose of improving the effectiveness of attention masking. First of all, we know that the contribution of different channel feature maps is different, some feature maps are accurately extracted to the decisive features, while other channels may introduce noise. If the characteristics of each channel are set to the same weight, it will inevitably affect the final judgment of the model. Therefore, under the premise that the convolution module has been used to extract the local semantic context of the feature map, we pay more attention to which channel features have a more important position in the multi-head attention. Therefore, unlike the way the transformer module focuses more on spatial attention, IT-transformer focuses on different channels. Under the bullish attention mechanism, we divide Q, K, and V into subsets according to the number of heads <italic>N</italic>, where <inline-formula id="ieqn-9"><mml:math id="mml-ieqn-9"><mml:msub><mml:mi>Q</mml:mi><mml:mrow><mml:mi>s</mml:mi><mml:mi>u</mml:mi><mml:mi>b</mml:mi></mml:mrow></mml:msub><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mi>R</mml:mi><mml:mrow><mml:mfrac><mml:mi>C</mml:mi><mml:mi>N</mml:mi></mml:mfrac><mml:mo>&#x00D7;</mml:mo><mml:mi>H</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mi>W</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula>, <inline-formula id="ieqn-10"><mml:math id="mml-ieqn-10"><mml:msub><mml:mi>K</mml:mi><mml:mrow><mml:mi>I</mml:mi><mml:mi>T</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mi>s</mml:mi><mml:mi>u</mml:mi><mml:mi>b</mml:mi></mml:mrow></mml:msub><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mi>R</mml:mi><mml:mrow><mml:mfrac><mml:mi>C</mml:mi><mml:mi>N</mml:mi></mml:mfrac><mml:mo>&#x00D7;</mml:mo><mml:mi>H</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mi>W</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula>, <inline-formula id="ieqn-11"><mml:math id="mml-ieqn-11"><mml:msub><mml:mi>V</mml:mi><mml:mrow><mml:mi>s</mml:mi><mml:mi>u</mml:mi><mml:mi>b</mml:mi></mml:mrow></mml:msub><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mi>R</mml:mi><mml:mrow><mml:mfrac><mml:mi>C</mml:mi><mml:mi>N</mml:mi></mml:mfrac><mml:mo>&#x00D7;</mml:mo><mml:mi>H</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mi>W</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula>.</p>
<p>The computational focus of attention also becomes the acquisition channel-level attention mask, which is calculated as follows:
<disp-formula id="eqn-7"><label>(7)</label><mml:math id="mml-eqn-7" display="block"><mml:mrow><mml:mi>A</mml:mi><mml:mi>t</mml:mi><mml:mi>t</mml:mi><mml:mi>e</mml:mi><mml:mi>n</mml:mi><mml:mi>t</mml:mi><mml:mi>i</mml:mi><mml:mi>o</mml:mi></mml:mrow><mml:msub><mml:mi>n</mml:mi><mml:mrow><mml:mrow><mml:mi>c</mml:mi><mml:mi>h</mml:mi><mml:mi>a</mml:mi><mml:mi>n</mml:mi><mml:mi>n</mml:mi><mml:mi>e</mml:mi><mml:mi>l</mml:mi></mml:mrow></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mi>C</mml:mi><mml:mi>a</mml:mi><mml:mi>t</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>V</mml:mi><mml:mrow><mml:mi>s</mml:mi><mml:mi>u</mml:mi><mml:mi>b</mml:mi></mml:mrow></mml:msub><mml:mo>&#x22C5;</mml:mo><mml:mrow><mml:mi>S</mml:mi><mml:mi>o</mml:mi><mml:mi>f</mml:mi><mml:mi>t</mml:mi></mml:mrow><mml:mo movablelimits="true" form="prefix">max</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>K</mml:mi><mml:mrow><mml:mi>I</mml:mi><mml:mi>T</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mi>s</mml:mi><mml:mi>u</mml:mi><mml:mi>b</mml:mi></mml:mrow></mml:msub><mml:mo>&#x22C5;</mml:mo><mml:msup><mml:mi>Q</mml:mi><mml:mrow><mml:msup><mml:mi></mml:mi><mml:mo>&#x2032;</mml:mo></mml:msup></mml:mrow></mml:msup><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">)</mml:mo></mml:math></disp-formula></p>
<p>We obtain the mask that reflects the attention of each group of channels by parallel computing, and then we also use the splicing method to obtain the attention mask that reflects the features of all the intermediate layers, among them <inline-formula id="ieqn-12"><mml:math id="mml-ieqn-12"><mml:mrow><mml:mi>A</mml:mi><mml:mi>t</mml:mi><mml:mi>t</mml:mi><mml:mi>e</mml:mi><mml:mi>n</mml:mi><mml:mi>t</mml:mi><mml:mi>i</mml:mi><mml:mi>o</mml:mi></mml:mrow><mml:msub><mml:mi>n</mml:mi><mml:mrow><mml:mrow><mml:mi>c</mml:mi><mml:mi>h</mml:mi><mml:mi>a</mml:mi><mml:mi>n</mml:mi><mml:mi>n</mml:mi><mml:mi>e</mml:mi><mml:mi>l</mml:mi></mml:mrow></mml:mrow></mml:msub><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mi>R</mml:mi><mml:mrow><mml:mi>C</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mi>C</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula>. Finally, by adding the input of the module by jumping the connection, the feature representation of the middle layer is further strengthened, and the enhanced feature map is obtained.
<disp-formula id="eqn-8"><label>(8)</label><mml:math id="mml-eqn-8" display="block"><mml:msub><mml:mi>X</mml:mi><mml:mrow><mml:mi>o</mml:mi><mml:mi>u</mml:mi><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mi>X</mml:mi><mml:mo>+</mml:mo><mml:mrow><mml:mi>A</mml:mi><mml:mi>t</mml:mi><mml:mi>t</mml:mi><mml:mi>e</mml:mi><mml:mi>n</mml:mi><mml:mi>t</mml:mi><mml:mi>i</mml:mi><mml:mi>o</mml:mi></mml:mrow><mml:msub><mml:mi>n</mml:mi><mml:mrow><mml:mrow><mml:mi>c</mml:mi><mml:mi>h</mml:mi><mml:mi>a</mml:mi><mml:mi>n</mml:mi><mml:mi>n</mml:mi><mml:mi>e</mml:mi><mml:mi>l</mml:mi></mml:mrow></mml:mrow></mml:msub></mml:math></disp-formula></p>
</sec>
<sec id="s3_3"><label>3.3</label><title>Loss Function</title>
<p>We detail the structure and working process of IT-transformers. In fact, in the detection task of small objects, to improve the overall detection accuracy of the model, we add the P2 level feature map refer to [<xref ref-type="bibr" rid="ref-1">1</xref>,<xref ref-type="bibr" rid="ref-9">9</xref>] and detect small objects in the large-size feature map. Here, using Cascade RCNN as the baseline, we design an IT-transformer-enhanced small object detection model, the overall structure of which is shown in <xref ref-type="fig" rid="fig-4">Fig. 4</xref>.</p>
<fig id="fig-4"><label>Figure 4</label><caption><title>The improved Cascade RCNN with IT-transformer</title></caption><graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_44284-fig-4.tif"/></fig>
<p>It can be seen that the IT-transformer can be plugged directly into the back end of the feature pyramid network (FPN), which also means that the IT-transformer can achieve a plug-and-play effect. In this regard, we conducted experimental verification in <xref ref-type="sec" rid="s4_6">Section 4.6</xref>, showing the wide utility and effectiveness of the IT-transformer.</p>
<p>As shown in <xref ref-type="fig" rid="fig-4">Fig. 4</xref>, this paper selects Cascade RCNN [<xref ref-type="bibr" rid="ref-27">27</xref>] as the baseline model, and builds the object detection model by inserting IT-transformer into it, so its loss function is mainly composed of 2 parts, and its calculation formula is:
<disp-formula id="eqn-9"><label>(9)</label><mml:math id="mml-eqn-9" display="block"><mml:mrow><mml:mrow><mml:mi>&#x02112;</mml:mi></mml:mrow></mml:mrow><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mrow><mml:mi>&#x02112;</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:mtext>RPN</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:msub><mml:mrow><mml:mrow><mml:mi>&#x02112;</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:mtext>ROI</mml:mtext></mml:mrow></mml:mrow></mml:msub></mml:math></disp-formula></p>
<p>Among them, the <inline-formula id="ieqn-13"><mml:math id="mml-ieqn-13"><mml:msub><mml:mrow><mml:mrow><mml:mi>&#x02112;</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:mtext>RPN</mml:mtext></mml:mrow></mml:mrow></mml:msub></mml:math></inline-formula> model makes the initial judgment of the object presence and position of the feature map, which is composed of binary classification <inline-formula id="ieqn-14"><mml:math id="mml-ieqn-14"><mml:msub><mml:mrow><mml:mrow><mml:mi>&#x02112;</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:mi>o</mml:mi><mml:mi>b</mml:mi><mml:mi>j</mml:mi><mml:mi>e</mml:mi><mml:mi>c</mml:mi><mml:mi>t</mml:mi></mml:mrow></mml:mrow></mml:msub></mml:math></inline-formula> loss and location regression loss <inline-formula id="ieqn-15"><mml:math id="mml-ieqn-15"><mml:msub><mml:mrow><mml:mrow><mml:mi>&#x02112;</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:mtext>loc</mml:mtext></mml:mrow></mml:mrow></mml:msub></mml:math></inline-formula>, specifically:
<disp-formula id="eqn-10"><label>(10)</label><mml:math id="mml-eqn-10" display="block"><mml:msub><mml:mrow><mml:mrow><mml:mi>&#x02112;</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:mi>o</mml:mi><mml:mi>b</mml:mi><mml:mi>j</mml:mi><mml:mi>e</mml:mi><mml:mi>c</mml:mi><mml:mi>t</mml:mi></mml:mrow></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mi>log</mml:mi><mml:mo>&#x2061;</mml:mo><mml:mrow><mml:mo>[</mml:mo><mml:msub><mml:mi>p</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:msubsup><mml:mrow><mml:mi>p</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:msup><mml:mi></mml:mi><mml:mrow><mml:msup><mml:mi></mml:mi><mml:mo>&#x2032;</mml:mo></mml:msup></mml:mrow></mml:msup></mml:mrow></mml:msubsup><mml:mo>+</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mn>1</mml:mn><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mi>p</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mn>1</mml:mn><mml:mo>&#x2212;</mml:mo><mml:msubsup><mml:mrow><mml:mi>p</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:msup><mml:mi></mml:mi><mml:mrow><mml:msup><mml:mi></mml:mi><mml:mo>&#x2032;</mml:mo></mml:msup></mml:mrow></mml:msup></mml:mrow></mml:msubsup><mml:mo stretchy="false">)</mml:mo><mml:mo>]</mml:mo></mml:mrow></mml:math></disp-formula>
<disp-formula id="eqn-11"><label>(11)</label><mml:math id="mml-eqn-11" display="block"><mml:msub><mml:mrow><mml:mrow><mml:mi>&#x02112;</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>l</mml:mi><mml:mi>o</mml:mi><mml:mi>c</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mi>&#x03BB;</mml:mi><mml:mo>&#x22C5;</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:msub><mml:mi>N</mml:mi><mml:mrow><mml:mi>r</mml:mi><mml:mi>e</mml:mi><mml:mi>g</mml:mi></mml:mrow></mml:msub></mml:mfrac><mml:msubsup><mml:mrow><mml:mi>p</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:msup><mml:mi></mml:mi><mml:mrow><mml:msup><mml:mi></mml:mi><mml:mo>&#x2032;</mml:mo></mml:msup></mml:mrow></mml:msup></mml:mrow></mml:msubsup><mml:mo>&#x22C5;</mml:mo><mml:msub><mml:mrow><mml:mrow><mml:mi>&#x02112;</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>r</mml:mi><mml:mi>e</mml:mi><mml:mi>g</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>b</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msubsup><mml:mrow><mml:mi>b</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:msup><mml:mi></mml:mi><mml:mrow><mml:msup><mml:mi></mml:mi><mml:mo>&#x2032;</mml:mo></mml:msup></mml:mrow></mml:msup></mml:mrow></mml:msubsup><mml:mo>)</mml:mo></mml:mrow></mml:math></disp-formula>where <italic>i</italic> represents the serial number of the anchor, <inline-formula id="ieqn-16"><mml:math id="mml-ieqn-16"><mml:msub><mml:mi>p</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> is the probability that the <inline-formula id="ieqn-17"><mml:math id="mml-ieqn-17"><mml:mi>i</mml:mi></mml:math></inline-formula>-th anchor has an object, <inline-formula id="ieqn-18"><mml:math id="mml-ieqn-18"><mml:msubsup><mml:mrow><mml:mi>p</mml:mi></mml:mrow><mml:mrow><mml:msub><mml:mi></mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:msup><mml:mi></mml:mi><mml:mrow><mml:msup><mml:mi></mml:mi><mml:mo>&#x2032;</mml:mo></mml:msup></mml:mrow></mml:msup></mml:mrow></mml:msubsup></mml:math></inline-formula> is the label assigned by the first anchor (1 when containing the object, otherwise 0), <inline-formula id="ieqn-19"><mml:math id="mml-ieqn-19"><mml:msub><mml:mi>N</mml:mi><mml:mrow><mml:mi>r</mml:mi><mml:mi>e</mml:mi><mml:mi>g</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> is the total number of valid object boxes currently predicted by the model, <inline-formula id="ieqn-20"><mml:math id="mml-ieqn-20"><mml:msub><mml:mi>b</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> is the number the coordinates of the object position predicted by the <inline-formula id="ieqn-21"><mml:math id="mml-ieqn-21"><mml:mi>i</mml:mi></mml:math></inline-formula>-th anchor, similarly, <inline-formula id="ieqn-22"><mml:math id="mml-ieqn-22"><mml:msubsup><mml:mrow><mml:mi>b</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:msup><mml:mi></mml:mi><mml:mrow><mml:msup><mml:mi></mml:mi><mml:mo>&#x2032;</mml:mo></mml:msup></mml:mrow></mml:msup></mml:mrow></mml:msubsup></mml:math></inline-formula> are the real coordinates assigned by the <inline-formula id="ieqn-23"><mml:math id="mml-ieqn-23"><mml:mi>i</mml:mi></mml:math></inline-formula>-th anchor containing the object, which is <inline-formula id="ieqn-24"><mml:math id="mml-ieqn-24"><mml:mi>&#x03BB;</mml:mi></mml:math></inline-formula> the adjustment coefficient of loss, which is set to 1.0 by default according to mmdetetion<xref ref-type="fn" rid="fn1"><sup>1</sup></xref><fn id="fn1"><label>1</label><p><ext-link ext-link-type="uri" xlink:href="https://github.com/open-mmlab/mmdetection">https://github.com/open-mmlab/mmdetection</ext-link></p></fn>.</p>
<p>So far, we get a series of ROIs. Then, the <inline-formula id="ieqn-25"><mml:math id="mml-ieqn-25"><mml:mrow><mml:mrow><mml:mi>&#x02112;</mml:mi></mml:mrow></mml:mrow><mml:mn>1</mml:mn></mml:math></inline-formula> loss fine-tuning object location box is used, which is calculated as:
<disp-formula id="eqn-12"><label>(12)</label><mml:math id="mml-eqn-12" display="block"><mml:msub><mml:mrow><mml:mrow><mml:mi>&#x02112;</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:mtext>loc</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:mrow><mml:mo>[</mml:mo><mml:mi>f</mml:mi><mml:mo>]</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:munderover><mml:mo movablelimits="false">&#x2211;</mml:mo><mml:mrow><mml:mi>t</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>N</mml:mi></mml:mrow></mml:munderover><mml:msub><mml:mrow><mml:mrow><mml:mi>&#x02112;</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>r</mml:mi><mml:mi>e</mml:mi><mml:mi>g</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:mi>f</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>b</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow><mml:mo>,</mml:mo><mml:msub><mml:mi>g</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow></mml:math></disp-formula>where <inline-formula id="ieqn-26"><mml:math id="mml-ieqn-26"><mml:mi>f</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>b</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> is the positional regression function, which is used to regress the candidate bounding box <inline-formula id="ieqn-27"><mml:math id="mml-ieqn-27"><mml:msub><mml:mi>b</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> to the object bounding box <inline-formula id="ieqn-28"><mml:math id="mml-ieqn-28"><mml:msub><mml:mi>g</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>. In fact, due to the fine-tuned regression method using cascading <italic>f</italic>, it consists of phased progressive functions, specifically:
<disp-formula id="eqn-13"><label>(13)</label><mml:math id="mml-eqn-13" display="block"><mml:mi>f</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>b</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:msub><mml:mi>f</mml:mi><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msub><mml:mo>&#x2218;</mml:mo><mml:msub><mml:mi>f</mml:mi><mml:mrow><mml:mi>T</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>&#x2218;</mml:mo><mml:mo>&#x22EF;</mml:mo><mml:msub><mml:mi>f</mml:mi><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo>,</mml:mo><mml:mi>b</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></disp-formula>in this paper <inline-formula id="ieqn-29"><mml:math id="mml-ieqn-29"><mml:mi>T</mml:mi><mml:mo>=</mml:mo><mml:mn>3</mml:mn></mml:math></inline-formula>, the regression position is fine-tuned under three conditions.</p>
<p>Further, in the <italic>t</italic> first stage, the position <inline-formula id="ieqn-30"><mml:math id="mml-ieqn-30"><mml:msub><mml:mrow><mml:mrow><mml:mi>&#x02112;</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula> loss function based on the calculation formula is as follows:
<disp-formula id="eqn-14"><label>(14)</label><mml:math id="mml-eqn-14" display="block"><mml:msub><mml:mrow><mml:mrow><mml:mi>&#x02112;</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:mtext>reg</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:munderover><mml:mo movablelimits="false">&#x2211;</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:msub><mml:mi>N</mml:mi><mml:mrow><mml:mi>r</mml:mi><mml:mi>e</mml:mi><mml:mi>g</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:munderover><mml:mrow><mml:mo>{</mml:mo><mml:mtable columnalign="left left" rowspacing=".2em" columnspacing="1em" displaystyle="false"><mml:mtr><mml:mtd><mml:msub><mml:mi>c</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo></mml:mtd><mml:mtd><mml:mi>I</mml:mi><mml:mi>O</mml:mi><mml:mi>U</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo>,</mml:mo><mml:mi>g</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>&#x003E;</mml:mo><mml:mi>u</mml:mi></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mn>0</mml:mn><mml:mo>,</mml:mo></mml:mtd><mml:mtd><mml:mrow><mml:mi>o</mml:mi><mml:mi>t</mml:mi><mml:mi>h</mml:mi><mml:mi>e</mml:mi><mml:mi>r</mml:mi><mml:mi>w</mml:mi><mml:mi>i</mml:mi><mml:mi>s</mml:mi><mml:mi>e</mml:mi></mml:mrow></mml:mtd></mml:mtr></mml:mtable><mml:mo fence="true" stretchy="true" symmetric="true"></mml:mo></mml:mrow></mml:math></disp-formula>where <inline-formula id="ieqn-31"><mml:math id="mml-ieqn-31"><mml:msub><mml:mi>c</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> represents the object class vector predicted by the anchor when the intersection over union (IOU) exceeds the threshold.</p>
<p>Finally, we use the cross-entropy loss function <inline-formula id="ieqn-32"><mml:math id="mml-ieqn-32"><mml:msub><mml:mrow><mml:mrow><mml:mi>&#x02112;</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:mtext>class</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mo>&#x2212;</mml:mo><mml:munderover><mml:mo movablelimits="false">&#x2211;</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>M</mml:mi></mml:mrow></mml:munderover><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>c</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mi>log</mml:mi><mml:mo>&#x2061;</mml:mo><mml:msubsup><mml:mrow><mml:mi>c</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:msup><mml:mi></mml:mi><mml:mrow><mml:msup><mml:mi></mml:mi><mml:mo>&#x2032;</mml:mo></mml:msup></mml:mrow></mml:msup></mml:mrow></mml:msubsup><mml:mo>+</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:mn>1</mml:mn><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mi>c</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow><mml:mi>log</mml:mi><mml:mo>&#x2061;</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:mn>1</mml:mn><mml:mo>&#x2212;</mml:mo><mml:msubsup><mml:mrow><mml:mi>c</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:msup><mml:mi></mml:mi><mml:mrow><mml:msup><mml:mi></mml:mi><mml:mo>&#x2032;</mml:mo></mml:msup></mml:mrow></mml:msup></mml:mrow></mml:msubsup><mml:mo>)</mml:mo></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula> to calculate the category loss of the object, and then the total loss function is as follows:
<disp-formula id="eqn-15"><label>(15)</label><mml:math id="mml-eqn-15" display="block"><mml:mrow><mml:mrow><mml:mi>&#x02112;</mml:mi></mml:mrow></mml:mrow><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mrow><mml:mi>&#x02112;</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:mi>o</mml:mi><mml:mi>b</mml:mi><mml:mi>j</mml:mi><mml:mi>e</mml:mi><mml:mi>c</mml:mi><mml:mi>t</mml:mi></mml:mrow></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:msub><mml:mrow><mml:mrow><mml:mi>&#x02112;</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>l</mml:mi><mml:mi>o</mml:mi><mml:mi>c</mml:mi></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:msub><mml:mrow><mml:mrow><mml:mi>&#x02112;</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>l</mml:mi><mml:mi>o</mml:mi><mml:mi>c</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">[</mml:mo><mml:mi>f</mml:mi><mml:mo stretchy="false">]</mml:mo><mml:mo>+</mml:mo><mml:msub><mml:mrow><mml:mrow><mml:mi>&#x02112;</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:mtext>class</mml:mtext></mml:mrow></mml:mrow></mml:msub></mml:math></disp-formula></p>
</sec>
</sec>
<sec id="s4"><label>4</label><title>Experiments</title>
<p>For small object detection tasks, GDUT-HWD<xref ref-type="fn" rid="fn2"><sup>2</sup></xref><fn id="fn2"><label>2</label><p><ext-link ext-link-type="uri" xlink:href="https://github.com/wujixiu/helmet-detection/tree/master/hardhatwearing-detection">https://github.com/wujixiu/helmet-detection/tree/master/hardhatwearing-detection</ext-link></p></fn>, Visdrone-2019<xref ref-type="fn" rid="fn3"><sup>3</sup></xref><fn id="fn3"><label>3</label><p><ext-link ext-link-type="uri" xlink:href="https://github.com/VisDrone/VisDrone-Dataset">https://github.com/VisDrone/VisDrone-Dataset</ext-link></p></fn>, etc., are available public benchmark datasets. To fully verify the effectiveness of the IT-transformer, we compare 8 typical algorithms in the above two datasets. In addition, we have built our own dataset of ground objects in the military field and conducted comparative experiments in it. The distribution of objects of each scale in the three datasets is shown in <xref ref-type="fig" rid="fig-5">Fig. 5</xref>, and it can be seen that the Armored Vehicle dataset we collected and sorted out has similar instance distribution characteristics to the other two, which are composed of small and medium-sized objects, which has great detection difficulty.</p>
<fig id="fig-5"><label>Figure 5</label><caption><title>The distribution of three used datasets</title></caption><graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_44284-fig-5.tif"/></fig>
<sec id="s4_1"><label>4.1</label><title>Datasets</title>
<p>Armored Vehicle: We collected, organized, and annotated 4975 images through online searches and local shooting. In the dataset, there are 10250 labeled boxes, and we use 3920 as the training set, which contains 8022 instances, and the remaining 1057 as the validation set, containing 2210 instances. There is only one type of object in the dataset, and its size distribution is shown in <xref ref-type="fig" rid="fig-5">Fig. 5</xref>. We label the data in a coco format to ensure that it can be used directly for multiple training architectures. The difference is that in the Armored Vehicle dataset, the object&#x2019;s viewing angle, distance, environment, scale, weather, and other characteristics are more complex, making it more difficult to detect small objects.</p>
<p>GDUT-HWD [<xref ref-type="bibr" rid="ref-38">38</xref>]: This is a very common hard hat detection dataset in industrial scenarios, containing 3174 training images, consisting of 5 types of labeled boxes, which is a lightweight benchmark for small object detection.</p>
<p>Visdrone-2019 [<xref ref-type="bibr" rid="ref-9">9</xref>]: This is a small object dataset of large scenes from an aerial perspective, consisting of 10209 images and 2.6 million annotation boxes, which can be used to test the detection performance of the model on small objects, and at the same time can test the efficiency of model reasoning. Due to its large image size, we divide each image into 4 non-overlapping subplots concerning [<xref ref-type="bibr" rid="ref-39">39</xref>].</p>
</sec>
<sec id="s4_2"><label>4.2</label><title>Metrics</title>
<p>We select mean average precision (mAP), APs, APm, and APl commonly used in object detection tasks as evaluation indicators and precision and recall are the basis for calculating each value. AP is the area around the precision-recall (P-R) curve and the coordinate axis, and its calculation formula is:
<disp-formula id="eqn-16"><label>(16)</label><mml:math id="mml-eqn-16" display="block"><mml:mi>A</mml:mi><mml:mi>P</mml:mi><mml:mo>=</mml:mo><mml:msubsup><mml:mrow><mml:mo>&#x222B;</mml:mo></mml:mrow><mml:mrow><mml:mn>0</mml:mn></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msubsup><mml:mi>P</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mi>d</mml:mi><mml:mi>x</mml:mi></mml:math></disp-formula></p>
<p>For datasets with multiple class objects, mAP is the average of APs across all classes, expressed as:
<disp-formula id="eqn-17"><label>(17)</label><mml:math id="mml-eqn-17" display="block"><mml:mi>m</mml:mi><mml:mi>A</mml:mi><mml:mi>P</mml:mi><mml:mo>=</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:mi>c</mml:mi></mml:mfrac><mml:munderover><mml:mo movablelimits="false">&#x2211;</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>c</mml:mi></mml:mrow></mml:munderover><mml:mi>A</mml:mi><mml:msub><mml:mi>P</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:math></disp-formula></p>
<p>APs refer to objects with a size of less than 32&#x2009;&#x00D7;&#x2009;32, and in the same way, APm and APl correspond to 96&#x2009;&#x00D7;&#x2009;96, and 128&#x2009;&#x00D7;&#x2009;128, respectively. In the course of the experiment, we also calculate the evaluation results of mAP50 concerning the practice of [<xref ref-type="bibr" rid="ref-1">1</xref>,<xref ref-type="bibr" rid="ref-9">9</xref>], which means the mAP is calculated when IOU&#x2009;&#x003D;&#x2009;0.5.</p>
</sec>
<sec id="s4_3"><label>4.3</label><title>Settings</title>
<p>All experiments in this paper are based on the mmdetection architecture, which ensures the fairness and reproducibility of the test. In the experimental process, we adopt a single-scale training strategy, and the input image size is uniformly limited to 640&#x2009;&#x00D7;&#x2009;640 (the Visdrone-2019 dataset is set to 1280&#x2009;&#x00D7;&#x2009;960), and only random flipping is used for data augmentation in the data preprocessing stage. For the learning rate and the number of detection heads, we determine through a grid search, which is described in <xref ref-type="sec" rid="s4_6">Section 4.6</xref>. In the following experiment, learning rate (lr) is 4E-2 and <italic>N</italic> is 8 in the following experiment. Other parameters refer to the default settings of mmdetection.</p>
</sec>
<sec id="s4_4"><label>4.4</label><title>Results in the Armored Vehicle Dataset</title>
<p>The experimental results are shown in <xref ref-type="table" rid="table-1">Table 1</xref>, from which it can be seen that the improved IT-Cascade-RCNN model achieves higher detection accuracy. The longitudinal comparison shows that IT-Cascade-RCNN improves 14.8&#x2005;mAP compared with the typical first-order detection model YOLOx and 12.8&#x2005;mAP compared with the typical second-order model Sparse. In particular, IT-Cascade-RCNN also achieved better results than DINO and DiffusionDET which based on diffusion models, exceeded 4.5 and 4.4&#x2005;mAP. Furthermore, the IT-transformer also surpassed another attention-based model, named adaptive training sample selection (ATSS) [<xref ref-type="bibr" rid="ref-40">40</xref>], 6&#x2005;mAP in a word. It is worth noting that under the AP50, although the DINO and DiffusionDET model achieved higher detection results, the performances have not been well extended to other threshold conditions, and they failed to balance between various early warning restrictions, object detection accuracy, and false alarm rate. In contrast, IT-Cascade-RCNN provides better results at various IOU thresholds. Further, we find that the accuracy of IT-transformer for large objects has decreased, because we have integrated global features and local features in the middle features of IT-transformer, resulting in the introduction of interference in some environmental information brought by local semantic features, resulting in a smaller APl.</p>
<table-wrap id="table-1"><label>Table 1</label><caption><title>The metrics in armored vehicle dataset</title></caption>
<table frame="hsides">
<colgroup>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
</colgroup>
<thead>
<tr>
<th align="left">Model</th>
<th align="left">mAP</th>
<th align="left">AP50</th>
<th align="left">APs</th>
<th align="left">APm</th>
<th align="left">APl</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">Retina</td>
<td align="left">44.5</td>
<td align="left">84.1</td>
<td align="left">24.2</td>
<td align="left">51.2</td>
<td align="left">54.2</td>
</tr>
<tr>
<td align="left">Sparse RCNN</td>
<td align="left">44.6</td>
<td align="left">81.6</td>
<td align="left">30.1</td>
<td align="left">50.4</td>
<td align="left">54.8</td>
</tr>
<tr>
<td align="left">YOLOv3</td>
<td align="left">42.7</td>
<td align="left">82.4</td>
<td align="left">38</td>
<td align="left">49.9</td>
<td align="left">59</td>
</tr>
<tr>
<td align="left">YOLOx</td>
<td align="left">42.6</td>
<td align="left">80.1</td>
<td align="left">29.5</td>
<td align="left">49.6</td>
<td align="left">47.9</td>
</tr>
<tr>
<td align="left">ATSS</td>
<td align="left">51.4</td>
<td align="left">87.3</td>
<td align="left">34.4</td>
<td align="left">57.1</td>
<td align="left">69.1</td>
</tr>
<tr>
<td align="left">DINO</td>
<td align="left">52.9</td>
<td align="left">89</td>
<td align="left">36.1</td>
<td align="left">58.8</td>
<td align="left">69.7</td>
</tr>
<tr>
<td align="left">DiffusionDET</td>
<td align="left">53</td>
<td align="left">89</td>
<td align="left">40.2</td>
<td align="left">57</td>
<td align="left">62.1</td>
</tr>
<tr>
<td align="left">Cascade RCNN</td>
<td align="left">54.7</td>
<td align="left">87.3</td>
<td align="left">38.5</td>
<td align="left">60.1</td>
<td align="left">70.6</td>
</tr>
<tr>
<td align="left">&#x002B;IT-transformer</td>
<td align="left">57.4</td>
<td align="left">88</td>
<td align="left">41</td>
<td align="left">63.5</td>
<td align="left">68.4</td>
</tr>
</tbody>
</table>
</table-wrap>
<p><xref ref-type="fig" rid="fig-6">Fig. 6</xref> shows the visualization of the detection results of each model in the Armored Vehicle dataset. <xref ref-type="fig" rid="fig-6">Fig. 6</xref> can more clearly show the detection effect of the five models, from which we can see that in the first line of images, Retina, YOLOx and DINO have serious false alarm problems, identifying non-existent areas as objects, while Faster RCNN fails to detect objects at all, and the improved model with cross-attention mechanism correctly detects objects; In the second line, Retina, Faster RCNN, and YOLOx also have the problem of missing detection, although DINO detects all objects, but the precision measurement accuracy is not as high as the improved model; Similarly, when the object in the third row is partially occluded, although the first three models are correctly positioned to the object, the detection accuracy does not reach a higher level, but unfortunately, DINO missed an object at this time; The fourth line shows the level difference between the models more vividly, when the object is obscured by smoke and dust, resulting in the object feature being disturbed, Retina, YOLOx and DINO fail to detect the object, and the Faster RCNN obtains less accurate detection results, compared to the improved Cascade RCNN model showing accurate results.</p>
<fig id="fig-6"><label>Figure 6</label><caption><title>Detection results (green circles indicate missed detections and yellow circles indicate false detections)</title></caption><graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_44284-fig-6.tif"/></fig>
</sec>
<sec id="s4_5"><label>4.5</label><title>Results in the GDUT-HWD Dataset</title>
<p>We also perform experiments in lightweight GDUT-HWD datasets to test the ability of the IT-transformer to deal with small object detection in industrial scenarios, and the experimental results are shown in <xref ref-type="table" rid="table-2">Table 2</xref>. From this, we found that IT-Cascade-RCNN also showed good performance advantages, improving by 14.1&#x2005;mAP compared with the typical first-order detection model YOLOx, 16.9&#x2005;mAP compared with the second-order detection model represented by sparse, and 13.3&#x2005;mAP higher than the DINO based diffusion model. Among the more challenging small-scale object detection results, IT-Cascade-RCNN also achieved the highest detection accuracy of 34.3, which is 2.1 higher than the benchmark Cascade-RCNN. In summary, the results show that IT-transformer can effectively improve the detection performance of the model.</p>
<table-wrap id="table-2"><label>Table 2</label><caption><title>The metrics in GDUT-HWD dataset</title></caption>
<table frame="hsides">
<colgroup>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
</colgroup>
<thead>
<tr>
<th align="left">Model</th>
<th align="left">mAP</th>
<th align="left">AP50</th>
<th align="left">APs</th>
<th align="left">APm</th>
<th align="left">APl</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">Retina</td>
<td align="left">34.9</td>
<td align="left">59.1</td>
<td align="left">18.2</td>
<td align="left">48.5</td>
<td align="left">58</td>
</tr>
<tr>
<td align="left">Sparse RCNN</td>
<td align="left">33.4</td>
<td align="left">56.5</td>
<td align="left">20.3</td>
<td align="left">44.5</td>
<td align="left">53.4</td>
</tr>
<tr>
<td align="left">YOLOx</td>
<td align="left">36.2</td>
<td align="left">67.9</td>
<td align="left">22.2</td>
<td align="left">47.9</td>
<td align="left">52.8</td>
</tr>
<tr>
<td align="left">YOLOv5</td>
<td align="left">40.5</td>
<td align="left">75.6</td>
<td align="left">26.8</td>
<td align="left">52</td>
<td align="left">57.1</td>
</tr>
<tr>
<td align="left">YOLOv7</td>
<td align="left">34.5</td>
<td align="left">70.8</td>
<td align="left">24.1</td>
<td align="left">47.1</td>
<td align="left">37.2</td>
</tr>
<tr>
<td align="left">DINO</td>
<td align="left">37</td>
<td align="left">69.8</td>
<td align="left">19.3</td>
<td align="left">50.1</td>
<td align="left">65.6</td>
</tr>
<tr>
<td align="left">Cascade RCNN</td>
<td align="left">49.2</td>
<td align="left">79.7</td>
<td align="left">32.2</td>
<td align="left">62.5</td>
<td align="left">70.8</td>
</tr>
<tr>
<td align="left">&#x002B;IT-transformer</td>
<td align="left">50.3</td>
<td align="left">80.8</td>
<td align="left">34.3</td>
<td align="left">63.1</td>
<td align="left">70.6</td>
</tr>
</tbody>
</table>
</table-wrap>
<p><xref ref-type="fig" rid="fig-7">Fig. 7</xref> is the visualization of some model detection results, and it is found that Retina, Faster RCNN, YOLOx, and DINO have serious missed detection problems, and none of them can detect the object marked by the green circle in the <xref ref-type="fig" rid="fig-7">Fig. 7</xref>. At the same time, Retina and Faster RCNN also have the problem of false detection, and they misjudge the object category marked by the yellow circle; Finally, Faster RCNN also has the problem of duplicate detection, and the object marked by the blue circle in the duplicate detection figure is repeated; Among the detected objects, the improved Cascade RCNN model has a higher degree of confidence. On the whole, the model improved by the cross-transformer shows better performance, which effectively improves the detection accuracy of the model for small objects.</p>
<fig id="fig-7"><label>Figure 7</label><caption><title>Detection results (green circles indicate missed detections, yellow circles indicate false detections and blue circles indicate retests)</title></caption><graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_44284-fig-7.tif"/></fig>
</sec>
<sec id="s4_6"><label>4.6</label><title>Ablation Experiments</title>
<p>The IT-transformers we design are mainly affected by factors such as learning rate, normalization layer, number of detection heads, etc., to test their impact on precision measurement accuracy more reliably, we carry out ablation experiments on them separately in this part.</p>
<sec id="s4_6_1"><label>4.6.1</label><title>The Impact of lr</title>
<p>We use the grid search method to test the influence of different learning rates on the detection accuracy of the model. During the experiment, we sampled 15 learning rates from 1E-3 to 5E-2 and experimented in the GDUT-HWD dataset, and the relevant results are shown in <xref ref-type="fig" rid="fig-8">Fig. 8</xref>.</p>
<fig id="fig-8"><label>Figure 8</label><caption><title>The impact of lr</title></caption><graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_44284-fig-8.tif"/></fig>
<p>Observing the experimental results in <xref ref-type="fig" rid="fig-8">Fig. 8</xref>, it is first confirmed that the difference in learning rate does have a great impact on the detection accuracy of the model, for example, with the increase of the learning rate, the detection accuracy of the model shows an upward trend. Furthermore, it can be seen that when lr is set to 4E-2, the model achieves the highest results, reaching 48.9&#x2005;mAP. Therefore, in the full text, we set the lr to 4E-2.</p>
</sec>
<sec id="s4_6_2"><label>4.6.2</label><title>The Impact of Head Number</title>
<p>The bull attention mechanism determines how many angles the interrelationships between features need to be extracted, and we know that the number of attention heads is not as many as possible, and vice versa. Our ablation experiments confirmed this as well. As shown in <xref ref-type="table" rid="table-3">Table 3</xref>, it can be seen that when the number of detection heads is small, an effective attention mask cannot be generated, resulting in an interaction between feature maps, which cannot provide more effective feature information for the detection head, and when the number of attention heads is too large, too much redundant information will be introduced, which will also weaken the expression ability of features. From the experimental results, when the head number is 8, the model performs better.</p>
<table-wrap id="table-3"><label>Table 3</label><caption><title>The results of different numbers of head</title></caption>
<table frame="hsides">
<colgroup>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
</colgroup>
<thead>
<tr>
<th align="left">Head</th>
<th align="left">mAP</th>
<th align="left">AP50</th>
<th align="left">APs</th>
<th align="left">APm</th>
<th align="left">APl</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">2</td>
<td align="left">48.1</td>
<td align="left">79.2</td>
<td align="left">32.1</td>
<td align="left">61.2</td>
<td align="left">68.3</td>
</tr>
<tr>
<td align="left">4</td>
<td align="left">48.1</td>
<td align="left">79.2</td>
<td align="left">31.8</td>
<td align="left">61.2</td>
<td align="left">69.3</td>
</tr>
<tr>
<td align="left">6</td>
<td align="left">48</td>
<td align="left">79.5</td>
<td align="left">31.4</td>
<td align="left">61</td>
<td align="left">69.2</td>
</tr>
<tr>
<td align="left">8</td>
<td align="left">48.9</td>
<td align="left">79.8</td>
<td align="left">33</td>
<td align="left">62</td>
<td align="left">69.1</td>
</tr>
<tr>
<td align="left">10</td>
<td align="left">48.3</td>
<td align="left">79.3</td>
<td align="left">32</td>
<td align="left">61.6</td>
<td align="left">69</td>
</tr>
<tr>
<td align="left">12</td>
<td align="left">48.6</td>
<td align="left">79.6</td>
<td align="left">33.2</td>
<td align="left">61</td>
<td align="left">68.8</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>As shown in <xref ref-type="table" rid="table-4">Table 4</xref>, we also experiment with the normalization layer in the transformer. The results show that the model performs better when the normalization layer is not used. We believe that the possible reason is that the normalization operation affects the representation of the middle-layer features, and when the normalization operation is carried out, the features are compulsorily concentrated on some prior knowledge, which weakens the ability of the model to rely on its ability to induct effective bias, drowns the middle-layer features that have a direct impact on the detection results, and causes the model detection accuracy to decline. On the contrary, by reducing the constraints of prior knowledge on the model learning process, and more through self-learning guidance, the model can more effectively learn the universal characteristics of different object features, to achieve more accurate detection in the detection process.</p>
<table-wrap id="table-4"><label>Table 4</label><caption><title>The impact of normalized layer</title></caption>
<table frame="hsides">
<colgroup>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
</colgroup>
<thead>
<tr>
<th align="left">Norm</th>
<th align="left">mAP</th>
<th align="left">AP50</th>
<th align="left">APs</th>
<th align="left">APm</th>
<th align="left">APl</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">LN</td>
<td align="left">48.8</td>
<td align="left">79.9</td>
<td align="left">32.9</td>
<td align="left">61.9</td>
<td align="left">69.1</td>
</tr>
<tr>
<td align="left">BN</td>
<td align="left">48.4</td>
<td align="left">79.8</td>
<td align="left">32.2</td>
<td align="left">61.5</td>
<td align="left">68.8</td>
</tr>
<tr>
<td align="left">None</td>
<td align="left">48.9</td>
<td align="left">79.8</td>
<td align="left">33</td>
<td align="left">62</td>
<td align="left">69.1</td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
<sec id="s4_6_3"><label>4.6.3</label><title>The Impact of Kernel Size</title>
<p>IT-transformer better integrates the ability of convolutional modules to obtain local semantic features. In fact, local semantic features can provide more environment and reference information for the identification of small objects, and help achieve accurate classification and positioning. In order to determine a more suitable field of view, in this part, we conducted a comparative experiment on the size of the convolution kernel in the Armored Vehicle dataset, and the results are shown in <xref ref-type="table" rid="table-5">Table 5</xref>.</p>
<table-wrap id="table-5"><label>Table 5</label><caption><title>The impact of kernel size (ensure that the size of the output feature map remains unchanged)</title></caption>
<table frame="hsides">
<colgroup>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
</colgroup>
<thead>
<tr>
<th align="center" colspan="3">Kernel size</th>
<th align="left" rowspan="2">mAP</th>
<th align="left" rowspan="2">AP50</th>
<th align="left" rowspan="2">APs</th>
<th align="left" rowspan="2">APm</th>
<th align="left" rowspan="2">APl</th>
<th align="left" rowspan="2">Parameters (G)</th>
<th align="left" rowspan="2">TFlOPS</th>
</tr>
<tr>
<th align="left">Size</th>
<th align="left">Stride</th>
<th align="left">Padding</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">1</td>
<td align="left">1</td>
<td align="left">0</td>
<td align="left">55.5</td>
<td align="left">87.3</td>
<td align="left">39.1</td>
<td align="left">61.8</td>
<td align="left">71.9</td>
<td align="left">0.0774</td>
<td align="left">0.114</td>
</tr>
<tr>
<td align="left">3</td>
<td align="left">1</td>
<td align="left">1</td>
<td align="left">56.9</td>
<td align="left">87.1</td>
<td align="left">40.1</td>
<td align="left">63.1</td>
<td align="left">70</td>
<td align="left">0.0879</td>
<td align="left">0.171</td>
</tr>
<tr>
<td align="left">5</td>
<td align="left">1</td>
<td align="left">2</td>
<td align="left">56.9</td>
<td align="left">87.1</td>
<td align="left">40.7</td>
<td align="left">62.9</td>
<td align="left">69.4</td>
<td align="left">0.109</td>
<td align="left">0.285</td>
</tr>
<tr>
<td align="left">7</td>
<td align="left">1</td>
<td align="left">3</td>
<td align="left">57.4</td>
<td align="left">88</td>
<td align="left">41</td>
<td align="left">63.5</td>
<td align="left">68.4</td>
<td align="left">0.14</td>
<td align="left">0.457</td>
</tr>
<tr>
<td align="left">9</td>
<td align="left">1</td>
<td align="left">4</td>
<td align="left">56.6</td>
<td align="left">87</td>
<td align="left">40.3</td>
<td align="left">62.7</td>
<td align="left">65.7</td>
<td align="left">0.182</td>
<td align="left">0.686</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>It can be seen from <xref ref-type="table" rid="table-5">Table 5</xref> that the change in convolution kernel size has a significant impact on IT-transformer, in which with the increase of convolution kernel size, the receptive field of intermediate feature fusion also increases, providing the validity of intermediate layer features, which is reflected in the detection results is the steady improvement of various indicators, such as kernel size is 7, reaching a maximum value of 57.4, in which the APs reaches 41; However, with the further increase of the convolution kernel size, such as kernel size 9, too many environmental features are integrated into the middle layer features, which interferes with the utilization effect of the middle layer features, resulting in a downward trend in object detection accuracy. At the same time, it is obvious that as the size of the convolution kernel increases, the number of parameters of the model will increase simultaneously, and the computing power expenditure will increase, but it is worth the effort to improve the accuracy of object detection.</p>

</sec>
<sec id="s4_6_4"><label>4.6.4</label><title>The Result of Plug-and-Play</title>
<p>As we mentioned earlier, IT-transformer has plug-and-play features and can significantly improve accuracy. In this regard, we selected typical first-stage and second-stage detection models such as Retina, Faster RCNN, and Cascade RCNN in the GDUT-HWD, Armored Vehicle, and Visdrone-2019 for experiments. The results are shown in <xref ref-type="table" rid="table-6">Table 6</xref>. After adding IT-transformer, the baseline model has achieved significant performance improvements, such as in GDUT-HWD, with the IT-transformer, the mAP of Faster RCNN and Cascade RCNN increased by 8.8 and 1.1, respectively; meanwhile, in the Armored Vehicle dataset the accuracy of Retina is improved by 4.1&#x2005;mAP and the accuracy of small objects by 20.56&#x0025;, compared with 6.18&#x0025; of APm and 0.95&#x0025; of APl. IT-transformer&#x2019;s effect on model performance improvement can also be reflected in the Visdrone-2019.</p>
<table-wrap id="table-6"><label>Table 6</label><caption><title>The results of plug-and-play</title></caption>
<table frame="hsides">
<colgroup>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
</colgroup>
<thead>
<tr>
<th align="left">Dataset</th>
<th align="left">Model</th>
<th align="left">mAP</th>
<th align="left">AP50</th>
<th align="left">APs</th>
<th align="left">APm</th>
<th align="left">APl</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left" rowspan="4">GDUT-HWD</td>
<td align="left">Faster RCNN</td>
<td align="left">40.1</td>
<td align="left">73</td>
<td align="left">23.2</td>
<td align="left">53.6</td>
<td align="left">62.4</td>
</tr>
<tr>
<td align="left">&#x002B;IT-transformer</td>
<td align="left">48.9</td>
<td align="left">79.4</td>
<td align="left">33</td>
<td align="left">62</td>
<td align="left">69.1</td>
</tr>
<tr>
<td align="left">Cascade RCNN</td>
<td align="left">49.2</td>
<td align="left">79.7</td>
<td align="left">32.2</td>
<td align="left">62.5</td>
<td align="left">70.8</td>
</tr>
<tr>
<td align="left">&#x002B;IT-transformer</td>
<td align="left">50.3</td>
<td align="left">80.8</td>
<td align="left">34.3</td>
<td align="left">63.1</td>
<td align="left">70.6</td>
</tr>
<tr>
<td align="left" rowspan="6">Armored vehicle</td>
<td align="left">Faster RCNN</td>
<td align="left">51.7</td>
<td align="left">86.4</td>
<td align="left">35.6</td>
<td align="left">57.8</td>
<td align="left">67.2</td>
</tr>
<tr>
<td align="left">&#x002B;IT-transformer</td>
<td align="left">52.8</td>
<td align="left">86.3</td>
<td align="left">36</td>
<td align="left">59.1</td>
<td align="left">68.1</td>
</tr>
<tr>
<td align="left">Retina</td>
<td align="left">44.7</td>
<td align="left">82.8</td>
<td align="left">24.8</td>
<td align="left">51.8</td>
<td align="left">63.1</td>
</tr>
<tr>
<td align="left">&#x002B;IT-transformer</td>
<td align="left">48.8</td>
<td align="left">85.4</td>
<td align="left">29.9</td>
<td align="left">55</td>
<td align="left">63.7</td>
</tr>
<tr>
<td align="left">Cascade RCNN</td>
<td align="left">54.7</td>
<td align="left">87.3</td>
<td align="left">38.5</td>
<td align="left">60.1</td>
<td align="left">70.6</td>
</tr>
<tr>
<td align="left">&#x002B;IT-transformer</td>
<td align="left">57.4</td>
<td align="left">88</td>
<td align="left">41</td>
<td align="left">63.5</td>
<td align="left">68.4</td>
</tr>
<tr>
<td align="left" rowspan="4">Visdrone-2019</td>
<td align="left">Faster RCNN</td>
<td align="left">18.7</td>
<td align="left">32.8</td>
<td align="left">9.7</td>
<td align="left">30.7</td>
<td align="left">48.6</td>
</tr>
<tr>
<td align="left">&#x002B;IT-transformer</td>
<td align="left">19.8</td>
<td align="left">33.6</td>
<td align="left">10.6</td>
<td align="left">32.4</td>
<td align="left">48.5</td>
</tr>
<tr>
<td align="left">Cascade RCNN</td>
<td align="left">20.7</td>
<td align="left">34</td>
<td align="left">10.6</td>
<td align="left">33.5</td>
<td align="left">49</td>
</tr>
<tr>
<td align="left">&#x002B;IT-transformer</td>
<td align="left">22</td>
<td align="left">35.5</td>
<td align="left">11.9</td>
<td align="left">35.1</td>
<td align="left">49.8</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>Experimental results show that the IT-transformer designed in this paper does exhibit good plug-and-play and can be directly used in many types of benchmark models. <xref ref-type="fig" rid="fig-9">Fig. 9</xref> is the test results in the Visdrone-2019 dataset, and we test the effect of cross-transformer addition on the detection effect of the Cascade RCNN model before and after the addition of the cross-transformer. It can be seen that the addition of cross-transformers significantly improves the false detection and missed detection of Cascade RCNN.</p>
<fig id="fig-9"><label>Figure 9</label><caption><title>The result in Visdrone-2019 (yellow circle indicates false detection, green circle represents missed detection)</title></caption><graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_44284-fig-9.tif"/></fig>
</sec>
</sec>
</sec>
<sec id="s5"><label>5</label><title>Conclusion</title>
<p>For the challenging small object detection task, we first analyze and sort out the existing solution ideas, summarize them into four basic methods, and then, combined them with the current mainstream attention mechanism, based on the traditional transformer model, from the perspective of compressing the number of model parameters and strengthening the coupling of middle-layer features, we design a cross-K-value transformer model with a double-branch structure, and at the same time, we apply the idea of multi-head attention to the processing process of channel attention masking. By experimenting with the self-built Armored Vehicle dataset and 2 additional benchmarks, the improved Cascade RCNN model based on cross-transformer was verified and a higher detection level was achieved. Finally, by combining the cross-transformer with the existing first-order and second-order detection models, the ablation experiment confirms that the cross-transformer has good plug-and-play performance and can effectively improve the detection results of each baseline. In addition, we also collected and collated an Armored Vehicle dataset containing a class of military ground objects to provide data support for related research.</p>
</sec>
</body>
<back>
<ack>
<p>None.</p>
</ack>
<sec><title>Funding Statement</title>
<p>The authors received no specific funding for this study.</p></sec>
<sec><title>Author Contributions</title>
<p>The authors confirm contribution to the paper as follows: study conception and design: Qinzhao Wang; data collection, analysis, and interpretation of results: Jian Wei; draft manuscript preparation: Zixu Zhao. All authors reviewed the results and approved the final version of the manuscript.</p></sec>
<sec sec-type="data-availability"><title>Availability of Data and Materials</title>
<p>Data will be available on request.</p></sec>
<sec sec-type="COI-statement"><title>Conflicts of Interest</title>
<p>The authors declare that they have no conflicts of interest to report regarding the present study.</p></sec>
<ref-list content-type="authoryear">
<title>References</title>
<ref id="ref-1"><label>[1]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>S.</given-names> <surname>Xu</surname></string-name>, <string-name><given-names>J.</given-names> <surname>Gu</surname></string-name>, <string-name><given-names>Y.</given-names> <surname>Hua</surname></string-name> and <string-name><given-names>Y.</given-names> <surname>Liu</surname></string-name></person-group>, &#x201C;<article-title>DKTNet: Dual-key transformer network for small object detection</article-title>,&#x201D; <source>Neurocomputing</source>, vol. <volume>525</volume>, pp. <fpage>29</fpage>&#x2013;<lpage>41</lpage>, <year>2023</year>.</mixed-citation></ref>
<ref id="ref-2"><label>[2]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>C.</given-names> <surname>Xu</surname></string-name>, <string-name><given-names>J.</given-names> <surname>Ding</surname></string-name>, <string-name><given-names>J.</given-names> <surname>Wang</surname></string-name>, <string-name><given-names>W.</given-names> <surname>Yang</surname></string-name>, <string-name><given-names>H.</given-names> <surname>Yu</surname></string-name> <etal>et al.,</etal></person-group> &#x201C;<article-title>Dynamic coarse-to-fine learning for oriented tiny object detection</article-title>,&#x201D; in <conf-name>Proc. of CVPR</conf-name>, <conf-loc>Vancouver, Canada</conf-loc>, pp. <fpage>7318</fpage>&#x2013;<lpage>7328</lpage>, <year>2023</year>.</mixed-citation></ref>
<ref id="ref-3"><label>[3]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>J.</given-names> <surname>Fan</surname></string-name> and <string-name><given-names>X.</given-names> <surname>Liu</surname></string-name></person-group>, &#x201C;<article-title>Multi background island bird detection based on faster R-CNN</article-title>,&#x201D; <source>Cybernetics and Systems</source>, vol. <volume>52</volume>, pp. <fpage>26</fpage>&#x2013;<lpage>35</lpage>, <year>2021</year>.</mixed-citation></ref>
<ref id="ref-4"><label>[4]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>C.</given-names> <surname>Guo</surname></string-name>, <string-name><given-names>B.</given-names> <surname>Fan</surname></string-name>, <string-name><given-names>Q.</given-names> <surname>Zhang</surname></string-name>, <string-name><given-names>S.</given-names> <surname>Xiang</surname></string-name> and <string-name><given-names>C.</given-names> <surname>Pan</surname></string-name></person-group>, &#x201C;<article-title>AugFPN: Improving multi-scale feature learning for object detection</article-title>,&#x201D; in <conf-name>Proc. of CVPR</conf-name>, <conf-loc>Seattle, WA, USA</conf-loc>, pp. <fpage>12595</fpage>&#x2013;<lpage>12604</lpage>, <year>2020</year>.</mixed-citation></ref>
<ref id="ref-5"><label>[5]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>G.</given-names> <surname>Ghiasi</surname></string-name>, <string-name><given-names>T. Y.</given-names> <surname>Lin</surname></string-name> and <string-name><given-names>Q. V.</given-names> <surname>Le</surname></string-name></person-group>, &#x201C;<article-title>NASFPN: Learning scalable feature pyramid architecture for object detection</article-title>,&#x201D; in <conf-name>Proc. of CVPR</conf-name>, <conf-loc>Long Beach, CA, USA</conf-loc>, pp. <fpage>7036</fpage>&#x2013;<lpage>7045</lpage>, <year>2019</year>.</mixed-citation></ref>
<ref id="ref-6"><label>[6]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>Y.</given-names> <surname>Zhu</surname></string-name>, <string-name><given-names>Q.</given-names> <surname>Zhou</surname></string-name>, <string-name><given-names>N.</given-names> <surname>Liu</surname></string-name>, <string-name><given-names>Z.</given-names> <surname>Xu</surname></string-name>, <string-name><given-names>Z.</given-names> <surname>Ou</surname></string-name> <etal>et al.,</etal></person-group> &#x201C;<article-title>ScaleKD: Distilling scale-aware knowledge in small object detector</article-title>,&#x201D; in <conf-name>Proc. of CVPR</conf-name>, <conf-loc>Vancouver, Canada</conf-loc>, pp. <fpage>19723</fpage>&#x2013;<lpage>19732</lpage>, <year>2023</year>.</mixed-citation></ref>
<ref id="ref-7"><label>[7]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>G.</given-names> <surname>Zhao</surname></string-name>, <string-name><given-names>W.</given-names> <surname>Ge</surname></string-name> and <string-name><given-names>Y.</given-names> <surname>Yu</surname></string-name></person-group>, &#x201C;<article-title>GraphFPN: Graph feature pyramid network for object detection</article-title>,&#x201D; in <conf-name>Proc. of ICCV</conf-name>, <conf-loc>Montreal, Canada</conf-loc>, pp. <fpage>2763</fpage>&#x2013;<lpage>2772</lpage>, <year>2021</year>.</mixed-citation></ref>
<ref id="ref-8"><label>[8]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>M.</given-names> <surname>Hu</surname></string-name>, <string-name><given-names>Y.</given-names> <surname>Li</surname></string-name>, <string-name><given-names>L.</given-names> <surname>Fang</surname></string-name> and <string-name><given-names>S.</given-names> <surname>Wang</surname></string-name></person-group>, &#x201C;<article-title>A<sup>2</sup>-FPN: Attention aggregation-based feature pyramid network for instance segmentation</article-title>,&#x201D; in <conf-name>Proc. of CVPR</conf-name>, <conf-loc>Nashville, TN, USA</conf-loc>, pp. <fpage>15343</fpage>&#x2013;<lpage>15352</lpage>, <year>2021</year>.</mixed-citation></ref>
<ref id="ref-9"><label>[9]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>C.</given-names> <surname>Yang</surname></string-name>, <string-name><given-names>Z.</given-names> <surname>Huang</surname></string-name> and <string-name><given-names>N.</given-names> <surname>Wang</surname></string-name></person-group>, &#x201C;<article-title>QueryDet: Cascaded sparse query for accelerating high-resolution small object detection</article-title>,&#x201D; in <conf-name>Proc. of CVPR</conf-name>, <conf-loc>New Orleans, LA, USA</conf-loc>, pp. <fpage>13668</fpage>&#x2013;<lpage>13677</lpage>, <year>2022</year>.</mixed-citation></ref>
<ref id="ref-10"><label>[10]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>T.</given-names> <surname>Lin</surname></string-name>, <string-name><given-names>P.</given-names> <surname>Goyal</surname></string-name>, <string-name><given-names>R.</given-names> <surname>Girshick</surname></string-name>, <string-name><given-names>K.</given-names> <surname>He</surname></string-name> and <string-name><given-names>P.</given-names> <surname>Doll&#x00E1;r</surname></string-name></person-group>, &#x201C;<article-title>Focal loss for dense object detection</article-title>,&#x201D; in <conf-name>Proc. of ICCV</conf-name>, <conf-loc>Venice, Italy</conf-loc>, pp. <fpage>2980</fpage>&#x2013;<lpage>2988</lpage>, <year>2017</year>.</mixed-citation></ref>
<ref id="ref-11"><label>[11]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><given-names>J.</given-names> <surname>Redmon</surname></string-name> and <string-name><given-names>A.</given-names> <surname>Farhadi</surname></string-name></person-group>, &#x201C;<article-title>YOLOv3: An incremental improvement</article-title>,&#x201D; <comment>arXiv preprint arXiv:1804.02767</comment>, <year>2018</year>.</mixed-citation></ref>
<ref id="ref-12"><label>[12]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>Z.</given-names> <surname>Ge</surname></string-name>, <string-name><given-names>S.</given-names> <surname>Liu</surname></string-name>, <string-name><given-names>F.</given-names> <surname>Wang</surname></string-name>, <string-name><given-names>Z.</given-names> <surname>Li</surname></string-name> and <string-name><given-names>J.</given-names> <surname>Sun</surname></string-name></person-group>, &#x201C;<article-title>YOLOX: Exceeding yolo series in 2021</article-title>,&#x201D; <comment>arXiv preprint arXiv:2107.08430</comment>, <year>2021</year>.</mixed-citation></ref>
<ref id="ref-13"><label>[13]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>J.</given-names> <surname>Dai</surname></string-name>, <string-name><given-names>H.</given-names> <surname>Qi</surname></string-name>, <string-name><given-names>Y.</given-names> <surname>Xiong</surname></string-name>, <string-name><given-names>Y.</given-names> <surname>Li</surname></string-name>, <string-name><given-names>G.</given-names> <surname>Zhang</surname></string-name> <etal>et al.,</etal></person-group> &#x201C;<article-title>Deformable convolutional networks</article-title>,&#x201D; in <conf-name>Proc. of ICCV</conf-name>, <conf-loc>Venice, Italy</conf-loc>, pp. <fpage>764</fpage>&#x2013;<lpage>773</lpage>, <year>2017</year>.</mixed-citation></ref>
<ref id="ref-14"><label>[14]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>X.</given-names> <surname>Zhu</surname></string-name>, <string-name><given-names>H.</given-names> <surname>Hu</surname></string-name>, <string-name><given-names>S.</given-names> <surname>Lin</surname></string-name> and <string-name><given-names>J.</given-names> <surname>Dai</surname></string-name></person-group>, &#x201C;<article-title>Deformable ConvNets V2: More deformable better results</article-title>,&#x201D; in <conf-name>Proc. of CVPR</conf-name>, <conf-loc>Long Beach, CA, USA</conf-loc>, pp. <fpage>9308</fpage>&#x2013;<lpage>9316</lpage>, <year>2019</year>.</mixed-citation></ref>
<ref id="ref-15"><label>[15]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>K.</given-names> <surname>Zhang</surname></string-name>, <string-name><given-names>T.</given-names> <surname>Li</surname></string-name>, <string-name><given-names>S.</given-names> <surname>Shen</surname></string-name>, <string-name><given-names>B.</given-names> <surname>Liu</surname></string-name>, <string-name><given-names>J.</given-names> <surname>Chen</surname></string-name> <etal>et al.,</etal></person-group> &#x201C;<article-title>Adaptive graph convolutional network with attention graph clustering for co-saliency detection</article-title>,&#x201D; in <conf-name>Proc. of CVPR</conf-name>, <conf-loc>Seattle, WA, USA</conf-loc>, pp. <fpage>9050</fpage>&#x2013;<lpage>9059</lpage>, <year>2020</year>.</mixed-citation></ref>
<ref id="ref-16"><label>[16]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>G.</given-names> <surname>Peng</surname></string-name>, <string-name><given-names>J.</given-names> <surname>Zhang</surname></string-name>, <string-name><given-names>H.</given-names> <surname>Li</surname></string-name> and <string-name><given-names>D.</given-names> <surname>Wang</surname></string-name></person-group>, &#x201C;<article-title>Attentional pyramid pooling of salient visual residuals for place recognition</article-title>,&#x201D; in <conf-name>Proc. of ICCV</conf-name>, <conf-loc>Montreal, Canada</conf-loc>, pp. <fpage>865</fpage>&#x2013;<lpage>874</lpage>, <year>2021</year>.</mixed-citation></ref>
<ref id="ref-17"><label>[17]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>X.</given-names> <surname>Zhu</surname></string-name>, <string-name><given-names>S.</given-names> <surname>Lyu</surname></string-name>, <string-name><given-names>X.</given-names> <surname>Wang</surname></string-name> and <string-name><given-names>Q.</given-names> <surname>Zhao</surname></string-name></person-group>, &#x201C;<article-title>TPH-YOLOv5: Improved YOLOv5 based on transformer prediction head for object detection on drone-captured scenarios</article-title>,&#x201D; in <conf-name>Proc. of ICCV</conf-name>, <conf-loc>Montreal, Canada</conf-loc>, pp. <fpage>2778</fpage>&#x2013;<lpage>2788</lpage>, <year>2021</year>.</mixed-citation></ref>
<ref id="ref-18"><label>[18]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>A.</given-names> <surname>Vaswani</surname></string-name>, <string-name><given-names>N.</given-names> <surname>Shazeer</surname></string-name>, <string-name><given-names>N.</given-names> <surname>Parmar</surname></string-name>, <string-name><given-names>J.</given-names> <surname>Uszkoreit</surname></string-name>, <string-name><given-names>L.</given-names> <surname>Jones</surname></string-name> <etal>et al.,</etal></person-group> &#x201C;<article-title>Attention is all you need</article-title>,&#x201D; <source>Advances in Neural Information Processing Systems</source>, vol. <volume>30</volume>, pp. <fpage>34</fpage>&#x2013;<lpage>45</lpage>, <year>2017</year>.</mixed-citation></ref>
<ref id="ref-19"><label>[19]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><given-names>A.</given-names> <surname>Dosovitskiy</surname></string-name>, <string-name><given-names>L.</given-names> <surname>Beyer</surname></string-name>, <string-name><given-names>A.</given-names> <surname>Kolesnikov</surname></string-name>, <string-name><given-names>D.</given-names> <surname>Weissenborn</surname></string-name>, <string-name><given-names>X.</given-names> <surname>Zhai</surname></string-name> <etal>et al.,</etal></person-group> &#x201C;<article-title>An image is worth 16 &#x00D7; 16 words: Transformers for image recognition at scale</article-title>,&#x201D; <comment>arXiv preprint arXiv:2010.11929</comment>, <year>2020</year>.</mixed-citation></ref>
<ref id="ref-20"><label>[20]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>Z.</given-names> <surname>Liu</surname></string-name>, <string-name><given-names>Y.</given-names> <surname>Lin</surname></string-name>, <string-name><given-names>Y.</given-names> <surname>Cao</surname></string-name>, <string-name><given-names>H.</given-names> <surname>Hu</surname></string-name>, <string-name><given-names>Y.</given-names> <surname>Wei</surname></string-name> <etal>et al.,</etal></person-group> &#x201C;<article-title>Swin transformer: Hierarchical vision transformer using shifted windows</article-title>,&#x201D; in <conf-name>Proc. of ICCV</conf-name>, <conf-loc>Montreal, Canada</conf-loc>, pp. <fpage>10012</fpage>&#x2013;<lpage>10022</lpage>, <year>2021</year>.</mixed-citation></ref>
<ref id="ref-21"><label>[21]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><given-names>X.</given-names> <surname>Zhu</surname></string-name>, <string-name><given-names>W.</given-names> <surname>Su</surname></string-name>, <string-name><given-names>L.</given-names> <surname>Lu</surname></string-name>, <string-name><given-names>B.</given-names> <surname>Li</surname></string-name>, <string-name><given-names>X.</given-names> <surname>Wang</surname></string-name> <etal>et al.,</etal></person-group> &#x201C;<article-title>Deformable DETR: Deformable transformers for end-to-end object detection</article-title>,&#x201D; <comment>arXiv preprint arXiv:2010.04159</comment>, <year>2020</year>.</mixed-citation></ref>
<ref id="ref-22"><label>[22]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><given-names>A.</given-names> <surname>Bochkovskiy</surname></string-name>, <string-name><given-names>C.</given-names> <surname>Wang</surname></string-name> and <string-name><given-names>H. M.</given-names> <surname>Liao</surname></string-name></person-group>, &#x201C;<article-title>YOLOv4: Optimal speed and accuracy of object detection</article-title>,&#x201D; <comment>arXiv preprint arXiv:2004.10934</comment>, <year>2020</year>.</mixed-citation></ref>
<ref id="ref-23"><label>[23]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>J.</given-names> <surname>Redmon</surname></string-name> and <string-name><given-names>A.</given-names> <surname>Farhadi</surname></string-name></person-group>, &#x201C;<article-title>YOLO9000: Better faster stronger</article-title>,&#x201D; in <conf-name>Proc. of CVPR</conf-name>, <conf-loc>Honolulu, HI, USA</conf-loc>, pp. <fpage>7263</fpage>&#x2013;<lpage>7271</lpage>, <year>2017</year>.</mixed-citation></ref>
<ref id="ref-24"><label>[24]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>W.</given-names> <surname>Liu</surname></string-name>, <string-name><given-names>D.</given-names> <surname>Anguelov</surname></string-name>, <string-name><given-names>D.</given-names> <surname>Erhan</surname></string-name>, <string-name><given-names>C.</given-names> <surname>Szegedy</surname></string-name>, <string-name><given-names>S.</given-names> <surname>Reed</surname></string-name> <etal>et al.,</etal></person-group> &#x201C;<article-title>SSD: Single shot multibox detector</article-title>,&#x201D; in <conf-name>Proc. of ECCV</conf-name>, <conf-loc>Amsterdam, Netherlands</conf-loc>, pp. <fpage>21</fpage>&#x2013;<lpage>37</lpage>, <year>2016</year>.</mixed-citation></ref>
<ref id="ref-25"><label>[25]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>T.</given-names> <surname>Lin</surname></string-name>, <string-name><given-names>P.</given-names> <surname>Goyal</surname></string-name>, <string-name><given-names>R.</given-names> <surname>Girshick</surname></string-name>, <string-name><given-names>K.</given-names> <surname>He</surname></string-name> and <string-name><given-names>P.</given-names> <surname>Doll&#x00E1;r</surname></string-name></person-group>, &#x201C;<article-title>Focal loss for dense object detection</article-title>,&#x201D; in <conf-name>Proc. of ICVV</conf-name>, <conf-loc>Venice, Italy</conf-loc>, pp. <fpage>2980</fpage>&#x2013;<lpage>2988</lpage>, <year>2017</year>.</mixed-citation></ref>
<ref id="ref-26"><label>[26]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>S.</given-names> <surname>Ren</surname></string-name>, <string-name><given-names>K.</given-names> <surname>He</surname></string-name>, <string-name><given-names>R.</given-names> <surname>Girshick</surname></string-name> and <string-name><given-names>J.</given-names> <surname>Sun</surname></string-name></person-group>, &#x201C;<article-title>Faster R-CNN: Towards real-time object detection with region proposal networks</article-title>,&#x201D; <source>Advances in Neural Information Processing Systems</source>, vol. <volume>28</volume>, pp. <fpage>91</fpage>&#x2013;<lpage>99</lpage>, <year>2015</year>.</mixed-citation></ref>
<ref id="ref-27"><label>[27]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>Z.</given-names> <surname>Cai</surname></string-name> and <string-name><given-names>N.</given-names> <surname>Vasconcelos</surname></string-name></person-group>, &#x201C;<article-title>Cascade R-CNN: High quality object detection and instance segmentation</article-title>,&#x201D; <source>IEEE Transactions on Pattern Analysis and Machine Intelligence</source>, vol. <volume>43</volume>, pp. <fpage>1483</fpage>&#x2013;<lpage>1498</lpage>, <year>2019</year>.</mixed-citation></ref>
<ref id="ref-28"><label>[28]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><given-names>H.</given-names> <surname>Zhang</surname></string-name>, <string-name><given-names>F.</given-names> <surname>Li</surname></string-name>, <string-name><given-names>S.</given-names> <surname>Liu</surname></string-name>, <string-name><given-names>L.</given-names> <surname>Zhang</surname></string-name>, <string-name><given-names>H.</given-names> <surname>Su</surname></string-name> <etal>et al.,</etal></person-group> &#x201C;<article-title>DINO: Detr with improved denoising anchor boxes for end-to-end object detection</article-title>,&#x201D; <comment>arXiv Preprint arXiv:2203.03605</comment>, <year>2022</year>.</mixed-citation></ref>
<ref id="ref-29"><label>[29]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>S.</given-names> <surname>Chen</surname></string-name>, <string-name><given-names>P.</given-names> <surname>Sun</surname></string-name>, <string-name><given-names>Y.</given-names> <surname>Song</surname></string-name> and <string-name><given-names>P.</given-names> <surname>Luo</surname></string-name></person-group>, &#x201C;<article-title>DiffusionDet: Diffusion model for object detection</article-title>,&#x201D; <comment>arXiv preprint arXiv:2211.09788</comment>, <year>2022</year>.</mixed-citation></ref>
<ref id="ref-30"><label>[30]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><given-names>I.</given-names> <surname>Ethem Hamamci</surname></string-name>, <string-name><given-names>S.</given-names> <surname>Er</surname></string-name>, <string-name><given-names>E.</given-names> <surname>Simsar</surname></string-name>, <string-name><given-names>A.</given-names> <surname>Sekuboyina</surname></string-name>, <string-name><given-names>M.</given-names> <surname>Gundogar</surname></string-name> <etal>et al.,</etal></person-group> &#x201C;<article-title>Diffusion-based hierarchical multi-label object detection to analyze panoramic dental X-rays</article-title>,&#x201D; <comment>arXiv preprint arXiv: 2303.06500</comment>, <year>2023</year>.</mixed-citation></ref>
<ref id="ref-31"><label>[31]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><given-names>S.</given-names> <surname>Nag</surname></string-name>, <string-name><given-names>X.</given-names> <surname>Zhu</surname></string-name>, <string-name><given-names>J.</given-names> <surname>Deng</surname></string-name>, <string-name><given-names>Y.</given-names> <surname>Song</surname></string-name> and <string-name><given-names>T.</given-names> <surname>Xiang</surname></string-name></person-group>, &#x201C;<article-title>DiffTAD: Temporal action detection with proposal denoising diffusion</article-title>,&#x201D; <comment>arXiv preprint arXiv:2303.14863</comment>, <year>2023</year>.</mixed-citation></ref>
<ref id="ref-32"><label>[32]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>S. Y.</given-names> <surname>Liu</surname></string-name>, <string-name><given-names>H. Y.</given-names> <surname>Guo</surname></string-name>, <string-name><given-names>J. G.</given-names> <surname>Hu</surname></string-name>, <string-name><given-names>X.</given-names> <surname>Zhao</surname></string-name>, <string-name><given-names>C. T.</given-names> <surname>Zhao</surname></string-name> <etal>et al.,</etal></person-group> &#x201C;<article-title>A novel data augmentation scheme for pedestrian detection with attribute preserving GAN</article-title>,&#x201D; <source>Neurocomputing</source>, vol. <volume>401</volume>, pp. <fpage>123</fpage>&#x2013;<lpage>132</lpage>, <year>2020</year>.</mixed-citation></ref>
<ref id="ref-33"><label>[33]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>S.</given-names> <surname>Liang</surname></string-name>, <string-name><given-names>H.</given-names> <surname>Wu</surname></string-name>, <string-name><given-names>L.</given-names> <surname>Zhen</surname></string-name>, <string-name><given-names>Q.</given-names> <surname>Hua</surname></string-name>, <string-name><given-names>S.</given-names> <surname>Garg</surname></string-name> <etal>et al.,</etal></person-group> &#x201C;<article-title>Edge YOLO: Real-time intelligent object detection system based on edge-cloud cooperation in autonomous vehicles</article-title>,&#x201D; <source>IEEE Transactions on Intelligent Transportation Systems</source>, vol. <volume>23</volume>, pp. <fpage>25345</fpage>&#x2013;<lpage>25360</lpage>, <year>2022</year>.</mixed-citation></ref>
<ref id="ref-34"><label>[34]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>B.</given-names> <surname>Bosquet</surname></string-name>, <string-name><given-names>D.</given-names> <surname>Cores</surname></string-name>, <string-name><given-names>L.</given-names> <surname>Seidenari</surname></string-name>, <string-name><given-names>V. M.</given-names> <surname>Brea</surname></string-name>, <string-name><given-names>M.</given-names> <surname>Mucientes</surname></string-name> <etal>et al.,</etal></person-group> &#x201C;<article-title>A full data augmentation pipeline for small object detection based on generative adversarial networks</article-title>,&#x201D; <source>Pattern Recognition</source>, vol. <volume>133</volume>, pp. <fpage>108998</fpage>&#x2013;<lpage>109007</lpage>, <year>2023</year>.</mixed-citation></ref>
<ref id="ref-35"><label>[35]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>S.</given-names> <surname>Ji</surname></string-name>, <string-name><given-names>Q.</given-names> <surname>Ling</surname></string-name> and <string-name><given-names>F.</given-names> <surname>Han</surname></string-name></person-group>, &#x201C;<article-title>An improved algorithm for small object detection based on YOLO v4 and multi-scale contextual information</article-title>,&#x201D; <source>Computers and Electrical Engineering</source>, vol. <volume>105</volume>, pp. <fpage>108490</fpage>&#x2013;<lpage>108499</lpage>, <year>2023</year>.</mixed-citation></ref>
<ref id="ref-36"><label>[36]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>X.</given-names> <surname>Yang</surname></string-name>, <string-name><given-names>J. R.</given-names> <surname>Yang</surname></string-name>, <string-name><given-names>J. C.</given-names> <surname>Yan</surname></string-name>, <string-name><given-names>Y.</given-names> <surname>Zhang</surname></string-name>, <string-name><given-names>T. F.</given-names> <surname>Zhang</surname></string-name> <etal>et al.,</etal></person-group> &#x201C;<article-title>SCRDet: Towards more robust detection for small, cluttered and rotated objects</article-title>,&#x201D; in <conf-name>Proc. of ICCV</conf-name>, <conf-loc>Seoul, Korea (South)</conf-loc>, pp. <fpage>8232</fpage>&#x2013;<lpage>8241</lpage>, <year>2019</year>.</mixed-citation></ref>
<ref id="ref-37"><label>[37]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>V.</given-names> <surname>Vidit</surname></string-name> and <string-name><given-names>M.</given-names> <surname>Salzmann</surname></string-name></person-group>, &#x201C;<article-title>Attention-based domain adaptation for single-stage detectors</article-title>,&#x201D; <source>Machine Vision and Applications</source>, vol. <volume>33</volume>, pp. <fpage>65</fpage>&#x2013;<lpage>74</lpage>, <year>2022</year>.</mixed-citation></ref>
<ref id="ref-38"><label>[38]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>J.</given-names> <surname>Wu</surname></string-name>, <string-name><given-names>N.</given-names> <surname>Cai</surname></string-name>, <string-name><given-names>W.</given-names> <surname>Chen</surname></string-name>, <string-name><given-names>H.</given-names> <surname>Wang</surname></string-name> and <string-name><given-names>G.</given-names> <surname>Wang</surname></string-name></person-group>, &#x201C;<article-title>Automatic detection of hardhats worn by construction personnel: A deep learning approach and benchmark dataset</article-title>,&#x201D; <source>Automation in Construction</source>, vol. <volume>106</volume>, pp. <fpage>102894</fpage>&#x2013;<lpage>102915</lpage>, <year>2019</year>.</mixed-citation></ref>
<ref id="ref-39"><label>[39]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>Z.</given-names> <surname>Liu</surname></string-name>, <string-name><given-names>G.</given-names> <surname>Gao</surname></string-name>, <string-name><given-names>L.</given-names> <surname>Sun</surname></string-name> and <string-name><given-names>Z.</given-names> <surname>Fang</surname></string-name></person-group>, &#x201C;<article-title>HRDNet: High-resolution detection network for small objects</article-title>,&#x201D; in <conf-name>Proc. of ICME</conf-name>, <conf-loc>Montreal, Canada</conf-loc>, pp. <fpage>1</fpage>&#x2013;<lpage>6</lpage>, <year>2021</year>.</mixed-citation></ref>
<ref id="ref-40"><label>[40]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>S.</given-names> <surname>Zhang</surname></string-name>, <string-name><given-names>C.</given-names> <surname>Chi</surname></string-name>, <string-name><given-names>Y.</given-names> <surname>Yao</surname></string-name>, <string-name><given-names>Z.</given-names> <surname>Lei</surname></string-name> and <string-name><given-names>S. Z.</given-names> <surname>Li</surname></string-name></person-group>, &#x201C;<article-title>Bridging the gap between anchor-based and anchor-free detection via adaptive training sample selection</article-title>,&#x201D; in <conf-name>Proc. of CVPR</conf-name>, <conf-loc>Seattle, WA, USA</conf-loc>, pp. <fpage>9759</fpage>&#x2013;<lpage>9768</lpage>, <year>2020</year>.</mixed-citation></ref>
</ref-list>
</back></article>