<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.1 20151215//EN" "http://jats.nlm.nih.gov/publishing/1.1/JATS-journalpublishing1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:mml="http://www.w3.org/1998/Math/MathML" xml:lang="en" article-type="research-article" dtd-version="1.1">
<front>
<journal-meta>
<journal-id journal-id-type="pmc">CMC</journal-id>
<journal-id journal-id-type="nlm-ta">CMC</journal-id>
<journal-id journal-id-type="publisher-id">CMC</journal-id>
<journal-title-group>
<journal-title>Computers, Materials &#x0026; Continua</journal-title>
</journal-title-group>
<issn pub-type="epub">1546-2226</issn>
<issn pub-type="ppub">1546-2218</issn>
<publisher>
<publisher-name>Tech Science Press</publisher-name>
<publisher-loc>USA</publisher-loc>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="publisher-id">46068</article-id>
<article-id pub-id-type="doi">10.32604/cmc.2023.046068</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Article</subject>
</subj-group>
</article-categories>
<title-group>
<article-title>A Real-Time Small Target Vehicle Detection Algorithm with an Improved YOLOv5m Network Model</article-title>
<alt-title alt-title-type="left-running-head">A Real-Time Small Target Vehicle Detection Algorithm with an Improved YOLOv5m Network Model</alt-title>
<alt-title alt-title-type="right-running-head">A Real-Time Small Target Vehicle Detection Algorithm with an Improved YOLOv5m Network Model</alt-title>
</title-group>
<contrib-group>
<contrib id="author-1" contrib-type="author">
<name name-style="western"><surname>Du</surname><given-names>Yaoyao</given-names></name></contrib>
<contrib id="author-2" contrib-type="author" corresp="yes">
<name name-style="western"><surname>Jiang</surname><given-names>Xiangkui</given-names></name><email>jiangxiangkui@xupt.edu.cn</email></contrib>
<aff><institution>School of Automation, Xi&#x2019;an University of Posts and Telecommunications</institution>, <addr-line>Xi&#x2019;an, 710121</addr-line>, <country>China</country></aff>
</contrib-group>
<author-notes>
<corresp id="cor1"><label>&#x002A;</label>Corresponding Author: Xiangkui Jiang. Email: <email>jiangxiangkui@xupt.edu.cn</email></corresp>
</author-notes>
<pub-date date-type="collection" publication-format="electronic">
<year>2024</year></pub-date>
<pub-date date-type="pub" publication-format="electronic"><day>30</day>
<month>1</month>
<year>2024</year></pub-date>
<volume>78</volume>
<issue>1</issue>
<fpage>303</fpage>
<lpage>327</lpage>
<history>
<date date-type="received">
<day>17</day>
<month>9</month>
<year>2023</year>
</date>
<date date-type="accepted">
<day>14</day>
<month>11</month>
<year>2023</year>
</date>
</history>
<permissions>
<copyright-statement>&#x00A9; 2024 Du and Jiang</copyright-statement>
<copyright-year>2024</copyright-year>
<copyright-holder>Du and Jiang</copyright-holder>
<license xlink:href="https://creativecommons.org/licenses/by/4.0/">
<license-p>This work is licensed under a <ext-link ext-link-type="uri" xlink:type="simple" xlink:href="https://creativecommons.org/licenses/by/4.0/">Creative Commons Attribution 4.0 International License</ext-link>, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
</license>
</permissions>
<self-uri content-type="pdf" xlink:href="TSP_CMC_46068.pdf"></self-uri>
<abstract>
<p>To address the challenges of high complexity, poor real-time performance, and low detection rates for small target vehicles in existing vehicle object detection algorithms, this paper proposes a real-time lightweight architecture based on You Only Look Once (YOLO) v5m. Firstly, a lightweight upsampling operator called Content-Aware Reassembly of Features (CARAFE) is introduced in the feature fusion layer of the network to maximize the extraction of deep-level features for small target vehicles, reducing the missed detection rate and false detection rate. Secondly, a new prediction layer for tiny targets is added, and the feature fusion network is redesigned to enhance the detection capability for small targets. Finally, this paper applies L1 regularization to train the improved network, followed by pruning and fine-tuning operations to remove redundant channels, reducing computational and parameter complexity and enhancing the detection efficiency of the network. Training is conducted on the VisDrone2019-DET dataset. The experimental results show that the proposed algorithm reduces parameters and computation by 63.8% and 65.8%, respectively. The average detection accuracy improves by 5.15%, and the detection speed reaches 47 images per second, satisfying real-time requirements. Compared with existing approaches, including YOLOv5m and classical vehicle detection algorithms, our method achieves higher accuracy and faster speed for real-time detection of small target vehicles in edge computing.</p>
</abstract>
<kwd-group kwd-group-type="author">
<kwd>Vehicle detection</kwd>
<kwd>YOLOv5m</kwd>
<kwd>small target</kwd>
<kwd>channel pruning</kwd>
<kwd>CARAFE</kwd>
</kwd-group>
<funding-group>
<award-group id="awg1">
<funding-source>General Project of Key Research and Development Plan of Shaanxi Province</funding-source>
<award-id>2022NY-087</award-id>
</award-group>
</funding-group>
</article-meta>
</front>
<body>
<sec id="s1">
<label>1</label>
<title>Introduction</title>
<p>Recently, urban traffic is increasingly being transformed into smart systems due to the accelerated development of smart city construction and the explosive growth of vehicle ownership. Object detection is essential in computer vision, and is a prerequisite technology for many practical problems, such as traffic scene analysis, intelligent driving, and security monitoring [<xref ref-type="bibr" rid="ref-1">1</xref>]. Therefore, vehicle object detection is recognized as a key technology and core content in the research of intelligent vehicle systems and intelligent transportation systems, and it has significant research value and practical significance.</p>
<p>The development of Vehicle object detection has undergone two main stages: traditional algorithms and deep learning-based methods [<xref ref-type="bibr" rid="ref-2">2</xref>&#x2013;<xref ref-type="bibr" rid="ref-4">4</xref>]. Traditional vehicle detection algorithms mostly rely on sliding windows and manually designed complex feature representations to accomplish the detection task, but the performance of this method is often limited by the quality and quantity of the manually designed features. It is also computationally expensive and less robust in complex scenarios. In comparison, deep learning-based vehicle detection algorithms can achieve better performance in various computer vision problems by learning high-level feature representations. They have advantages like high accuracy, fast speed, and strong robustness in complex conditions. With the advancement of convolutional neural networks, the current trend in vehicle detection algorithms is to develop deeper and more complex networks to achieve higher accuracy. However, improving accuracy often comes at a cost. The existing vehicle detection algorithms exhibit high complexity, parameter count, and computational requirements, rendering them unsuitable for deployment on mobile and terminal devices with limited hardware resources. Moreover, real-world scenarios pose challenges for vehicle detection, including small-sized targets, high speed, complex scenes, limited extractable features, and significant scale variations. These issues render existing detection algorithms inadequate for detecting small vehicle targets. Therefore, the purpose of this study is to propose an optimized model for vehicle detection, specifically addressing the shortcomings of previous studies, such as high detection costs, poor detection rates, and inaccurate detection of small-sized vehicle targets [<xref ref-type="bibr" rid="ref-5">5</xref>,<xref ref-type="bibr" rid="ref-6">6</xref>], based on analyzing and comparing existing deep learning object detection models.</p>
<p>The organization of the remaining files is as shown below. In <xref ref-type="sec" rid="s2">Section 2</xref>, this paper introduced some related research on vehicle object detection. <xref ref-type="sec" rid="s3">Section 3</xref> provides the benchmark detection model this article selected. <xref ref-type="sec" rid="s4">Section 4</xref> describes the specific improvement plan in detail. The detailed experimental settings, experimental results, experimental analysis, and comparison with other models are presented in <xref ref-type="sec" rid="s5">Section 5</xref>. The final section presents the conclusion.</p>
</sec>
<sec id="s2">
<label>2</label>
<title>Related Works</title>
<p>Currently, vehicle object detection algorithms based on deep learning can be grouped into two types based on their detection methods. The first type is the two-stage detection algorithm using candidate boxes, with RCNN as its typical representative [<xref ref-type="bibr" rid="ref-7">7</xref>]. It predicts the final target box by generating a set of candidate boxes, and then classifying and regressing these boxes. The second type is the one-stage detection algorithm using regression, with YOLO as its typical representative [<xref ref-type="bibr" rid="ref-8">8</xref>&#x2013;<xref ref-type="bibr" rid="ref-11">11</xref>]. It directly convolves and pools the image to generate candidate boxes, and performs classification and regression at the same time to detect the vehicle object. The two-stage detection algorithm is more accurate but poorer detection speed and higher computational complexity. In contrast, the one-stage detection algorithm focuses more on the balance between detection speed and accuracy. It is widely used for vehicle detection.</p>
<p>Yang et al. [<xref ref-type="bibr" rid="ref-12">12</xref>] proposed an improved vehicle detection system that combines the YOLOv2 detection algorithm with the long short-term memory (LSTM) model. Initially, all vehicle categories were amalgamated and low-level features were removed. The detected targets were then enhanced by a dual-layer LSTM model (dLSTM) to improve the accuracy of detecting vehicle objects. However, the improved model is somewhat unwieldy and does not effectively reduce computational load. Stuparu et al. [<xref ref-type="bibr" rid="ref-13">13</xref>] proposed a high-performance single-stage vehicle detection model based on the RetinaNet architecture and the Cars Overhead With Context dataset (COWC). The model is accuracy improved to 0.7232, and the detection time is approximately 300 ms. Zhang et al. [<xref ref-type="bibr" rid="ref-14">14</xref>] proposed a vehicle detection network (SGMFNet) using self-attention. This network is enhanced by adding the Global-Local Feature Guidance (GLFG) module, Parallel Sample Feature Fusion (PSFF) module, and Inverted-residual Feature Enhancement (IFE) module to improve the feature extraction capability and multi-scale feature fusion effect of small vehicle targets. However, the network is too large to be deployed on embedded devices. Zhao et al. [<xref ref-type="bibr" rid="ref-15">15</xref>] proposed an attention-based inverted residual block structure called ESGBlock to replace the original backbone network in YOLOv5 detection algorithm. This method effectively reduces parameters and computation. The GSConv module is introduced in the feature fusion layer with knowledge distillation to address the problem of high-dimensional feature information loss and complexity. Although the improved algorithm is more in line with the requirements of lightweight deployment on embedded devices, it sacrifices part of the detection accuracy and does not improve detection of small objects. Mao et al. [<xref ref-type="bibr" rid="ref-16">16</xref>] introduced the Spatial Pyramid Pooling (SPP) module based on the YOLOv3 detection algorithm, and combined it with Soft Non-Maximum Suppression (Soft-NMS) and the inverted residual technology to detect small and occluded vehicles better. However, the network structure of the model is too complex, which makes training difficult. Cheng et al. [<xref ref-type="bibr" rid="ref-17">17</xref>] optimized the YOLOv4 detection algorithm by introducing lightweight backbone networks such as MobileNetV3, as well as the Multiscale-PANet and soft-merge modules, which improved the mAP index to 90.62% while achieving 54 FPS and 11.54 M parameters. These optimization measures have simplified the model and accelerated the detection speed. However, the model&#x2019;s detection accuracy did not improve significantly. Liu et al. [<xref ref-type="bibr" rid="ref-18">18</xref>] proposed a lightweight feature extraction network, Light CarNet, based on the YOLOv4 detection algorithm. According to the characteristics of detecting vehicle targets at different scales, a four-scale feature bidirectional weighted fusion module was designed for classification and regression. The experiments demonstrated a 1.14% increase in mAP value and improved detection of small vehicle targets while maintaining real-time detection. However, this method also increased the complexity and number of parameters of the model.</p>
<p>The above-mentioned method has contributed to vehicle target detection, but three important issues still need to be urgently addressed.</p>
<p>(1) Detection of vehicles with small targets and fuzzy targets is poor.</p>
<p>(2) Faced with complex network structures, the models need a lot of computing resources and costly model training.</p>
<p>(3) In practical applications, it is essential to not only meet the requirement for high accuracy but also consider real-time performance.</p>
<p>To address the above-mentioned issues, this paper proposes a real-time and efficient small target vehicle detection algorithm based on the YOLOv5m detection algorithm, with lightweight improvements to enhance the detection ability for small target vehicles. In conclusion, the following are the contributions made by this paper.</p>
<p>(1) To enhance the model&#x2019;s utilization of deep semantic information, this paper proposes the use of a lightweight upsampling operator instead of the traditional nearest-neighbor interpolation operator. By reorganizing contextual features, it effectively improves the detection of vehicle objects in complex scenes.</p>
<p>(2) To improve the detection of small target vehicles, this paper introduces a dedicated small target prediction layer within the model&#x2019;s output prediction layers. This enhances the model&#x2019;s focus on small target vehicles.</p>
<p>(3) This paper proposes an enhanced FPN &#x002B; PANet architecture to improve the fusion capability of the model for small target vehicle features.</p>
<p>(4) Based on the contribution of model channels to model performance, this paper applies channel pruning and compression operations to the improved YOLOv5m network model. It removes redundant channels, thereby reducing model complexity while further improving network detection performance.</p>
</sec>
<sec id="s3">
<label>3</label>
<title>YOLOv5m Algorithm</title>
<p>The YOLO series of network models has become one of the top-performing models in object detection due to its balance of speed and accuracy. YOLOv5 is the fifth generation of the YOLO series algorithm [<xref ref-type="bibr" rid="ref-19">19</xref>&#x2013;<xref ref-type="bibr" rid="ref-21">21</xref>], proposed by Ultralytics in May 2020. There are five pre-trained models of YOLOv5, which differ by the width and depth parameters and are named YOLOv5n, YOLOv5s, YOLOv5m, YOLOv5l, and YOLOv5x. <xref ref-type="table" rid="table-1">Table 1</xref> shows the width and depth parameters and the corresponding model sizes for each pre-trained model.</p>
<table-wrap id="table-1">
<label>Table 1</label>
<caption>
<title>Summary of the network model parameters for the YOLOv5 algorithm</title>
</caption>
<table frame="hsides">
<colgroup>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
</colgroup>
<thead>
<tr>
<th>Parameter categories</th>
<th>YOLOv5n</th>
<th>YOLOv5s</th>
<th>YOLOv5m</th>
<th>YOLOv5l</th>
<th>YOLOv5x</th>
</tr>
</thead>
<tbody>
<tr>
<td>Width</td>
<td>0.25</td>
<td>0.50</td>
<td>0.75</td>
<td>1.0</td>
<td>1.25</td>
</tr>
<tr>
<td>Depth</td>
<td>0.33</td>
<td>0.33</td>
<td>0.67</td>
<td>1.0</td>
<td>1.33</td>
</tr>
<tr>
<td>Model size</td>
<td>3.87 MB</td>
<td>14.1 MB</td>
<td>40.8 MB</td>
<td>89.3 MB</td>
<td>166 MB</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>Among the five pre-trained models, as the model size and detection accuracy increase, the detection rate decreases. Therefore, taking into account detection rate, accuracy, and model size, this research selects the YOLOv5m-6.0 architecture as the basis for optimizing and improving the small target vehicle detection algorithm. <xref ref-type="fig" rid="fig-1">Fig. 1</xref> depicts the network structure of YOLOv5m-6.0, which comprises four main parts: input, backbone, neck, and prediction. The training images have a size of 640 pixels by 640 pixels and consist of three color channels.</p>
<fig id="fig-1">
<label>Figure 1</label>
<caption>
<title>YOLOv5m-6.0 algorithm network model structure</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_46068-fig-1.tif"/>
</fig>
<p>The input network applies mosaic data enhancement to the input vehicle image by simultaneously reading in four images and randomly scaling, cropping, and arranging them before stitching them together. By incorporating information from these diverse images, the model&#x2019;s dataset is enriched, improving its robust performance and enhancing small target vehicle detection in this paper. Moreover, the network employs adaptive anchor box calculation methods to set the optimal initial anchor boxes, facilitating iterative optimization of network parameters during training. Additionally, the original image is preprocessed to a size of <inline-formula id="ieqn-1"><mml:math id="mml-ieqn-1"><mml:mn>640</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mn>640</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mn>3</mml:mn></mml:math></inline-formula> using adaptive image scaling, which reduces redundant information and improves model inference speed compared to traditional resize operations.</p>
<p>The backbone network of YOLOv5m-6.0 has three primary components: the CBS (kernel size (k) &#x003D; 6, stride (s) &#x003D; 2, padding (p) &#x003D; 2) module, CSPDarkNet53, and the Spatial Pyramid Pooling Fast (SPPF) module. In this context, k &#x003D; 6 represents a convolution kernel size of 6 &#x00D7; 6, s &#x003D; 2 indicates that the kernel slides 2 pixels at a time, and p &#x003D; 2 signifies zero-padding of 2 pixels around the input image. Compared to the old version, YOLOv5m-6.0 replaces the focus module for ease of model export and employs multiple small pooling kernels in the SPPF module, rather than a single large kernel in the SPP module. This enhances the network&#x2019;s ability to recognize fuzzy small targets in complex backgrounds while also improving its computational speed. Furthermore, YOLOv5m-6.0 uses Leaky Rectified Linear Unit (Leaky ReLU) as the activation function, with its calculation formula and derivatives shown in <xref ref-type="disp-formula" rid="eqn-1">Eqs. (1)</xref> and <xref ref-type="disp-formula" rid="eqn-2">(2)</xref>.
<disp-formula id="eqn-1"><label>(1)</label><mml:math id="mml-eqn-1" display="block"><mml:mrow><mml:mtext>Leaky ReLU</mml:mtext></mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mtext>x</mml:mtext></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mrow><mml:mo>{</mml:mo><mml:mtable columnalign="left left" rowspacing=".2em" columnspacing="1em" displaystyle="false"><mml:mtr><mml:mtd><mml:mrow><mml:mtext>x</mml:mtext></mml:mrow><mml:mo>,</mml:mo><mml:mrow><mml:mtext>x</mml:mtext></mml:mrow><mml:mo>&#x003E;</mml:mo><mml:mrow><mml:mn>0</mml:mn></mml:mrow></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mrow><mml:mi mathvariant="normal">&#x03B1;</mml:mi></mml:mrow><mml:mrow><mml:mtext>x</mml:mtext></mml:mrow><mml:mo>,</mml:mo><mml:mrow><mml:mtext>x</mml:mtext></mml:mrow><mml:mo>&#x2264;</mml:mo><mml:mrow><mml:mn>0</mml:mn></mml:mrow></mml:mtd></mml:mtr></mml:mtable><mml:mo fence="true" stretchy="true" symmetric="true"></mml:mo></mml:mrow></mml:math></disp-formula>
<disp-formula id="eqn-2"><label>(2)</label><mml:math id="mml-eqn-2" display="block"><mml:mrow><mml:mtext>Leaky ReLU</mml:mtext></mml:mrow><mml:msup><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mtext>x</mml:mtext></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mi>&#x2032;</mml:mi></mml:msup><mml:mo>=</mml:mo><mml:mrow><mml:mo>{</mml:mo><mml:mtable columnalign="left left" rowspacing=".2em" columnspacing="1em" displaystyle="false"><mml:mtr><mml:mtd><mml:mrow><mml:mn>1</mml:mn></mml:mrow><mml:mo>,</mml:mo><mml:mrow><mml:mtext>x</mml:mtext></mml:mrow><mml:mo>&#x003E;</mml:mo><mml:mrow><mml:mn>0</mml:mn></mml:mrow></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mrow><mml:mi mathvariant="normal">&#x03B1;</mml:mi></mml:mrow><mml:mo>,</mml:mo><mml:mrow><mml:mtext>x</mml:mtext></mml:mrow><mml:mo>&#x2264;</mml:mo><mml:mrow><mml:mn>0</mml:mn></mml:mrow></mml:mtd></mml:mtr></mml:mtable><mml:mo fence="true" stretchy="true" symmetric="true"></mml:mo></mml:mrow></mml:math></disp-formula>where <italic>&#x03B1;</italic> is generally set to 0.01, and its corresponding function and derivative function curves are shown in <xref ref-type="fig" rid="fig-2">Fig. 2</xref>.</p>
<fig id="fig-2">
<label>Figure 2</label>
<caption>
<title>Leaky ReLU activation function and its derivatives</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_46068-fig-2.tif"/>
</fig>
<p>Since it is a segmented function, it cannot maintain the same relational prediction for positive and negative input values, resulting in unstable performance results, and there are also intermittent points in the derivative function.</p>
<p>Therefore, YOLOv5m-6.0 uses the Sigmoid Weighted Linear Unit (SiLU) instead of the old version of Leaky ReLU as the new activation function. Its calculation formula and derivatives are given by <xref ref-type="disp-formula" rid="eqn-3">Eqs. (3)</xref> and <xref ref-type="disp-formula" rid="eqn-4">(4)</xref>, and their corresponding function and derivative function curves are shown in <xref ref-type="fig" rid="fig-3">Fig. 3</xref>.</p>
<fig id="fig-3">
<label>Figure 3</label>
<caption>
<title>SiLU activation function and its derivatives</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_46068-fig-3.tif"/>
</fig>
<p><disp-formula id="eqn-3"><label>(3)</label><mml:math id="mml-eqn-3" display="block"><mml:mrow><mml:mtext>SiLU</mml:mtext></mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mtext>x</mml:mtext></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mtext>x</mml:mtext></mml:mrow><mml:mrow><mml:mn>1</mml:mn><mml:mo>+</mml:mo><mml:msup><mml:mrow><mml:mtext>e</mml:mtext></mml:mrow><mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mrow><mml:mtext>x</mml:mtext></mml:mrow></mml:mrow></mml:msup></mml:mrow></mml:mfrac></mml:math></disp-formula>
<disp-formula id="eqn-4"><label>(4)</label><mml:math id="mml-eqn-4" display="block"><mml:mrow><mml:mtext>SiLU</mml:mtext></mml:mrow><mml:msup><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mtext>x</mml:mtext></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mi>&#x2032;</mml:mi></mml:msup><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mn>1</mml:mn><mml:mo>+</mml:mo><mml:msup><mml:mrow><mml:mtext>e</mml:mtext></mml:mrow><mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mrow><mml:mtext>x</mml:mtext></mml:mrow></mml:mrow></mml:msup><mml:mo>+</mml:mo><mml:mrow><mml:mtext>x</mml:mtext></mml:mrow><mml:msup><mml:mrow><mml:mtext>e</mml:mtext></mml:mrow><mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mrow><mml:mtext>x</mml:mtext></mml:mrow></mml:mrow></mml:msup></mml:mrow><mml:msup><mml:mrow><mml:mo>(</mml:mo><mml:mn>1</mml:mn><mml:mo>+</mml:mo><mml:mrow><mml:msup><mml:mrow><mml:mtext>e</mml:mtext></mml:mrow><mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mrow><mml:mtext>x</mml:mtext></mml:mrow></mml:mrow></mml:msup></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msup></mml:mfrac></mml:math></disp-formula></p>
<p>The SiLU activation function is more suitable for depth models because it is derivable everywhere and its derivative function satisfies the properties of continuity, smoothness, nonmonotonicity, and constant greater than 0.</p>
<p>The neck network resides between the backbone network and the prediction network, consisting of the Feature Pyramid Network (FPN) and the Path Aggregation Network (PANet). As shown in <xref ref-type="fig" rid="fig-4">Fig. 4</xref>, where FPN is the top-down propagation path [<xref ref-type="bibr" rid="ref-22">22</xref>]. Firstly, a 2-fold upsampling operation is performed on the small-sized feature map. Secondly, a splicing fusion operation is performed on laterally connected same-sized feature maps. Then the fused feature maps undergo a <inline-formula id="ieqn-2"><mml:math id="mml-ieqn-2"><mml:mn>3</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mn>3</mml:mn></mml:math></inline-formula> convolution operation to remove the blending effect caused by upsampling. Repeat the operation stage by stage, thus transmitting the deep strong semantic information to the shallow layer. And PANet is the bottom-up propagation path [<xref ref-type="bibr" rid="ref-23">23</xref>]. Firstly, the large-sized feature map is downsampled 2-fold. Then the feature maps of the same size connected with the lateral ones through splicing fusion and convolution. Repeat the operation stage by stage, thus transferring the strong localization information from the shallow layer to the deep layer and further enhancing the network is ability to extract fusion features. The structure enhances the model&#x2019;s ability to detect targets of different scales, increases feature diversity, and improves overall robustness.</p>
<fig id="fig-4">
<label>Figure 4</label>
<caption>
<title>FPN and PANet structure</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_46068-fig-4.tif"/>
</fig>
<p>As the output end, the prediction network is used to predict and return the target. YOLOv5m-6.0 contains three prediction layers with dimensions of (80 &#x002A; 80 &#x002A; 255), (40 &#x002A; 40 &#x002A; 255), and (20 &#x002A; 20 &#x002A; 255), which are used to detect large, medium, and small-scale targets respectively. CIOU_LOSS is used as the loss function of the prediction frame of the network during training as shown in <xref ref-type="disp-formula" rid="eqn-5">Eq. (5)</xref>. It considers three important geometric factors, which are overlapping area, centroid spacing ratio, and centroid aspect ratio. Compared with the old version, the speed and precision of the prediction frame regression have been improved effectively.
<disp-formula id="eqn-5"><label>(5)</label><mml:math id="mml-eqn-5" display="block"><mml:mrow><mml:mtext>CIOU</mml:mtext></mml:mrow><mml:mi mathvariant="normal">&#x005F;</mml:mi><mml:mrow><mml:mtext>LOSS&#xA0;</mml:mtext></mml:mrow><mml:mo>=</mml:mo><mml:mn>1</mml:mn><mml:mo>&#x2212;</mml:mo><mml:mrow><mml:mtext>IOU</mml:mtext></mml:mrow><mml:mo>+</mml:mo><mml:mfrac><mml:mrow><mml:mrow><mml:mtext>distance</mml:mtext></mml:mrow><mml:mi mathvariant="normal">&#x005F;</mml:mi><mml:msup><mml:mn>2</mml:mn><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msup></mml:mrow><mml:mrow><mml:mrow><mml:mtext>distance</mml:mtext></mml:mrow><mml:mi mathvariant="normal">&#x005F;</mml:mi><mml:msup><mml:mrow><mml:mtext>C</mml:mtext></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msup></mml:mrow></mml:mfrac><mml:mo>+</mml:mo><mml:mrow><mml:mi mathvariant="normal">&#x03B1;</mml:mi></mml:mrow><mml:mo>&#x00D7;</mml:mo><mml:mrow><mml:mtext>V</mml:mtext></mml:mrow></mml:math></disp-formula>where the <inline-formula id="ieqn-3"><mml:math id="mml-ieqn-3"><mml:mstyle displaystyle="true" scriptlevel="0"><mml:mfrac><mml:mrow><mml:mi>d</mml:mi><mml:mi>i</mml:mi><mml:mi>s</mml:mi><mml:mi>t</mml:mi><mml:mi>a</mml:mi><mml:mi>n</mml:mi><mml:mi>c</mml:mi><mml:mi>e</mml:mi><mml:mi mathvariant="normal">&#x005F;</mml:mi><mml:msup><mml:mn>2</mml:mn><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msup></mml:mrow><mml:mrow><mml:mi>d</mml:mi><mml:mi>i</mml:mi><mml:mi>s</mml:mi><mml:mi>t</mml:mi><mml:mi>a</mml:mi><mml:mi>n</mml:mi><mml:mi>c</mml:mi><mml:mi>e</mml:mi><mml:mi mathvariant="normal">&#x005F;</mml:mi><mml:msup><mml:mi>C</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msup></mml:mrow></mml:mfrac></mml:mstyle></mml:math></inline-formula> term takes into account the centroid spacing ratio factor, <inline-formula id="ieqn-4"><mml:math id="mml-ieqn-4"><mml:mi>d</mml:mi><mml:mi>i</mml:mi><mml:mi>s</mml:mi><mml:mi>t</mml:mi><mml:mi>a</mml:mi><mml:mi>n</mml:mi><mml:mi>c</mml:mi><mml:mi>e</mml:mi><mml:mi mathvariant="normal">&#x005F;</mml:mi><mml:mn>2</mml:mn></mml:math></inline-formula> represents the square of the Euclidean distance between the predicted frame and the centroid of the labeled frame, and is the diagonal length of the smallest closed frame covering both frames.</p>
<p>The term <inline-formula id="ieqn-5"><mml:math id="mml-ieqn-5"><mml:mi>&#x03B1;</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mi>V</mml:mi></mml:math></inline-formula> in the equation uses the bounding box aspect ratio scale information, <inline-formula id="ieqn-6"><mml:math id="mml-ieqn-6"><mml:mi>&#x03B1;</mml:mi></mml:math></inline-formula> is a positive balance parameter, and <inline-formula id="ieqn-7"><mml:math id="mml-ieqn-7"><mml:mi>V</mml:mi></mml:math></inline-formula> is a measure of the consistency of the aspect ratio. It is calculated as follows:
<disp-formula id="eqn-6"><label>(6)</label><mml:math id="mml-eqn-6" display="block"><mml:mrow><mml:mi mathvariant="normal">&#x03B1;</mml:mi></mml:mrow><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mtext>V</mml:mtext></mml:mrow><mml:mrow><mml:mn>1</mml:mn><mml:mo>&#x2212;</mml:mo><mml:mrow><mml:mtext>IOU</mml:mtext></mml:mrow><mml:mo>+</mml:mo><mml:mrow><mml:mtext>V</mml:mtext></mml:mrow></mml:mrow></mml:mfrac></mml:math></disp-formula>
<disp-formula id="eqn-7"><label>(7)</label><mml:math id="mml-eqn-7" display="block"><mml:mrow><mml:mtext>V</mml:mtext></mml:mrow><mml:mo>=</mml:mo><mml:mfrac><mml:mn>4</mml:mn><mml:msup><mml:mrow><mml:mi mathvariant="normal">&#x03C0;</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msup></mml:mfrac><mml:msup><mml:mrow><mml:mo>(</mml:mo><mml:msup><mml:mi>tanh</mml:mi><mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup><mml:mo>&#x2061;</mml:mo><mml:mfrac><mml:msup><mml:mrow><mml:mtext>w</mml:mtext></mml:mrow><mml:mrow><mml:mrow><mml:mtext>gt</mml:mtext></mml:mrow></mml:mrow></mml:msup><mml:msup><mml:mrow><mml:mtext>h</mml:mtext></mml:mrow><mml:mrow><mml:mrow><mml:mtext>gt</mml:mtext></mml:mrow></mml:mrow></mml:msup></mml:mfrac><mml:mo>&#x2212;</mml:mo><mml:msup><mml:mi>tanh</mml:mi><mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup><mml:mo>&#x2061;</mml:mo><mml:mfrac><mml:mrow><mml:mtext>w</mml:mtext></mml:mrow><mml:mrow><mml:mtext>h</mml:mtext></mml:mrow></mml:mfrac><mml:mo>)</mml:mo></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msup></mml:math></disp-formula></p>
<p>The equation uses the intersection over union (IOU) metric to account for the overlapping area factor, which is calculated as follows:
<disp-formula id="eqn-8"><label>(8)</label><mml:math id="mml-eqn-8" display="block"><mml:mrow><mml:mtext>IOU&#xA0;</mml:mtext></mml:mrow><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mo>|</mml:mo><mml:mrow><mml:mtext>B</mml:mtext></mml:mrow><mml:mo>&#x2229;</mml:mo><mml:msup><mml:mrow><mml:mtext>B</mml:mtext></mml:mrow><mml:mrow><mml:mrow><mml:mtext>gt</mml:mtext></mml:mrow></mml:mrow></mml:msup><mml:mo>|</mml:mo></mml:mrow><mml:mrow><mml:mo>|</mml:mo><mml:mrow><mml:mtext>B</mml:mtext></mml:mrow><mml:mo>&#x222A;</mml:mo><mml:msup><mml:mrow><mml:mtext>B</mml:mtext></mml:mrow><mml:mrow><mml:mrow><mml:mtext>gt</mml:mtext></mml:mrow></mml:mrow></mml:msup><mml:mo>|</mml:mo></mml:mrow></mml:mfrac></mml:math></disp-formula></p>
<p>In the equation, <inline-formula id="ieqn-8"><mml:math id="mml-ieqn-8"><mml:mi>B</mml:mi><mml:mo>=</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo>,</mml:mo><mml:mi>y</mml:mi><mml:mo>,</mml:mo><mml:mi>w</mml:mi><mml:mo>,</mml:mo><mml:mi>h</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> and <inline-formula id="ieqn-9"><mml:math id="mml-ieqn-9"><mml:msup><mml:mi>B</mml:mi><mml:mrow><mml:mi>g</mml:mi><mml:mi>t</mml:mi></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:msup><mml:mi>x</mml:mi><mml:mrow><mml:mi>g</mml:mi><mml:mi>t</mml:mi></mml:mrow></mml:msup><mml:mo>,</mml:mo><mml:msup><mml:mi>y</mml:mi><mml:mrow><mml:mi>g</mml:mi><mml:mi>t</mml:mi></mml:mrow></mml:msup><mml:mo>,</mml:mo><mml:msup><mml:mi>w</mml:mi><mml:mrow><mml:mi>g</mml:mi><mml:mi>t</mml:mi></mml:mrow></mml:msup><mml:mo>,</mml:mo><mml:msup><mml:mi>h</mml:mi><mml:mrow><mml:mi>g</mml:mi><mml:mi>t</mml:mi></mml:mrow></mml:msup><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> represent the position parameters contained in the predicted frame and the marked frame, respectively.</p>
<p>To filter the predicted frames, the algorithm uses a weighted Non-Maximum Suppression (NMS) approach which enables the detection of overlapping targets without requiring additional computational resources.</p>
<p>Based on the YOLOv5m-6.0 network architecture, it can be observed that the neck feature fusion network layer, due to the use of nearest-neighbor upsampling, neglects the semantic information of the extracted vehicle object features, leading to a decrease in the effectiveness of vehicle object detection. Therefore, it is necessary to replace the upsampling module to enhance the utilization of vehicle features and redesign an enhanced feature fusion network structure to improve the fusion capability for small-sized vehicle features. Furthermore, since the output only includes prediction layers of three scales, the feature extraction performance is poor when detecting vehicle objects with smaller proportions. Hence, it is crucial to design corresponding anchor boxes for small-sized vehicle objects to enhance their attention. Additionally, the overall network structure contains a large number of parameters and computational complexity, necessitating lightweight model compression operations.</p>
</sec>
<sec id="s4">
<label>4</label>
<title>Methodologies</title>
<p>This section primarily introduces components used to enhance the detection performance of small target vehicles, including the lightweight upsampling operator CARAFE, small target prediction layers, and the channel pruning compression process proposed in this paper. Additionally, a detailed diagram of the improved network model structure is provided.</p>
<sec id="s4_1">
<label>4.1</label>
<title>Lightweight Upsampling Operator CARAFE</title>
<p>The YOLOv5m network utilizes the nearest neighbor interpolation algorithm to upsample feature maps, as illustrated in <xref ref-type="fig" rid="fig-5">Fig. 5</xref>. This algorithm maintains pixel values of the transformed pixels to be the same as those of the nearest input pixel. <xref ref-type="disp-formula" rid="eqn-9">Eqs. (9)</xref> and <xref ref-type="disp-formula" rid="eqn-10">(10)</xref> define the calculations involved, where (<italic>srcX</italic>, <italic>srcY</italic>) represents the original image&#x2019;s pixel coordinates, and (<italic>dstX</italic>, <italic>dstY</italic>) represents the sampled image&#x2019;s pixel coordinates. The terms <italic>srcWidth</italic>, <italic>srcHeight</italic>, <italic>dstWidth</italic>, and <italic>dstHeight</italic> signify the dimensions of the original and sampled images, and the function <italic>round(x)</italic> rounds x to the nearest integer using the principle of rounding half up. Consequently, the sampled pixel values (<italic>dstX</italic>, <italic>dstY</italic>) are equivalent to the original image&#x2019;s calculated pixel values (<italic>srcX</italic>, <italic>srcY</italic>).</p>
<fig id="fig-5">
<label>Figure 5</label>
<caption>
<title>Neighbor interpolation algorithm</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_46068-fig-5.tif"/>
</fig>
<p><disp-formula id="eqn-9"><label>(9)</label><mml:math id="mml-eqn-9" display="block"><mml:mrow><mml:mtext>srcX&#xA0;</mml:mtext></mml:mrow><mml:mo>=</mml:mo><mml:mrow><mml:mtext>&#xA0;round</mml:mtext></mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mtext>dstX</mml:mtext></mml:mrow><mml:mo>&#x00D7;</mml:mo><mml:mfrac><mml:mrow><mml:mtext>srcWidth</mml:mtext></mml:mrow><mml:mrow><mml:mtext>dstWidth</mml:mtext></mml:mrow></mml:mfrac><mml:mo>)</mml:mo></mml:mrow></mml:math></disp-formula>
<disp-formula id="eqn-10"><label>(10)</label><mml:math id="mml-eqn-10" display="block"><mml:mrow><mml:mtext>srcY&#xA0;</mml:mtext></mml:mrow><mml:mo>=</mml:mo><mml:mrow><mml:mtext>&#xA0;round</mml:mtext></mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mtext>dstY</mml:mtext></mml:mrow><mml:mo>&#x00D7;</mml:mo><mml:mfrac><mml:mrow><mml:mtext>srcHeight</mml:mtext></mml:mrow><mml:mrow><mml:mtext>dstHeight</mml:mtext></mml:mrow></mml:mfrac><mml:mo>)</mml:mo></mml:mrow></mml:math></disp-formula></p>
<p>Therefore, the nearest neighbor interpolation algorithm solely relies on the spatial proximity of pixel points to establish the upsampling kernel. It fails to leverage the abundant semantic information within the feature map. Moreover, its perceptual field of vision is very small, only <inline-formula id="ieqn-10"><mml:math id="mml-ieqn-10"><mml:mn>1</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mn>1</mml:mn></mml:math></inline-formula> in size, thus it underutilizes the surrounding information.</p>
<p>To improve the utilization of feature semantic information while controlling both the number of parameters and computational complexity, this paper introduces a lightweight upsampling operator called Content-Aware Reassembly of Features (CARAFE) to improve the network. Assuming input size is <inline-formula id="ieqn-11"><mml:math id="mml-ieqn-11"><mml:mi>H</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mi>W</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mi>C</mml:mi></mml:math></inline-formula> and upsampling rate is <inline-formula id="ieqn-12"><mml:math id="mml-ieqn-12"><mml:mi>&#x03C3;</mml:mi></mml:math></inline-formula> CARAFE generates a new feature map of size <inline-formula id="ieqn-13"><mml:math id="mml-ieqn-13"><mml:mi>&#x03C3;</mml:mi><mml:mi>H</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mi>&#x03C3;</mml:mi><mml:mi>W</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mi>C</mml:mi></mml:math></inline-formula>. The structure of the CARAFE network is shown in <xref ref-type="fig" rid="fig-6">Fig. 6</xref>.</p>
<fig id="fig-6">
<label>Figure 6</label>
<caption>
<title>The overall framework of CARAFE</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_46068-fig-6.tif"/>
</fig>
<p>CARAFE consists of two key modules [<xref ref-type="bibr" rid="ref-24">24</xref>&#x2013;<xref ref-type="bibr" rid="ref-28">28</xref>]: the kernel prediction module and the content-aware reassembly module. It can utilize rich contextual information from lower levels to predict the reassembled kernels and reorganize features within the predetermined neighborhood. Assuming that the upsampling kernel size is <inline-formula id="ieqn-14"><mml:math id="mml-ieqn-14"><mml:msub><mml:mi>K</mml:mi><mml:mrow><mml:mi>u</mml:mi><mml:mi>p</mml:mi></mml:mrow></mml:msub><mml:mo>&#x00D7;</mml:mo><mml:msub><mml:mi>K</mml:mi><mml:mrow><mml:mi>u</mml:mi><mml:mi>p</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>, the procedure is as follows:</p>
<p>(1) Kernel prediction module</p>
<p>The module generates reconfigured kernels in a content-aware manner via three sub-modules: channel compressors, content compressors, and kernel normalizer.</p>
<p>Firstly, to reduce the number of parameters and computation, the input feature map <italic>F</italic> is passed through a channel compressor composed of <inline-formula id="ieqn-15"><mml:math id="mml-ieqn-15"><mml:mn>1</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mn>1</mml:mn></mml:math></inline-formula> convolutional layers so that the number of input feature channels is compressed from <inline-formula id="ieqn-16"><mml:math id="mml-ieqn-16"><mml:mi>C</mml:mi></mml:math></inline-formula> to <inline-formula id="ieqn-17"><mml:math id="mml-ieqn-17"><mml:msub><mml:mi>C</mml:mi><mml:mrow><mml:mi>m</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>. Secondly, the compressed feature map is fed to the content encoder, and a <inline-formula id="ieqn-18"><mml:math id="mml-ieqn-18"><mml:mn>3</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mn>3</mml:mn></mml:math></inline-formula> convolution is used to generate a reconstructed kernel with the size of <inline-formula id="ieqn-19"><mml:math id="mml-ieqn-19"><mml:mi>H</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mi>W</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:msup><mml:mi>&#x03C3;</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msup><mml:mo>&#x00D7;</mml:mo><mml:msubsup><mml:mi>K</mml:mi><mml:mrow><mml:mi>u</mml:mi><mml:mi>p</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msubsup><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>, and perform a sub-pixelShuffle scaling operation on it to resize it to size <inline-formula id="ieqn-20"><mml:math id="mml-ieqn-20"><mml:mi>&#x03C3;</mml:mi><mml:mi>H</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mi>&#x03C3;</mml:mi><mml:mi>W</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:msubsup><mml:mi>K</mml:mi><mml:mrow><mml:mi>u</mml:mi><mml:mi>p</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msubsup></mml:math></inline-formula> [<xref ref-type="bibr" rid="ref-29">29</xref>]. Finally, to ensure that the distribution of input features remains unchanged, each <inline-formula id="ieqn-21"><mml:math id="mml-ieqn-21"><mml:msub><mml:mi>K</mml:mi><mml:mrow><mml:mi>u</mml:mi><mml:mi>p</mml:mi></mml:mrow></mml:msub><mml:mo>&#x00D7;</mml:mo><mml:msub><mml:mi>K</mml:mi><mml:mrow><mml:mi>u</mml:mi><mml:mi>p</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> reassembly kernel is normalized along the channel dimension using the softmax function, ensuring that the weights of each reassembly kernel sum up to 1.</p>
<p>(2) Content-aware reassembly module</p>
<p>The module utilizes the generated reassembly kernels to reassemble the features, outputs a new feature map <inline-formula id="ieqn-22"><mml:math id="mml-ieqn-22"><mml:msup><mml:mi>F</mml:mi><mml:mo>&#x2032;</mml:mo></mml:msup></mml:math></inline-formula> that contains semantic information.</p>
<p>Firstly, the (<italic>x</italic>, <italic>y</italic>) coordinates on the output feature map <inline-formula id="ieqn-23"><mml:math id="mml-ieqn-23"><mml:msup><mml:mi>F</mml:mi><mml:mo>&#x2032;</mml:mo></mml:msup></mml:math></inline-formula> are correspondingly mapped to (<inline-formula id="ieqn-24"><mml:math id="mml-ieqn-24"><mml:msup><mml:mi>x</mml:mi><mml:mo>&#x2032;</mml:mo></mml:msup></mml:math></inline-formula>, <inline-formula id="ieqn-25"><mml:math id="mml-ieqn-25"><mml:msup><mml:mi>y</mml:mi><mml:mo>&#x2032;</mml:mo></mml:msup></mml:math></inline-formula>) on the input feature map <italic>F</italic>. The mapping relationship is shown in <xref ref-type="disp-formula" rid="eqn-11">Eq. (11)</xref>. Secondly, the reshaping operation is performed on the recombined kernel at the corresponding position to generate a perceptual field with the size of <inline-formula id="ieqn-26"><mml:math id="mml-ieqn-26"><mml:msub><mml:mi>K</mml:mi><mml:mrow><mml:mi>u</mml:mi><mml:mi>p</mml:mi></mml:mrow></mml:msub><mml:mo>&#x00D7;</mml:mo><mml:msub><mml:mi>K</mml:mi><mml:mrow><mml:mi>u</mml:mi><mml:mi>p</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>. Then, the inner product is performed with the neighborhood centered at (<inline-formula id="ieqn-27"><mml:math id="mml-ieqn-27"><mml:msup><mml:mi>x</mml:mi><mml:mo>&#x2032;</mml:mo></mml:msup></mml:math></inline-formula>, <inline-formula id="ieqn-28"><mml:math id="mml-ieqn-28"><mml:msup><mml:mi>y</mml:mi><mml:mo>&#x2032;</mml:mo></mml:msup></mml:math></inline-formula>) on the <italic>F</italic>. It is worth noting that the same reshaping kernel is shared at the same position. Finally, the <inline-formula id="ieqn-29"><mml:math id="mml-ieqn-29"><mml:msup><mml:mi>F</mml:mi><mml:mo>&#x2032;</mml:mo></mml:msup></mml:math></inline-formula> with the size <inline-formula id="ieqn-30"><mml:math id="mml-ieqn-30"><mml:mi>&#x03C3;</mml:mi><mml:mi>H</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mi>&#x03C3;</mml:mi><mml:mi>W</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mi>C</mml:mi></mml:math></inline-formula> is output.
<disp-formula id="eqn-11"><label>(11)</label><mml:math id="mml-eqn-11" display="block"><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mtext>x</mml:mtext></mml:mrow><mml:mo>&#x2032;</mml:mo><mml:mo>,</mml:mo><mml:mrow><mml:mtext>y</mml:mtext></mml:mrow><mml:mo>&#x2032;</mml:mo><mml:mo>)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mrow><mml:mo>{</mml:mo><mml:mtable columnalign="left left" rowspacing=".2em" columnspacing="1em" displaystyle="false"><mml:mtr><mml:mtd><mml:mrow><mml:mtext>x</mml:mtext></mml:mrow><mml:mo>&#x2032;</mml:mo><mml:mo>=</mml:mo><mml:mrow><mml:mo>&#x230A;</mml:mo><mml:mstyle displaystyle="true" scriptlevel="0"><mml:mfrac><mml:mrow><mml:mtext>x</mml:mtext></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x03C3;</mml:mi></mml:mrow></mml:mfrac></mml:mstyle><mml:mo>&#x230B;</mml:mo></mml:mrow></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mrow><mml:mtext>y</mml:mtext></mml:mrow><mml:mo>&#x2032;</mml:mo><mml:mo>=</mml:mo><mml:mrow><mml:mo>&#x230A;</mml:mo><mml:mstyle displaystyle="true" scriptlevel="0"><mml:mfrac><mml:mrow><mml:mtext>y</mml:mtext></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x03C3;</mml:mi></mml:mrow></mml:mfrac></mml:mstyle><mml:mo>&#x230B;</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable><mml:mo fence="true" stretchy="true" symmetric="true"></mml:mo></mml:mrow></mml:math></disp-formula></p>
<p>The <inline-formula id="ieqn-31"><mml:math id="mml-ieqn-31"><mml:mrow><mml:mo>&#x230A;</mml:mo><mml:mo>&#x230B;</mml:mo></mml:mrow></mml:math></inline-formula> in the equation represents the floor function, which rounds down to the nearest integer. The calculation of the lightweight upsampling operator CARAFE is shown in <xref ref-type="disp-formula" rid="eqn-12">Eq. (12)</xref>.
<disp-formula id="eqn-12"><label>(12)</label><mml:math id="mml-eqn-12" display="block"><mml:mn>2</mml:mn><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mtext>C</mml:mtext></mml:mrow><mml:mo>+</mml:mo><mml:mn>1</mml:mn><mml:mo>)</mml:mo></mml:mrow><mml:msub><mml:mrow><mml:mtext>C</mml:mtext></mml:mrow><mml:mrow><mml:mrow><mml:mtext>m</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:mn>2</mml:mn><mml:mrow><mml:mo>(</mml:mo><mml:mn>81</mml:mn><mml:msub><mml:mrow><mml:mtext>C</mml:mtext></mml:mrow><mml:mrow><mml:mrow><mml:mtext>m</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:msup><mml:mrow><mml:mi mathvariant="normal">&#x03C3;</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msup><mml:mo>+</mml:mo><mml:mn>1</mml:mn><mml:mo>)</mml:mo></mml:mrow><mml:msup><mml:mrow><mml:mi mathvariant="normal">&#x03C3;</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msup><mml:msubsup><mml:mrow><mml:mtext>K</mml:mtext></mml:mrow><mml:mrow><mml:mrow><mml:mtext>up</mml:mtext></mml:mrow></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msubsup><mml:mo>+</mml:mo><mml:mn>2</mml:mn><mml:msup><mml:mrow><mml:mi mathvariant="normal">&#x03C3;</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msup><mml:msubsup><mml:mrow><mml:mtext>K</mml:mtext></mml:mrow><mml:mrow><mml:mrow><mml:mtext>up</mml:mtext></mml:mrow></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msubsup><mml:mrow><mml:mtext>C</mml:mtext></mml:mrow></mml:math></disp-formula></p>
</sec>
<sec id="s4_2">
<label>4.2</label>
<title>Adding Tiny Target Prediction Layers</title>
<p>The YOLOv5m network model before improvement struggles to capture feature information from small-scale vehicles, impeding learning performance, which makes the detection of small target vehicles have seriously missed detection and false detection in practice. There are three main reasons:</p>
<p>(1) The network subsampling multiplier is very large, so small target vehicles can not occupy pixels.</p>
<p>(2) The network perception field is large, which makes the perceived small object features contain a large number of surrounding worthless features.</p>
<p>(3) The deep and shallow feature maps in the network are not well-balanced in semantic and spatial attributes.</p>
<p>Therefore, to enhance the attention of the network to small target vehicles and improve the detection performance, it is proposed to add a tiny target prediction layer with a perceptual field of <inline-formula id="ieqn-32"><mml:math id="mml-ieqn-32"><mml:mn>4</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mn>4</mml:mn></mml:math></inline-formula> size based on the three target prediction layers of large [<xref ref-type="bibr" rid="ref-30">30</xref>&#x2013;<xref ref-type="bibr" rid="ref-33">33</xref>], medium and small of the original network. The corresponding feature fusion network has been redesigned. The improved network with the prediction frame size settings in each feature layer is presented in <xref ref-type="table" rid="table-2">Table 2</xref>. <xref ref-type="fig" rid="fig-7">Fig. 7</xref> shows the structure of the improved FPN &#x002B; PANet.</p>
<table-wrap id="table-2">
<label>Table 2</label>
<caption>
<title>Detect feature map information</title>
</caption>
<table frame="hsides">
<colgroup>
<col align="left"/>
<col align="left"/>
<col align="left"/>
</colgroup>
<thead>
<tr>
<th>Feature map</th>
<th>Receptive field</th>
<th>Anchor box</th>
</tr>
</thead>
<tbody>
<tr>
<td>160 &#x00D7; 160 &#x00D7; 64</td>
<td>4 &#x00D7; 4</td>
<td>[5,6,8,14,15,11]</td>
</tr>
<tr>
<td>80 &#x00D7; 80 &#x00D7; 128</td>
<td>8 &#x00D7; 8</td>
<td>[10,13,16,30,33,23]</td>
</tr>
<tr>
<td>40 &#x00D7; 40 &#x00D7; 256</td>
<td>16 &#x00D7; 16</td>
<td>[30,61,62,45,59,119]</td>
</tr>
<tr>
<td>20 &#x00D7; 20 &#x00D7; 512</td>
<td>32 &#x00D7; 32</td>
<td>[116,90,156,198,373,326]</td>
</tr>
</tbody>
</table>
</table-wrap><fig id="fig-7">
<label>Figure 7</label>
<caption>
<title>Improved FPN &#x002B; PANet structure</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_46068-fig-7.tif"/>
</fig>
</sec>
<sec id="s4_3">
<label>4.3</label>
<title>Channel Pruning Compression</title>
<p>This paper uses the improved YOLOv5m network as the input model and apply channel pruning and compression operations to reduce computational complexity [<xref ref-type="bibr" rid="ref-34">34</xref>&#x2013;<xref ref-type="bibr" rid="ref-37">37</xref>], improve generalization performance, and enhance the network&#x2019;s accuracy on low-resource devices. Specifically, this article integrates sparse regularization training to identify and prune the low-performance channels, followed by fine-tuning to further improve accuracy. <xref ref-type="fig" rid="fig-8">Fig. 8</xref> illustrates the implementation flow of our channel pruning and compression methodology.</p>
<fig id="fig-8">
<label>Figure 8</label>
<caption>
<title>Channel pruning and compression flow chart</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_46068-fig-8.tif"/>
</fig>
<p>Step1: Sparse Regularization Training</p>
<p>In modern neural networks, it is common to use Batch Normalization (BN) layers after convolutional layers. The data output by the convolution layer is distributed within a reasonable range through translation and scaling, to speed up the training and convergence of the network and improve the generalization performance. Let <inline-formula id="ieqn-33"><mml:math id="mml-ieqn-33"><mml:msub><mml:mi>Z</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mi>n</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> and <inline-formula id="ieqn-34"><mml:math id="mml-ieqn-34"><mml:msub><mml:mi>Z</mml:mi><mml:mrow><mml:mi>o</mml:mi><mml:mi>u</mml:mi><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> be the input and output of the BN layer, respectively. And <inline-formula id="ieqn-35"><mml:math id="mml-ieqn-35"><mml:msub><mml:mi>&#x03BC;</mml:mi><mml:mrow><mml:mi>B</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> and <inline-formula id="ieqn-36"><mml:math id="mml-ieqn-36"><mml:msub><mml:mi>&#x03C3;</mml:mi><mml:mrow><mml:mi>B</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> represent the mean and standard deviation of the input samples within a batch. Then the BN layer is calculated as shown in <xref ref-type="disp-formula" rid="eqn-13">Eqs. (13)</xref> and <xref ref-type="disp-formula" rid="eqn-14">(14)</xref>.
<disp-formula id="eqn-13"><label>(13)</label><mml:math id="mml-eqn-13" display="block"><mml:mrow><mml:mover><mml:mrow><mml:mtext>Z</mml:mtext></mml:mrow><mml:mo stretchy="false">&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:msub><mml:mrow><mml:mtext>Z</mml:mtext></mml:mrow><mml:mrow><mml:mrow><mml:mtext>in</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mrow><mml:mi mathvariant="normal">&#x0B5;</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mtext>B</mml:mtext></mml:mrow></mml:mrow></mml:msub></mml:mrow><mml:msqrt><mml:msubsup><mml:mrow><mml:mi mathvariant="normal">&#x03C3;</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mtext>B</mml:mtext></mml:mrow></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msubsup><mml:mo>+</mml:mo><mml:mrow><mml:mi mathvariant="normal">&#x03B5;</mml:mi></mml:mrow></mml:msqrt></mml:mfrac></mml:math></disp-formula>
<disp-formula id="eqn-14"><label>(14)</label><mml:math id="mml-eqn-14" display="block"><mml:msub><mml:mrow><mml:mtext>Z</mml:mtext></mml:mrow><mml:mrow><mml:mrow><mml:mtext>out</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mrow><mml:mi mathvariant="normal">&#x03B3;</mml:mi></mml:mrow><mml:mrow><mml:mover><mml:mrow><mml:mtext>Z</mml:mtext></mml:mrow><mml:mo stretchy="false">&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mo>+</mml:mo><mml:mrow><mml:mi mathvariant="normal">&#x03B2;</mml:mi></mml:mrow></mml:math></disp-formula></p>
<p>The BN layer uses learnable scaling factors <inline-formula id="ieqn-37"><mml:math id="mml-ieqn-37"><mml:mi>&#x03B3;</mml:mi></mml:math></inline-formula> and translation parameters <inline-formula id="ieqn-38"><mml:math id="mml-ieqn-38"><mml:mi>&#x03B2;</mml:mi></mml:math></inline-formula> to normalize the input values. However, to prevent the possibility of division by zero, a small constant <inline-formula id="ieqn-39"><mml:math id="mml-ieqn-39"><mml:mi>&#x03B5;</mml:mi></mml:math></inline-formula> is added to the denominator. When the scaling factor <inline-formula id="ieqn-40"><mml:math id="mml-ieqn-40"><mml:mi>&#x03B3;</mml:mi></mml:math></inline-formula> tends towards zero, the output of the convolutional module becomes independent of the input. In such cases, the channel can be considered less important for model performance, and the weight of <inline-formula id="ieqn-41"><mml:math id="mml-ieqn-41"><mml:mi>&#x03B3;</mml:mi></mml:math></inline-formula> serves as an indicator of channel importance for potential pruning. Therefore, <inline-formula id="ieqn-42"><mml:math id="mml-ieqn-42"><mml:mi>&#x03B3;</mml:mi></mml:math></inline-formula> can be used as an indicator to identify low-performance channels effectively.</p>
<p>Therefore, by sparse regularization of the scale factor <inline-formula id="ieqn-43"><mml:math id="mml-ieqn-43"><mml:mi>&#x03B3;</mml:mi></mml:math></inline-formula> in the BN layer and joint training of the network weights, the <inline-formula id="ieqn-44"><mml:math id="mml-ieqn-44"><mml:mi>&#x03B3;</mml:mi></mml:math></inline-formula> in the BN layer in the neural network converges to 0, and a sparse network with friendly pruning is obtained. The loss function of sparse training is shown in <xref ref-type="disp-formula" rid="eqn-15">Eq. (15)</xref>:
<disp-formula id="eqn-15"><label>(15)</label><mml:math id="mml-eqn-15" display="block"><mml:mrow><mml:mtext>L</mml:mtext></mml:mrow><mml:mo>=</mml:mo><mml:munder><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mtext>x</mml:mtext></mml:mrow><mml:mo>,</mml:mo><mml:mrow><mml:mtext>y</mml:mtext></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:munder><mml:mrow><mml:mtext>l</mml:mtext></mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mtext>f</mml:mtext></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mtext>x</mml:mtext></mml:mrow><mml:mo>,</mml:mo><mml:mrow><mml:mtext>W</mml:mtext></mml:mrow><mml:mo stretchy="false">)</mml:mo><mml:mo>,</mml:mo><mml:mrow><mml:mtext>y</mml:mtext></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mo>+</mml:mo><mml:mrow><mml:mi mathvariant="normal">&#x03BB;</mml:mi></mml:mrow><mml:msub><mml:mrow><mml:mo symmetric="true">&#x2016;</mml:mo><mml:mrow><mml:mi mathvariant="normal">&#x03B3;</mml:mi></mml:mrow><mml:mo symmetric="true">&#x2016;</mml:mo></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:math></disp-formula></p>
<p>The above equation consists of the sum of two terms. The first term is the loss function of the original YOLOv5m algorithm, where (<italic>x</italic>, <italic>y</italic>) represents the training input and label values, <italic>W</italic> denotes the network trainable weight parameter, and the second term is the loss function of the sparse training scale factor <inline-formula id="ieqn-45"><mml:math id="mml-ieqn-45"><mml:mi>&#x03B3;</mml:mi></mml:math></inline-formula>, where <inline-formula id="ieqn-46"><mml:math id="mml-ieqn-46"><mml:mi>&#x03BB;</mml:mi></mml:math></inline-formula> is the sparsity rate, the larger the value, the greater the sparsity of the network, and also has a greater impact on the network accuracy, which is used to balance the loss of the front and back two terms. In the second term, <inline-formula id="ieqn-47"><mml:math id="mml-ieqn-47"><mml:msub><mml:mrow><mml:mo symmetric="true">&#x2016;</mml:mo><mml:mrow><mml:mi mathvariant="normal">&#x03B3;</mml:mi></mml:mrow><mml:mo symmetric="true">&#x2016;</mml:mo></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula> is the L1 regularization term for the scaling factor. It is used to drive the scaling factor <inline-formula id="ieqn-48"><mml:math id="mml-ieqn-48"><mml:mi>&#x03B3;</mml:mi></mml:math></inline-formula> towards 0, thus achieving network sparsity. The calculation of this term is shown in <xref ref-type="disp-formula" rid="eqn-16">Eq. (16)</xref>, where <inline-formula id="ieqn-49"><mml:math id="mml-ieqn-49"><mml:mi mathvariant="normal">&#x0393;</mml:mi></mml:math></inline-formula> represents the set of all scaling factors in the network.
<disp-formula id="eqn-16"><label>(16)</label><mml:math id="mml-eqn-16" display="block"><mml:msub><mml:mrow><mml:mo symmetric="true">&#x2016;</mml:mo><mml:mrow><mml:mi mathvariant="normal">&#x03B3;</mml:mi></mml:mrow><mml:mo symmetric="true">&#x2016;</mml:mo></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:munder><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x03B3;</mml:mi></mml:mrow><mml:mo>&#x2208;</mml:mo><mml:mrow><mml:mi mathvariant="normal">&#x0393;</mml:mi></mml:mrow></mml:mrow></mml:munder><mml:mrow><mml:mo>|</mml:mo><mml:mrow><mml:mi mathvariant="normal">&#x03B3;</mml:mi></mml:mrow><mml:mo>|</mml:mo></mml:mrow></mml:math></disp-formula></p>
<p>Step2: Pruning Operation</p>
<p>After the network is trained by sparse regularization, the scaling factor <inline-formula id="ieqn-50"><mml:math id="mml-ieqn-50"><mml:mi>&#x03B3;</mml:mi></mml:math></inline-formula> corresponding to the majority of channels will tend to be 0, which means that the contribution of these channels to the network performance is very low. Therefore, this paper sorts all the scaling factors <inline-formula id="ieqn-51"><mml:math id="mml-ieqn-51"><mml:mi>&#x03B3;</mml:mi></mml:math></inline-formula>, then set the pruning rate and adopt a specific pruning strategy, and finally prune all the inputs and outputs of the channels below the threshold value, to get a more lightweight compression model.</p>
<p>The channel pruning schematic is shown in <xref ref-type="fig" rid="fig-9">Fig. 9</xref>. The left image represents a sparse network, while the right image represents a compressed network after pruning. By performing channel pruning operations, the channels <inline-formula id="ieqn-52"><mml:math id="mml-ieqn-52"><mml:msub><mml:mi>C</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mn>2</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula> and <inline-formula id="ieqn-53"><mml:math id="mml-ieqn-53"><mml:msub><mml:mi>C</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mn>4</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula> in the left image, where the scaling factors tend towards 0, are eliminated. Finally, the remaining channels are reorganized to obtain a pruned and compact network.</p>
<fig id="fig-9">
<label>Figure 9</label>
<caption>
<title>Schematic diagram of channel pruning</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_46068-fig-9.tif"/>
</fig>
<p>According to the characteristics of the small target vehicle detection network, a global threshold strategy is adopted to prune, that is, whether to prune the channel or not is decided by introducing the global threshold <inline-formula id="ieqn-54"><mml:math id="mml-ieqn-54"><mml:mrow><mml:mover><mml:mi>&#x03B3;</mml:mi><mml:mo stretchy="false">&#x005E;</mml:mo></mml:mover></mml:mrow></mml:math></inline-formula>. Firstly, the scaling factor <inline-formula id="ieqn-55"><mml:math id="mml-ieqn-55"><mml:mi>&#x03B3;</mml:mi></mml:math></inline-formula> of all channels in the sparse network is sorted, and then the lower scaling factor <inline-formula id="ieqn-56"><mml:math id="mml-ieqn-56"><mml:mi>&#x03B3;</mml:mi></mml:math></inline-formula> is selected as the global threshold <inline-formula id="ieqn-57"><mml:math id="mml-ieqn-57"><mml:mrow><mml:mover><mml:mi>&#x03B3;</mml:mi><mml:mo stretchy="false">&#x005E;</mml:mo></mml:mover></mml:mrow></mml:math></inline-formula> according to the predetermined pruning rate, and the channels below this threshold in the network are eliminated. In addition, the single threshold control pruning can cause all channels in a layer of the network to be pruned, thus destroying the regular structure of the original network. Therefore, it is necessary to introduce a local safety threshold <inline-formula id="ieqn-58"><mml:math id="mml-ieqn-58"><mml:mi>&#x03C4;</mml:mi></mml:math></inline-formula> again, that is, to eliminate channels that are less than the global threshold <inline-formula id="ieqn-59"><mml:math id="mml-ieqn-59"><mml:mrow><mml:mover><mml:mi>&#x03B3;</mml:mi><mml:mo stretchy="false">&#x005E;</mml:mo></mml:mover></mml:mrow></mml:math></inline-formula> and the local threshold <inline-formula id="ieqn-60"><mml:math id="mml-ieqn-60"><mml:mi>&#x03C4;</mml:mi></mml:math></inline-formula> layer by layer to prevent excessive pruning.</p>
<p>After pruning, the number of parameters in the network model is significantly reduced, resulting in a more compact model. However, when targeting fine-grained target detection tasks, increasing the pruning rate may lead to a slight decline in precision. To address this issue, it is necessary to fine-tune the pruned model and use the fine-tuned model as the final compressed model.</p>
<p>The pruning algorithm can be classified into two categories based on the pruning operation process: iterative and one-shot.</p>
<p>Iteration: Pruning is carried out layer by layer, and it needs to be retrained and fine-tuned after each pruning. However, since this method requires multiple iterations and the consumption of computational resources increases with the complexity of the network structure. It is not used.</p>
<p>One-shot: After sorting the scale factor <inline-formula id="ieqn-61"><mml:math id="mml-ieqn-61"><mml:mi>&#x03B3;</mml:mi></mml:math></inline-formula>, the BN layers in the network are pruned simultaneously to remove the redundant parameters and then retrained. Not only does it reduce the consumption of computational resources, but it also significantly improves the detection accuracy. Therefore, this paper adopts a one-shot approach to prune the model.</p>
</sec>
<sec id="s4_4">
<label>4.4</label>
<title>Improved Network Structure Diagram</title>
<p>This paper presents the improved model structure of the YOLOv5m-6.0 algorithm, as shown in <xref ref-type="fig" rid="fig-10">Fig. 10</xref>. While retaining the original backbone feature extraction network, this paper replaces CARAFE with a new upsampling method and add small target prediction layers. This paper also redesigns the enhanced feature fusion network. Finally, this paper applies channel pruning to the model structure and use the compressed model as the final improved model.</p>
<fig id="fig-10">
<label>Figure 10</label>
<caption>
<title>Improved YOLOv5m-6.0 network model structure diagram</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_46068-fig-10.tif"/>
</fig>
</sec>
</sec>
<sec id="s5">
<label>5</label>
<title>Experiments</title>
<p>This section aims to verify the effectiveness and rationality of the improved YOLOv5m algorithm in small target vehicle detection. To achieve this, this article will introduce the dataset used for model training, the experimental platform environment, relevant hyperparameters, evaluation metrics, analyze the results, and show the detection comparison effect.</p>
<sec id="s5_1">
<label>5.1</label>
<title>Build the Dataset</title>
<p>To test the method&#x2019;s feasibility and effectiveness, this study uses the VisDrone2019-DET large public dataset to train and evaluate the model. The dataset, developed by the AISKYEYE team at Tianjin university&#x2019;s machine learning and data mining laboratory, is an open-source dataset for UAV high-altitude scenes that includes 10 categories of interest, such as cars, people, vans, and others. This experiment focuses on car detection, and filtering techniques were applied to extract a representative and diverse sample set of 8178 images for training and detection. Of these, 90% were used to create the training set consisting of 7275 images, while the remaining 10% were allocated to the test set containing 903 images. The dataset covers various traffic scenarios, including streets, highways, and intersections, and diverse challenging environmental backgrounds, such as strong light, low light, rainy weather, and foggy conditions, reflecting typical and relevant real-world conditions. <xref ref-type="fig" rid="fig-11">Fig. 11</xref> shows the distribution of sample label scales in the training set, where the width and height of the label frame are respectively represented by the horizontal and vertical axes. The distribution of data points indicates the dataset contains a large number of small-scale target vehicles that meet the requirements of the experimental training.</p>
<fig id="fig-11">
<label>Figure 11</label>
<caption>
<title>Schematic diagram of the label scale of the training set sample</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_46068-fig-11.tif"/>
</fig>
</sec>
<sec id="s5_2">
<label>5.2</label>
<title>Experimental Environment</title>
<p>The experiment used PyTorch on Windows 10. A virtual environment was created using Anaconda Navigator with Python 3.9, PyTorch 1.10, TensorFlow 2.7.0, and Cuda 11.2 installed. The hardware configuration included a 6X65-2680 V4 CPU and NVIDIA RTX4000 GPU. The iterative training utilized a modified YOLOv5m network structure and included an initial learning rate of 0.01, batch size of 16, 65% pruning rate, 0.0002 sparse rate, and 100 epochs.</p>
</sec>
<sec id="s5_3">
<label>5.3</label>
<title>Model Performance Evaluation Metrics</title>
<p>To accurately evaluate the improved YOLOv5m for detecting small target vehicles on the VisDrone2019-DET dataset, this paper analyzes its lightweight and detection performance using various metrics. These metrics include the number of parameters, recall, average precision when the threshold IOU is 0.5 (mAP@0.5), average precision over IOU thresholds ranging from 0.5 to 0.95 in increments of 0.05 (mAP@0.5:0.95) [<xref ref-type="bibr" rid="ref-38">38</xref>], Giga FLoating-point Operations Per Second (GFLOPS), Frames Per Second (FPS), and Model Size. These metrics provide a comprehensive measurement of the model&#x2019;s performance from various aspects and perspectives. The calculation formula is as follows:
<disp-formula id="eqn-17"><label>(17)</label><mml:math id="mml-eqn-17" display="block"><mml:mrow><mml:mtext>Recall&#xA0;</mml:mtext></mml:mrow><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mtext>TP</mml:mtext></mml:mrow><mml:mrow><mml:mrow><mml:mtext>TP</mml:mtext></mml:mrow><mml:mo>+</mml:mo><mml:mrow><mml:mtext>FN</mml:mtext></mml:mrow></mml:mrow></mml:mfrac></mml:math></disp-formula>
<disp-formula id="eqn-18"><label>(18)</label><mml:math id="mml-eqn-18" display="block"><mml:mrow><mml:mtext>Precision&#xA0;</mml:mtext></mml:mrow><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mtext>TP</mml:mtext></mml:mrow><mml:mrow><mml:mrow><mml:mtext>TP</mml:mtext></mml:mrow><mml:mo>+</mml:mo><mml:mrow><mml:mtext>FP</mml:mtext></mml:mrow></mml:mrow></mml:mfrac></mml:math></disp-formula>
<disp-formula id="eqn-19"><label>(19)</label><mml:math id="mml-eqn-19" display="block"><mml:mrow><mml:mtext>AP&#xA0;</mml:mtext></mml:mrow><mml:mo>=</mml:mo><mml:msubsup><mml:mo>&#x222B;</mml:mo><mml:mrow><mml:mn>0</mml:mn></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msubsup><mml:mrow><mml:mtext>P</mml:mtext></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mtext>R</mml:mtext></mml:mrow><mml:mo stretchy="false">)</mml:mo><mml:mrow><mml:mtext>dR</mml:mtext></mml:mrow></mml:math></disp-formula>
<disp-formula id="eqn-20"><label>(20)</label><mml:math id="mml-eqn-20" display="block"><mml:mrow><mml:mtext>mAP&#xA0;</mml:mtext></mml:mrow><mml:mo>=</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:mrow><mml:mtext>n</mml:mtext></mml:mrow></mml:mfrac><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mrow><mml:mtext>i</mml:mtext></mml:mrow><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mrow><mml:mtext>n</mml:mtext></mml:mrow></mml:mrow></mml:munderover><mml:mrow><mml:mtext>AP</mml:mtext></mml:mrow></mml:math></disp-formula>
<disp-formula id="eqn-21"><label>(21)</label><mml:math id="mml-eqn-21" display="block"><mml:mrow><mml:mtext>GFLOPS&#xA0;</mml:mtext></mml:mrow><mml:mo>=</mml:mo><mml:mn>2</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:msup><mml:mn>10</mml:mn><mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mn>9</mml:mn></mml:mrow></mml:msup><mml:mrow><mml:mtext>HW</mml:mtext></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mrow><mml:mtext>C</mml:mtext></mml:mrow><mml:mrow><mml:mrow><mml:mtext>in</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:msup><mml:mrow><mml:mtext>K</mml:mtext></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msup><mml:mo>+</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo><mml:msub><mml:mrow><mml:mtext>C</mml:mtext></mml:mrow><mml:mrow><mml:mrow><mml:mtext>out</mml:mtext></mml:mrow></mml:mrow></mml:msub></mml:math></disp-formula>
<disp-formula id="eqn-22"><label>(22)</label><mml:math id="mml-eqn-22" display="block"><mml:mrow><mml:mtext>FPS&#xA0;</mml:mtext></mml:mrow><mml:mo>=</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:mrow><mml:mtext>t</mml:mtext></mml:mrow></mml:mfrac></mml:math></disp-formula></p>
<p>In the above formula, three dichotomous parameters are selected which are defined in <xref ref-type="table" rid="table-3">Table 3</xref>. Where 1 and 0 indicate whether the result is a vehicle or not, respectively. True Positives (TP) represents the number of correctly detected vehicle samples. It refers to the samples where both the ground truth and the model prediction indicate the presence of vehicles. False Positives (FP) represents the number of falsely detected vehicle samples. It refers to the samples where the ground truth indicates the absence of vehicles, but the model incorrectly predicts them as vehicles. False Negatives (FN) represents the number of missed vehicle samples. It refers to the samples where the ground truth indicates the presence of vehicles, but the model incorrectly predicts them as non-vehicles. Average Precision (AP) represents the area under the precision-recall curve. The value of AP is calculated for each class, and n represents the total number of classes. HW indicates the output feature map size. K denotes the convolutional kernel size, <inline-formula id="ieqn-62"><mml:math id="mml-ieqn-62"><mml:msub><mml:mi>C</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mi>n</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> and <inline-formula id="ieqn-63"><mml:math id="mml-ieqn-63"><mml:msub><mml:mi>C</mml:mi><mml:mrow><mml:mi>o</mml:mi><mml:mi>u</mml:mi><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> represent the input and output channel counts, respectively. FPS is the images detected per second, and t represents the time consumed for detecting a single image.</p>
<table-wrap id="table-3">
<label>Table 3</label>
<caption>
<title>Definition of second classification parameters</title>
</caption>
<table frame="hsides">
<colgroup>
<col align="left"/>
<col align="left"/>
<col align="left"/>
</colgroup>
<thead>
<tr>
<th>Parameter categories</th>
<th>True values</th>
<th>Predicted values</th>
</tr>
</thead>
<tbody>
<tr>
<td>TP</td>
<td>1</td>
<td>1</td>
</tr>
<tr>
<td>FP</td>
<td>0</td>
<td>1</td>
</tr>
<tr>
<td>FN</td>
<td>1</td>
<td>0</td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
<sec id="s5_4">
<label>5.4</label>
<title>Experimental Results and Analysis</title>
<p>This section provides a detailed introduction to the experimental training results of the improved algorithm from the perspectives of lightweight performance and testing performance, and analyzes and verifies them. Finally, a comparison of the detection performance with the current mainstream vehicle detection algorithm models is conducted under the same experimental environment and parameters.</p>
<sec id="s5_4_1">
<label>5.4.1</label>
<title>Sparse Regularization Training</title>
<p>As the selection of the sparsity rate directly influences the level of network sparsity, which in turn affects the effectiveness of subsequent model compression and detection performance, this paper assesses the network sparsity level across various sparsity rates. The optimal value of the sparsity rate &#x03BB; is then chosen based on the application requirements. <xref ref-type="fig" rid="fig-12">Fig. 12</xref> illustrates the distribution of the scaling factor <inline-formula id="ieqn-64"><mml:math id="mml-ieqn-64"><mml:mi>&#x03B3;</mml:mi></mml:math></inline-formula> in the BN layer of the improved network model after 100 rounds of iterative training using different sparsity rates <inline-formula id="ieqn-65"><mml:math id="mml-ieqn-65"><mml:mi>&#x03BB;</mml:mi></mml:math></inline-formula> during sparse regularization training.</p>
<fig id="fig-12">
<label>Figure 12</label>
<caption>
<title>Schematic diagram of the variation of the scaling factor for different sparsity rates</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_46068-fig-12.tif"/>
</fig>
<p>When the sparsity rate <inline-formula id="ieqn-66"><mml:math id="mml-ieqn-66"><mml:mi>&#x03BB;</mml:mi></mml:math></inline-formula> is set to 0, indicating the absence of sparsity training, the scaling factors <inline-formula id="ieqn-67"><mml:math id="mml-ieqn-67"><mml:mi>&#x03B3;</mml:mi></mml:math></inline-formula> in each layer follow a normal distribution. As the number of training round increases, the <inline-formula id="ieqn-68"><mml:math id="mml-ieqn-68"><mml:mi>&#x03B3;</mml:mi></mml:math></inline-formula> values remain mostly unchanged and centered around 1.0. Consequently, the model cannot be compressed at this stage.</p>
<p>When the sparsity rate <inline-formula id="ieqn-69"><mml:math id="mml-ieqn-69"><mml:mi>&#x03BB;</mml:mi><mml:mo>=</mml:mo><mml:mn>0.0001</mml:mn></mml:math></inline-formula>, as the network training progresses, the <inline-formula id="ieqn-70"><mml:math id="mml-ieqn-70"><mml:mi>&#x03B3;</mml:mi></mml:math></inline-formula> values of each layer gradually approach 0, and the network gradually becomes sparse. After training, the <inline-formula id="ieqn-71"><mml:math id="mml-ieqn-71"><mml:mi>&#x03B3;</mml:mi></mml:math></inline-formula> values of each layer are concentrated around 0.73. Consequently, the model can be compressed to a certain extent.</p>
<p>When the sparsity rate <inline-formula id="ieqn-72"><mml:math id="mml-ieqn-72"><mml:mi>&#x03BB;</mml:mi><mml:mo>=</mml:mo><mml:mn>0.0002</mml:mn></mml:math></inline-formula>, the speed of network sparsification increases, and finally, it is concentrated around 0.49. Consequently, the model can be significantly compressed.</p>
<p>When the sparsity rate <inline-formula id="ieqn-73"><mml:math id="mml-ieqn-73"><mml:mrow><mml:mi mathvariant="normal">&#x03BB;</mml:mi></mml:mrow></mml:math></inline-formula> increases to 0.0003, the degree of network sparsity increases significantly. After training, the <inline-formula id="ieqn-74"><mml:math id="mml-ieqn-74"><mml:mrow><mml:mi mathvariant="normal">&#x03B3;</mml:mi></mml:mrow></mml:math></inline-formula> values of each layer are concentrated around 0.25. Consequently, the model can be compressed to an extremely high degree.</p>
<p><xref ref-type="table" rid="table-4">Table 4</xref> presents the average detection precision of the sparse-trained models at various sparsity rates. The results indicate that for sparsity rates of 0, 0.0001, and 0.0002, the average detection precision remains relatively stable as the network sparsity increases. However, a noticeable decline in the average detection precision occurs when the sparsity rate reaches 0.0003.</p>
<table-wrap id="table-4">
<label>Table 4</label>
<caption>
<title>The average detection precision of the sparse-trained models under different sparsity rates</title>
</caption>
<table frame="hsides">
<colgroup>
<col align="left"/>
<col align="left"/>
</colgroup>
<thead>
<tr>
<th>&#x03BB;</th>
<th>mAP@0.5 (%)</th>
</tr>
</thead>
<tbody>
<tr>
<td>0</td>
<td>76.88</td>
</tr>
<tr>
<td>0.0001</td>
<td>76.15</td>
</tr>
<tr>
<td>0.0002</td>
<td>76.22</td>
</tr>
<tr>
<td>0.0003</td>
<td>68.58</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>Based on the aforementioned experiments, it was observed that a sparsity rate of <inline-formula id="ieqn-75"><mml:math id="mml-ieqn-75"><mml:mrow><mml:mi mathvariant="normal">&#x03BB;</mml:mi></mml:mrow><mml:mo>=</mml:mo><mml:mn>0.0001</mml:mn></mml:math></inline-formula> results in relatively low model sparsity and inadequate compression effect. Conversely, a sparsity rate of <inline-formula id="ieqn-76"><mml:math id="mml-ieqn-76"><mml:mrow><mml:mi mathvariant="normal">&#x03BB;</mml:mi></mml:mrow><mml:mo>=</mml:mo><mml:mn>0.0003</mml:mn></mml:math></inline-formula> leads to excessive model sparsity and significant precision loss. Thus, to ensure a balance between compression effect and detection performance, the optimal sparsity rate of <inline-formula id="ieqn-77"><mml:math id="mml-ieqn-77"><mml:mrow><mml:mi mathvariant="normal">&#x03BB;</mml:mi></mml:mrow><mml:mo>=</mml:mo><mml:mn>0.0002</mml:mn></mml:math></inline-formula> is selected.</p>
</sec>
<sec id="s5_4_2">
<label>5.4.2</label>
<title>Channel Pruning</title>
<p>After completing sparse regularization training, the selection of the channel pruning rate becomes crucial. If a channel pruning rate is chosen too small, the model&#x2019;s lightweight effect may be compromised. Conversely, selecting a channel pruning rate that is too large can potentially have destructive effects on the model. Hence, this paper pruned the model using various channel pruning rates and analyzed the average detection precision of the pruned model, as depicted in <xref ref-type="fig" rid="fig-13">Fig. 13</xref>. The horizontal axis represents the pruning rate, while the vertical axis represents the corresponding average detection precision value. From the figure, it is evident that the model&#x2019;s evaluation detection precision remains stable when the pruning rate is below 65%. However, there is a rapid decline in the average detection precision when the pruning rate exceeds 65%. To strike a balance between the network&#x2019;s detection precision and lightweight requirements, the experiment ultimately determined the channel pruning rate as 65%.</p>
<fig id="fig-13">
<label>Figure 13</label>
<caption>
<title>The impact of different pruning rates on model accuracy</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_46068-fig-13.tif"/>
</fig>
<p>The statistics of the number of output channels for each layer of the network before and after pruning are shown in <xref ref-type="fig" rid="fig-14">Fig. 14</xref>. The red bar chart represents the number of output channels in each layer of the network before channel pruning, totaling 19,600. The blue bar chart represents the number of output channels in each layer of the network after channel pruning, totaling 10,560. From the figure, it can be observed that a total of 9,040 redundant channels were pruned in all layers of the model, indicating a significant reduction in redundant channels and effective compression of the model.</p>
<fig id="fig-14">
<label>Figure 14</label>
<caption>
<title>Comparison of channel number before and after pruning</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_46068-fig-14.tif"/>
</fig>
<p><xref ref-type="fig" rid="fig-15">Fig. 15</xref> shows a comparison of the parameter quantity, calculation quantity, and model size before and after pruning in a columnar diagram. The red bar represents the values before channel pruning, while the green bar represents the values after channel pruning. It can be observed from the figure that after pruning, the number of parameters in the model decreased to 35.26% of the original value, with a reduction of 64.74% in redundant parameters. The GFLOPS also decreased by 72.18%, and the size of the pruned model was reduced by 63.25% compared to the original model. Further validation confirmed that the channel pruning method used in this study effectively compressed the YOLOv5m network model and saved network resources.</p>
<fig id="fig-15">
<label>Figure 15</label>
<caption>
<title>Lightweight comparison of models before and after pruning</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_46068-fig-15.tif"/>
</fig>
</sec>
<sec id="s5_4_3">
<label>5.4.3</label>
<title>Comparative Analysis of Experiments</title>
<p>To validate the feasibility and robustness of the proposed improvement scheme, three groups of controlled experiments were conducted for analysis, as shown below:
<list list-type="bullet">
<list-item>
<p>This paper conducted ablative experiments comparing the original YOLOv5m network model (Model A) with Model B, Model C, and Model D. Model B replaced the upsampling operator with a lightweight bottle operator, while Model C added a small target prediction head on top of the original model. Model D combined the improvement methods of both Model B and Model C.</p></list-item>
<list-item>
<p>The original network Model A is compared and analyzed with the final improved Model E.</p></list-item>
<list-item>
<p>This paper compared and analyzed the experimental results of Model E with those of other classic mainstream vehicle detection algorithms, in order to assess the effectiveness Model E.</p></list-item>
</list></p>
<p><xref ref-type="table" rid="table-5">Table 5</xref> presents a comparison of lightweight metrics parameters for each model. Meanwhile, <xref ref-type="table" rid="table-6">Table 6</xref> includes the experimental results for each model&#x2019;s training, and a comparison of parameters for detecting performance indexes.</p>
<table-wrap id="table-5">
<label>Table 5</label>
<caption>
<title>Lightweight data comparison</title>
</caption>
<table frame="hsides">
<colgroup>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
</colgroup>
<thead>
<tr>
<th></th>
<th>Model</th>
<th>Parameters</th>
<th>GFLOPS</th>
<th>Model size/MB</th>
</tr>
</thead>
<tbody>
<tr>
<td>A</td>
<td>YOLOv5m</td>
<td>20 871 318</td>
<td>48.2</td>
<td>40.2</td>
</tr>
<tr>
<td>B</td>
<td>YOLOv5m_CARAFE</td>
<td>21 112 302</td>
<td>49.5</td>
<td>40.7</td>
</tr>
<tr>
<td>C</td>
<td>YOLOv5m-Small</td>
<td>21 140 264</td>
<td>57.4</td>
<td>41.3</td>
</tr>
<tr>
<td>D</td>
<td>YOLOv5m_CARAFE-Small</td>
<td>21 437 564</td>
<td>59.3</td>
<td>41.9</td>
</tr>
<tr>
<td><bold>E</bold></td>
<td><bold>Ours</bold></td>
<td><bold>7 559 170</bold></td>
<td><bold>16.5</bold></td>
<td><bold>15.4</bold></td>
</tr>
<tr>
<td>F</td>
<td>YOLOv3-tiny</td>
<td>8 669 876</td>
<td>13.0</td>
<td>16.6</td>
</tr>
<tr>
<td>G</td>
<td>YOLOv5s</td>
<td>7 022 326</td>
<td>15.9</td>
<td>13.7</td>
</tr>
<tr>
<td>H</td>
<td>YOLOX_s</td>
<td>8 937 682</td>
<td>26.8</td>
<td>34.3</td>
</tr>
<tr>
<td>I</td>
<td>YOLOv8s</td>
<td>11 135 987</td>
<td>28.6</td>
<td>21.4</td>
</tr>
</tbody>
</table>
</table-wrap><table-wrap id="table-6">
<label>Table 6</label>
<caption>
<title>Testing performance data comparison</title>
</caption>
<table frame="hsides">
<colgroup>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
</colgroup>
<thead>
<tr>
<th></th>
<th>Model</th>
<th>mAP@0.5 (%)</th>
<th>Recall (%)</th>
<th>mAP@0.5:0.95 (%)</th>
<th>FPS</th>
</tr>
</thead>
<tbody>
<tr>
<td>A</td>
<td>YOLOv5m</td>
<td>75.29</td>
<td>69.15</td>
<td>47.22</td>
<td>28</td>
</tr>
<tr>
<td>B</td>
<td>YOLOv5m_CARAFE</td>
<td>76.28</td>
<td>69.82</td>
<td>47.5</td>
<td>27</td>
</tr>
<tr>
<td>C</td>
<td>YOLOv5m-Small</td>
<td>78.62</td>
<td>71.03</td>
<td>48.89</td>
<td>25</td>
</tr>
<tr>
<td>D</td>
<td>YOLOv5m_CARAFE-Small</td>
<td>79.43</td>
<td>71.32</td>
<td>49.90</td>
<td>24</td>
</tr>
<tr>
<td><bold>E</bold></td>
<td><bold>Ours</bold></td>
<td><bold>80.44</bold></td>
<td><bold>73.73</bold></td>
<td><bold>51.27</bold></td>
<td><bold>47</bold></td>
</tr>
<tr>
<td>F</td>
<td>YOLOv3-tiny</td>
<td>54.13</td>
<td>49.99</td>
<td>28.43</td>
<td>38</td>
</tr>
<tr>
<td>G</td>
<td>YOLOv5s</td>
<td>72.92</td>
<td>65.87</td>
<td>44.10</td>
<td>41</td>
</tr>
<tr>
<td>H</td>
<td>YOLOX_s</td>
<td>70.32</td>
<td>58.77</td>
<td>41.76</td>
<td>30</td>
</tr>
<tr>
<td>I</td>
<td>YOLOv8s</td>
<td>73.71</td>
<td>67.72</td>
<td>47.76</td>
<td>33</td>
</tr>
</tbody>
</table>
</table-wrap>
<p><xref ref-type="fig" rid="fig-16">Fig. 16</xref> displays graphs of the network&#x2019;s average accuracy, loss function, and recall variation during training. The horizontal axis showing iterations, while the vertical axis showing the corresponding values of mAP@0.5, loss function, and network recall rate. The green line in the figure represents the experimental result curve of the final improved model. From the figure, it can be observed that the improved algorithm achieves higher detection accuracy and recall rate compared to other detection models. Additionally, it demonstrates faster convergence, further validating the effectiveness of the improvement method.</p>
<fig id="fig-16">
<label>Figure 16</label>
<caption>
<title>Network average precision, loss function and recall variation curves</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_46068-fig-16.tif"/>
</fig>
<p>The experimental results of Model A and Model B in <xref ref-type="table" rid="table-4">Tables 4</xref> and <xref ref-type="table" rid="table-5">5</xref> show that replacing the lightweight upsampling module CARAFE has slightly increased the algorithm&#x2019;s parameter and computational complexity. However, the average detection accuracy and recall rate have improved by 0.99% and 0.67%, respectively. These results indicate that CARAFE has improved the algorithm&#x2019;s detection performance without adding excessive parameter and computational complexity.</p>

<p>According to the experimental results of Model A and Model C in <xref ref-type="table" rid="table-4">Tables 4</xref> and <xref ref-type="table" rid="table-5">5</xref>, after adding the small object detection head in this paper, although the computational complexity of the algorithm has increased to some extent, the average detection accuracy and recall rate have improved by 3.33% and 1.88%, respectively. This indicates that the small object detection head plays a crucial role in improving the performance of small object vehicle detection, but it also leads to a more complex model.</p>

<p>Based on the experimental results of Models A and D in <xref ref-type="table" rid="table-6">Table 6</xref> and <xref ref-type="fig" rid="fig-16">Fig. 16</xref>, it indicates that the lightweight upsampling operator CARAFE introduced in this paper and the addition of a tiny target prediction layer have accelerated the convergence of the network model and improved the average detection accuracy for small vehicle targets. Compared to the original model, the proposed approach increased mAP@0.5 by 4.14% and Recall by 2.17%. Furthermore, it addressed the missed detection problem of small and occluded target vehicles by the network and confirmed the effectiveness of the proposed improvement plan. According to the lightweight comparison analysis presented in <xref ref-type="table" rid="table-5">Table 5</xref>, while the proposed approach slightly increased the network&#x2019;s parameters and computational complexity, the obtaining improvements in accuracy and detection rates justify this increase.</p>

<p>To achieve model lightweight and reduce the number of parameters and complexity, this article performed channel pruning on network Model D to obtain the final improved model E. The comparative analysis of the lightweight data of Model A and Model E in <xref ref-type="table" rid="table-5">Table 5</xref> shows that the parameter number of the final improved model E proposed in this article reduced from the original network Model A is 20,871,318 to 7,559,170, a decrease of up to 63.8%. Additionally, the model size is reduced by 24.8 MB, and there is a 31.7 GFLOPS reduction, a decrease of up to 65.8%.</p>

<p>Based on the experimental results presented in <xref ref-type="table" rid="table-6">Table 6</xref> and <xref ref-type="fig" rid="fig-16">Fig. 16</xref>, it is evident that the improved Model E significantly enhanced the mAP@0.5 from 75.29% to 80.44%, with an increase of 5.15%, thereby improving the detection accuracy of small target vehicles. Moreover, the Recall increased by 4.58%, and the FPS increased from 28 to 47, thereby meeting the requirement of real-time detection speed. Overall, the analysis indicates that the enhanced Model E demonstrates superior detection precision when compared to the original model. The model has removed numerous redundant channels to reduce model complexity and improve performance in small vehicle detection. Additionally, the model has been optimized for real-time detection needs without compromising its lightweight design.</p>
<p>To conduct an in-depth analysis of the enhanced Model E&#x2019;s detection performance on compact vehicles, other mainstream detection algorithms in the YOLO series were selected for experimental comparison under the same parameters and experimental environment. Based on the experimental findings presented in <xref ref-type="table" rid="table-6">Table 6</xref> and <xref ref-type="fig" rid="fig-16">Fig. 16</xref>, it can be observed that the average detection accuracy of the improved algorithm increased by 7.52%, 10.12%, 26.31%, and 6.73% compared to YOLOv5s, YOLOX_s, YOLOv3-tiny, and YOLOv8s, respectively. Furthermore, the network recall rate improved by 7.86%, 14.96%, 23.74%, and 6.01%, respectively.</p>
<p><xref ref-type="fig" rid="fig-17">Fig. 17</xref> provides a comparative analysis of the detection performance of various algorithms on the VisDrone2019-DET dataset. This includes the improved algorithm before pruning, as well as the current mainstream vehicle detection algorithms: YOLOv5m, YOLOv3-tiny, YOLOv5s, YOLOX-s, and YOLOv8s. The x-axis represents the number of images detected per second, with higher values indicating faster detection speed. The y-axis represents the average detection accuracy of the models, with higher values indicating higher average detection accuracy. The ideal position should correspond to the top right corner of the graph, indicating that the model achieves high accuracy while processing images quickly. From the figure, it can be observed that the red dot represents the final improved algorithm. Based on our comprehensive analysis, the improved algorithm outperforms other mainstream detection algorithms in terms of both speed and accuracy. This feature meets the requirements of real-time and lightweight detection without compromising accuracy. It is better suitable for detecting vehicle targets with numerous small targets and substantial differences in scale. The analysis was conducted under the same parameters and experimental environment, making the results reliable and informative.</p>
<fig id="fig-17">
<label>Figure 17</label>
<caption>
<title>Comparative analysis of model detection performance</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_46068-fig-17.tif"/>
</fig>
</sec>
</sec>
<sec id="s5_5">
<label>5.5</label>
<title>Comparative Analysis of Detection Effects</title>
<p>Based on the comparative analysis of various evaluation indicators, model E was ultimately determined as the improved model in this paper. <xref ref-type="fig" rid="fig-18">Fig. 18</xref> displays the original YOLOv5m model&#x2019;s actual detection effect before improvement, while <xref ref-type="fig" rid="fig-19">Fig. 19</xref> shows the final improved model&#x2019;s actual detection effect. The improved model achieves higher detection confidence than the original YOLOv5m model. It successfully detects small and occluded target vehicles that were missed by the original model, further confirming the effectiveness of the proposed improvement solution.</p>
<fig id="fig-18">
<label>Figure 18</label>
<caption>
<title>The original YOLOv5m model detection effect</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_46068-fig-18.tif"/>
</fig><fig id="fig-19">
<label>Figure 19</label>
<caption>
<title>The effect of the improved model detection is shown</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_46068-fig-19.tif"/>
</fig>
<p>After analyzing the detection results, it was found that the improved model proposed in this paper can more accurately detect small-sized target vehicles at longer distances compared to the baseline model, while also improving the detection speed. However, there are still certain limitations. For example, there are still some instances of missed detections for smaller vehicle targets and dense vehicle targets.</p>
</sec>
</sec>
<sec id="s6">
<label>6</label>
<title>Conclusion</title>
<p>To address the issues of deep learning technology in the field of vehicle target detection, this paper proposes a lightweight vehicle target detection algorithm based on YOLOv5m. By improving its structure and compressing the model, the average detection accuracy and recall rate were increased by 5.15% and 4.58%, respectively, compared to the original model, and the FPS reached 47. The experiments showed that this enhanced algorithm can accurately detect small and occluded objects in real-time, meeting the requirements of small vehicle detection. Additionally, by cutting redundant channels and parameters, the algorithm greatly compressed the network model, reducing the parameter and computational volume by 63.8% and 65.8%, respectively, and the model size by 24.8 MB. Overall, this algorithm can effectively detect small and dense vehicle targets, providing valuable insights for intelligent city construction. However, the algorithm also has certain limitations and room for improvement. For more complex scenarios, such as small target vehicles with low background contrast, the detection effect is not ideal, which will be a key research topic in the future.</p>
</sec>
</body>
<back>
<ack>
<p>The authors would like to thank the anonymous reviewers and the editor for their valuable suggestions, which greatly contributed to the improved quality of this article.</p>
</ack>
<sec><title>Funding Statement</title>
<p>This research was funded by the General Project of Key Research and Development Plan of Shaanxi Province (No. 2022NY-087).</p>
</sec>
<sec><title>Author Contributions</title>
<p>The authors confirm contribution to the paper as follows: study conception and design: Yaoyao Du and Xiangkui Jiang; data collection: Yaoyao Du; analysis and interpretation of results: Yaoyao Du and Xiangkui Jiang; draft manuscript preparation: Yaoyao Du. All authors reviewed the results and approved the final version of the manuscript.</p>
</sec>
<sec sec-type="data-availability"><title>Availability of Data and Materials</title>
<p>The data that support the findings of this study are openly available at <ext-link ext-link-type="uri" xlink:href="https://github.com/VisDrone/VisDrone-Dataset">https://github.com/VisDrone/VisDrone-Dataset</ext-link>.</p>
</sec>
<sec sec-type="COI-statement"><title>Conflicts of Interest</title>
<p>The authors declare that they have no conflicts of interest to report regarding the present study.</p>
</sec>
<ref-list content-type="authoryear">
<title>References</title>
<ref id="ref-1"><label>[1]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>M.</given-names> <surname>Alshraideh</surname></string-name>, <string-name><given-names>B. A.</given-names> <surname>Mahafzah</surname></string-name>, <string-name><given-names>S.</given-names> <surname>Al-Sharaeh</surname></string-name> and <string-name><given-names>Z. M.</given-names> <surname>Hawamdeh</surname></string-name></person-group>, &#x201C;<article-title>A robotic intelligent wheelchair system based on obstacle avoidance and navigation functions</article-title>,&#x201D; <source>Journal of Experimental &#x0026; Theoretical Artificial Intelligence</source>, vol. <volume>27</volume>, no. <issue>4</issue>, pp. <fpage>471</fpage>&#x2013;<lpage>482</lpage>, <year>2015</year>.</mixed-citation></ref>
<ref id="ref-2"><label>[2]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>W.</given-names> <surname>Xu</surname></string-name>, <string-name><given-names>F.</given-names> <surname>Yu</surname></string-name>, <string-name><given-names>S.</given-names> <surname>Liu</surname></string-name>, <string-name><given-names>D.</given-names> <surname>Xiao</surname></string-name>, <string-name><given-names>J.</given-names> <surname>Hu</surname></string-name> <etal>et al.</etal></person-group><italic>,</italic> &#x201C;<article-title>Real-time multi-class disturbance detection for &#x03A6;-OTDR based on YOLO algorithm</article-title>,&#x201D; <source>Sensors</source>, vol. <volume>22</volume>, no. <issue>5</issue>, pp. <fpage>1994</fpage>, <year>2022</year>; <pub-id pub-id-type="pmid">35271143</pub-id></mixed-citation></ref>
<ref id="ref-3"><label>[3]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>M.</given-names> <surname>Ju</surname></string-name>, <string-name><given-names>B.</given-names> <surname>Niu</surname></string-name>, <string-name><given-names>S.</given-names> <surname>Jin</surname></string-name> and <string-name><given-names>Z.</given-names> <surname>Liu</surname></string-name></person-group>, &#x201C;<article-title>SuperDet: An efficient single-shot network for vehicle detection in remote sensing images</article-title>,&#x201D; <source>Electronics</source>, vol. <volume>12</volume>, no. <issue>6</issue>, pp. <fpage>1312</fpage>, <year>2023</year>.</mixed-citation></ref>
<ref id="ref-4"><label>[4]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>A.</given-names> <surname>Bouguettaya</surname></string-name>, <string-name><given-names>H.</given-names> <surname>Zarzour</surname></string-name>, <string-name><given-names>A.</given-names> <surname>Kechida</surname></string-name> and <string-name><given-names>A. M.</given-names> <surname>Taberkit</surname></string-name></person-group>, &#x201C;<article-title>Vehicle detection from UAV imagery with deep learning: A review</article-title>,&#x201D; <source>IEEE Transactions on Neural Networks and Learning Systems</source>, vol. <volume>33</volume>, no. <issue>11</issue>, pp. <fpage>6047</fpage>&#x2013;<lpage>6067</lpage>, <year>2022</year>; <pub-id pub-id-type="pmid">34029200</pub-id></mixed-citation></ref>
<ref id="ref-5"><label>[5]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>K.</given-names> <surname>Tong</surname></string-name>, <string-name><given-names>Y.</given-names> <surname>Wu</surname></string-name> and <string-name><given-names>F.</given-names> <surname>Zhou</surname></string-name></person-group>, &#x201C;<article-title>Recent advances in small object detection based on deep learning: A review</article-title>,&#x201D; <source>Image and Vision Computing</source>, vol. <volume>97</volume>, pp. <fpage>103910</fpage>, <year>2020</year>.</mixed-citation></ref>
<ref id="ref-6"><label>[6]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>S.</given-names> <surname>Yang</surname></string-name>, <string-name><given-names>J.</given-names> <surname>Zhang</surname></string-name>, <string-name><given-names>C.</given-names> <surname>Bo</surname></string-name>, <string-name><given-names>M.</given-names> <surname>Wang</surname></string-name> and <string-name><given-names>L.</given-names> <surname>Chen</surname></string-name></person-group>, &#x201C;<article-title>Fast vehicle logo detection in complex scenes</article-title>,&#x201D; <source>Optics &#x0026; Laser Technology</source>, vol. <volume>110</volume>, pp. <fpage>196</fpage>&#x2013;<lpage>201</lpage>, <year>2019</year>.</mixed-citation></ref>
<ref id="ref-7"><label>[7]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>K. R.</given-names> <surname>Akshatha</surname></string-name>, <string-name><given-names>A. K.</given-names> <surname>Karunakar</surname></string-name>, <string-name><given-names>S. B.</given-names> <surname>Shenoy</surname></string-name>, <string-name><given-names>A. K.</given-names> <surname>Pai</surname></string-name>, <string-name><given-names>N. H.</given-names> <surname>Nagaraj</surname></string-name> <etal>et al.</etal></person-group><italic>,</italic> &#x201C;<article-title>Human detection in aerial thermal images using faster R-CNN and SSD algorithms</article-title>,&#x201D; <source>Electronics</source>, vol. <volume>11</volume>, no. <issue>7</issue>, pp. <fpage>1151</fpage>, <year>2022</year>.</mixed-citation></ref>
<ref id="ref-8"><label>[8]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>J.</given-names> <surname>Dai</surname></string-name>, <string-name><given-names>T.</given-names> <surname>Li</surname></string-name>, <string-name><given-names>Z.</given-names> <surname>Xuan</surname></string-name> and <string-name><given-names>Z.</given-names> <surname>Feng</surname></string-name></person-group>, &#x201C;<article-title>Automated defect analysis system for industrial computerized tomography images of solid rocket motor grains based on YOLO-V4 model</article-title>,&#x201D; <source>Electronics</source>, vol. <volume>11</volume>, no. <issue>19</issue>, pp. <fpage>3215</fpage>, <year>2022</year>.</mixed-citation></ref>
<ref id="ref-9"><label>[9]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>H. C.</given-names> <surname>Nguyen</surname></string-name>, <string-name><given-names>T. H.</given-names> <surname>Nguyen</surname></string-name>, <string-name><given-names>R.</given-names> <surname>Scherer</surname></string-name> and <string-name><given-names>V. H.</given-names> <surname>Le</surname></string-name></person-group>, &#x201C;<article-title>YOLO series for human hand action detection and classification from egocentric videos</article-title>,&#x201D; <source>Sensors</source>, vol. <volume>23</volume>, no. <issue>6</issue>, pp. <fpage>3255</fpage>, <year>2023</year>; <pub-id pub-id-type="pmid">36991971</pub-id></mixed-citation></ref>
<ref id="ref-10"><label>[10]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><given-names>V.</given-names> <surname>Viswanatha</surname></string-name>, <string-name><given-names>R. K.</given-names> <surname>Chandana</surname></string-name> and <string-name><given-names>A.</given-names> <surname>Ramachandra</surname></string-name></person-group>, &#x201C;<article-title>Real time object detection system with YOLO and CNN models: A review</article-title>,&#x201D; [Online]. Available: <ext-link ext-link-type="uri" xlink:href="https://arxiv.org/abs/2208.00773">https://arxiv.org/abs/2208.00773</ext-link> <comment>(accessed on 25/07/2023)</comment></mixed-citation></ref>
<ref id="ref-11"><label>[11]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>S. J.</given-names> <surname>Ji</surname></string-name>, <string-name><given-names>Q. H.</given-names> <surname>Ling</surname></string-name> and <string-name><given-names>F.</given-names> <surname>Han</surname></string-name></person-group>, &#x201C;<article-title>An improved algorithm for small object detection based on YOLO v4 and multi-scale contextual information</article-title>,&#x201D; <source>Computers and Electrical Engineering</source>, vol. <volume>105</volume>, pp. <fpage>108490</fpage>, <year>2023</year>.</mixed-citation></ref>
<ref id="ref-12"><label>[12]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>W. J.</given-names> <surname>Yang</surname></string-name>, <string-name><given-names>W. J.</given-names> <surname>Liow</surname></string-name>, <string-name><given-names>S. F.</given-names> <surname>Chen</surname></string-name>, <string-name><given-names>J. F.</given-names> <surname>Yang</surname></string-name>, <string-name><given-names>P. C.</given-names> <surname>Chung</surname></string-name> <etal>et al.</etal></person-group><italic>,</italic> &#x201C;<article-title>Improved vehicle detection systems with double-layer LSTM modules</article-title>,&#x201D; <source>EURASIP Journal on Advances in Signal Processing</source>, vol. <volume>2022</volume>, no. <issue>1</issue>, pp. <fpage>7</fpage>, <year>2022</year>.</mixed-citation></ref>
<ref id="ref-13"><label>[13]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>D. G.</given-names> <surname>Stuparu</surname></string-name>, <string-name><given-names>R. I.</given-names> <surname>Ciobanu</surname></string-name> and <string-name><given-names>C.</given-names> <surname>Dobre</surname></string-name></person-group>, &#x201C;<article-title>Vehicle detection in overhead satellite images using a one-stage object detection model</article-title>,&#x201D; <source>Sensors</source>, vol. <volume>20</volume>, no. <issue>22</issue>, pp. <fpage>6485</fpage>, <year>2020</year>; <pub-id pub-id-type="pmid">33202875</pub-id></mixed-citation></ref>
<ref id="ref-14"><label>[14]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>Y.</given-names> <surname>Zhang</surname></string-name>, <string-name><given-names>C.</given-names> <surname>Wu</surname></string-name>, <string-name><given-names>T.</given-names> <surname>Zhang</surname></string-name>, <string-name><given-names>Y.</given-names> <surname>Liu</surname></string-name> and <string-name><given-names>Y.</given-names> <surname>Zheng</surname></string-name></person-group>, &#x201C;<article-title>Self-attention guidance and multiscale feature fusion-based UAV image object detection</article-title>,&#x201D; <source>IEEE Geoscience and Remote Sensing Letters</source>, vol. <volume>20</volume>, pp. <fpage>1</fpage>&#x2013;<lpage>5</lpage>, <year>2023</year>.</mixed-citation></ref>
<ref id="ref-15"><label>[15]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>L.</given-names> <surname>Zhao</surname></string-name>, <string-name><given-names>Z.</given-names> <surname>Wei</surname></string-name>, <string-name><given-names>Y.</given-names> <surname>Li</surname></string-name>, <string-name><given-names>J.</given-names> <surname>Jin</surname></string-name> and <string-name><given-names>X.</given-names> <surname>Li</surname></string-name></person-group>, &#x201C;<article-title>SEDG-Yolov5: A lightweight traffic sign detection model based on knowledge distillation</article-title>,&#x201D; <source>Electronics</source>, vol. <volume>12</volume>, no. <issue>2</issue>, pp. <fpage>305</fpage>, <year>2023</year>.</mixed-citation></ref>
<ref id="ref-16"><label>[16]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>Q. C.</given-names> <surname>Mao</surname></string-name>, <string-name><given-names>H. M.</given-names> <surname>Sun</surname></string-name>, <string-name><given-names>L. Q.</given-names> <surname>Zuo</surname></string-name> and <string-name><given-names>R. S.</given-names> <surname>Jia</surname></string-name></person-group>, &#x201C;<article-title>Finding every car: A traffic surveillance multi-scale vehicle object detection method</article-title>,&#x201D; <source>Applied Intelligence</source>, vol. <volume>50</volume>, no. <issue>10</issue>, pp. <fpage>3125</fpage>&#x2013;<lpage>3136</lpage>, <year>2020</year>.</mixed-citation></ref>
<ref id="ref-17"><label>[17]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>Q.</given-names> <surname>Cheng</surname></string-name>, <string-name><given-names>H.</given-names> <surname>Wang</surname></string-name>, <string-name><given-names>B.</given-names> <surname>Zhu</surname></string-name>, <string-name><given-names>Y.</given-names> <surname>Shi</surname></string-name> and <string-name><given-names>B.</given-names> <surname>Xie</surname></string-name></person-group>, &#x201C;<article-title>A real-time UAV target detection algorithm based on edge computing</article-title>,&#x201D; <source>Drones</source>, vol. <volume>7</volume>, no. <issue>2</issue>, pp. <fpage>95</fpage>, <year>2023</year>.</mixed-citation></ref>
<ref id="ref-18"><label>[18]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>L. C.</given-names> <surname>Liu</surname></string-name>, <string-name><given-names>X. Y.</given-names> <surname>Jia</surname></string-name>, <string-name><given-names>D. N.</given-names> <surname>Han</surname></string-name>, <string-name><given-names>Z. D.</given-names> <surname>Li</surname></string-name> and <string-name><given-names>H. M.</given-names> <surname>Sun</surname></string-name></person-group>, &#x201C;<article-title>Lightweight vehicle object detection network for unmanned aerial vehicles aerial images</article-title>,&#x201D; <source>Journal of Electronic Imaging</source>, vol. <volume>32</volume>, no. <issue>1</issue>, pp. <fpage>013014</fpage>, <year>2023</year>.</mixed-citation></ref>
<ref id="ref-19"><label>[19]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>Z.</given-names> <surname>Tian</surname></string-name>, <string-name><given-names>J.</given-names> <surname>Huang</surname></string-name>, <string-name><given-names>Y.</given-names> <surname>Yang</surname></string-name> and <string-name><given-names>W.</given-names> <surname>Nie</surname></string-name></person-group>, &#x201C;<article-title>KCFS-YOLOv5: A high-precision detection method for object detection in aerial remote sensing images</article-title>,&#x201D; <source>Applied Sciences</source>, vol. <volume>13</volume>, no. <issue>1</issue>, pp. <fpage>649</fpage>, <year>2023</year>.</mixed-citation></ref>
<ref id="ref-20"><label>[20]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>Z.</given-names> <surname>Liu</surname></string-name>, <string-name><given-names>X.</given-names> <surname>Gao</surname></string-name>, <string-name><given-names>Y.</given-names> <surname>Wan</surname></string-name>, <string-name><given-names>J.</given-names> <surname>Wang</surname></string-name> and <string-name><given-names>H.</given-names> <surname>Lyu</surname></string-name></person-group>, &#x201C;<article-title>An improved YOLOv5 method for small object detection in UAV capture scenes</article-title>,&#x201D; <source>IEEE Access</source>, vol. <volume>11</volume>, pp. <fpage>14365</fpage>&#x2013;<lpage>14374</lpage>, <year>2023</year>.</mixed-citation></ref>
<ref id="ref-21"><label>[21]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>A.</given-names> <surname>Cardellicchio</surname></string-name>, <string-name><given-names>F.</given-names> <surname>Solimani</surname></string-name>, <string-name><given-names>G.</given-names> <surname>Dimauro</surname></string-name>, <string-name><given-names>A.</given-names> <surname>Petrozza</surname></string-name>, <string-name><given-names>S.</given-names> <surname>Summerer</surname></string-name> <etal>et al.</etal></person-group><italic>,</italic> &#x201C;<article-title>Detection of tomato plant phenotyping traits using YOLOv5-based single stage detectors</article-title>,&#x201D; <source>Computers and Electronics in Agriculture</source>, vol. <volume>207</volume>, pp. <fpage>107757</fpage>, <year>2023</year>.</mixed-citation></ref>
<ref id="ref-22"><label>[22]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>H.</given-names> <surname>Wang</surname></string-name> and <string-name><given-names>T.</given-names> <surname>Wang</surname></string-name></person-group>, &#x201C;<article-title>Multi-scale residual aggregation feature pyramid network for object detection</article-title>,&#x201D; <source>Electronics</source>, vol. <volume>12</volume>, no. <issue>1</issue>, pp. <fpage>93</fpage>, <year>2022</year>.</mixed-citation></ref>
<ref id="ref-23"><label>[23]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>J.</given-names> <surname>Wu</surname></string-name> and <string-name><given-names>S.</given-names> <surname>Liao</surname></string-name></person-group>, &#x201C;<article-title>Traffic sign detection based on SSD combined with receptive field module and path aggregation network</article-title>,&#x201D; <source>Computational Intelligence and Neuroscience</source>, vol. <volume>2022</volume>, pp. <fpage>1</fpage>&#x2013;<lpage>13</lpage>, <year>2022</year>.</mixed-citation></ref>
<ref id="ref-24"><label>[24]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>J.</given-names> <surname>Wang</surname></string-name>, <string-name><given-names>K.</given-names> <surname>Chen</surname></string-name>, <string-name><given-names>R.</given-names> <surname>Xu</surname></string-name>, <string-name><given-names>Z.</given-names> <surname>Liu</surname></string-name>, <string-name><given-names>C. C.</given-names> <surname>Loy</surname></string-name> <etal>et al.</etal></person-group><italic>,</italic> &#x201C;<article-title>CARAFE: Content-aware reassembly of features</article-title>,&#x201D; in <conf-name>IEEE/CVF Int. Conf. on Computer Vision</conf-name>, <publisher-loc>Seoul, Korea</publisher-loc>, pp. <fpage>3007</fpage>&#x2013;<lpage>3016</lpage>, <year>2019</year>.</mixed-citation></ref>
<ref id="ref-25"><label>[25]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>G.</given-names> <surname>Yu</surname></string-name>, <string-name><given-names>R.</given-names> <surname>Cai</surname></string-name>, <string-name><given-names>J.</given-names> <surname>Su</surname></string-name>, <string-name><given-names>M.</given-names> <surname>Hou</surname></string-name> and <string-name><given-names>R.</given-names> <surname>Deng</surname></string-name></person-group>, &#x201C;<article-title>U-YOLOv7: A network for underwater organism detection</article-title>,&#x201D; <source>Ecological Informatics</source>, vol. <volume>75</volume>, pp. <fpage>102108</fpage>, <year>2023</year>.</mixed-citation></ref>
<ref id="ref-26"><label>[26]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>X.</given-names> <surname>Mou</surname></string-name>, <string-name><given-names>S.</given-names> <surname>Lei</surname></string-name> and <string-name><given-names>X.</given-names> <surname>Zhou</surname></string-name></person-group>, &#x201C;<article-title>YOLO-FR: A YOLOv5 infrared small target detection algorithm based on feature reassembly sampling method</article-title>,&#x201D; <source>Sensors</source>, vol. <volume>23</volume>, no. <issue>5</issue>, pp. <fpage>2710</fpage>, <year>2023</year>; <pub-id pub-id-type="pmid">36904912</pub-id></mixed-citation></ref>
<ref id="ref-27"><label>[27]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>P.</given-names> <surname>Liu</surname></string-name>, <string-name><given-names>Q.</given-names> <surname>Wang</surname></string-name>, <string-name><given-names>H.</given-names> <surname>Zhang</surname></string-name>, <string-name><given-names>J.</given-names> <surname>Mi</surname></string-name> and <string-name><given-names>Y.</given-names> <surname>Liu</surname></string-name></person-group>, &#x201C;<article-title>A lightweight object detection algorithm for remote sensing images based on attention mechanism and YOLOv5s</article-title>,&#x201D; <source>Remote Sensing</source>, vol. <volume>15</volume>, no. <issue>9</issue>, pp. <fpage>2429</fpage>, <year>2023</year>.</mixed-citation></ref>
<ref id="ref-28"><label>[28]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>H.</given-names> <surname>Wang</surname></string-name>, <string-name><given-names>S.</given-names> <surname>Zhang</surname></string-name>, <string-name><given-names>S.</given-names> <surname>Zhao</surname></string-name>, <string-name><given-names>J.</given-names> <surname>Lu</surname></string-name>, <string-name><given-names>Y.</given-names> <surname>Wang</surname></string-name> <etal>et al.</etal></person-group><italic>,</italic> &#x201C;<article-title>Fast detection of cannibalism behavior of juvenile fish based on deep learning</article-title>,&#x201D; <source>Computers and Electronics in Agriculture</source>, vol. <volume>198</volume>, pp. <fpage>107033</fpage>, <year>2022</year>.</mixed-citation></ref>
<ref id="ref-29"><label>[29]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><given-names>W.</given-names> <surname>Shi</surname></string-name>, <string-name><given-names>J.</given-names> <surname>Caballero</surname></string-name>, <string-name><given-names>F.</given-names> <surname>Husz&#x00E1;r</surname></string-name>, <string-name><given-names>J.</given-names> <surname>Totz</surname></string-name>, <string-name><given-names>A. P.</given-names> <surname>Aitken</surname></string-name> <etal>et al.</etal></person-group><italic>,</italic> &#x201C;<article-title>Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network</article-title>,&#x201D; [Online]. Available: <ext-link ext-link-type="uri" xlink:href="http://arxiv.org/abs/1609.05158">http://arxiv.org/abs/1609.05158</ext-link> <comment>(accessed on 25/07/2023)</comment></mixed-citation></ref>
<ref id="ref-30"><label>[30]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>F.</given-names> <surname>Tang</surname></string-name>, <string-name><given-names>F.</given-names> <surname>Yang</surname></string-name> and <string-name><given-names>X.</given-names> <surname>Tian</surname></string-name></person-group>, &#x201C;<article-title>Long-distance person detection based on YOLOv7</article-title>,&#x201D; <source>Electronics</source>, vol. <volume>12</volume>, no. <issue>6</issue>, pp. <fpage>1502</fpage>, <year>2023</year>.</mixed-citation></ref>
<ref id="ref-31"><label>[31]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>R.</given-names> <surname>Yang</surname></string-name>, <string-name><given-names>W.</given-names> <surname>Li</surname></string-name>, <string-name><given-names>X.</given-names> <surname>Shang</surname></string-name>, <string-name><given-names>D.</given-names> <surname>Zhu</surname></string-name> and <string-name><given-names>X.</given-names> <surname>Man</surname></string-name></person-group>, &#x201C;<article-title>KPE-YOLOv5: An improved small target detection algorithm based on YOLOv5</article-title>,&#x201D; <source>Electronics</source>, vol. <volume>12</volume>, no. <issue>4</issue>, pp. <fpage>817</fpage>, <year>2023</year>.</mixed-citation></ref>
<ref id="ref-32"><label>[32]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>R.</given-names> <surname>Zhang</surname></string-name> and <string-name><given-names>C.</given-names> <surname>Wen</surname></string-name></person-group>, &#x201C;<article-title>SOD-YOLO: A small target defect detection algorithm for wind turbine blades based on improved YOLOv5</article-title>,&#x201D; <source>Advanced Theory and Simulations</source>, vol. <volume>5</volume>, no. <issue>7</issue>, pp. <fpage>2100631</fpage>, <year>2022</year>.</mixed-citation></ref>
<ref id="ref-33"><label>[33]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>J.</given-names> <surname>Wei</surname></string-name>, <string-name><given-names>G.</given-names> <surname>Liu</surname></string-name>, <string-name><given-names>S.</given-names> <surname>Liu</surname></string-name> and <string-name><given-names>Z.</given-names> <surname>Xiao</surname></string-name></person-group>, &#x201C;<article-title>A novel algorithm for small object detection based on YOLOv4</article-title>,&#x201D; <source>PeerJ Computer Science</source>, vol. <volume>9</volume>, pp. <fpage>e1314</fpage>, <year>2023</year>; <pub-id pub-id-type="pmid">37346537</pub-id></mixed-citation></ref>
<ref id="ref-34"><label>[34]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>Y.</given-names> <surname>Yang</surname></string-name>, <string-name><given-names>S.</given-names> <surname>Xiao</surname></string-name>, <string-name><given-names>J.</given-names> <surname>Yang</surname></string-name> and <string-name><given-names>C.</given-names> <surname>Cheng</surname></string-name></person-group>, &#x201C;<article-title>A tiny model for fast and precise ship detection via feature channel pruning</article-title>,&#x201D; <source>Sensors</source>, vol. <volume>22</volume>, no. <issue>23</issue>, pp. <fpage>9331</fpage>, <year>2022</year>; <pub-id pub-id-type="pmid">36502044</pub-id></mixed-citation></ref>
<ref id="ref-35"><label>[35]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>C.</given-names> <surname>Yang</surname></string-name> and <string-name><given-names>H.</given-names> <surname>Liu</surname></string-name></person-group>, &#x201C;<article-title>Channel pruning based on convolutional neural network sensitivity</article-title>,&#x201D; <source>Neurocomputing</source>, vol. <volume>507</volume>, pp. <fpage>97</fpage>&#x2013;<lpage>106</lpage>, <year>2022</year>.</mixed-citation></ref>
<ref id="ref-36"><label>[36]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>H.</given-names> <surname>Liu</surname></string-name>, <string-name><given-names>K.</given-names> <surname>Fan</surname></string-name>, <string-name><given-names>Q.</given-names> <surname>Ouyang</surname></string-name> and <string-name><given-names>N.</given-names> <surname>Li</surname></string-name></person-group>, &#x201C;<article-title>Real-time small drones detection based on pruned YOLOv4</article-title>,&#x201D; <source>Sensors</source>, vol. <volume>21</volume>, no. <issue>10</issue>, pp. <fpage>3374</fpage>, <year>2021</year>; <pub-id pub-id-type="pmid">34066267</pub-id></mixed-citation></ref>
<ref id="ref-37"><label>[37]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>D.</given-names> <surname>Wang</surname></string-name> and <string-name><given-names>D.</given-names> <surname>He</surname></string-name></person-group>, &#x201C;<article-title>Channel pruned YOLO V5s-based deep learning approach for rapid and accurate apple fruitlet detection before fruit thinning</article-title>,&#x201D; <source>Biosystems Engineering</source>, vol. <volume>210</volume>, pp. <fpage>271</fpage>&#x2013;<lpage>281</lpage>, <year>2021</year>.</mixed-citation></ref>
<ref id="ref-38"><label>[38]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>D.</given-names> <surname>Padilla Carrasco</surname></string-name>, <string-name><given-names>H. A.</given-names> <surname>Rashwan</surname></string-name>, <string-name><given-names>M. &#x00C1;.</given-names> <surname>Garc&#x00ED;a</surname></string-name> and <string-name><given-names>D.</given-names> <surname>Puig</surname></string-name></person-group>, &#x201C;<article-title>T-YOLO: Tiny vehicle detection based on YOLO and multi-scale convolutional neural networks</article-title>,&#x201D; <source>IEEE Access</source>, vol. <volume>11</volume>, pp. <fpage>22430</fpage>&#x2013;<lpage>22440</lpage>, <year>2023</year>.</mixed-citation></ref>
</ref-list>
</back></article>