<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.1 20151215//EN" "http://jats.nlm.nih.gov/publishing/1.1/JATS-journalpublishing1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:mml="http://www.w3.org/1998/Math/MathML" xml:lang="en" article-type="research-article" dtd-version="1.1">
<front>
<journal-meta>
<journal-id journal-id-type="pmc">CMC</journal-id>
<journal-id journal-id-type="nlm-ta">CMC</journal-id>
<journal-id journal-id-type="publisher-id">CMC</journal-id>
<journal-title-group>
<journal-title>Computers, Materials &#x0026; Continua</journal-title>
</journal-title-group>
<issn pub-type="epub">1546-2226</issn>
<issn pub-type="ppub">1546-2218</issn>
<publisher>
<publisher-name>Tech Science Press</publisher-name>
<publisher-loc>USA</publisher-loc>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="publisher-id">78756</article-id>
<article-id pub-id-type="doi">10.32604/cmc.2026.078756</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Article</subject>
</subj-group>
</article-categories>
<title-group>
<article-title>A Large-Scale Dataset for Real-Time Vehicle Detection in Vietnamese Urban Traffic Scenes</article-title>
<alt-title alt-title-type="left-running-head">A Large-Scale Dataset for Real-Time Vehicle Detection in Vietnamese Urban Traffic Scenes</alt-title>
<alt-title alt-title-type="right-running-head">A Large-Scale Dataset for Real-Time Vehicle Detection in Vietnamese Urban Traffic Scenes</alt-title>
</title-group>
<contrib-group>
<contrib id="author-1" contrib-type="author">
<name name-style="western"><surname>Vo</surname><given-names>Quang Dong Nguyen</given-names></name><xref ref-type="aff" rid="aff-1">1</xref></contrib>
<contrib id="author-2" contrib-type="author">
<name name-style="western"><surname>Nguyen</surname><given-names>Gia Nhu</given-names></name><xref ref-type="aff" rid="aff-1">1</xref></contrib>
<contrib id="author-3" contrib-type="author" corresp="yes">
<name name-style="western"><surname>Tran</surname><given-names>Hoang Vu</given-names></name><xref ref-type="aff" rid="aff-2">2</xref><email>thvu@ute.udn.vn</email></contrib>
<aff id="aff-1"><label>1</label><institution>School of Computer Science and Artificial Intelligence, Duy Tan University</institution>, <addr-line>Danang</addr-line>, <country>Vietnam</country></aff>
<aff id="aff-2"><label>2</label><institution>The University of Danang-University of Technology and Education</institution>, <addr-line>Danang</addr-line>, <country>Vietnam</country></aff>
</contrib-group>
<author-notes>
<corresp id="cor1"><label>&#x002A;</label>Corresponding Author: Hoang Vu Tran. Email: <email>thvu@ute.udn.vn</email></corresp>
</author-notes>
<pub-date date-type="collection" publication-format="electronic">
<year>2026</year>
</pub-date>
<pub-date date-type="pub" publication-format="electronic">
<day>8</day><month>5</month><year>2026</year>
</pub-date>
<volume>88</volume>
<issue>1</issue>
<elocation-id>23</elocation-id>
<history>
<date date-type="received">
<day>07</day>
<month>01</month>
<year>2026</year>
</date>
<date date-type="accepted">
<day>12</day>
<month>02</month>
<year>2026</year>
</date>
</history>
<permissions>
<copyright-statement>&#x00A9; 2026 The Authors. Published by Tech Science Press.</copyright-statement>
<copyright-year>2026</copyright-year>
<copyright-holder>The Authors</copyright-holder>
<license xlink:href="https://creativecommons.org/licenses/by/4.0/">
<license-p>This work is licensed under a <ext-link ext-link-type="uri" xlink:type="simple" xlink:href="https://creativecommons.org/licenses/by/4.0/">Creative Commons Attribution 4.0 International License</ext-link>, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
</license>
</permissions>
<self-uri content-type="pdf" xlink:href="TSP_CMC_78756.pdf"></self-uri>
<abstract>
<p>Reliable vehicle detection in urban traffic environments remains challenging, particularly for fixed-view CCTV systems deployed in Southeast Asian cities, where heterogeneous traffic composition, high traffic density, frequent occlusions, and complex visual conditions are prevalent. The absence of large-scale datasets tailored to such mixed-traffic environments poses a significant limitation to the performance and generalization capability of existing object detection models. To address this gap, this paper presents a large-scale traffic image dataset for real-time vehicle detection in Vietnamese urban environments. The proposed dataset comprises 23,364 images collected from fixed-view CCTV traffic cameras deployed across Da Nang City, a representative urban area exhibiting mixed-traffic patterns commonly observed in Southeast Asian cities. The data cover diverse temporal periods, weather conditions, and traffic density levels encountered in real-world traffic monitoring scenarios. To comprehensively characterize these conditions, over 1.1 million instances are annotated across multiple traffic-related categories, including pedestrians, bicycles, motorbikes, cars, buses, trucks, and traffic lights with explicit signal-state labels. Such fine-grained, multi-class annotations support not only object-level detection but also higher-level traffic scene analysis relevant to intelligent transportation system (ITS) applications, such as traffic flow analysis and signal control. To balance annotation accuracy and scalability, a semi-automatic labeling pipeline is employed. Initial object annotations are generated using a pretrained YOLOv11m model and subsequently refined through systematic manual verification using the CVAT platform. Comprehensive experiments are conducted under the same experimental protocol, using the same YOLOv11m architecture, comprising a pretrained baseline and a version fine-tuned on the proposed dataset with domain-specific data augmentation and optimized hyperparameter settings tailored to fixed-view CCTV conditions. Under the same evaluation setting, the pretrained YOLOv11m achieves a mean Average Precision (mAP) of 0.409; in contrast, fine-tuning on the proposed dataset improves the mAP to 0.788. These results underscore the necessity of localized, context-aware datasets such as the one presented in this work for robust real-time traffic perception in Vietnam and similar Southeast Asian urban contexts.</p>
</abstract>
<kwd-group kwd-group-type="author">
<kwd>Deep learning</kwd>
<kwd>ITS</kwd>
<kwd>real-time vehicle detection</kwd>
<kwd>Vietnamese traffic dataset</kwd>
<kwd>traffic detection</kwd>
</kwd-group>
<funding-group>
<award-group id="awg1">
<funding-source>The University of Danang&#x2014;University of Technology and Education, and the School of Computer Science, Duy Tan University, Da Nang City, Vietnam</funding-source>
</award-group>
</funding-group>
</article-meta>
</front>
<body>
<sec id="s1">
<label>1</label>
<title>Introduction</title>
<p>In recent decades, rapid urbanization has significantly increased the complexity of urban traffic systems. This growth has led to dynamic interactions among diverse vehicles and pedestrians within mixed-traffic environments commonly observed in many developing countries. Such complexity arises from the coexistence of cars, buses, trucks, motorcycles, bicycles, and pedestrians sharing limited road infrastructure. These environments are typically associated with weak lane discipline and heterogeneous driving behaviors, which further complicate traffic dynamics.</p>
<p>Traffic complexity is further exacerbated by external factors such as time of day, weather conditions, and varying traffic densities, making real-time traffic monitoring and management a persistent challenge. In Vietnam, motorbikes dominate urban transportation and contribute to highly unstructured traffic scenes, highlighting the need for accurate and scalable traffic perception systems to enhance road safety, traffic efficiency, and smart mobility initiatives. However, conventional surveillance-based approaches often lack the adaptability and contextual awareness required to cope with such complex traffic dynamics [<xref ref-type="bibr" rid="ref-1">1</xref>&#x2013;<xref ref-type="bibr" rid="ref-5">5</xref>]. These characteristics pose significant challenges for vision-based traffic perception systems, particularly those relying on fixed-view CCTV surveillance.</p>
<p>Recent advances in artificial intelligence and computer vision have enabled reliable real-time object detection in complex traffic environments. Deep learning-based detection frameworks, such as YOLO and Faster R-CNN, have demonstrated strong performance in recognizing and classifying traffic participants with high accuracy and computational efficiency. Nevertheless, their effectiveness strongly depends on the availability of high-quality, region-specific datasets that reflect local traffic characteristics [<xref ref-type="bibr" rid="ref-6">6</xref>&#x2013;<xref ref-type="bibr" rid="ref-9">9</xref>]. Most publicly available traffic datasets are collected in Western cities or structured traffic environments. As a result, they fail to adequately capture dense motorbike flows, weak lane discipline, and persistent occlusions commonly observed under fixed-view CCTV settings. Consequently, these datasets provide limited representation of heterogeneous traffic composition, dense flows, and fixed-view CCTV surveillance conditions typical of Southeast Asian urban contexts.</p>
<p>To address this limitation, this paper introduces a large-scale dataset designed for real-time vehicle detection in Vietnamese urban traffic environments. The proposed dataset consists of 23,364 images captured from fixed-view CCTV traffic cameras installed at major intersections in Da Nang City, Vietnam. This urban setting is representative of mixed traffic flow with high motorbike density. The dataset spans diverse temporal periods, weather conditions, and traffic density levels commonly encountered in real-world deployments. It provides detailed annotations for multiple traffic participants, as well as traffic lights with explicit state labels. To balance annotation quality and scalability, a semi-automatic labeling pipeline is adopted. This pipeline combines initial detections generated by a pre-trained YOLOv11m model with systematic manual verification and refinement.</p>
<p>This work contributes a large-scale, fine-grained, and publicly available dataset tailored to Vietnamese urban traffic scenes. The dataset serves as a valuable resource for advancing research in intelligent traffic perception and smart city applications within mixed-traffic environments. Experimental evaluations show that models fine-tuned on the proposed dataset achieve improved detection performance in dense and visually complex traffic scenes. These findings highlight the limitations of general-purpose datasets and emphasize the importance of localized data for domain-specific adaptation. Although the dataset is collected in Vietnam, the traffic characteristics it captures, such as heterogeneous vehicle composition, high motorbike density, mixed traffic flow, and fixed-view CCTV surveillance, are widely observed across many Southeast Asian cities. Similar characteristics have been reported in studies conducted in Thailand, Indonesia, and Malaysia. Recent vision-based traffic research further supports this regional similarity; for example, Xu and Liu [<xref ref-type="bibr" rid="ref-10">10</xref>] demonstrate the effectiveness of fixed-camera, vision-based deep learning frameworks under heterogeneous vehicle structures and complex urban traffic conditions. While cross-country generalization is not explicitly evaluated in this work, the shared traffic characteristics suggest that the proposed dataset provides a solid foundation for intelligent transportation system (ITS) applications, including traffic surveillance, congestion analysis, and smart city development in comparable urban contexts. <xref ref-type="fig" rid="fig-1">Fig. 1</xref> illustrates the overall technical roadmap of the proposed framework, covering the complete workflow from real-world CCTV data acquisition to dataset construction, model evaluation, and smart traffic applications. This framework highlights the practical and system-oriented nature of the proposed dataset for intelligent transportation research.</p>
<fig id="fig-1">
<label>Figure 1</label>
<caption>
<title>Overall technical roadmap of the proposed framework, from CCTV data acquisition and semi-automatic annotation to dataset construction, model evaluation, and smart traffic applications.</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_78756-fig-1.tif"/>
</fig>
</sec>
<sec id="s2">
<label>2</label>
<title>Related Work</title>
<p>Over the past decade, the object detection task has witnessed significant advancements, with numerous studies [<xref ref-type="bibr" rid="ref-11">11</xref>&#x2013;<xref ref-type="bibr" rid="ref-13">13</xref>] leveraging deep learning-based approaches on widely adopted benchmark datasets, including UA-DETRAC and Microsoft COCO. These datasets have played a pivotal role in advancing object recognition and traffic scene understanding. However, their applicability to Southeast Asian urban environments, particularly Vietnam, remains limited due to substantial differences in traffic composition, infrastructure heterogeneity, and environmental conditions. This mismatch underscores the need for large-scale, context-specific datasets to support reliable real-time vehicle detection and ITS in the region.</p>
<p>The Common Objects in Context (COCO) dataset, introduced by Lin et al. [<xref ref-type="bibr" rid="ref-14">14</xref>], provides extensive annotations across diverse object categories, including vehicles and pedestrians, making it a foundational resource for training generic urban perception models. Nevertheless, COCO predominantly reflects Western traffic scenarios and does not adequately capture the characteristics of Vietnamese mixed traffic. Moreover, although traffic signals are included, the absence of explicit traffic light state annotations limits its utility for intersection-level traffic analysis and control in Vietnam.</p>
<p>UA-DETRAC, proposed by Wen et al. [<xref ref-type="bibr" rid="ref-15">15</xref>], offers high-quality annotations with over 1.2 million manually labeled bounding boxes and covers diverse environmental and illumination conditions, including rainy and nighttime scenes. It supports both dense and sparse traffic scenarios and provides multi-object tracking benchmarks, making it partially relevant for urban traffic analysis. However, UA-DETRAC exhibits critical limitations in Vietnamese contexts, most notably the absence of motorbike, pedestrian, and bicycle annotations, as well as the lack of traffic light state labeling. In addition, its data are collected under more structured traffic regulations, reducing its representativeness of Vietnam&#x2019;s mixed and informal traffic flow.</p>
<p>VisDrone, developed by Zhu et al. [<xref ref-type="bibr" rid="ref-16">16</xref>], is a large-scale aerial dataset designed for traffic monitoring and surveillance, featuring detailed annotations under varied weather and lighting conditions. While effective for high-level traffic analysis, its top-down drone perspective differs substantially from fixed, street-level CCTV viewpoints commonly deployed in urban traffic monitoring. Furthermore, the lack of motorbike representation and traffic light annotations limits its applicability to Vietnamese urban traffic scenarios.</p>
<p>Trinh et al. [<xref ref-type="bibr" rid="ref-17">17</xref>] introduced the UIT-VinaDeveS22 dataset, which comprises 1364 CCTV images collected in Vietnam and captures diverse traffic conditions across different times of day and weather scenarios. Its local origin makes it more representative of Vietnamese traffic patterns than existing international benchmarks. However, its relatively small scale, limited number of video sources, potential data leakage due to video-level overlap, and the absence of traffic light annotations constrain its generalization capability and suitability for intersection-level analysis.</p>
<p>Recent object detection frameworks, including YOLOv7, YOLOv11, and Faster R-CNN, have demonstrated strong performance on standard benchmarks [<xref ref-type="bibr" rid="ref-18">18</xref>&#x2013;<xref ref-type="bibr" rid="ref-20">20</xref>]. Nevertheless, their effectiveness in Vietnamese urban environments remains underexplored, largely due to the lack of large-scale, domain-representative datasets that reflect local traffic characteristics.</p>
<p>Beyond dataset construction, prior studies have explored methodological strategies to improve robustness in challenging traffic conditions, such as image-to-image translation for nighttime en-hancement and occlusion-aware modeling based on keypoints and spatio-temporal reasoning. Notably, Xu et al. [<xref ref-type="bibr" rid="ref-21">21</xref>] proposed a monocular framework integrating object detection and keypoint estimation to handle severe occlusion. Despite their effectiveness, such approaches remain highly dependent on representative, domain-specific datasets for training and validation, particularly in heterogeneous traffic environments.</p>
<p>In contrast to existing resources, this work introduces a large-scale CCTV-based dataset specifically tailored to Vietnamese urban traffic scenes. The proposed dataset provides fine-grained annotations for multiple traffic participants and explicit traffic light states, capturing mixed traffic flow, high motorbike density, and diverse environmental conditions. As such, it offers a solid foundation for robust real-time vehicle detection and the development of ITS applications in complex urban settings.</p>
</sec>
<sec id="s3">
<label>3</label>
<title>Dataset Construction and Annotation</title>
<sec id="s3_1">
<label>3.1</label>
<title>Dataset Collection and Camera Configuration</title>
<p>The proposed dataset was collected from fixed-view CCTV traffic cameras deployed at major intersections in Da Nang City, Vietnam. Each intersection is monitored by four cameras corresponding to the incoming traffic directions. The cameras are installed at heights ranging from 6 to 10 m, with downward viewing angles ranging from 30&#x00B0; and 45&#x00B0;, enabling a comprehensive coverage of vehicle movements and interactions within the intersection area. In total, 23,364 images were extracted from continuous video streams, capturing diverse temporal, environmental, and traffic conditions, as illustrated in <xref ref-type="fig" rid="fig-2">Fig. 2</xref>.</p>
<fig id="fig-2">
<label>Figure 2</label>
<caption>
<title>Representative sample frames from the Vietnamese urban traffic dataset.</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_78756-fig-2.tif"/>
</fig>
<p>Video frames were extracted from the continuous CCTV streams using a fixed temporal sampling strategy to reduce redundancy while preserving traffic dynamics. Specifically, shorter sampling intervals were applied during peak traffic periods, whereas longer intervals were used under low-density conditions. All extracted frames were resized to a standardized resolution of 640 &#x00D7; 640 pixels to ensure compatibility with the input requirements of the YOLOv11m detection framework.</p>
</sec>
<sec id="s3_2">
<label>3.2</label>
<title>Spatial and Temporal Coverage</title>
<p>To capture realistic variations in urban traffic conditions, the dataset was categorized along multiple spatial&#x2013;temporal dimensions.</p>
<p>Time Variability: The dataset is divided into morning, afternoon, and evening periods to capture temporal variations in traffic flow and illumination conditions in urban environments.</p>
<p>Weather Scenarios: Traffic scenes are further categorized into fine, sunny, and rainy conditions, reflecting common weather variations that affect visibility, reflections, and motion blur.</p>
<p>Traffic Density: It is classified into sparse and dense levels based on vehicle counts per scene, enabling evaluation of detection performance under low- and high-congestion conditions with varying degrees of occlusion.</p>
</sec>
<sec id="s3_3">
<label>3.3</label>
<title>Annotation Categories and Dataset Partitioning</title>
<p>All samples were annotated with bounding boxes and class labels covering seven traffic categories: pedestrian, bicycle, motorcycle, car, bus, truck, and traffic light. Traffic lights were additionally annotated by signal state to support traffic signal analysis within ITS. As shown in <xref ref-type="fig" rid="fig-3">Fig. 3</xref>, traffic-light annotations include active states (green, yellow, and red) and the corresponding permitted movement directions, which are critical for intersection-level traffic analysis. For multi-head traffic signals, each clearly identifiable signal head was treated as an independent instance, while arrow-based signals were labeled according to the illuminated direction and inactive arrows were ignored.</p>
<fig id="fig-3">
<label>Figure 3</label>
<caption>
<title>Representative examples of traffic-light annotations in the proposed dataset. (<bold>a</bold>) Illustrates a green traffic signal, including the remaining movement time and the corresponding permitted movement direction. (<bold>b</bold>) Shows a yellow traffic signal. (<bold>c</bold>) Presents a red traffic signal displaying a countdown timer indicating the remaining waiting time.</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_78756-fig-3.tif"/>
</fig>
<p>In scenes containing multiple traffic signal groups, annotations were limited to signal heads relevant to the primary traffic flow, as determined by lane orientation and intersection geometry. To ensure annotation reliability, traffic-light instances were labeled only when both signal state and movement direction were unambiguous under fixed-camera viewpoints; instances affected by severe occlusion, motion blur, glare, or adverse weather conditions were excluded. Although explicit visibility-level annotations are not included in the current release, this design choice prioritizes label accuracy and consistency, with potential extensions considered in future work.</p>
<p>The dataset was partitioned into training (70%), validation (10%), and test (20%) subsets using a video-level split strategy to prevent temporal leakage. Dataset statistics are summarized in <xref ref-type="table" rid="table-1">Table 1</xref> and illustrated in <xref ref-type="fig" rid="fig-4">Fig. 4</xref>.</p>
<table-wrap id="table-1">
<label>Table 1</label>
<caption>
<title>Distribution of dataset samples across training, validation, and test sets.</title>
</caption>
<table>
<colgroup>
<col align="center"/>
<col align="center"/> </colgroup>
<thead>
<tr>
<th>Subset</th>
<th>Sample Count</th>
</tr>
</thead>
<tbody>
<tr>
<td>Training Set (70%)</td>
<td>16,353</td>
</tr>
<tr>
<td>Validation Set (10%)</td>
<td>2341</td>
</tr>
<tr>
<td>Test Set (20%)</td>
<td>4670</td>
</tr>
<tr>
<td><bold>Total</bold></td>
<td><bold>23,364</bold></td>
</tr>
</tbody>
</table>
</table-wrap><fig id="fig-4">
<label>Figure 4</label>
<caption>
<title>Pie charts illustrating dataset distribution. (<bold>a</bold>) Proportional breakdown of samples across training, validation, and test subsets. (<bold>b</bold>) Distribution of samples across temporal conditions, weather scenarios, and traffic-density levels.</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_78756-fig-4.tif"/>
</fig>
</sec>
<sec id="s3_4">
<label>3.4</label>
<title>Annotation Workflow</title>
<p>To construct a high-quality dataset for real-time vehicle detection in Vietnamese urban environments, we designed a semi-automatic annotation pipeline that integrates modern deep learning-based models with human-in-the-loop refinement. This hybrid strategy balances annotation efficiency with the accuracy required for downstream computer vision research. The overall annotation workflow is illustrated in <xref ref-type="fig" rid="fig-5">Fig. 5</xref>.</p>
<fig id="fig-5">
<label>Figure 5</label>
<caption>
<title>Overview of the proposed semi-automatic annotation pipeline.</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_78756-fig-5.tif"/>
</fig>
<sec id="s3_4_1">
<label>3.4.1</label>
<title>Semi-Automatic Pre-Annotation Using YOLOv11m</title>
<p>The YOLOv11m model [<xref ref-type="bibr" rid="ref-22">22</xref>] was adopted for automatic pre-annotation. As a recent advancement in the Ultralytics YOLO family, YOLOv11m incorporates improvements in backbone architecture, feature aggregation, and training optimization (<xref ref-type="fig" rid="fig-6">Fig. 6</xref>). These enhancements provide a favorable accuracy&#x2013;speed trade-off, making the model suitable for dense urban traffic scenes with small and heavily occluded objects.</p>
<fig id="fig-6">
<label>Figure 6</label>
<caption>
<title>Comparison of representative YOLO model variants (adapted from [<xref ref-type="bibr" rid="ref-23">23</xref>]).</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_78756-fig-6.tif"/>
</fig>
<p>A total of 23,364 CCTV images were processed in a GPU-accelerated Google Colab environment and resized to 640 &#x00D7; 640 pixels. Initial annotations were generated using a YOLOv11m model pretrained on the COCO dataset and exported in YOLO format with associated model metadata to support reproducibility. This automated pre-labeling step reduced manual annotation effort and provided a consistent baseline for subsequent refinement.</p>
</sec>
<sec id="s3_4_2">
<label>3.4.2</label>
<title>Human-in-the-Loop Annotation Refinement</title>
<p>Automated annotations were further refined using a human-in-the-loop procedure to mitigate errors arising from domain discrepancies between Vietnamese urban traffic scenes and COCO-style datasets. Manual refinement was conducted using the CVAT platform, following standardized annotation guidelines that define class taxonomies, occlusion handling, and minimum bounding-box requirements (see <xref ref-type="fig" rid="fig-7">Fig. 7</xref>). Annotators were trained to correct localization inaccuracies, supplement missed detections&#x2014;particularly for small or occluded objects, resolve class ambiguities (e.g., bus vs. truck), and remove false positives caused by background clutter.</p>
<fig id="fig-7">
<label>Figure 7</label>
<caption>
<title>CVAT annotation interface used for manual refinement of object annotations.</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_78756-fig-7.tif"/>
</fig>
</sec>
<sec id="s3_4_3">
<label>3.4.3</label>
<title>Bounding-Box Verification for Small and Occluded Objects</title>
<p>To better capture heavily occluded, partially visible, and small-scale objects, an additional verification pass was performed. This review focused on sub-20-pixel instances, vehicles in highly congested intersection areas, and objects partially occluded by other traffic participants or scene elements. Such targeted refinement enhances the representation of challenging cases, thereby improving the dataset&#x2019;s suitability for training robust real-world detection models.</p>
</sec>
<sec id="s3_4_4">
<label>3.4.4</label>
<title>Quality Assurance and Consistency Checks</title>
<p>To ensure dataset reliability, a multi-stage quality control procedure was employed (<xref ref-type="fig" rid="fig-8">Fig. 8</xref> presents examples after refinement). A subset of 10% of the annotations was cross-checked by an independent annotator. Inter-annotator agreement was evaluated using IoU, with cases below 0.5 subject to additional review. Further consistency checks enforced uniform bounding-box tightness, object boundaries, and class definitions, while random sampling was conducted to identify potential systematic annotation biases.</p>
<fig id="fig-8">
<label>Figure 8</label>
<caption>
<title>Sample frames after applying the proposed semi-automatic labeling approach.</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_78756-fig-8.tif"/>
</fig>
<p>In addition, all annotators underwent a structured training process prior to large-scale annotation. This training included detailed annotation guidelines defining class taxonomies, occlusion handling, and minimum bounding-box requirements, followed by pilot annotation sessions with expert feedback. To further reduce systematic bias, challenging samples&#x2014;such as small objects, heavily occluded vehicles, and dense traffic scenes&#x2014;were subject to targeted re-inspection. Annotation reliability was evaluated using both class-level agreement, measured by Cohen&#x2019;s Kappa, and box-level consistency based on IoU statistics. These quantitative checks complemented the qualitative review process and helped maintain consistent labeling across different annotators.</p>
</sec>
</sec>
</sec>
<sec id="s4">
<label>4</label>
<title>Experimental Setup and Model Training</title>
<sec id="s4_1">
<label>4.1</label>
<title>Benchmark Protocol and Dataset Usage</title>
<p>To evaluate the practical utility of the proposed dataset, benchmark experiments were conducted using the YOLOv11m object detection model. A video-level data partitioning strategy was adopted to prevent temporal overlap and ensure an unbiased evaluation.</p>
<p>The distribution of annotated instances is reported in <xref ref-type="table" rid="table-2">Table 2</xref>, while <xref ref-type="fig" rid="fig-9">Fig. 9</xref> illustrates the dataset composition across different contextual factors, including time of day, weather conditions, and traffic density. This experimental setting facilitates a systematic assessment of detection performance under diverse urban traffic scenarios, supporting applications such as multi-class vehicle detection and traffic light state recognition.</p>
<table-wrap id="table-2">
<label>Table 2</label>
<caption>
<title>Distribution of annotated instances across object classes.</title>
</caption>
<table>
<colgroup>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/> </colgroup>
<thead>
<tr>
<th>Class</th>
<th>Train</th>
<th>Validation</th>
<th>Test</th>
<th>Total</th>
</tr>
</thead>
<tbody>
<tr>
<td>Person</td>
<td>299,929</td>
<td>42,394</td>
<td>84,883</td>
<td>427,206</td>
</tr>
<tr>
<td>Bicycle</td>
<td>651</td>
<td>104</td>
<td>151</td>
<td>906</td>
</tr>
<tr>
<td>Motorbike</td>
<td>104,669</td>
<td>14,922</td>
<td>29,697</td>
<td>149,288</td>
</tr>
<tr>
<td>Car</td>
<td>223,380</td>
<td>31,658</td>
<td>63,784</td>
<td>318,822</td>
</tr>
<tr>
<td>Bus</td>
<td>4975</td>
<td>716</td>
<td>1441</td>
<td>7132</td>
</tr>
<tr>
<td>Truck</td>
<td>40,648</td>
<td>5961</td>
<td>11,684</td>
<td>58,293</td>
</tr>
<tr>
<td>Traffic Light</td>
<td>109,564</td>
<td>15,780</td>
<td>31,479</td>
<td>156,823</td>
</tr>
<tr>
<td><bold>Total</bold></td>
<td><bold>783,816</bold></td>
<td><bold>111,535</bold></td>
<td><bold>223,119</bold></td>
<td><bold>1,118,470</bold></td>
</tr>
</tbody>
</table>
</table-wrap><fig id="fig-9">
<label>Figure 9</label>
<caption>
<title>Dataset statistics categorized by time of day, weather conditions, and traffic-density levels.</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_78756-fig-9.tif"/>
</fig>
</sec>
<sec id="s4_2">
<label>4.2</label>
<title>Model Configuration and Training Strategy</title>
<p>The YOLOv11m model was selected as the benchmark detector due to its favorable accuracy&#x2013;speed trade-off and suitability for real-time urban traffic surveillance. Two configurations were evaluated: (i) a pretrained model initialized with COCO weights, and (ii) a fine-tuned model trained on the proposed dataset. All images were uniformly resized to 640 &#x00D7; 640 pixels throughout pre-annotation, training, and evaluation to ensure pipeline consistency.</p>
<p>To improve robustness and generalization under complex traffic conditions, a data augmentation strategy and corresponding hyperparameter configuration were incorporated into the YOLO training framework. The augmentation design was tailored specifically for fixed-view CCTV traffic cameras, ensuring that all transformations remained physically plausible while introducing sufficient variability for effective learning.</p>
<p>The augmentation pipeline integrates three categories of transformations: photometric, geometric, and composition-based augmentations. Photometric augmentations (<xref ref-type="fig" rid="fig-10">Fig. 10</xref>) were applied to simulate real-world illumination variations commonly observed in urban traffic scenes, including shadows, nighttime lighting, and headlight glare. These include controlled adjustments of hue (0.02&#x2013;0.04), saturation (0.5&#x2013;0.8), and brightness (0.3&#x2013;0.6).</p>
<fig id="fig-10">
<label>Figure 10</label>
<caption>
<title>Photometric augmentations and geometric augmentations training method. (<bold>a</bold>) Original. (<bold>b</bold>) Horizontal Flip. (<bold>c</bold>) Brightness(HSV_v). (<bold>d</bold>) Saturation(HSV_s).</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_78756-fig-10.tif"/>
</fig>
<p>Geometric augmentations were intentionally restricted to preserve realistic CCTV perspectives. Only minor translation (0.05&#x2013;0.1), scaling (0&#x2013;0.5), and limited rotation (up to 45&#x00B0;) were applied, reflecting the fixed-camera setup and avoiding unrealistic distortions. In addition, mosaic and copy-paste augmentations (<xref ref-type="fig" rid="fig-11">Fig. 11</xref>) were employed to increase scene complexity and enhance robustness in dense and occlusion-heavy traffic scenarios, using four-image mosaic blending (0.5&#x2013;1) and limited vehicle instance copy-paste (0&#x2013;0.2).</p>
<fig id="fig-11">
<label>Figure 11</label>
<caption>
<title>Mosaic augmentation samples used in the training process. (<bold>a</bold>) Sample 1. (<bold>b</bold>) Sample 2. (<bold>c</bold>) Sample 3.</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_78756-fig-11.tif"/>
</fig>
<p>All augmentation operations were integrated directly into the YOLO training pipeline using a predefined hyperparameter search space. The model was trained for 50 epochs with a batch size of 32, using the AdamW optimizer, an initial learning rate of 0.01, and standard YOLO weight decay parameter grouping. The complete augmentation configuration and hyperparameter settings are summarized in <xref ref-type="table" rid="table-3">Table 3</xref>. By constraining augmentation intensities to match CCTV imaging characteristics, the training process achieves stable optimization while improving robustness across diverse environmental conditions and traffic congestion levels.</p>
<table-wrap id="table-3">
<label>Table 3</label>
<caption>
<title>Summarizes the key augmentation techniques and hyperparameter settings.</title>
</caption>
<table>
<colgroup>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/> </colgroup>
<thead>
<tr>
<th>Argument</th>
<th>Type</th>
<th>Default</th>
<th>Proposed Values</th>
<th>Hyperparameter Tuning (Best Parameter)</th>
<th>Range</th>
</tr>
</thead>
<tbody>
<tr>
<td>hsv_h</td>
<td>Float</td>
<td>0.015</td>
<td>0.02&#x2013;0.04</td>
<td>0.021</td>
<td>0.0&#x2013;1.0</td>
</tr>
<tr>
<td>hsv_s</td>
<td>Float</td>
<td>0.7</td>
<td>0.5&#x2013;0.8</td>
<td>0.6</td>
<td>0.0&#x2013;1.0</td>
</tr>
<tr>
<td>hsv_v</td>
<td>Float</td>
<td>0.4</td>
<td>0.3&#x2013;0.6</td>
<td>0.38</td>
<td>0.0&#x2013;1.0</td>
</tr>
<tr>
<td>Degrees</td>
<td>Float</td>
<td>0</td>
<td>0&#x2013;45</td>
<td>45</td>
<td>0.0&#x2013;180</td>
</tr>
<tr>
<td>Translate</td>
<td>Float</td>
<td>0.1</td>
<td>0.05&#x2013;0.1</td>
<td>0.08</td>
<td>0.0&#x2013;1.0</td>
</tr>
<tr>
<td>Scale</td>
<td>Float</td>
<td>0.5</td>
<td>0&#x2013;0.5</td>
<td>0.1</td>
<td>&#x2265; 0.0</td>
</tr>
<tr>
<td>Mosaic</td>
<td>Float</td>
<td>1</td>
<td>0.5&#x2013;1</td>
<td>0.7</td>
<td>0.0&#x2013;1.0</td>
</tr>
<tr>
<td>Copy_paste</td>
<td>Float</td>
<td>0</td>
<td>0&#x2013;0.2</td>
<td>0.1</td>
<td>0.0&#x2013;1.0</td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
<sec id="s4_3">
<label>4.3</label>
<title>Benchmark Evaluation</title>
<sec id="s4_3_1">
<label>4.3.1</label>
<title>Assessment Metric</title>
<p>To assess the effectiveness of the proposed dataset, benchmark experiments were conducted using the YOLOv11m object detection model. The mean Average Precision (mAP) serves as a standard measure for evaluating model effectiveness in terms of accuracy. mAP is computed as the mean of the Average Precision (AP) values across all object classes, where AP is evaluated at specific Intersection over Union (IoU) thresholds, including AP50 and AP75
<list list-type="bullet">
<list-item>
<p>IoU quantifies the overlap between a predicted bounding box and the corresponding ground-truth annotation, as defined in <xref ref-type="disp-formula" rid="eqn-1">Eq. (1)</xref>.
<disp-formula id="eqn-1"><label>(1)</label><mml:math id="mml-eqn-1" display="block"><mml:mi>I</mml:mi><mml:mi>o</mml:mi><mml:mi>U</mml:mi><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mi>A</mml:mi><mml:mi>r</mml:mi><mml:mi>e</mml:mi><mml:mi>a</mml:mi><mml:mspace width="thinmathspace" /><mml:mi>o</mml:mi><mml:mi>f</mml:mi><mml:mspace width="thinmathspace" /><mml:mi>O</mml:mi><mml:mi>v</mml:mi><mml:mi>e</mml:mi><mml:mi>r</mml:mi><mml:mi>l</mml:mi><mml:mi>a</mml:mi><mml:mi>p</mml:mi></mml:mrow><mml:mrow><mml:mi>A</mml:mi><mml:mi>r</mml:mi><mml:mi>e</mml:mi><mml:mi>a</mml:mi><mml:mspace width="thinmathspace" /><mml:mi>o</mml:mi><mml:mi>f</mml:mi><mml:mspace width="thinmathspace" /><mml:mi>U</mml:mi><mml:mi>n</mml:mi><mml:mi>i</mml:mi><mml:mi>o</mml:mi><mml:mi>n</mml:mi></mml:mrow></mml:mfrac></mml:math></disp-formula></p></list-item>
<list-item>
<p>Precision (P) reflects the proportion of predicted bounding boxes that correspond to correct detections, as defined in <xref ref-type="disp-formula" rid="eqn-2">Eq. (2)</xref>.
<disp-formula id="eqn-2"><label>(2)</label><mml:math id="mml-eqn-2" display="block"><mml:mi>P</mml:mi><mml:mi>r</mml:mi><mml:mi>e</mml:mi><mml:mi>c</mml:mi><mml:mi>i</mml:mi><mml:mi>s</mml:mi><mml:mi>i</mml:mi><mml:mi>o</mml:mi><mml:mi>n</mml:mi><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mi>T</mml:mi><mml:mi>P</mml:mi></mml:mrow><mml:mrow><mml:mi>T</mml:mi><mml:mi>P</mml:mi><mml:mo>+</mml:mo><mml:mi>F</mml:mi><mml:mi>P</mml:mi></mml:mrow></mml:mfrac></mml:math></disp-formula>where TP (True Positive): correctly detected objects and FP (False Positive): incorrect detections.</p></list-item>
<list-item>
<p>Recall (R) measures the proportion of ground-truth objects that are successfully detected, as defined in <xref ref-type="disp-formula" rid="eqn-3">Eq. (3)</xref>.
<disp-formula id="eqn-3"><label>(3)</label><mml:math id="mml-eqn-3" display="block"><mml:mi>R</mml:mi><mml:mi>e</mml:mi><mml:mi>c</mml:mi><mml:mi>a</mml:mi><mml:mi>l</mml:mi><mml:mi>l</mml:mi><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mi>T</mml:mi><mml:mi>P</mml:mi></mml:mrow><mml:mrow><mml:mi>T</mml:mi><mml:mi>P</mml:mi><mml:mo>+</mml:mo><mml:mi>F</mml:mi><mml:mi>N</mml:mi></mml:mrow></mml:mfrac></mml:math></disp-formula>where <italic>FN</italic> (False Negative): missed ground-truth objects.</p></list-item>
<list-item>
<p><italic>AP</italic> summarizes the precision&#x2013;recall curve into a single scalar value, as defined in <xref ref-type="disp-formula" rid="eqn-4">Eq. (4)</xref>.
<disp-formula id="eqn-4"><label>(4)</label><mml:math id="mml-eqn-4" display="block"><mml:mi>A</mml:mi><mml:mi>P</mml:mi><mml:mo>=</mml:mo><mml:msubsup><mml:mo>&#x222B;</mml:mo><mml:mrow><mml:mn>0</mml:mn></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msubsup><mml:mi>P</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>R</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mi>d</mml:mi><mml:mi>R</mml:mi></mml:math></disp-formula>where <italic>PR</italic> is precision&#x2013;recall curve. <italic>AP</italic> is usually reported at a fixed IoU threshold, such as:</p>
<p><inline-formula id="ieqn-1"><mml:math id="mml-ieqn-1"><mml:mi>A</mml:mi><mml:mi>P</mml:mi><mml:mn>50</mml:mn><mml:mo>&#x003A;</mml:mo><mml:mi>I</mml:mi><mml:mi>o</mml:mi><mml:mi>U</mml:mi><mml:mo>&#x2265;</mml:mo><mml:mn>0.50</mml:mn></mml:math></inline-formula> and <inline-formula id="ieqn-2"><mml:math id="mml-ieqn-2"><mml:mi>A</mml:mi><mml:mi>P</mml:mi><mml:mn>75</mml:mn><mml:mo>&#x003A;</mml:mo><mml:mi>I</mml:mi><mml:mi>o</mml:mi><mml:mi>U</mml:mi><mml:mo>&#x2265;</mml:mo><mml:mn>0.75</mml:mn></mml:math></inline-formula></p></list-item>
<list-item>
<p>mAP is computed as the mean of AP scores across all object classes, as defined in <xref ref-type="disp-formula" rid="eqn-5">Eq. (5)</xref>.
<disp-formula id="eqn-5"><label>(5)</label><mml:math id="mml-eqn-5" display="block"><mml:mi>m</mml:mi><mml:mi>A</mml:mi><mml:mi>P</mml:mi><mml:mo>=</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:mi>N</mml:mi></mml:mfrac><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>N</mml:mi></mml:mrow></mml:munderover><mml:mi>A</mml:mi><mml:msub><mml:mi>P</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:math></disp-formula>where <italic>N</italic> is the total number of object classes, <italic>AP</italic><sub><italic>i</italic></sub> is the Average Precision for class <italic>i</italic>.</p></list-item>
</list></p>
</sec>
<sec id="s4_3_2">
<label>4.3.2</label>
<title>Experimental Environment Setup</title>
<p>The experiments were conducted on a Vast.ai cloud instance equipped with an NVIDIA GeForce RTX 3090 GPU, as detailed in <xref ref-type="table" rid="table-4">Table 4</xref>.</p>
<table-wrap id="table-4">
<label>Table 4</label>
<caption>
<title>Hardware configuration of the experimental environment.</title>
</caption>
<table>
<colgroup>
<col align="center"/>
<col align="center"/> </colgroup>
<thead>
<tr>
<th>Vast.ai Cloud</th>
<th>Values</th>
</tr>
</thead>
<tbody>
<tr>
<td>GPU Name</td>
<td>1x RTX 3090, Intel Xeon E5-2683 v3</td>
</tr>
<tr>
<td>CPU</td>
<td>56 CPU/64GB</td>
</tr>
<tr>
<td>GPU (vRAM)</td>
<td>24 GB</td>
</tr>
<tr>
<td>GPU Architecture</td>
<td>Ampere architecture</td>
</tr>
<tr>
<td>CUDA Compute Capability</td>
<td>8.6 (Max CUDA 13.0)</td>
</tr>
<tr>
<td>CUDA Version</td>
<td>12.x</td>
</tr>
<tr>
<td>Shared Memory</td>
<td>Up to 100 KB/block</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>This environment was used to train and evaluate both the pretrained and fine-tuned YOLOv11m models under a unified hardware and software configuration, ensuring fair and reproducible comparisons. The setup supported data augmentation and domain-specific hyperparameter tuning required for effective training. Inference speed was evaluated at an input resolution of 640 &#x00D7; 640 pixels. The fine-tuned YOLOv11m model achieved an average inference speed of approximately 35 frames per second (FPS) for single-stream CCTV input, including preprocessing and postprocessing overhead, satisfying the real-time requirements of urban traffic monitoring and ITS.</p>
</sec>
<sec id="s4_3_3">
<label>4.3.3</label>
<title>Experimental Results</title>
<p>The objective of the experimental evaluation in this work is to validate the quality, diversity, and practical relevance of the proposed dataset rather than to compare detection architectures or achieve state-of-the-art performance. Accordingly, a single strong and widely adopted real-time detector, YOLOv11m, is employed as a representative baseline to evaluate object detection performance across all annotated classes. To examine robustness under different conditions, the dataset was stratified by time of day, weather, and traffic density. Both pretrained and fine-tuned models were evaluated on a dedicated test set of 4670 images to ensure a fair comparison. As shown in <xref ref-type="fig" rid="fig-12">Figs. 12</xref> and <xref ref-type="fig" rid="fig-13">13</xref>, domain-specific training leads to consistent performance improvements across diverse urban traffic scenarios.</p>
<fig id="fig-12">
<label>Figure 12</label>
<caption>
<title>Comparison of Average Precision at IoU thresholds of 0.50 (AP50) and 0.75 (AP75) between the pretrained and fine-tuned models.</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_78756-fig-12.tif"/>
</fig><fig id="fig-13">
<label>Figure 13</label>
<caption>
<title>Comparison of mean Average Precision (mAP) between the pretrained and fine-tuned models. (<bold>a</bold>) mAP comparison illustrated using a clustered bar chart. (<bold>b</bold>) mAP trends of the pretrained and fine-tuned models shown using a line chart.</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_78756-fig-13.tif"/>
</fig>
</sec>
</sec>
<sec id="s4_4">
<label>4.4</label>
<title>Results Analysis and Discussion</title>
<sec id="s4_4_1">
<label>4.4.1</label>
<title>Overall Performance Analysis</title>
<p>Across all object classes, the fine-tuned model consistently outperforms the pretrained YOLOv11m baseline. The baseline model, trained on the general-purpose COCO dataset, shows clear limitations in Vietnamese urban traffic scenes, especially under high traffic density, small object scales, and frequent occlusions. Fine-tuning on the proposed domain-specific dataset leads to improved localization accuracy and more reliable discrimination among visually similar vehicle categories. Qualitative comparisons in <xref ref-type="fig" rid="fig-12">Figs. 12</xref> and <xref ref-type="fig" rid="fig-13">13</xref> further illustrate the performance gap between the pretrained and fine-tuned detectors.</p>

</sec>
<sec id="s4_4_2">
<label>4.4.2</label>
<title>Per-Class Detection Performance Analysis</title>
<p>To complement the overall mAP evaluation, we analyze detection performance at the per-class level to reveal class-specific strengths and limitations, particularly for small and less frequent traffic participants that are critical in urban environments. <xref ref-type="table" rid="table-5">Table 5</xref> reports Average Precision (AP) at IoU thresholds of 0.50 and 0.75 for each class, comparing the pretrained and fine-tuned YOLOv11m models.</p>
<table-wrap id="table-5">
<label>Table 5</label>
<caption>
<title>Per-class AP comparison between pretrained and fine-tuned models.</title>
</caption>
<table>
<colgroup>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/> </colgroup>
<thead>
<tr>
<th align="center" rowspan="2">Class</th>
<th colspan="2">AP50</th>
<th colspan="2">AP75</th>
</tr>
<tr>
<th>Pretrained</th>
<th>Fine-Tuned</th>
<th>Pretrained</th>
<th>Fine-Tuned</th>
</tr>
</thead>
<tbody>
<tr>
<td>Person</td>
<td>0.559</td>
<td>0.922</td>
<td>0.377</td>
<td>0.745</td>
</tr>
<tr>
<td>Bicycle</td>
<td>0.132</td>
<td>0.681</td>
<td>0.09</td>
<td>0.597</td>
</tr>
<tr>
<td>Motorbike</td>
<td>0.461</td>
<td>0.929</td>
<td>0.245</td>
<td>0.806</td>
</tr>
<tr>
<td>Car</td>
<td>0.823</td>
<td>0.975</td>
<td>0.731</td>
<td>0.958</td>
</tr>
<tr>
<td>Bus</td>
<td>0.305</td>
<td>0.863</td>
<td>0.283</td>
<td>0.835</td>
</tr>
<tr>
<td>Truck</td>
<td>0.61</td>
<td>0.953</td>
<td>0.499</td>
<td>0.917</td>
</tr>
<tr>
<td>Traffic light</td>
<td>0.321</td>
<td>0.992</td>
<td>0.27</td>
<td>0.989</td>
</tr>
<tr>
<td><bold>Total/Average</bold></td>
<td><bold>0.459</bold></td>
<td><bold>0</bold>.<bold>902</bold></td>
<td><bold>0.356</bold></td>
<td><bold>0</bold>.<bold>835</bold></td>
</tr>
</tbody>
</table>
</table-wrap>
<p>For the pretrained model, performance varies substantially across classes. Large and visually distinctive objects such as cars achieve relatively high AP (0.823 at AP50), whereas smaller or infrequent classes&#x2014;including bicycles (0.132 AP50), bus (0.305 AP50), and traffic lights (0.321 AP50)&#x2014;exhibit severe performance degradation. This highlights the limited transferability of COCO-pretrained detectors to Vietnamese mixed-traffic scenes characterized by small object scales and frequent occlusion.</p>
<p>After fine-tuning on the proposed dataset, substantial improvements are observed across all categories. Notably, bicycle AP50 increases to 0.681, pedestrian AP50 to 0.922, and traffic light detection reaches 0.992 AP50. Despite these gains, stricter localization at AP75 remains more challenging for small or occluded objects, reflecting the combined effects of class imbalance, object scale, and dense traffic interactions.</p>
</sec>
<sec id="s4_4_3">
<label>4.4.3</label>
<title>Condition-Based Performance Analysis</title>
<p>To further understand how these class-wise behaviors manifest under realistic surveillance conditions, we analyze detection performance stratified by time of day, weather conditions, and traffic density.</p>
<p>Time of Day: <xref ref-type="table" rid="table-6">Table 6</xref> shows that the pretrained model degrades notably in evening scenes due to illumination changes, particularly for small and infrequent objects. After fine-tuning, performance becomes more stable across all periods, with substantial gains for bicycles and traffic lights, indicating improved robustness under low-light conditions.</p>
<table-wrap id="table-6">
<label>Table 6</label>
<caption>
<title>Class-wise mAP@ [50&#x2013;95] performance by time of day.</title>
</caption>
<table>
<colgroup>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/> </colgroup>
<thead>
<tr>
<th align="center" rowspan="2">Class</th>
<th colspan="3">Pretrained</th>
<th colspan="3">Fine-Tuned</th>
</tr>
<tr>
<th>Morning</th>
<th>Afternoon</th>
<th>Evening</th>
<th>Morning</th>
<th>Afternoon</th>
<th>Evening</th>
</tr>
</thead>
<tbody>
<tr>
<td>Person</td>
<td>0.3205</td>
<td>0.439667</td>
<td>0.294333</td>
<td>0.7065</td>
<td>0.768</td>
<td>0.667667</td>
</tr>
<tr>
<td>Bicycle</td>
<td>0.115</td>
<td>0.121667</td>
<td>0.0813333</td>
<td>0.474</td>
<td>0.563667</td>
<td>0.435333</td>
</tr>
<tr>
<td>Motorbike</td>
<td>0.25375</td>
<td>0.271667</td>
<td>0.154333</td>
<td>0.74425</td>
<td>0.750333</td>
<td>0.66</td>
</tr>
<tr>
<td>Car</td>
<td>0.673</td>
<td>0.688667</td>
<td>0.550667</td>
<td>0.91625</td>
<td>0.935667</td>
<td>0.888667</td>
</tr>
<tr>
<td>Bus</td>
<td>0.35925</td>
<td>0.252333</td>
<td>0.117333</td>
<td>0.85025</td>
<td>0.758</td>
<td>0.764</td>
</tr>
<tr>
<td>Truck</td>
<td>0.25</td>
<td>0.523333</td>
<td>0.217333</td>
<td>0.86475</td>
<td>0.883667</td>
<td>0.854333</td>
</tr>
<tr>
<td>Traffic light</td>
<td>0.12525</td>
<td>0.357</td>
<td>0.250667</td>
<td>0.41925</td>
<td>0.984333</td>
<td>0.959</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>Weather conditions: <xref ref-type="table" rid="table-7">Table 7</xref> indicates that the pretrained model degrades significantly in rainy conditions. especially for small or low-contrast objects due to blur and reflections. Fine-tuning yields consistently high performance across all weather types, with notable gains in rain, demonstrating improved robustness to weather-related visual challenges.</p>
<table-wrap id="table-7">
<label>Table 7</label>
<caption>
<title>Class-wise mAP@ [50&#x2013;95] performance by weather conditions.</title>
</caption>
<table>
<colgroup>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/> </colgroup>
<thead>
<tr>
<th align="center" rowspan="2">Class</th>
<th colspan="3">Pretrained</th>
<th colspan="3">Fine-Tuned</th>
</tr>
<tr>
<th>Fine</th>
<th>Rain</th>
<th>Sunshine</th>
<th>Fine</th>
<th>Rain</th>
<th>Sunshine</th>
</tr>
</thead>
<tbody>
<tr>
<td>Person</td>
<td>0.386</td>
<td>0.338333</td>
<td>0.382</td>
<td>0.768667</td>
<td>0.689667</td>
<td>0.7375</td>
</tr>
<tr>
<td>Bicycle</td>
<td>0.0753333</td>
<td>0.123</td>
<td>0.074</td>
<td>0.586667</td>
<td>0.333333</td>
<td>0.4695</td>
</tr>
<tr>
<td>Motorbike</td>
<td>0.250333</td>
<td>0.178667</td>
<td>0.239</td>
<td>0.739667</td>
<td>0.664</td>
<td>0.7475</td>
</tr>
<tr>
<td>Car</td>
<td>0.633</td>
<td>0.574333</td>
<td>0.6895</td>
<td>0.909667</td>
<td>0.89</td>
<td>0.933</td>
</tr>
<tr>
<td>Bus</td>
<td>0.204</td>
<td>0.278667</td>
<td>0.2695</td>
<td>0.74</td>
<td>0.835</td>
<td>0.7905</td>
</tr>
<tr>
<td>Truck</td>
<td>0.274333</td>
<td>0.189667</td>
<td>0.289</td>
<td>0.873333</td>
<td>0.868</td>
<td>0.8515</td>
</tr>
<tr>
<td>Traffic light</td>
<td>0.347667</td>
<td>0.209</td>
<td>0.012</td>
<td>0.984</td>
<td>0.889333</td>
<td>0.496</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>Traffic Density: <xref ref-type="table" rid="table-8">Table 8</xref> shows that the pretrained model degrades markedly in dense traffic, especially for small and occluded objects. After fine-tuning, performance improves substantially, with the largest gains in congested scenes, indicating enhanced robustness to occlusion and object overlap in dense urban traffic. Overall, the condition-based analysis demonstrates the robustness of the proposed dataset while revealing its remaining limitations in challenging urban scenarios.</p>
<table-wrap id="table-8">
<label>Table 8</label>
<caption>
<title>Class-wise mAP@ [50&#x2013;95] performance by traffic density.</title>
</caption>
<table>
<colgroup>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/> </colgroup>
<thead>
<tr>
<th align="center" rowspan="2">Class</th>
<th colspan="2">Pretrained</th>
<th colspan="2">Fine-Tuned</th>
</tr>
<tr>
<th>Sparse</th>
<th>Dense</th>
<th>Sparse</th>
<th>Dense</th>
</tr>
</thead>
<tbody>
<tr>
<td>Person</td>
<td>0.386</td>
<td>0.2735</td>
<td>0.768667</td>
<td>0.6415</td>
</tr>
<tr>
<td>Bicycle</td>
<td>0.0753333</td>
<td>0.163</td>
<td>0.586667</td>
<td>0.597</td>
</tr>
<tr>
<td>Motorbike</td>
<td>0.250333</td>
<td>0.264</td>
<td>0.739667</td>
<td>0.751</td>
</tr>
<tr>
<td>Car</td>
<td>0.633</td>
<td>0.7045</td>
<td>0.909667</td>
<td>0.9365</td>
</tr>
<tr>
<td>Bus</td>
<td>0.204</td>
<td>0.2795</td>
<td>0.74</td>
<td>0.8305</td>
</tr>
<tr>
<td>Truck</td>
<td>0.452333</td>
<td>0.3475</td>
<td>0.873333</td>
<td>0.873</td>
</tr>
<tr>
<td>Traffic light</td>
<td>0.347667</td>
<td>0.327</td>
<td>0.984</td>
<td>0.4475</td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
<sec id="s4_4_4">
<label>4.4.4</label>
<title>Summary of Experimental Findings</title>
<p>The experimental results indicate that the proposed large-scale, domain-specific dataset plays a central role in the observed performance improvements over the pretrained YOLOv11m model. Fine-tuning on this dataset yields consistent gains across object classes and evaluation scenarios, highlighting the limitations of general-purpose datasets when applied to mixed and highly congested urban traffic environments. Improved robustness under challenging conditions, including nighttime scenes, rainy weather, and dense traffic, is observed in the experimental results. This robustness is primarily attributed to the dataset&#x2019;s comprehensive coverage of real-world scenarios and the careful refinement of annotations for small and occluded objects. Standard data augmentation techniques and commonly adopted hyperparameter settings were applied to ensure stable training. Isolated ablation studies were not conducted, as the primary focus of this work is dataset construction rather than training strategy optimization. Similarly, aspects related to deployment efficiency, hardware-specific optimization, and large-scale multi-stream scalability are beyond the scope of this experimental study, as they depend strongly on application-specific system configurations. Overall, the proposed dataset provides a reliable benchmark for real-time vehicle detection and establishes a solid foundation for future ITS research in Southeast Asian urban contexts.</p>
</sec>
</sec>
</sec>
<sec id="s5">
<label>5</label>
<title>Conclusion</title>
<p>In this paper, we addressed the limitations of existing object detection datasets in representing the complexity of Vietnamese urban traffic. This complexity is characterized by mixed vehicle types, dense traffic flows, and challenging environmental conditions. We introduced a new large-scale dataset comprising 23,364 images collected from fixed-view CCTV traffic cameras at major intersections in Da Nang City, Vietnam. The dataset is designed to support real-time vehicle detection in urban environments and covers diverse temporal conditions, weather scenarios, and traffic density levels. It is extensively annotated to include key traffic participants, particularly motorbikes, as well as fine-grained traffic light states essential for intelligent traffic monitoring systems.</p>
<p>To ensure annotation accuracy and scalability, we employed a semi-automatic annotation pipeline combining YOLOv11m-based pre-annotation with manual verification and refinement using CVAT. This hybrid workflow significantly reduced labeling effort while preserving high-quality annotations, especially for small, densely packed, and partially occluded objects. The final dataset contains over 1.1 million labeled instances with consistent and reliable annotations.</p>
<p>Experimental evaluations using the pretrained YOLOv11m baseline and a model fine-tuned on the proposed dataset show that general-purpose pretrained models perform poorly when directly applied to Vietnamese traffic scenes. In contrast, fine-tuning on the proposed dataset yields substantial im-provements across all object classes and evaluation conditions, including nighttime, rainy weather, and dense traffic. These results highlight the importance of localized, context-aware datasets for robust traffic perception.</p>
<p>Although standard training enhancements such as data augmentation and hyperparameter tuning were applied, the primary contribution of this work lies in the dataset rather than training strategy optimization. Overall, the proposed dataset and annotation framework provide a strong benchmark for AI-driven traffic perception in Vietnam. We acknowledge that all data were collected from Da Nang City, which may introduce a degree of geographical bias. However, Da Nang represents a typical Vietnamese urban environment characterized by mixed traffic flow, high motorbike density, and diverse weather conditions. As such, the dataset captures fundamental traffic characteristics common across many Vietnamese and Southeast Asian cities. Future work will focus on expanding the dataset to additional locations, exploring advanced detection architectures, and investigating techniques such as domain adaptation, self-supervised learning, and multi-camera fusion.</p>
</sec>
</body>
<back>
<ack>
<p>Not applicable.</p>
</ack>
<sec>
<title>Funding Statement</title>
<p>This work was supported in part by the Ministry of Science and Technology (MOST), The University of Danang&#x2014;University of Technology and Education, and the School of Computer Science, Duy Tan University, Da Nang City, Vietnam.</p>
</sec>
<sec>
<title>Author Contributions</title>
<p>Quang Dong Nguyen Vo: Conceptualization, methodology, software, investigation, writing-original. Gia Nhu Nguyen: Conceptualization, investigation, writing-review and editing. Hoang Vu Tran: Conceptualization, investigation, writing-review and editing, supervision. All authors reviewed and approved the final version of the manuscript.</p>
</sec>
<sec sec-type="data-availability">
<title>Availability of Data and Materials</title>
<p>The dataset is available for non-commercial research use upon reasonable request to the corresponding author and is provided under controlled access to ensure ethical and privacy compliance. Commercial use requires explicit permission from the authors.</p>
</sec>
<sec>
<title>Ethics Approval</title>
<p>The dataset was collected from publicly operated urban traffic CCTV systems and used solely for research purposes. Personally identifiable information, including faces and license plates, was anonymized when visible. The dataset contains only bounding-box annotations without identity tracking or personal metadata and complies with applicable Vietnamese data protection regulations.</p>
</sec>
<sec sec-type="COI-statement">
<title>Conflicts of Interest</title>
<p>The authors declare no conflicts of interest.</p>
</sec>
<ref-list content-type="authoryear">
<title>References</title>
<ref id="ref-1"><label>[1]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Majstorovi&#x0107;</surname> <given-names>&#x017D;</given-names></string-name>, <string-name><surname>Ti&#x0161;ljari&#x0107;</surname> <given-names>L</given-names></string-name>, <string-name><surname>Ivanjko</surname> <given-names>E</given-names></string-name>, <string-name><surname>Cari&#x0107;</surname> <given-names>T</given-names></string-name></person-group>. <article-title>Urban traffic signal control under mixed traffic flows: literature review</article-title>. <source>Appl Sci</source>. <year>2023</year>;<volume>13</volume>(<issue>7</issue>):<fpage>4484</fpage>. doi:<pub-id pub-id-type="doi">10.3390/app13074484</pub-id>.</mixed-citation></ref>
<ref id="ref-2"><label>[2]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Huu</surname> <given-names>DN</given-names></string-name>, <string-name><surname>Ngoc</surname> <given-names>VN</given-names></string-name></person-group>. <article-title>Analysis study of current transportation status in Vietnam&#x2019;s urban traffic and the transition to electric two-wheelers mobility</article-title>. <source>Sustainability</source>. <year>2021</year>;<volume>13</volume>(<issue>10</issue>):<fpage>5577</fpage>. doi:<pub-id pub-id-type="doi">10.3390/su13105577</pub-id>.</mixed-citation></ref>
<ref id="ref-3"><label>[3]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Nguyen</surname> <given-names>T</given-names></string-name></person-group>. <article-title>Crowd-AI sensing based traffic analysis for Ho Chi Minh City planning simulation</article-title>. <source>Dir Comput Inf Sci Eng</source>. <year>2020</year>;<volume>20</volume>(<issue>2025234</issue>):<fpage>25234</fpage>.</mixed-citation></ref>
<ref id="ref-4"><label>[4]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Buch</surname> <given-names>N</given-names></string-name>, <string-name><surname>Velastin</surname> <given-names>SA</given-names></string-name>, <string-name><surname>Orwell</surname> <given-names>J</given-names></string-name></person-group>. <article-title>A review of computer vision techniques for the analysis of urban traffic</article-title>. <source>IEEE Trans Intell Transport Syst</source>. <year>2011</year>;<volume>12</volume>(<issue>3</issue>):<fpage>920</fpage>&#x2013;<lpage>39</lpage>. doi:<pub-id pub-id-type="doi">10.1109/tits.2011.2119372</pub-id>.</mixed-citation></ref>
<ref id="ref-5"><label>[5]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Tamizh Selvi</surname> <given-names>A</given-names></string-name>, <string-name><surname>Domilin Shyni</surname> <given-names>I</given-names></string-name>, <string-name><surname>Rexiline Sheeba</surname> <given-names>I</given-names></string-name>, <string-name><surname>Jayasudha</surname> <given-names>FV</given-names></string-name>, <string-name><surname>Sanju</surname> <given-names>IMS</given-names></string-name></person-group>. <article-title>Real-time traffic monitoring and analysis using YOLO-based object detection</article-title>. In: <conf-name>Proceedings of the 2025 International Conference on Next Generation Computing Systems (ICNGCS); 2025 Aug 21&#x2013;22</conf-name>; <publisher-loc>Coimbatore, India</publisher-loc>. p. <fpage>1</fpage>&#x2013;<lpage>6</lpage>. doi:<pub-id pub-id-type="doi">10.1109/ICNGCS64900.2025.11183064</pub-id>.</mixed-citation></ref>
<ref id="ref-6"><label>[6]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Aswini</surname> <given-names>N</given-names></string-name>, <string-name><surname>Hegde</surname> <given-names>R</given-names></string-name></person-group>. <article-title>Real time traffic monitoring using YOLO V5</article-title>. In: <conf-name>Proceedings of the 2023 5th International Conference on Inventive Research in Computing Applications (ICIRCA); 2023 Aug 3&#x2013;5</conf-name>; <publisher-loc>Coimbatore, India</publisher-loc>. p. <fpage>572</fpage>&#x2013;<lpage>6</lpage>. doi:<pub-id pub-id-type="doi">10.1109/ICIRCA57980.2023.10220896</pub-id>.</mixed-citation></ref>
<ref id="ref-7"><label>[7]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Maity</surname> <given-names>M</given-names></string-name>, <string-name><surname>Banerjee</surname> <given-names>S</given-names></string-name>, <string-name><surname>Sinha Chaudhuri</surname> <given-names>S</given-names></string-name></person-group>. <article-title>Faster R-CNN and YOLO based vehicle detection: a survey</article-title>. In: <conf-name>Proceedings of the 2021 5th International Conference on Computing Methodologies and Communication (ICCMC); 2021 Apr 8&#x2013;10</conf-name>; <publisher-loc>Erode, India</publisher-loc>. p. <fpage>1442</fpage>&#x2013;<lpage>7</lpage>. doi:<pub-id pub-id-type="doi">10.1109/ICCMC51019.2021.9418274</pub-id>.</mixed-citation></ref>
<ref id="ref-8"><label>[8]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Abbasi</surname> <given-names>M</given-names></string-name>, <string-name><surname>Shahraki</surname> <given-names>A</given-names></string-name>, <string-name><surname>Taherkordi</surname> <given-names>A</given-names></string-name></person-group>. <article-title>Deep learning for network traffic monitoring and analysis (NTMA): a survey</article-title>. <source>Comput Commun</source>. <year>2021</year>;<volume>170</volume>(<issue>3</issue>):<fpage>19</fpage>&#x2013;<lpage>41</lpage>. doi:<pub-id pub-id-type="doi">10.1016/j.comcom.2021.01.021</pub-id>.</mixed-citation></ref>
<ref id="ref-9"><label>[9]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Almukhalfi</surname> <given-names>H</given-names></string-name>, <string-name><surname>Noor</surname> <given-names>A</given-names></string-name>, <string-name><surname>Noor</surname> <given-names>TH</given-names></string-name></person-group>. <article-title>Traffic management approaches using machine learning and deep learning techniques: a survey</article-title>. <source>Eng Appl Artif Intell</source>. <year>2024</year>;<volume>133</volume>(<issue>3</issue>):<fpage>108147</fpage>. doi:<pub-id pub-id-type="doi">10.1016/j.engappai.2024.108147</pub-id>.</mixed-citation></ref>
<ref id="ref-10"><label>[10]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Xu</surname> <given-names>B</given-names></string-name>, <string-name><surname>Liu</surname> <given-names>C</given-names></string-name></person-group>. <article-title>Keypoint detection-based and multi-deep learning model integrated method for identifying vehicle axle load spatial-temporal distribution</article-title>. <source>Adv Eng Inform</source>. <year>2024</year>;<volume>62</volume>:<fpage>102688</fpage>. doi:<pub-id pub-id-type="doi">10.1016/j.aei.2024.102688</pub-id>.</mixed-citation></ref>
<ref id="ref-11"><label>[11]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>He</surname> <given-names>K</given-names></string-name>, <string-name><surname>Gkioxari</surname> <given-names>G</given-names></string-name>, <string-name><surname>Doll&#x00E1;r</surname> <given-names>P</given-names></string-name>, <string-name><surname>Girshick</surname> <given-names>R</given-names></string-name></person-group>. <article-title>Mask R-CNN</article-title>. In: <conf-name>Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV); 2017 Oct 22&#x2013;29</conf-name>; <publisher-loc>Venice, Italy</publisher-loc>. p. <fpage>2980</fpage>&#x2013;<lpage>8</lpage>. doi:<pub-id pub-id-type="doi">10.1109/ICCV.2017.322</pub-id>.</mixed-citation></ref>
<ref id="ref-12"><label>[12]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><surname>Ge</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Liu</surname> <given-names>S</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>F</given-names></string-name>, <string-name><surname>Li</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Sun</surname> <given-names>J</given-names></string-name></person-group>. <article-title>Yolox: exceeding yolo series in 2021</article-title>. <comment>arXiv:2107.08430.2107. 2021</comment>. doi:<pub-id pub-id-type="doi">10.48550/arXiv.2107.08430</pub-id>.</mixed-citation></ref>
<ref id="ref-13"><label>[13]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><surname>Lyu</surname> <given-names>C</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>W</given-names></string-name>, <string-name><surname>Huang</surname> <given-names>H</given-names></string-name>, <string-name><surname>Zhou</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Liu</surname> <given-names>Y</given-names></string-name>, <etal>et al</etal></person-group>. <article-title>RTMDet: an empirical study of designing real-time object detectors</article-title>. <comment>arXiv: 2212.07784. 2022</comment>.</mixed-citation></ref>
<ref id="ref-14"><label>[14]</label><mixed-citation publication-type="book"><person-group person-group-type="author"><string-name><surname>Lin</surname> <given-names>TY</given-names></string-name>, <string-name><surname>Maire</surname> <given-names>M</given-names></string-name>, <string-name><surname>Belongie</surname> <given-names>S</given-names></string-name>, <string-name><surname>Hays</surname> <given-names>J</given-names></string-name>, <string-name><surname>Perona</surname> <given-names>P</given-names></string-name>, <string-name><surname>Ramanan</surname> <given-names>D</given-names></string-name></person-group>. <chapter-title>Microsoft COCO: common objects in context</chapter-title>. In: <person-group person-group-type="editor"><string-name><surname>Fleet</surname> <given-names>D</given-names></string-name>, <string-name><surname>Pajdla</surname> <given-names>T</given-names></string-name>, <string-name><surname>Schiele</surname> <given-names>B</given-names></string-name>, <string-name><surname>Tuytelaars</surname> <given-names>T</given-names></string-name></person-group>, editors. <source>Computer VIsion&#x2014;ECCV 2014. ECCV 2014. Lecture notes in computer science</source>. Vol. <volume>8693</volume>. <publisher-loc>Cham, Switzerland</publisher-loc>: <publisher-name>Springer</publisher-name>; <year>2014</year>. p. <fpage>740</fpage>&#x2013;<lpage>55</lpage>. doi:<pub-id pub-id-type="doi">10.1007/978-3-319-10602-1_48</pub-id>.</mixed-citation></ref>
<ref id="ref-15"><label>[15]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Wen</surname> <given-names>L</given-names></string-name>, <string-name><surname>Du</surname> <given-names>D</given-names></string-name>, <string-name><surname>Cai</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Lei</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Chang</surname> <given-names>MC</given-names></string-name>, <string-name><surname>Qi</surname> <given-names>H</given-names></string-name>, <etal>et al</etal></person-group>. <article-title>UA-DETRAC: a new benchmark and protocol for multi-object detection and tracking</article-title>. <source>Comput Vis Image Underst</source>. <year>2020</year>;<volume>193</volume>(<issue>9</issue>):<fpage>102907</fpage>. doi:<pub-id pub-id-type="doi">10.1016/j.cviu.2020.102907</pub-id>.</mixed-citation></ref>
<ref id="ref-16"><label>[16]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><surname>Zhu</surname> <given-names>P</given-names></string-name>, <string-name><surname>Wen</surname> <given-names>L</given-names></string-name>, <string-name><surname>Du</surname> <given-names>D</given-names></string-name>, <string-name><surname>Bian</surname> <given-names>X</given-names></string-name>, <string-name><surname>Fan</surname> <given-names>H</given-names></string-name>, <string-name><surname>Hu</surname> <given-names>Q</given-names></string-name>, <etal>et al.</etal></person-group> <article-title>Detection and tracking meet drones challenge</article-title>. <comment>arXiv:2001.06303. 2020</comment>.</mixed-citation></ref>
<ref id="ref-17"><label>[17]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Trinh</surname> <given-names>T</given-names></string-name>, <string-name><surname>Nguyen</surname> <given-names>K</given-names></string-name></person-group>. <article-title>A Vietnamese benchmark for vehicle detection and real-time empirical evaluation</article-title>. <source>Tho Univ J Sci</source>. <year>2022</year>;<volume>14</volume>(<issue>3</issue>):<fpage>45</fpage>&#x2013;<lpage>52</lpage>. doi:<pub-id pub-id-type="doi">10.22144/ctu.jen.2022.042</pub-id>.</mixed-citation></ref>
<ref id="ref-18"><label>[18]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><surname>Sapkota</surname> <given-names>R</given-names></string-name>, <string-name><surname>Calero</surname> <given-names>MF</given-names></string-name>, <string-name><surname>Qureshi</surname> <given-names>R</given-names></string-name>, <string-name><surname>Badgujar</surname> <given-names>C</given-names></string-name>, <string-name><surname>Nepal</surname> <given-names>U</given-names></string-name>, <string-name><surname>Poulose</surname> <given-names>A</given-names></string-name>, <etal>et al</etal></person-group>. <article-title>YOLO advances to its genesis: a decadal and comprehensive review of the You Only Look Once (YOLO) series</article-title>. <comment>arXiv:2406.19407. 2024</comment>. doi:<pub-id pub-id-type="doi">10.48550/arXiv.2406.19407</pub-id>.</mixed-citation></ref>
<ref id="ref-19"><label>[19]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Sun</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Li</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Li</surname> <given-names>S</given-names></string-name>, <string-name><surname>Duan</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Ning</surname> <given-names>H</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>Y</given-names></string-name></person-group>. <article-title>PBA-YOLOv7: an object detection method based on an improved YOLOv7 network</article-title>. <source>Appl Sci</source>. <year>2023</year>;<volume>13</volume>(<issue>18</issue>):<fpage>10436</fpage>. doi:<pub-id pub-id-type="doi">10.3390/app131810436</pub-id>.</mixed-citation></ref>
<ref id="ref-20"><label>[20]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Ren</surname> <given-names>S</given-names></string-name>, <string-name><surname>He</surname> <given-names>K</given-names></string-name>, <string-name><surname>Girshick</surname> <given-names>R</given-names></string-name>, <string-name><surname>Sun</surname> <given-names>J</given-names></string-name></person-group>. <article-title>Faster R-CNN: towards real-time object detection with region proposal networks</article-title>. <source>IEEE Trans Pattern Anal Mach Intell</source>. <year>2017</year>;<volume>39</volume>(<issue>6</issue>):<fpage>1137</fpage>&#x2013;<lpage>49</lpage>. doi:<pub-id pub-id-type="doi">10.1109/tpami.2016.2577031</pub-id>; <pub-id pub-id-type="pmid">27295650</pub-id></mixed-citation></ref>
<ref id="ref-21"><label>[21]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Xu</surname> <given-names>B</given-names></string-name>, <string-name><surname>Liu</surname> <given-names>X</given-names></string-name>, <string-name><surname>Feng</surname> <given-names>G</given-names></string-name>, <string-name><surname>Liu</surname> <given-names>C</given-names></string-name></person-group>. <article-title>A monocular-based framework for accurate identification of spatial-temporal distribution of vehicle wheel loads under occlusion scenarios</article-title>. <source>Eng Appl Artif Intell</source>. <year>2024</year>;<volume>133</volume>(<issue>1</issue>):<fpage>107972</fpage>. doi:<pub-id pub-id-type="doi">10.1016/j.engappai.2024.107972</pub-id>.</mixed-citation></ref>
<ref id="ref-22"><label>[22]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><surname>Jegham</surname> <given-names>N</given-names></string-name>, <string-name><surname>Koh</surname> <given-names>CY</given-names></string-name>, <string-name><surname>Abdelatti</surname> <given-names>M</given-names></string-name>, <string-name><surname>Hendawi</surname> <given-names>A</given-names></string-name></person-group>. <article-title>Evaluating the evolution of YOLO (You Only Look Once) models: a comprehensive benchmark study of YOLO11 and its predecessors</article-title>. <comment>arXiv:2411.00201. 2024</comment>.</mixed-citation></ref>
<ref id="ref-23"><label>[23]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><collab>Ultralytics</collab></person-group>. <article-title>Ultralytics yolov11</article-title>; <year>2025 [cited 2026 Feb 19]</year>. Available from: <ext-link ext-link-type="uri" xlink:href="https://www.ultralytics.com/blog/comparing-ultralytics-yolo11-vs-previous-yolo-models">https://www.ultralytics.com/blog/comparing-ultralytics-yolo11-vs-previous-yolo-models</ext-link>.</mixed-citation></ref>
</ref-list>
</back></article>