<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.1 20151215//EN" "http://jats.nlm.nih.gov/publishing/1.1/JATS-journalpublishing1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:mml="http://www.w3.org/1998/Math/MathML" xml:lang="en" article-type="research-article" dtd-version="1.1">
<front>
<journal-meta>
<journal-id journal-id-type="pmc">CMC</journal-id>
<journal-id journal-id-type="nlm-ta">CMC</journal-id>
<journal-id journal-id-type="publisher-id">CMC</journal-id>
<journal-title-group>
<journal-title>Computers, Materials &#x0026; Continua</journal-title>
</journal-title-group>
<issn pub-type="epub">1546-2226</issn>
<issn pub-type="ppub">1546-2218</issn>
<publisher>
<publisher-name>Tech Science Press</publisher-name>
<publisher-loc>USA</publisher-loc>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="publisher-id">72626</article-id>
<article-id pub-id-type="doi">10.32604/cmc.2025.072626</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Article</subject>
</subj-group>
</article-categories>
<title-group>
<article-title>A Fine-Grained Recognition Model based on Discriminative Region Localization and Efficient Second-Order Feature Encoding</article-title>
<alt-title alt-title-type="left-running-head">A Fine-Grained Recognition Model based on Discriminative Region Localization and Efficient Second-Order Feature Encoding</alt-title>
<alt-title alt-title-type="right-running-head">A Fine-Grained Recognition Model based on Discriminative Region Localization and Efficient Second-Order Feature Encoding</alt-title>
</title-group>
<contrib-group>
<contrib id="author-1" contrib-type="author" corresp="yes">
<name name-style="western"><surname>Zhang</surname><given-names>Xiaorui</given-names></name><xref ref-type="aff" rid="aff-1">1</xref><xref ref-type="aff" rid="aff-2">2</xref><email>zxr365@126.com</email></contrib>
<contrib id="author-2" contrib-type="author">
<name name-style="western"><surname>Wang</surname><given-names>Yingying</given-names></name><xref ref-type="aff" rid="aff-2">2</xref></contrib>
<contrib id="author-3" contrib-type="author">
<name name-style="western"><surname>Sun</surname><given-names>Wei</given-names></name><xref ref-type="aff" rid="aff-3">3</xref></contrib>
<contrib id="author-4" contrib-type="author">
<name name-style="western"><surname>Zhou</surname><given-names>Shiyu</given-names></name><xref ref-type="aff" rid="aff-2">2</xref></contrib>
<contrib id="author-5" contrib-type="author">
<name name-style="western"><surname>Zhang</surname><given-names>Haoming</given-names></name><xref ref-type="aff" rid="aff-4">4</xref></contrib>
<contrib id="author-6" contrib-type="author">
<name name-style="western"><surname>Wang</surname><given-names>Pengpai</given-names></name><xref ref-type="aff" rid="aff-1">1</xref></contrib>
<aff id="aff-1"><label>1</label><institution>College of Computer and Information Engineering, Nanjing Tech University</institution>, <addr-line>Nanjing, 211816</addr-line>, <country>China</country></aff>
<aff id="aff-2"><label>2</label><institution>School of Software, Nanjing University of Information Science and Technology</institution>, <addr-line>Nanjing, 210044</addr-line>, <country>China</country></aff>
<aff id="aff-3"><label>3</label><institution>School of Automation, Nanjing University of Information Science and Technology</institution>, <addr-line>Nanjing, 210044</addr-line>, <country>China</country></aff>
<aff id="aff-4"><label>4</label><institution>School of Computer Science, Nanjing University of Information Science and Technology</institution>, <addr-line>Nanjing, 210044</addr-line>, <country>China</country></aff>
</contrib-group>
<author-notes>
<corresp id="cor1"><label>&#x002A;</label>Corresponding Author: Xiaorui Zhang. Email: <email>zxr365@126.com</email></corresp>
</author-notes>
<pub-date date-type="collection" publication-format="electronic">
<year>2026</year>
</pub-date>
<pub-date date-type="pub" publication-format="electronic">
<day>10</day><month>2</month><year>2026</year>
</pub-date>
<volume>87</volume>
<issue>1</issue>
<elocation-id>37</elocation-id>
<history>
<date date-type="received">
<day>31</day>
<month>08</month>
<year>2025</year>
</date>
<date date-type="accepted">
<day>18</day>
<month>11</month>
<year>2025</year>
</date>
</history>
<permissions>
<copyright-statement>&#x00A9; 2026 The Authors.</copyright-statement>
<copyright-year>2026</copyright-year>
<copyright-holder>Published by Tech Science Press.</copyright-holder>
<license xlink:href="https://creativecommons.org/licenses/by/4.0/">
<license-p>This work is licensed under a <ext-link ext-link-type="uri" xlink:type="simple" xlink:href="https://creativecommons.org/licenses/by/4.0/">Creative Commons Attribution 4.0 International License</ext-link>, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
</license>
</permissions>
<self-uri content-type="pdf" xlink:href="TSP_CMC_72626.pdf"></self-uri>
<abstract>
<p>Discriminative region localization and efficient feature encoding are crucial for fine-grained object recognition. However, existing data augmentation methods struggle to accurately locate discriminative regions in complex backgrounds, small target objects, and limited training data, leading to poor recognition. Fine-grained images exhibit &#x201C;small inter-class differences,&#x201D; and while second-order feature encoding enhances discrimination, it often requires dual Convolutional Neural Networks (CNN), increasing training time and complexity. This study proposes a model integrating discriminative region localization and efficient second-order feature encoding. By ranking feature map channels via a fully connected layer, it selects high-importance channels to generate an enhanced map, accurately locating discriminative regions. Cropping and erasing augmentations further refine recognition. To improve efficiency, a novel second-order feature encoding module generates an attention map from the fourth convolutional group of Residual Network 50 layers (ResNet-50) and multiplies it with features from the fifth group, producing second-order features while reducing dimensionality and training time. Experiments on Caltech-University of California, San Diego Birds-200-2011 (CUB-200-2011), Stanford Car, and Fine-Grained Visual Classification of Aircraft (FGVC Aircraft) datasets show state-of-the-art accuracy of 88.9%, 94.7%, and 93.3%, respectively.</p>
</abstract>
<kwd-group kwd-group-type="author">
<kwd>Fine-grained recognition</kwd>
<kwd>feature encoding</kwd>
<kwd>data augmentation</kwd>
<kwd>second-order feature</kwd>
<kwd>discriminative regions</kwd>
</kwd-group>
<funding-group>
<award-group id="awg1">
<funding-source>National Nature Science Foundation of China</funding-source>
<award-id>62272236</award-id>
<award-id>62376128</award-id>
<award-id>62306139</award-id>
</award-group>
<award-group id="awg2">
<funding-source>Natural Science Foundation of Jiangsu Province under Grant</funding-source>
<award-id>BK20201136</award-id>
<award-id>BK20191401</award-id>
</award-group>
</funding-group>
</article-meta>
</front>
<body>
<sec id="s1">
<label>1</label>
<title>Introduction</title>
<p>Fine-grained recognition aims to provide a more detailed and specific classification of similar objects or categories. Its applications span various fields such as everyday life, commerce, security, and nature conservation [<xref ref-type="bibr" rid="ref-1">1</xref>]. For example, precise identification of various bird species helps scientists monitor populations and develop conservation strategies in around bird conservation. In the business world, accurately identifying different product types and specifications can provide customers with tailored recommendations, increasing sales. However, fine- grained images often come with complex backgrounds, and the target objects to be identified occupy a relatively small proportion of the image, making existing methods time- consuming to train and limited in their ability to achieve high recognition accuracy. Therefore, there is the need to develop faster and more accurate fine-grained recognition methods.</p>
<p>In the field of computer vision, fine-grained recognition has always been a difficult topic, leading academics to suggest a variety of approaches to deal with this challenging issue. For instance, there are frequently just a few high-quality image samples available due to the high expense of physical work. Moreover, fine-grained samples follow a long-tail distribution [<xref ref-type="bibr" rid="ref-2">2</xref>], where there are many samples in the &#x201C;head classes&#x201D; with few categories, and fewer samples in the &#x201C;tail classes&#x201D; with many categories [<xref ref-type="bibr" rid="ref-3">3</xref>]. This distribution frequently results in models that identify head classes well but tail classes poorly. Researchers have proposed data augmentation techniques [<xref ref-type="bibr" rid="ref-4">4</xref>] to expand datasets. However, these methods have limitations. Some cropped samples contain only background regions. Erased samples may remove discriminative regions. Consequently, these methods generate many ineffective augmented samples, reducing augmentation effectiveness. Additionally, fine-grained images [<xref ref-type="bibr" rid="ref-5">5</xref>,<xref ref-type="bibr" rid="ref-6">6</xref>] from different categories often exhibit subtle differences only in some local regions, i.e., characterized by small inter-class differences. Traditional first-order features [<xref ref-type="bibr" rid="ref-4">4</xref>] frequently fail to capture these subtle differences, as they are extracted from isolated local regions of the image and do not account for the spatial relationships or contextual information. To distinguish subtle differences, researchers have introduced second-order feature encoding methods, which compute the covariance or correlation matrix between first-order features. By capturing the joint distribution of first-order features, second-order features reveal the interdependencies between features within local regions, thereby highlighting subtle variations that may be overlooked when considering the image. Nevertheless, traditional second-order features encoding methods are computationally expensive due to their high dimensionality and the need for two rounds of feature extraction, resulting in time-consuming training processes.</p>
<p>To address the above challenges, this study proposes a fine-grained recognition model based on discriminative region localization and efficient second-order feature encoding. Firstly, considering the complexity of back- grounds and the small proportion of objects in images, this study designs a data augmentation method to ensure accurate localization of discriminative regions. This method solves the issue of inaccurate localization of discriminative regions caused by complex backgrounds and small proportion of objects in images by using the weight matrix of the fully connected layer to rank the importance of each channel, with a channel of the feature map representing a portion of the image. By selecting channels with higher importance, the method further accurately localizes the discriminative regions. Second, a efficient second-order feature encoding module is designed in this study. To reduce computational complexity and minimize training time, this module conducts a one-round feature extraction and feature down- sampling based on a single CNN for second-order feature encoding. This lessens the issue of training taking a long period because of high-dimensional second-order features that are present in conventional techniques. In brief, the following sums up our contributions.
<list list-type="bullet">
<list-item>
<p>We propose an importance ranking algorithm for localizing discriminative regions. This algorithm utilizes a weight matrix from the fully connected layer, where higher value corresponds to channels with greater importance for fine-grained image recognition. Channels with higher importance represent image regions that are more advantageous for recognition. By selecting and combining these channels based on their importance using the weight matrix, we generate an enhanced image that accurately localizes discriminative regions essential for fine-grained recognition. This approach effectively addresses the challenge of localizing discriminative regions in images with complex backgrounds and small target object proportions.</p></list-item>
<list-item>
<p>We design a new, efficient second-order feature encoding module. This module reduces the dimensionality of the features extracted by the fourth convolutional group of ResNet-50 and calculates the covariance matrix between these reduced features and these extracted by the fifth convolutional group to generate second-order features. The fourth convolutional group, with its shallower layers and smaller receptive field, captures features rich in edges, textures, and color, directly reflecting the local structure of the image. In contrast, the fifth convolutional group, with its larger receptive field and deeper layers, aggregates information to produce features with richer semantic content. By employing a single CNN for both feature extraction and dimensionality reduction in one pass, this method significantly shortens training time and reduces computational complexity.</p></list-item>
</list></p>
<p>The rest of this paper is structured as follows: <xref ref-type="sec" rid="s2">Section 2</xref> reviews the related work; <xref ref-type="sec" rid="s3">Section 3</xref> presents the proposed method; <xref ref-type="sec" rid="s4">Section 4</xref> provides the quantitative and visual results, along with comparisons to related studies; and <xref ref-type="sec" rid="s5">Section 5</xref> summarizes the main contributions of this research and offers suggestions for future work.</p>
</sec>
<sec id="s2">
<label>2</label>
<title>Related Work</title>
<p>Fine-grained recognition tasks are critical in various fields, including daily life, commerce, security, and nature conservation. Due to the inherent challenges of these tasks, they have been the focus of extensive research over the past few decades. This section provides a detailed review of related work, organized into three key aspects: network structure, data augmentation, and second-order feature encoding.</p>
<sec id="s2_1">
<label>2.1</label>
<title>Network Structure</title>
<p>Single-channel and dual-channel CNNs are the two most common network models used in fine-grained recognition applications. A variation of the single-channel CNN, the dual-channel CNN consists of two CNN networks, or feature extractors. It extracts the information about interactions between various regions from the feature maps that the two CNN networks extracted using bilinear pooling procedures [<xref ref-type="bibr" rid="ref-7">7</xref>]. In 2015, Lin et al. [<xref ref-type="bibr" rid="ref-8">8</xref>] proposed a novel dual-channel CNN model for extracting high-order features from images. This model multiplied the features outputted by two parallel networks, M-Net and D-Net, and pooled them to obtain an image descriptor, thereby enhancing feature representation capability. In 2017, Lin et al. utilized the pooling outputs of the dual-channel CNN for feature extraction and located feature interactions under certain rules to obtain second-order features. However, the results of the experiments indicated that the bilinear CNN needed two feature extraction rounds, which resulted in lengthy training periods. Furthermore, a significant amount of redundant data was included in the second-order features that were produced following feature interaction. The authors made an effort to lessen superfluous features without compromising identification accuracy, however the results were not very pleasing. Although the dual-channel CNN can capture interaction information between different regions in feature maps, it requires two rounds of feature extraction, resulting in prolonged training time. In this study, we apply dimensionality reduction and down sampling to the features outputted by the fourth convolutional group of ResNet-50, ensuring that the feature dimensions from the fourth and fifth convolutional groups are consistent. Then, we encode the features from the fourth and fifth convolutional groups using the efficient second-order feature encoding method proposed in this study. By employing a single CNN for one round of feature extraction, we effectively shorten training time. For further details, please refer to <xref ref-type="sec" rid="s3_1">Section 3.1</xref> in this paper.</p>
</sec>
<sec id="s2_2">
<label>2.2</label>
<title>Data Augmentation</title>
<p>Data augmentation is a method that takes original samples and applies a few adjustments to create new ones. Its goal is to increase the dataset&#x2019;s size to mitigate the overfitting issue brought on by a lack of training data. Tran et al. [<xref ref-type="bibr" rid="ref-9">9</xref>] analyzed data augmentation strategies from the perspective of Bayesian methods, learning missing data from the distribution of the training set to perform data augmentation. However, this method does not specifically focus on discriminative regions. Hu et al. [<xref ref-type="bibr" rid="ref-10">10</xref>] employed a single channel to guide the generation of augmented samples through cropping and erasing, treating discriminative and background regions equally. Cropping and erasing may unintentionally eliminate discriminative regions when the fraction of objects to be identified is low, making it impossible to increase the number of samples containing discriminative regions in a targeted manner. Chen et al. [<xref ref-type="bibr" rid="ref-11">11</xref>] randomly sampled a batch and applied two types of data augmentation to each image in the batch, producing two views. They aimed to bring different views of the same image closer together in latent space and keep views of different images far apart. This method could not guarantee that augmented samples must contain discriminative regions since it did not distinguish between discriminative and background regions, even though restricting augmented samples with loss functions made them more rational and realistic. Current data augmentation methods in fine-grained recognition research increase sample numbers but primarily rely on simple geometric transformations. These methods treat both target object regions and background regions equally, without focusing on discriminative regions. As a result, some augmented samples may consist solely of background regions, generating ineffective samples and weakening the model&#x2019;s generalization ability. To address this issue, this study proposes an importance-ranking algorithm for localizing discriminative regions. This algorithm leverages the weight matrix of the fully connected layer to rank the importance of each channel in the feature map. Since each channel corresponds to a specific part of the image, selecting and combing channels with higher importance allows for precise localization of discriminative regions. This ensures that the augmented samples generated through data augmentation contain meaningful discriminative regions, thereby enhancing the effectiveness of the new samples.</p>
</sec>
<sec id="s2_3">
<label>2.3</label>
<title>Second-Order Feature Encoding</title>
<p>The distinctions between object categories in fine-grained recognition tasks are often so slight that they cannot always be made out, not even with convolutional features. Several techniques for computing second-order features have emerged to capture the minor distinctions between features of different categories. Second-order feature encoding methods combine first-order features, such as by taking the outer product of features [<xref ref-type="bibr" rid="ref-12">12</xref>,<xref ref-type="bibr" rid="ref-13">13</xref>]. Second-order features can capture interactions between features and express nonlinear relationships. Lin et al. [<xref ref-type="bibr" rid="ref-8">8</xref>] extracted features from images using two separate CNNs. To acquire second-order features, they multiplied the feature values in distinct channels at the same place in the two feature maps. The capacity to represent features is improved by these second-order features, which are the sum of the features from two CNN paths. Gao et al. proposed a compact bilinear pooling method, which effectively reduces feature dimensions compared to conventional bilinear pooling methods, albeit with a slight decrease in performance. Kong and Fowlkes [<xref ref-type="bibr" rid="ref-14">14</xref>] introduced the LRBP model, which addresses the issue of excessively high dimensionality of fused second-order features and the need for large parameter quantities in linear classifiers through two successive approximation operations: low-rank approximation of parameter matrices and shared projection. Cui et al. [<xref ref-type="bibr" rid="ref-15">15</xref>] proposed the Kernel Pooling framework, which captures second-order information between features and is versatile. Paschali et al. [<xref ref-type="bibr" rid="ref-16">16</xref>] proposed the bilinear attention pooling method, which randomly selects a channel from the features outputted by the fourth convolutional group of ResNet-50 and multiplies it channel-wise with features from the fifth convolutional group to obtain second-order feature [<xref ref-type="bibr" rid="ref-17">17</xref>]. Nevertheless, the second-order features could not always improve discriminative region features because of the randomness of the selection process. Although second-order features can capture local structures and texture information, offering stronger feature representation and improved classification performance, traditional second-order feature encoding methods require two CNNs to extract first-order features from images and combine them. This leads to longer training time and high-dimensional fused features. Therefore, this study applies dimensionality reduction, down sampling, and filtering to the features outputted by the fourth convolutional group of ResNet-50 to generate an attention map. This attention map is then multiplied channel-wise and pixel-wise with the features from the fifth convolutional group to produce second-order features. By performing feature extraction only once and reducing the dimensionality and complexity of the features, we shorten training time and alleviate the issue of excessively high dimensionality in fused second-order features.</p>
</sec>
</sec>
<sec id="s3">
<label>3</label>
<title>Proposed Method</title>
<p>In this section, we present a novel fine-grained recognition model, termed sfeModel, which based on discriminative region localization and efficient second-order feature encoding. To localize discriminative regions, we propose an importance-ranking algorithm. Additionally, we design an efficient second-order feature encoding module to reduce the dimensionality of second-order features and shorten training time. The following subsections will provide a detailed overview of the model from four aspects: Architecture, Efficient Second-Order Feature Encoding, Discriminative Region Localization, and Data Augmentation.</p>
<sec id="s3_1">
<label>3.1</label>
<title>Architecture</title>
<p>In order to reduce training time and mitigate the difficulties caused by small object proportions and complex backgrounds in fine-grained images, sfeModel uses ResNet-50 as the backbone network and integrates a data augmentation module and an effective second-order feature encoding module, as shown in <xref ref-type="fig" rid="fig-1">Fig. 1</xref>. The efficient second-order feature encoding module utilizes a single CNN for one round of feature extraction. It reduces and down samples the convolutional features outputted by the fourth convolutional group to generate an attention map, then performs elementwise multiplication between the attention map and the convolutional features of the fifth convolutional group at corresponding positions to obtain second-order features. Rather than treating background and discriminative regions equally, the data augmentation module uses the significance ranking method to localize discriminative regions, guaranteeing that supplemented samples contain discriminative regions.</p>
<fig id="fig-1">
<label>Figure 1</label>
<caption>
<title>Overall network architecture diagram</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_72626-fig-1.tif"/>
</fig>
<p>The sfeModel reduces training time and increases identification accuracy by combining the effective second-order feature encoding module and the data augmentation module. This helps to efficiently handle the obstacles presented by complicated background and small object proportions in images.</p>
</sec>
<sec id="s3_2">
<label>3.2</label>
<title>Efficient Second-Order Feature Encoding</title>
<p>Previous studies have shown that second-order feature encoding can effectively capture local structural and textural information, thereby enhancing feature representation and classification performance. In contrast, traditional aggregation methods such as summation or averaging rely solely on first-order statistics. As illustrated in <xref ref-type="fig" rid="fig-2">Fig. 2</xref>, let the convolutional features outputted by the fourth convolutional group and the fifth convolutional group be denoted as <inline-formula id="ieqn-1"><mml:math id="mml-ieqn-1"><mml:msub><mml:mi>X</mml:mi><mml:mrow><mml:mn>4</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula> and <inline-formula id="ieqn-2"><mml:math id="mml-ieqn-2"><mml:msub><mml:mi>X</mml:mi><mml:mrow><mml:mn>5</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula>, respectively. We perform global average pooling on <inline-formula id="ieqn-3"><mml:math id="mml-ieqn-3"><mml:msub><mml:mi>X</mml:mi><mml:mrow><mml:mn>4</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula>, resulting in a <inline-formula id="ieqn-4"><mml:math id="mml-ieqn-4"><mml:mn>1</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mn>1</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mi>c</mml:mi></mml:math></inline-formula> vector. After sorting this vector in descending order, we select the top <inline-formula id="ieqn-5"><mml:math id="mml-ieqn-5"><mml:mi>m</mml:mi></mml:math></inline-formula> values to form an attention map <inline-formula id="ieqn-6"><mml:math id="mml-ieqn-6"><mml:msub><mml:mi>A</mml:mi><mml:mrow><mml:mi>m</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>, where each value corresponds to a channel in <inline-formula id="ieqn-7"><mml:math id="mml-ieqn-7"><mml:msub><mml:mi>X</mml:mi><mml:mrow><mml:mn>4</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula>. Next, a max pooling with a <inline-formula id="ieqn-8"><mml:math id="mml-ieqn-8"><mml:mn>2</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mn>2</mml:mn></mml:math></inline-formula> pooling kernel and stride 2 on <inline-formula id="ieqn-9"><mml:math id="mml-ieqn-9"><mml:msub><mml:mi>A</mml:mi><mml:mrow><mml:mi>m</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> was done to down sample it, resulting in <inline-formula id="ieqn-10"><mml:math id="mml-ieqn-10"><mml:msubsup><mml:mi>A</mml:mi><mml:mrow><mml:mi>m</mml:mi></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula>. At this point, <inline-formula id="ieqn-11"><mml:math id="mml-ieqn-11"><mml:msubsup><mml:mi>A</mml:mi><mml:mrow><mml:mi>m</mml:mi></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula> and <inline-formula id="ieqn-12"><mml:math id="mml-ieqn-12"><mml:msub><mml:mi>X</mml:mi><mml:mrow><mml:mn>5</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula> have the same height and width. Finally, element-wise multiplication between <inline-formula id="ieqn-13"><mml:math id="mml-ieqn-13"><mml:msubsup><mml:mi>A</mml:mi><mml:mrow><mml:mi>m</mml:mi></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula> and the corresponding elements of <inline-formula id="ieqn-14"><mml:math id="mml-ieqn-14"><mml:msub><mml:mi>X</mml:mi><mml:mrow><mml:mn>5</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula> was done at each position to obtain second-order features <inline-formula id="ieqn-15"><mml:math id="mml-ieqn-15"><mml:mi>F</mml:mi></mml:math></inline-formula>, as shown in <xref ref-type="disp-formula" rid="eqn-1">Eq. (1)</xref>.</p>
<fig id="fig-2">
<label>Figure 2</label>
<caption>
<title>Diagram of second-order feature encoding process</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_72626-fig-2.tif"/>
</fig>
<p><disp-formula id="eqn-1"><label>(1)</label><mml:math id="mml-eqn-1" display="block"><mml:mi>F</mml:mi><mml:mo>=</mml:mo><mml:msubsup><mml:mi>A</mml:mi><mml:mrow><mml:mi>m</mml:mi></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msubsup><mml:mo>&#x00D7;</mml:mo><mml:msub><mml:mi>X</mml:mi><mml:mrow><mml:mn>5</mml:mn></mml:mrow></mml:msub></mml:math></disp-formula></p>
<p>In this context, <italic>F</italic> represents second-order features, <inline-formula id="ieqn-16"><mml:math id="mml-ieqn-16"><mml:msubsup><mml:mi>A</mml:mi><mml:mrow><mml:mi>m</mml:mi></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula> represents the downsampled attention map after pooling, and <inline-formula id="ieqn-17"><mml:math id="mml-ieqn-17"><mml:msub><mml:mi>X</mml:mi><mml:mrow><mml:mn>5</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula> represents the output features from the fifth convolutional group of ResNet-50. The fourth convolutional group&#x2019;s output features are used to create attention maps. The attention map&#x2019;s channels each represent a distinct feature of the object, such as the head of a bird, an airplane&#x2019;s wings, or an automobile&#x2019;s registration plate. The element-wise multiplication between the attention map <inline-formula id="ieqn-18"><mml:math id="mml-ieqn-18"><mml:msubsup><mml:mi>A</mml:mi><mml:mrow><mml:mi>m</mml:mi></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula> and the feature map <inline-formula id="ieqn-19"><mml:math id="mml-ieqn-19"><mml:msub><mml:mi>X</mml:mi><mml:mrow><mml:mn>5</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula> is designed to efficiently capture second-order feature interactions. Unlike the full outer product in classical bilinear pooling, which computes interactions between all channel pairs, our method computes a selective interaction. The attention map <inline-formula id="ieqn-20"><mml:math id="mml-ieqn-20"><mml:msubsup><mml:mi>A</mml:mi><mml:mrow><mml:mi>m</mml:mi></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula>, derived from the most salient channels of <inline-formula id="ieqn-21"><mml:math id="mml-ieqn-21"><mml:msub><mml:mi>X</mml:mi><mml:mrow><mml:mn>4</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula>, acts as a condensed representation of discriminative shallow features (e.g., edges, textures). The operation <inline-formula id="ieqn-22"><mml:math id="mml-ieqn-22"><mml:mi>F</mml:mi><mml:mo>=</mml:mo><mml:msubsup><mml:mi>A</mml:mi><mml:mrow><mml:mi>m</mml:mi></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msubsup><mml:mo>&#x00D7;</mml:mo><mml:msub><mml:mi>X</mml:mi><mml:mrow><mml:mn>5</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula>	 calculates, for each spatial location, the Hadamard product between this shallow feature representation and the deep semantic features of <inline-formula id="ieqn-23"><mml:math id="mml-ieqn-23"><mml:msub><mml:mi>X</mml:mi><mml:mrow><mml:mn>5</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula>. This captures the local co-activation or covariance between these two distinct sets of features. The resulting feature map <inline-formula id="ieqn-24"><mml:math id="mml-ieqn-24"><mml:mi>F</mml:mi></mml:math></inline-formula> thus encodes how discriminative low-level patterns and high-level semantic concepts co-vary across the image, which is a fundamental characteristic of second-order statistics. This approach provides a powerful yet computationally efficient alternative to full bilinear pooling. The sfeModel only needs one round of feature extraction, saving computational costs and complexity as compared to bilinear networks. To mitigate the issue of excessively high dimensionality in the fused second-order features, attention maps are generated in order to down sample the features produced by the fourth convolutional group.</p>
</sec>
<sec id="s3_3">
<label>3.3</label>
<title>Discriminative Region Localization</title>
<p>Through the second-order feature encoding module described in <xref ref-type="sec" rid="s3_1">Section 3.1</xref>, we obtain a second-order feature map <inline-formula id="ieqn-25"><mml:math id="mml-ieqn-25"><mml:mi>F</mml:mi><mml:mo>=</mml:mo><mml:mo fence="false" stretchy="false">{</mml:mo><mml:msub><mml:mi>F</mml:mi><mml:mrow><mml:mn>1</mml:mn><mml:mo>,</mml:mo></mml:mrow></mml:msub><mml:msub><mml:mi>F</mml:mi><mml:mrow><mml:mn>2</mml:mn><mml:mo>,</mml:mo></mml:mrow></mml:msub><mml:msub><mml:mi>F</mml:mi><mml:mrow><mml:mn>3</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:msub><mml:mi>F</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>.</mml:mo><mml:mo>.</mml:mo><mml:mo>.</mml:mo><mml:msub><mml:mi>F</mml:mi><mml:mrow><mml:mi>p</mml:mi></mml:mrow></mml:msub><mml:mo fence="false" stretchy="false">}</mml:mo></mml:math></inline-formula>, where <inline-formula id="ieqn-26"><mml:math id="mml-ieqn-26"><mml:mi>i</mml:mi><mml:mo>&#x2208;</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mi>P</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:math></inline-formula>. Here, <inline-formula id="ieqn-27"><mml:math id="mml-ieqn-27"><mml:mi>P</mml:mi></mml:math></inline-formula> represents the number of channels. We then reduce the dimensionality of the second-order feature map <inline-formula id="ieqn-28"><mml:math id="mml-ieqn-28"><mml:mi>F</mml:mi></mml:math></inline-formula> by applying convolutional layers, aligning the number of channels in <italic>F</italic> with that of the output feature map of the fifth convolutional group of ResNet-50. Suppose the dimension-reduced feature map is denoted as <inline-formula id="ieqn-29"><mml:math id="mml-ieqn-29"><mml:msup><mml:mi>F</mml:mi><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:mo fence="false" stretchy="false">{</mml:mo><mml:msub><mml:mi>F</mml:mi><mml:mrow><mml:mn>1</mml:mn><mml:mo>,</mml:mo></mml:mrow></mml:msub><mml:msub><mml:mi>F</mml:mi><mml:mrow><mml:mn>2</mml:mn><mml:mo>,</mml:mo></mml:mrow></mml:msub><mml:msub><mml:mi>F</mml:mi><mml:mrow><mml:mn>3</mml:mn></mml:mrow></mml:msub><mml:mo>&#x2026;</mml:mo><mml:msub><mml:mi>F</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>.</mml:mo><mml:mo>.</mml:mo><mml:mo>.</mml:mo><mml:msub><mml:mi>F</mml:mi><mml:mrow><mml:mi>c</mml:mi></mml:mrow></mml:msub><mml:mo fence="false" stretchy="false">}</mml:mo></mml:math></inline-formula>, where <inline-formula id="ieqn-30"><mml:math id="mml-ieqn-30"><mml:mi>i</mml:mi><mml:mo>&#x2208;</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mi>C</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:math></inline-formula>, and C represents the number of channels in the dimension-reduced feature map. The dimension-reduced feature map <inline-formula id="ieqn-31"><mml:math id="mml-ieqn-31"><mml:msup><mml:mi>F</mml:mi><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula> is flattened into a one-dimensional vector <inline-formula id="ieqn-32"><mml:math id="mml-ieqn-32"><mml:mi>f</mml:mi><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mi>R</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>C</mml:mi><mml:mi>H</mml:mi><mml:mi>W</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>&#x00D7;</mml:mo><mml:mn>1</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup></mml:math></inline-formula>. Here, <inline-formula id="ieqn-33"><mml:math id="mml-ieqn-33"><mml:mi>H</mml:mi></mml:math></inline-formula>, and <inline-formula id="ieqn-34"><mml:math id="mml-ieqn-34"><mml:mi>W</mml:mi></mml:math></inline-formula> respectively represent height and width of the second-order feature map. However, due to the high dimensionality of those features, which leads to excessive parameter calculations and a waste of computational resources and memory, global average pooling is employed to further reduce the dimensionality of each channel of the dimension-reduced feature map. After dimension reduction, a feature vector <inline-formula id="ieqn-35"><mml:math id="mml-ieqn-35"><mml:mi>f</mml:mi><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mi>R</mml:mi><mml:mrow><mml:mi>C</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mn>1</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup></mml:math></inline-formula>, is obtained which is then normalized in terms of its magnitude. Finally, the normalized feature vector <inline-formula id="ieqn-36"><mml:math id="mml-ieqn-36"><mml:mi>f</mml:mi><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mi>R</mml:mi><mml:mrow><mml:mi>C</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mn>1</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup></mml:math></inline-formula> is fed into a classifier for classification.</p>
<p>The classifier is implemented by fully connected layers and a soft max layer. The fully connected layers use a weight matrix to map the dimension-reduced feature vector <inline-formula id="ieqn-37"><mml:math id="mml-ieqn-37"><mml:mi>f</mml:mi><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mi>R</mml:mi><mml:mrow><mml:mi>C</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mn>1</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup></mml:math></inline-formula> to probability scores for each class. To be more precise, the weight matrix is used to weight and sum each member of the feature vector in order to get the probability scores for each class. As a result, the weight matrix may be used to determine the significance of each channel in the feature map for a particular class. Let <italic>N</italic> be the number of classes, and the weight matrix of the fully connected layers be denoted as <inline-formula id="ieqn-38"><mml:math id="mml-ieqn-38"><mml:mi>w</mml:mi><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mi>R</mml:mi><mml:mrow><mml:mi>C</mml:mi><mml:mi>X</mml:mi><mml:mi>N</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula>. Each element <inline-formula id="ieqn-39"><mml:math id="mml-ieqn-39"><mml:msub><mml:mi>w</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> in the weight matrix represents the connection weight between the <inline-formula id="ieqn-40"><mml:math id="mml-ieqn-40"><mml:mi>i</mml:mi></mml:math></inline-formula>th element of the feature vector and the score for class <italic>j</italic>, where <inline-formula id="ieqn-41"><mml:math id="mml-ieqn-41"><mml:mi>i</mml:mi><mml:mo>&#x2208;</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mi>C</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:math></inline-formula> and <inline-formula id="ieqn-42"><mml:math id="mml-ieqn-42"><mml:mi>j</mml:mi><mml:mo>&#x2208;</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mi>N</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:math></inline-formula>. Since each element of the feature vector represents the global information of the corresponding channel in the feature map <inline-formula id="ieqn-43"><mml:math id="mml-ieqn-43"><mml:mi>F</mml:mi></mml:math></inline-formula>, <inline-formula id="ieqn-44"><mml:math id="mml-ieqn-44"><mml:msub><mml:mi>w</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> can be interpreted as the contribution of the <inline-formula id="ieqn-45"><mml:math id="mml-ieqn-45"><mml:mi>i</mml:mi></mml:math></inline-formula>th channel in the feature map <inline-formula id="ieqn-46"><mml:math id="mml-ieqn-46"><mml:mi>F</mml:mi></mml:math></inline-formula> to class <inline-formula id="ieqn-47"><mml:math id="mml-ieqn-47"><mml:mi>j</mml:mi></mml:math></inline-formula>.</p>
<p>As shown in <xref ref-type="fig" rid="fig-3">Fig. 3</xref>, based on the weight matrix <inline-formula id="ieqn-48"><mml:math id="mml-ieqn-48"><mml:mi>w</mml:mi><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mi>R</mml:mi><mml:mrow><mml:mi>C</mml:mi><mml:mi>X</mml:mi><mml:mi>N</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula>, the importance of each channel in the feature map can be determined for each class. While importance sorting is based on the categories with the greatest probability scores during testing, it is carried out during training using the categories corresponding to the ground truth labels. The elements <inline-formula id="ieqn-49"><mml:math id="mml-ieqn-49"><mml:msub><mml:mi>W</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> are sorted in descending order, where <inline-formula id="ieqn-50"><mml:math id="mml-ieqn-50"><mml:mi>i</mml:mi><mml:mo>&#x2208;</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mi>C</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:math></inline-formula>, and <italic>j</italic> represents the selected classes. This importance sorting algorithm selects the top <inline-formula id="ieqn-51"><mml:math id="mml-ieqn-51"><mml:mi>k</mml:mi></mml:math></inline-formula> values, where these <inline-formula id="ieqn-52"><mml:math id="mml-ieqn-52"><mml:mi>k</mml:mi></mml:math></inline-formula> values correspond to the most important channels, providing an approximate localization of the discriminative regions in the object. The remaining <inline-formula id="ieqn-53"><mml:math id="mml-ieqn-53"><mml:mi>C</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mi>k</mml:mi></mml:math></inline-formula> channels are considered relatively less important. To ensure diversity in the cropped regions, <inline-formula id="ieqn-54"><mml:math id="mml-ieqn-54"><mml:mi>m</mml:mi></mml:math></inline-formula> (<inline-formula id="ieqn-55"><mml:math id="mml-ieqn-55"><mml:mi>m</mml:mi><mml:mo>&#x003C;</mml:mo><mml:mi>C</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mi>k</mml:mi></mml:math></inline-formula>) channels are randomly selected from these <inline-formula id="ieqn-56"><mml:math id="mml-ieqn-56"><mml:mi>C</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mi>k</mml:mi></mml:math></inline-formula> channels. The weighted sum of these selected k channels and m channels is computed to generate the cropped enhancement image, as shown in <xref ref-type="disp-formula" rid="eqn-2">Eq. (2)</xref>. Furthermore, <inline-formula id="ieqn-57"><mml:math id="mml-ieqn-57"><mml:mi>t</mml:mi></mml:math></inline-formula> channels are randomly selected from the aforementioned <inline-formula id="ieqn-58"><mml:math id="mml-ieqn-58"><mml:mi>k</mml:mi></mml:math></inline-formula> channels, and the weighted sum of these <inline-formula id="ieqn-59"><mml:math id="mml-ieqn-59"><mml:mi>t</mml:mi></mml:math></inline-formula> channels is computed to generate the erased enhancement image, as shown in <xref ref-type="disp-formula" rid="eqn-3">Eq. (3)</xref>. Finally, normalization is performed on the cropped enhancement image and erased enhancement image according to <xref ref-type="disp-formula" rid="eqn-4">Eq. (4)</xref>.</p>
<fig id="fig-3">
<label>Figure 3</label>
<caption>
<title>Discriminative region localization maps</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_72626-fig-3.tif"/>
</fig>
<p><disp-formula id="eqn-2"><label>(2)</label><mml:math id="mml-eqn-2" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd /><mml:mtd><mml:msub><mml:mi>A</mml:mi><mml:mrow><mml:mi>c</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:msubsup><mml:mo movablelimits="false">&#x2211;</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>k</mml:mi><mml:mo>+</mml:mo><mml:mi>m</mml:mi></mml:mrow></mml:msubsup><mml:msub><mml:mi>W</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:msub><mml:mi>F</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<disp-formula id="eqn-3"><label>(3)</label><mml:math id="mml-eqn-3" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd /><mml:mtd><mml:msub><mml:mi>A</mml:mi><mml:mrow><mml:mi>d</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:msubsup><mml:mo movablelimits="false">&#x2211;</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msubsup><mml:msub><mml:mi>W</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:msub><mml:mi>F</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<disp-formula id="eqn-4"><label>(4)</label><mml:math id="mml-eqn-4" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd /><mml:mtd><mml:msubsup><mml:mi>A</mml:mi><mml:mrow><mml:mo>&#x2217;</mml:mo></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msubsup><mml:mrow><mml:mo>(</mml:mo><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi><mml:mo>)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:msub><mml:mi>A</mml:mi><mml:mrow><mml:mo>&#x2217;</mml:mo></mml:mrow></mml:msub><mml:mo>&#x2212;</mml:mo><mml:mo movablelimits="true" form="prefix">min</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>A</mml:mi><mml:mrow><mml:mo>&#x2217;</mml:mo></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:mo movablelimits="true" form="prefix">max</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>A</mml:mi><mml:mrow><mml:mo>&#x2217;</mml:mo></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mo movablelimits="true" form="prefix">min</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>A</mml:mi><mml:mrow><mml:mo>&#x2217;</mml:mo></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:mfrac></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>where <inline-formula id="ieqn-60"><mml:math id="mml-ieqn-60"><mml:msub><mml:mi>A</mml:mi><mml:mrow><mml:mi>c</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> represents the cropped enhancement image, <inline-formula id="ieqn-61"><mml:math id="mml-ieqn-61"><mml:msub><mml:mi>W</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> denotes the weight value at position <inline-formula id="ieqn-62"><mml:math id="mml-ieqn-62"><mml:mrow><mml:mo>(</mml:mo><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula> in the weight matrix, <inline-formula id="ieqn-63"><mml:math id="mml-ieqn-63"><mml:msub><mml:mi>F</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> represents the <inline-formula id="ieqn-64"><mml:math id="mml-ieqn-64"><mml:mi>i</mml:mi></mml:math></inline-formula>-th feature channel, and <inline-formula id="ieqn-65"><mml:math id="mml-ieqn-65"><mml:msub><mml:mi>A</mml:mi><mml:mrow><mml:mi>d</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> represents the erased enhancement image. <inline-formula id="ieqn-66"><mml:math id="mml-ieqn-66"><mml:msub><mml:mi>A</mml:mi><mml:mrow><mml:mrow><mml:mo>&#x2217;</mml:mo></mml:mrow></mml:mrow></mml:msub></mml:math></inline-formula> denotes both the cropped enhancement image <inline-formula id="ieqn-67"><mml:math id="mml-ieqn-67"><mml:msub><mml:mi>A</mml:mi><mml:mrow><mml:mi>c</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> and the erased enhancement image <inline-formula id="ieqn-68"><mml:math id="mml-ieqn-68"><mml:msub><mml:mi>A</mml:mi><mml:mrow><mml:mi>d</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>, while <inline-formula id="ieqn-69"><mml:math id="mml-ieqn-69"><mml:msubsup><mml:mi>A</mml:mi><mml:mrow><mml:mrow><mml:mo>&#x2217;</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula> refers to the normalized cropped enhancement image and erased enhancement image. Each feature channel can locate certain specific regions in the original image. Note that this method prevents the problem of overly small, cropped regions that arises from randomly choosing only one channel each time, while also ensuring sample diversity. The network cannot acquire useful information when the cropped region is too tiny because it contains fewer features. The number of feature channels selected in this study is limited by parameters, preventing the cropped region from becoming excessively large. An excessively large cropped region contains too many non-discriminative regions, which defeats the purpose of excluding interference. The <inline-formula id="ieqn-70"><mml:math id="mml-ieqn-70"><mml:mi>k</mml:mi></mml:math></inline-formula> channels identified locate the discriminative regions of the object. Therefore, the subsequent selection of <inline-formula id="ieqn-71"><mml:math id="mml-ieqn-71"><mml:mi>t</mml:mi></mml:math></inline-formula> channels can pinpoint relatively important discriminative sub-regions. By removing these sub-regions, the network is encouraged to look for other discriminative features instead than depending only on one particular class of discriminative features. The network will rely on other discriminative features for recognition when a discriminative region in the original sample image is obscured, strengthening the model&#x2019;s resilience.</p>
</sec>
<sec id="s3_4">
<label>3.4</label>
<title>Data Augmentation</title>
<sec id="s3_4_1">
<label>3.4.1</label>
<title>Region Cropping</title>
<p>As illustrated in <xref ref-type="fig" rid="fig-4">Fig. 4</xref>, region cropping is the technique of selecting discriminative regions from the original sample image, enlarging them to match the original image&#x2019;s size, and then feeding the regions into a CNN network for training. The procedure described in <xref ref-type="sec" rid="s3_3">Section 3.3</xref> yields the cropped enhancement image, whose length and width are both less than those of the original image. To match the size of the original image, the cropped enhanced image is first up sampled. Then, the cropping mask <inline-formula id="ieqn-72"><mml:math id="mml-ieqn-72"><mml:msubsup><mml:mi>A</mml:mi><mml:mrow><mml:mi>c</mml:mi></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msubsup><mml:mo stretchy="false">(</mml:mo><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> is generated, where pixels with values greater than the manually set cropping threshold <inline-formula id="ieqn-73"><mml:math id="mml-ieqn-73"><mml:mrow><mml:mi mathvariant="normal">&#x03B8;</mml:mi></mml:mrow><mml:mrow><mml:mtext>c</mml:mtext></mml:mrow><mml:mo>&#x2208;</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">]</mml:mo></mml:math></inline-formula> are set to 1, and all other pixels are set to 0, as shown in the following <xref ref-type="disp-formula" rid="eqn-5">Eq. (5)</xref>:</p>
<fig id="fig-4">
<label>Figure 4</label>
<caption>
<title>Flowchart of region cropping process</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_72626-fig-4.tif"/>
</fig>
<p><disp-formula id="eqn-5"><label>(5)</label><mml:math id="mml-eqn-5" display="block"><mml:msub><mml:mi>C</mml:mi><mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mrow><mml:mo>{</mml:mo><mml:mtable columnalign="left" rowspacing="4pt" columnspacing="1em"><mml:mtr><mml:mtd><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mspace width="thinmathspace" /><mml:mi>i</mml:mi><mml:mi>f</mml:mi><mml:msubsup><mml:mi>A</mml:mi><mml:mrow><mml:mi>c</mml:mi></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msubsup><mml:mrow><mml:mo>(</mml:mo><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi><mml:mo>)</mml:mo></mml:mrow><mml:mo>&#x003E;</mml:mo><mml:mi>&#x03B8;</mml:mi><mml:mi>c</mml:mi></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mspace width="thinmathspace" /><mml:mi>o</mml:mi><mml:mi>t</mml:mi><mml:mi>h</mml:mi><mml:mi>e</mml:mi><mml:mi>r</mml:mi><mml:mi>w</mml:mi><mml:mi>i</mml:mi><mml:mi>s</mml:mi><mml:mi>e</mml:mi></mml:mtd></mml:mtr></mml:mtable><mml:mo fence="true" stretchy="true" symmetric="true"></mml:mo></mml:mrow></mml:math></disp-formula>where <inline-formula id="ieqn-74"><mml:math id="mml-ieqn-74"><mml:msub><mml:mi>C</mml:mi><mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mo>,</mml:mo><mml:mrow><mml:mi>j</mml:mi></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:msub></mml:math></inline-formula> represents the cropping mask at position <inline-formula id="ieqn-75"><mml:math id="mml-ieqn-75"><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mo>,</mml:mo><mml:mrow><mml:mi>j</mml:mi></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula>, and <inline-formula id="ieqn-76"><mml:math id="mml-ieqn-76"><mml:msubsup><mml:mi>A</mml:mi><mml:mrow><mml:mi>c</mml:mi></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msubsup><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mo>,</mml:mo><mml:mrow><mml:mi>j</mml:mi></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula> denotes the normalized value at the position of the cropped enhanced image. Determining a bounding box <inline-formula id="ieqn-77"><mml:math id="mml-ieqn-77"><mml:mi>B</mml:mi></mml:math></inline-formula> that covers the region where <inline-formula id="ieqn-78"><mml:math id="mml-ieqn-78"><mml:msub><mml:mi>C</mml:mi><mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mo>,</mml:mo><mml:mrow><mml:mi>j</mml:mi></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:msub></mml:math></inline-formula> is positive, the area covered by <inline-formula id="ieqn-79"><mml:math id="mml-ieqn-79"><mml:mi>B</mml:mi></mml:math></inline-formula> in the original image is enlarged to serve as the input for the cropped enhancement, allowing for clearer details and extraction of finer-grained features. By reducing interference from other regions, region cropping enables the network to concentrate more on obtaining features from discriminative regions. Following region cropping, the supplemented samples maintain the same class labels as the original samples.</p>
</sec>
<sec id="s3_4_2">
<label>3.4.2</label>
<title>Region Erasure</title>
<p>Region erasure refers to randomly erasing parts of discriminative regions from the original image and using the erased image as an augmented sample for training, as shown in <xref ref-type="fig" rid="fig-5">Fig. 5</xref>. The dimensions of the erased enhanced image are less than those of the original image, and it can be obtained by following the instructions in <xref ref-type="sec" rid="s4_2">Section 4.2</xref>. First, the erased enhancement image is up sampled to the same size as the original image, and then the erasure mask of pixels in the erased enhancement image, denoted as <inline-formula id="ieqn-80"><mml:math id="mml-ieqn-80"><mml:msubsup><mml:mi>A</mml:mi><mml:mrow><mml:mi>d</mml:mi></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msubsup><mml:mo stretchy="false">(</mml:mo><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>, with values greater than the artificially set erasure threshold <inline-formula id="ieqn-81"><mml:math id="mml-ieqn-81"><mml:msub><mml:mi>&#x03B8;</mml:mi><mml:mrow><mml:mi>d</mml:mi></mml:mrow></mml:msub><mml:mo>&#x2208;</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">]</mml:mo></mml:math></inline-formula>, is set to 1, while the erasure mask of other pixels is set to 0, as shown in <xref ref-type="disp-formula" rid="eqn-6">Eq. (6)</xref>:</p>
<fig id="fig-5">
<label>Figure 5</label>
<caption>
<title>Flowchart of region erasure process</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_72626-fig-5.tif"/>
</fig>
<p><disp-formula id="eqn-6"><label>(6)</label><mml:math id="mml-eqn-6" display="block"><mml:msub><mml:mi>D</mml:mi><mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mrow><mml:mo>{</mml:mo><mml:mtable columnalign="left" rowspacing="4pt" columnspacing="1em"><mml:mtr><mml:mtd><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mspace width="thinmathspace" /><mml:mi>i</mml:mi><mml:mi>f</mml:mi><mml:msubsup><mml:mi>A</mml:mi><mml:mrow><mml:mi>d</mml:mi></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msubsup><mml:mrow><mml:mo>(</mml:mo><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi><mml:mo>)</mml:mo></mml:mrow><mml:mo>&#x003E;</mml:mo><mml:msub><mml:mi>&#x03B8;</mml:mi><mml:mrow><mml:mi>d</mml:mi></mml:mrow></mml:msub></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mspace width="thinmathspace" /><mml:mi>o</mml:mi><mml:mi>t</mml:mi><mml:mi>h</mml:mi><mml:mi>e</mml:mi><mml:mi>r</mml:mi><mml:mi>w</mml:mi><mml:mi>i</mml:mi><mml:mi>s</mml:mi><mml:mi>e</mml:mi></mml:mtd></mml:mtr></mml:mtable><mml:mo fence="true" stretchy="true" symmetric="true"></mml:mo></mml:mrow></mml:math></disp-formula>where <inline-formula id="ieqn-82"><mml:math id="mml-ieqn-82"><mml:msub><mml:mi>D</mml:mi><mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:msub></mml:math></inline-formula> represents the erasure mask at position <inline-formula id="ieqn-83"><mml:math id="mml-ieqn-83"><mml:mrow><mml:mo>(</mml:mo><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula>, and <inline-formula id="ieqn-84"><mml:math id="mml-ieqn-84"><mml:msubsup><mml:mi>A</mml:mi><mml:mrow><mml:mi>d</mml:mi></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msubsup><mml:mrow><mml:mo>(</mml:mo><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula> denotes the normalized value at the position of the cropped enhanced image. By determining a bounding box <inline-formula id="ieqn-85"><mml:math id="mml-ieqn-85"><mml:mi>B</mml:mi></mml:math></inline-formula> covering the region where <inline-formula id="ieqn-86"><mml:math id="mml-ieqn-86"><mml:mi>D</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula> is positive, the pixels in the area surrounded by <italic>B</italic> in the original sample are set to 0, resulting in an erased augmented sample. The erased augmented sample retains the same class labels as the original sample. Since images may contain multiple discriminative visual features, some features may become invisible due to occlusion or other reasons in certain images. The generalization capabilities of a CNN may suffer if it depends too much on a single visual feature for classification. Consequently, in this study, the CNN is encouraged to extract different discriminative visual features from the images via random partial erasure of discriminative regions, which improves the model&#x2019;s capacity for generalization. Furthermore, in cases when the network misclassifies samples, it might not be able to identify the appropriate discriminative region. It can, however, identify the area most pertinent to the network&#x2019;s class output. Erasing these areas might therefore direct the network to look for regions that are pertinent to the appropriate class.</p>
</sec>
</sec>
</sec>
<sec id="s4">
<label>4</label>
<title>Experiments</title>
<p>This section provides an analysis of the experimental details and results of the proposed solution. Through ablation studies and comparative experiments with state-of-the-art methods, we validate the effectiveness of our approach. Additionally, to visually illustrate the impact of different image regions on the outcome, we conducted visualization experiments. The following subsections will provide a detailed explanation of the dataset, evaluation metrics, experimental procedures, and results.</p>
<sec id="s4_1">
<label>4.1</label>
<title>Datasets</title>
<p>This study conducts experiments on three datasets including CUB-200-2011, Stanford Cars, and FGVC Aircraft. For region cropping and region erasure, the threshold &#x03B8; was randomly sampled uniformly between 0.2 and 0.4.</p>
<p>For the objects that need to be classified, each of the three datasets has part annotation points, bounding box annotations, and category labels. All trials, however, simply use the object&#x2019;s category labels. For the objects that need to be classified, each of the three datasets provides part annotation points and bounding box annotations in addition to category labels. It is important to emphasize that in all our experiments, only the image data and category labels were used for both training and testing; no bounding boxes, part annotations, or other localization signals were utilized at any stage. This ensures a fair comparison with other weakly-supervised methods and demonstrates our model&#x2019;s capability to autonomously locate discriminative regions.</p>
<p><xref ref-type="table" rid="table-1">Table 1</xref> displays the specifics of the three datasets. CUB-200-2011 dataset comprises 11,788 images of 200 bird species, with a nearly 1:1 ratio between the training and test sets. Because birds usually have small bodies, the bird item takes up a small amount of the image. Furthermore, birds frequently rest on elaborate tree branches or structures, creating background that are varied and complicated. Furthermore, birds differ significantly in appearance while they are in various stances, including flying, standing, and spreading their wings. As a result, this dataset is generally acknowledged to be difficult for tasks involving recognition. Stanford Cars dataset comprises 16,185 images of 196 car models, with a nearly 1:1 ratio between the training and test sets. The images are captured from multiple angles, leading to significant variations in the proportions of the objects across different images.</p>
<table-wrap id="table-1">
<label>Table 1</label>
<caption>
<title>Basic information of the datasets</title>
</caption>
<table>
<colgroup>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/> </colgroup>
<thead>
<tr>
<th>Dataset</th>
<th>Object</th>
<th>Number of classes</th>
<th>Number of training samples</th>
<th>Number of testing samples</th>
</tr>
</thead>
<tbody>
<tr>
<td>CUB-200-2011</td>
<td>Birds</td>
<td>200</td>
<td>5994</td>
<td>5794</td>
</tr>
<tr>
<td>Stanford Cars</td>
<td>Vehicle</td>
<td>196</td>
<td>8144</td>
<td>8041</td>
</tr>
<tr>
<td>FGVC Aircraft</td>
<td>Aircraft</td>
<td>100</td>
<td>6667</td>
<td>3333</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>FGVC Aircraft dataset comprises 10,000 images of aircraft. It can be categorized into four levels of granularity: Manufacturer, Family, Variant, and Model. This research employs the Variant level, where the 10,000 aircraft images are divided into 100 different categories. The ratio of training set to test set is approximately 2:1.</p>
</sec>
<sec id="s4_2">
<label>4.2</label>
<title>Evaluating Indicators</title>
<p>This study uses ACC (Accuracy) as the evaluation metric. ACC is calculated as the proportion of correctly predicted samples to the total number of testing samples, as shown in <xref ref-type="disp-formula" rid="eqn-7">Eq. (7)</xref>:
<disp-formula id="eqn-7"><label>(7)</label><mml:math id="mml-eqn-7" display="block"><mml:mi>A</mml:mi><mml:mi>C</mml:mi><mml:mi>C</mml:mi><mml:mo>=</mml:mo><mml:mfrac><mml:msub><mml:mi>n</mml:mi><mml:mrow><mml:mi>c</mml:mi><mml:mi>o</mml:mi><mml:mi>r</mml:mi><mml:mi>r</mml:mi><mml:mi>e</mml:mi><mml:mi>c</mml:mi><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mi>n</mml:mi></mml:mfrac></mml:math></disp-formula>where <inline-formula id="ieqn-87"><mml:math id="mml-ieqn-87"><mml:msub><mml:mi>n</mml:mi><mml:mrow><mml:mi>c</mml:mi><mml:mi>o</mml:mi><mml:mi>r</mml:mi><mml:mi>r</mml:mi><mml:mi>e</mml:mi><mml:mi>c</mml:mi><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> represents the number of samples predicted correctly, and <inline-formula id="ieqn-88"><mml:math id="mml-ieqn-88"><mml:mi>n</mml:mi></mml:math></inline-formula> represents the total number of samples. If the predicted label for each sample matches the ground truth label, the sample is deemed to have been accurately predicted; if not, the sample was predicted wrongly.</p>
</sec>
<sec id="s4_3">
<label>4.3</label>
<title>Implementation Details</title>
<p>PyTorch 1.0 was utilized to train this study. An i7-10700 CPU and an NVIDIA GeForce RTX 2070 GPU were part of the experimental configuration. ResNet-50, which was initially trained using pre-trained weights from ImageNet, was used as the backbone network. During the training process, the momentum coefficient was set to 0.9, the weight decay parameter was set to 1e&#x2212;5, and the initial learning rate of the network was set to 1e&#x2212;3. The learning rate was multiplied by 0.9 every two epochs. The batch size for training samples was set to 10. For region cropping and region erasure, the thresholds <inline-formula id="ieqn-89"><mml:math id="mml-ieqn-89"><mml:msub><mml:mi>&#x03B8;</mml:mi><mml:mrow><mml:mi>c</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> and <inline-formula id="ieqn-90"><mml:math id="mml-ieqn-90"><mml:msub><mml:mi>&#x03B8;</mml:mi><mml:mrow><mml:mi>d</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> were randomly sampled uniformly from the range <inline-formula id="ieqn-91"><mml:math id="mml-ieqn-91"><mml:mo stretchy="false">[</mml:mo><mml:mn>0.2</mml:mn><mml:mo>,</mml:mo><mml:mn>0.4</mml:mn><mml:mo stretchy="false">]</mml:mo></mml:math></inline-formula> for each training sample. The other key hyperparameters for discriminative region localization were set as follows: the number of top channels <inline-formula id="ieqn-92"><mml:math id="mml-ieqn-92"><mml:mi>k</mml:mi></mml:math></inline-formula> was set to <inline-formula id="ieqn-93"><mml:math id="mml-ieqn-93"><mml:mn>0.3</mml:mn><mml:mspace width="thinmathspace" /><mml:mi>C</mml:mi></mml:math></inline-formula>, the number of random channels <inline-formula id="ieqn-94"><mml:math id="mml-ieqn-94"><mml:mi>m</mml:mi></mml:math></inline-formula> was set to <inline-formula id="ieqn-95"><mml:math id="mml-ieqn-95"><mml:mn>0.2</mml:mn><mml:mspace width="thinmathspace" /><mml:mi>C</mml:mi></mml:math></inline-formula>, and the number of channels to erase <inline-formula id="ieqn-96"><mml:math id="mml-ieqn-96"><mml:mi>t</mml:mi></mml:math></inline-formula> was set to <inline-formula id="ieqn-97"><mml:math id="mml-ieqn-97"><mml:mn>0.5</mml:mn><mml:mspace width="thinmathspace" /><mml:mi>k</mml:mi></mml:math></inline-formula>, where <inline-formula id="ieqn-98"><mml:math id="mml-ieqn-98"><mml:mi>C</mml:mi></mml:math></inline-formula> is the number of channels in the dimension-reduced second-order feature map <inline-formula id="ieqn-99"><mml:math id="mml-ieqn-99"><mml:msup><mml:mi>F</mml:mi><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula>. These settings were consistent across all datasets.</p>
<p>The loss function is the cross-entropy loss, as shown in <xref ref-type="disp-formula" rid="eqn-8">Eq. (8)</xref>. Although this study rarely introduces noise during data augmentation, a small weight is still assigned to augmented samples through a series of comparative experiments. In <xref ref-type="disp-formula" rid="eqn-8">Eq. (8)</xref>, when <inline-formula id="ieqn-100"><mml:math id="mml-ieqn-100"><mml:msub><mml:mi>&#x03B2;</mml:mi><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula> and <inline-formula id="ieqn-101"><mml:math id="mml-ieqn-101"><mml:msub><mml:mi>&#x03B2;</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula> are both set to 0.15, a higher validation accuracy is achieved.
<disp-formula id="eqn-8"><label>(8)</label><mml:math id="mml-eqn-8" display="block"><mml:mi>L</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>r</mml:mi><mml:mi>a</mml:mi><mml:mi>w</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>c</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>d</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mi>y</mml:mi><mml:mo>)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mi>L</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>r</mml:mi><mml:mi>a</mml:mi><mml:mi>w</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mi>y</mml:mi><mml:mo>)</mml:mo></mml:mrow><mml:mo>+</mml:mo><mml:msub><mml:mi>&#x03B2;</mml:mi><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mi>L</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>c</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mi>y</mml:mi><mml:mo>)</mml:mo></mml:mrow><mml:mo>+</mml:mo><mml:msub><mml:mi>&#x03B2;</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub><mml:mi>L</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>d</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mi>y</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:math></disp-formula>here, <inline-formula id="ieqn-102"><mml:math id="mml-ieqn-102"><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>r</mml:mi><mml:mi>a</mml:mi><mml:mi>w</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>c</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>d</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> represent the original sample, cropped augmented sample, and erased augmented sample, respectively, while <inline-formula id="ieqn-103"><mml:math id="mml-ieqn-103"><mml:mi>y</mml:mi></mml:math></inline-formula> represents the label of the sample. <inline-formula id="ieqn-104"><mml:math id="mml-ieqn-104"><mml:mi>L</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>r</mml:mi><mml:mi>a</mml:mi><mml:mi>w</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mi>y</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula>, <inline-formula id="ieqn-105"><mml:math id="mml-ieqn-105"><mml:mi>L</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>c</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mi>y</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula>, and <inline-formula id="ieqn-106"><mml:math id="mml-ieqn-106"><mml:mi>L</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>d</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mi>y</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula> represent the loss values of the original sample, cropped augmented sample, and erased augmented sample, respectively. <inline-formula id="ieqn-107"><mml:math id="mml-ieqn-107"><mml:mi>L</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>r</mml:mi><mml:mi>a</mml:mi><mml:mi>w</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>c</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>d</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mi>y</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula> represents the total loss value of the network. <inline-formula id="ieqn-108"><mml:math id="mml-ieqn-108"><mml:mi>f</mml:mi></mml:math></inline-formula> represents the feature modulus.</p>
<p>Since normalizing the modulus reduces the value of the cross-entropy loss and slows down the convergence speed of the model, this study adopts the approach of multiplying the modulus by a hyperparameter s to normalize the modulus to a larger value. Here, s is set to 100, as shown in <xref ref-type="disp-formula" rid="eqn-9">Eq. (9)</xref>.
<disp-formula id="eqn-9"><label>(9)</label><mml:math id="mml-eqn-9" display="block"><mml:mrow><mml:mover><mml:mi>f</mml:mi><mml:mo stretchy="false">&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mi>s</mml:mi><mml:mi>f</mml:mi></mml:mrow><mml:msub><mml:mrow><mml:mo>|</mml:mo><mml:mrow><mml:mo>|</mml:mo><mml:mi>f</mml:mi><mml:mo>|</mml:mo></mml:mrow><mml:mo>|</mml:mo></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub></mml:mfrac></mml:math></disp-formula></p>
<p>To reduce random error, this study trains for 80 epochs on each dataset and conducts five tests. The average of the five test results is taken as the result.</p>
</sec>
<sec id="s4_4">
<label>4.4</label>
<title>Experimental Results</title>
<sec id="s4_4_1">
<label>4.4.1</label>
<title>Ablation Study</title>
<p>Using the Stanford Cars dataset, we ran our initial tests with qualitative analysis. The curves representing training and validation accuracy and loss as epochs vary are depicted in <xref ref-type="fig" rid="fig-6">Fig. 6a</xref>,<xref ref-type="fig" rid="fig-6">b</xref>, respectively. In <xref ref-type="fig" rid="fig-6">Fig. 6a</xref>, the vertical axis denotes accuracy; in <xref ref-type="fig" rid="fig-6">Fig. 6b</xref>, it reflects the loss value. The horizontal axis in both cases indicates the number of epochs. The validation curves closely track the training curves, indicating stable convergence and good generalization without severe overfitting.</p>
<fig id="fig-6">
<label>Figure 6</label>
<caption>
<title>Diagram of the proposed model&#x2019;s training process. (<bold>a</bold>) Training and validation accuracy. (<bold>b</bold>) Training and validation loss</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_72626-fig-6.tif"/>
</fig>
<p>Over the course of training, the loss value progressively approaches 0 and the model&#x2019;s accuracy gradually increases with more epochs, gradually nearing 1. The model exhibits rapid convergence in the first twenty epochs, followed by a gradual flattening in the following epochs, and finally stabilization, as seen by the figures. This suggests that the model put forward in this study has a very high degree of stability.</p>
<p>To demonstrate the effectiveness of the proposed method in this study, we retrained the model on the same training set after removing the second-order feature encoding module and data augmentation module. Similarly, the model was trained for 80 epochs. <xref ref-type="fig" rid="fig-7">Fig. 7a</xref>,<xref ref-type="fig" rid="fig-7">b</xref> respectively shows the curves of training and validation accuracy and loss of the model as epochs change after removing the second-order feature encoding module and data augmentation module. During training, the model&#x2019;s accuracy improves with increasing epochs, gradually approaching 0.94, and the loss value decreases with increasing epochs, gradually approaching 0.15. Upon comparing <xref ref-type="fig" rid="fig-6">Figs. 6</xref> and <xref ref-type="fig" rid="fig-7">7</xref>, it is evident that in the absence of second-order feature encoding and data augmentation, the accuracy significantly declines by 5 percentage points and the loss value increases by 15 percentage points, even though the model gradually stabilizes after training for 20 epochs. This demonstrates how well the second-order feature encoding and data augmentation techniques are suggested for this model work.</p>
<fig id="fig-7">
<label>Figure 7</label>
<caption>
<title>Diagram of the model&#x2019;s training process without the second-order feature encoding and data augmentation modules. (<bold>a</bold>) Training and validation accuracy. (<bold>b</bold>) Training and validation loss</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_72626-fig-7.tif"/>
</fig>
<p>Next, the CUB-200-2011 dataset was subjected to trials in quantitative analysis in this study. The accuracy results obtained on the CUB-200-2011 dataset using both ResNet-50 alone and ResNet-50 with the inclusion of the second-order feature encoding module are shown in <xref ref-type="table" rid="table-2">Table 2</xref>&#x2019;s first and second rows. After the second-order feature encoding module was included, the accuracy rose from 83.5% to 86.1%, showing a 2.6 percentage point improvement and demonstrating the second-order encoding module&#x2019;s efficacy. <xref ref-type="table" rid="table-2">Table 2</xref>&#x2019;s third row, which is an improvement of 1.7 percentage points over the second row, illustrates how successful the region cropping strategy is. The efficacy of the region erasure method is demonstrated by the fourth row, which exhibits an improvement of 1.1 percentage points over the third row. Comparatively, for fine-grained recognition tasks, the region cropping method outperforms the region erasure method. The accuracy results achieved with the region cropping approach suggested in this study and the random region cropping method are shown in <xref ref-type="table" rid="table-2">Table 2</xref>&#x2019;s third and sixth rows, respectively. When the two are compared, accuracy is not significantly improved by the random region cropping method; instead, it barely improved by 0.6 percentage points. The efficacy of the region cropping strategy suggested in this study was further demonstrated by the fact that it outperformed the random region cropping method by 1.1 percentage points. The region erasing method suggested in this study performs better than the random region erasure method, as can also be seen by comparing the fifth and seventh rows.</p>
<table-wrap id="table-2">
<label>Table 2</label>
<caption>
<title>Accuracy of proposed methods and their combinations on the CUB-200-2011 dataset</title>
</caption>
<table>
<colgroup>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/> </colgroup>
<thead>
<tr>
<th>Second-order feature encoding</th>
<th>Region cropping</th>
<th>Region erasure</th>
<th>Random region cropping</th>
<th>Random region erasure</th>
<th>Accuracy</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>83.5</td>
</tr>
<tr>
<td>&#x221A;</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>86.1</td>
</tr>
<tr>
<td>&#x221A;</td>
<td>&#x221A;</td>
<td></td>
<td></td>
<td></td>
<td>87.8</td>
</tr>
<tr>
<td>&#x221A;</td>
<td></td>
<td>&#x221A;</td>
<td></td>
<td></td>
<td>87.3</td>
</tr>
<tr>
<td>&#x221A;</td>
<td>&#x221A;</td>
<td>&#x221A;</td>
<td></td>
<td></td>
<td>88.9</td>
</tr>
<tr>
<td>&#x221A;</td>
<td></td>
<td></td>
<td>&#x221A;</td>
<td></td>
<td>86.7</td>
</tr>
<tr>
<td>&#x221A;</td>
<td>&#x221A;</td>
<td></td>
<td></td>
<td>&#x221A;</td>
<td>86.5</td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
<sec id="s4_4_2">
<label>4.4.2</label>
<title>Comparative Experiment</title>
<p>Comparative tests on the three datasets covered in part 4.1 are presented in this part, with specific results displayed in <xref ref-type="table" rid="table-3">Table 3</xref>. The accuracy of the model suggested in this study on the three datasets is shown in <xref ref-type="table" rid="table-3">Table 3</xref>&#x2019;s last row. The accuracy of other models is referenced from the original papers of these methods, and bold numbers indicate the highest accuracy achieved by all models on that dataset. &#x201C;-&#x201D; denotes that the original paper did not conduct experiments on the corresponding dataset. CUB-200-2011: CUB-200-2011 is a fine-grained bird classification dataset jointly created by the California Institute of Technology and UCSD, containing 11788 images, widely used in computer vision tasks such as image classification and object detection. Stanford Cars:The Stanford Cars dataset is a fine-grained classification dataset containing 196 type of car images, primarily used for image classification tasks. FGVC aircraft: FGVC aircraft is a fine-grained classification dataset focused on aircraft model recognition, containing 10000 high-definition aircraft images. If the editor has other more suitable formatting methods, the author also accepts them.</p>
<table-wrap id="table-3">
<label>Table 3</label>
<caption>
<title>Comparison of accuracy of different models on three datasets (%)</title>
</caption>
<table>
<colgroup>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/> </colgroup>
<thead>
<tr>
<th>Model/Dataset</th>
<th>CUB-200-2011</th>
<th>Stanford cars</th>
<th>FGVC aircraft</th>
</tr>
</thead>
<tbody>
<tr>
<td>ST-CNN [<xref ref-type="bibr" rid="ref-17">17</xref>]</td>
<td>84.1</td>
<td>&#x2013;</td>
<td>&#x2013;</td>
</tr>
<tr>
<td>SDAN [<xref ref-type="bibr" rid="ref-18">18</xref>]</td>
<td>84.7</td>
<td>90.2</td>
<td>87.1</td>
</tr>
<tr>
<td>RA-CNN [<xref ref-type="bibr" rid="ref-19">19</xref>]</td>
<td>85.4</td>
<td>92.5</td>
<td>88.4</td>
</tr>
<tr>
<td>Bilinear-CNN [<xref ref-type="bibr" rid="ref-4">4</xref>]</td>
<td>84.1</td>
<td>91.3</td>
<td>84.1</td>
</tr>
<tr>
<td>FASNet [<xref ref-type="bibr" rid="ref-20">20</xref>]</td>
<td>86.9</td>
<td>92.6</td>
<td>90.7</td>
</tr>
<tr>
<td>NTS-Net [<xref ref-type="bibr" rid="ref-21">21</xref>]</td>
<td>87.5</td>
<td>93.9</td>
<td>91.4</td>
</tr>
<tr>
<td>DFL-CNN [<xref ref-type="bibr" rid="ref-22">22</xref>]</td>
<td>87.4</td>
<td>93.8</td>
<td>92.0</td>
</tr>
<tr>
<td>WS-DAN [<xref ref-type="bibr" rid="ref-23">23</xref>]</td>
<td>89.4</td>
<td>94.5</td>
<td>93.0</td>
</tr>
<tr>
<td>WS-CPM [<xref ref-type="bibr" rid="ref-24">24</xref>]</td>
<td>90.1</td>
<td>&#x2013;</td>
<td>&#x2013;</td>
</tr>
<tr>
<td>alignment-segmentation [<xref ref-type="bibr" rid="ref-25">25</xref>]</td>
<td>82.8</td>
<td>92.8</td>
<td>&#x2013;</td>
</tr>
<tr>
<td>Bubbles Game [<xref ref-type="bibr" rid="ref-26">26</xref>]</td>
<td>90.2</td>
<td>&#x2013;</td>
<td>&#x2013;</td>
</tr>
<tr>
<td><bold>Ours</bold></td>
<td><bold>88.9</bold></td>
<td><bold>94.7</bold></td>
<td><bold>93.3</bold></td>
</tr>
</tbody>
</table>
</table-wrap>
<p>The techniques used by the 10 fine-grained recognition models in <xref ref-type="table" rid="table-3">Table 3</xref> can be classified into 6 categories: (1) attention mechanism; (2) discriminative region localization; (3) high-order features; (4) multi-region feature fusion; (5) loss functions; and (6) data augmentation. The specific details are shown in <xref ref-type="table" rid="table-4">Table 4</xref>. Analysis of <xref ref-type="table" rid="table-3">Tables 3</xref> and <xref ref-type="table" rid="table-4">4</xref> leads to the following conclusions.</p>
<table-wrap id="table-4">
<label>Table 4</label>
<caption>
<title>Techniques used by different models</title>
</caption>
<table>
<colgroup>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/> </colgroup>
<thead>
<tr>
<th>Model/Technique</th>
<th>Attention mechanism</th>
<th>Discriminative region localization</th>
<th>High-order features</th>
<th>Multi-region feature fusion</th>
<th>Loss function</th>
<th>Data augmentation</th>
</tr>
</thead>
<tbody>
<tr>
<td>ST-CNN [<xref ref-type="bibr" rid="ref-17">17</xref>]</td>
<td>&#x221A;</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>SDAN [<xref ref-type="bibr" rid="ref-18">18</xref>]</td>
<td>&#x221A;</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>RA-CNN [<xref ref-type="bibr" rid="ref-19">19</xref>]</td>
<td>&#x221A;</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Bilinear-CNN [<xref ref-type="bibr" rid="ref-4">4</xref>]</td>
<td></td>
<td></td>
<td>&#x221A;</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>FASNet [<xref ref-type="bibr" rid="ref-20">20</xref>]</td>
<td>&#x221A;</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>NTS-Net [<xref ref-type="bibr" rid="ref-21">21</xref>]</td>
<td></td>
<td>&#x221A;</td>
<td></td>
<td>&#x221A;</td>
<td></td>
<td></td>
</tr>
<tr>
<td>DFL-CNN [<xref ref-type="bibr" rid="ref-22">22</xref>]</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>WS-DAN [<xref ref-type="bibr" rid="ref-23">23</xref>]</td>
<td>&#x221A;</td>
<td></td>
<td>&#x221A;</td>
<td></td>
<td>&#x221A;</td>
<td>&#x221A;</td>
</tr>
<tr>
<td>WS-CPM [<xref ref-type="bibr" rid="ref-24">24</xref>]</td>
<td></td>
<td>&#x221A;</td>
<td></td>
<td>&#x221A;</td>
<td></td>
<td>&#x221A;</td>
</tr>
<tr>
<td>Alignment-segmentation [<xref ref-type="bibr" rid="ref-25">25</xref>]</td>
<td></td>
<td>&#x221A;</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Bubbles Gam [<xref ref-type="bibr" rid="ref-26">26</xref>]</td>
<td></td>
<td></td>
<td>&#x221A;</td>
<td></td>
<td></td>
<td>&#x221A;</td>
</tr>
<tr>
<td><bold>Ours</bold></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>
</table-wrap>
<p>(a) compared to the WS-DAN and WS-CPM models, the accuracy of the proposed model in the CUB-200-2011 dataset is lower. Images in the CUB-200-2011 collection primarily feature highly detailed backgrounds of branches, flowers, and grass, with the avian items taking up a comparatively tiny percentage of the total image. Furthermore, birds differ significantly in look when they are in various stances. The WS-CPM model detects discriminative local regions, extracts features using part feature extractors, fuse features from multiple regions, and optimizes the loss function. The WS-DAN model combines central loss and cross-entropy loss as the loss function and uses bilinear attention pooling to extract high-order features of different portions. This suggests that in situations when the objects of interest represent a small fraction of the image, the background is complex, and there is high intra-class variance, local feature extraction and optimal loss functions can greatly increase recognition accuracy. (b) Compared to the other nine models listed in <xref ref-type="table" rid="table-4">Table 4</xref>, the proposed model achieved the highest accuracy on the Stanford Cars and FGVC Aircraft datasets. While the suggested model does not make use of attention mechanisms, CNN networks are able to discriminatively localize regions by nature. Additionally, the suggested model prioritizes each channel, allowing for variance in chopped enhanced samples as well as the localization of object discriminative regions during region cropping. Erasing regions with discriminative features during region erasure makes the network more efficient in its search for additional discriminative features, improving the model&#x2019;s resilience and generalization capacity and leading to more accurate data augmentation. (c) Both WS-DAN, WS-CPM, and the proposed model utilize data augmentation techniques. When compared to the other models in <xref ref-type="table" rid="table-3">Table 3</xref>, these three models attain higher accuracy. This shows that for fine-grained recognition problems with limited training data. data augmentation is a very effective approach. Higher recognition accuracy can be attained by augmenting the data to increase the number of samples.</p>

</sec>
<sec id="s4_4_3">
<label>4.4.3</label>
<title>Visualization Experiment</title>
<p>To provide an intuitive comparison of how different models allocate their attention, we visualized the class activation maps for both the baseline model and our proposed model. <xref ref-type="fig" rid="fig-8">Fig. 8</xref> presents these visualization results. The first row shows the original input images. The second and third rows display the attention heatmaps generated by the baseline model and our proposed model, respectively. In the heatmaps, red regions indicate higher attention weight, while blue regions indicate lower attention. The comparison clearly shows that our model more accurately localizes the discriminative regions of the objects (e.g., the bird&#x2019;s head and body) while paying less attention to the background, thereby reducing its deceptive influence.</p>
<fig id="fig-8">
<label>Figure 8</label>
<caption>
<title>Visualization results</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_72626-fig-8.tif"/>
</fig>
</sec>
<sec id="s4_4_4">
<label>4.4.4</label>
<title>Hyperparameter Sensitivity Analysis</title>
<p>To evaluate the robustness of our model, we conducted a sensitivity analysis on the CUB-200-2011 dataset, varying the key hyperparameters <inline-formula id="ieqn-109"><mml:math id="mml-ieqn-109"><mml:mi>k</mml:mi></mml:math></inline-formula>, <inline-formula id="ieqn-110"><mml:math id="mml-ieqn-110"><mml:mi>m</mml:mi></mml:math></inline-formula>, <inline-formula id="ieqn-111"><mml:math id="mml-ieqn-111"><mml:mi>t</mml:mi></mml:math></inline-formula>, and the threshold range for <inline-formula id="ieqn-112"><mml:math id="mml-ieqn-112"><mml:msub><mml:mi>&#x03B8;</mml:mi><mml:mrow><mml:mi>c</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo>/</mml:mo></mml:mrow><mml:msub><mml:mi>&#x03B8;</mml:mi><mml:mrow><mml:mi>d</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>. The baseline accuracy is 88.9%. The results, summarized in <xref ref-type="table" rid="table-5">Table 5</xref>, show that the model&#x2019;s performance remains stable under moderate variations of these parameters, indicating that our method is not overly sensitive to their exact settings. To evaluate the robustness of our model, we conducted a sensitivity analysis on the CUB-200-2011 dataset, varying the key hyperparameters: <italic>k</italic> (number of top channels for discriminative region localization), <italic>m</italic> (number of randomly selected channels from the remaining C-k channels to ensure cropping diversity), <italic>t</italic> (number of channels randomly selected from the top k channels for region erasure), and the threshold range for &#x03B8;c/&#x03B8;d (thresholds for generating cropping and erasure masks). The baseline accuracy is 88.9%. The results, summarized in <xref ref-type="table" rid="table-5">Table 5</xref> (where Hyperparameter indicates the tested variable, Value/Variation shows the specific settings, and Accuracy reports the performance metric), demonstrate that the model's performance remains stable under moderate variations of these parameters, indicating that our method is not overly sensitive to their exact settings.</p>
<table-wrap id="table-5">
<label>Table 5</label>
<caption>
<title>Sensitivity analysis of hyperparameters on the CUB-200-2011 dataset</title>
</caption>
<table>
<colgroup>
<col align="center"/>
<col align="center"/>
<col align="center"/> </colgroup>
<thead>
<tr>
<th>Hyperparameter</th>
<th>Value/Variation</th>
<th>Accuracy (%)</th>
</tr>
</thead>
<tbody>
<tr>
<td>Baseline</td>
<td><inline-formula id="ieqn-113"><mml:math id="mml-ieqn-113"><mml:mi>k</mml:mi><mml:mo>=</mml:mo><mml:mn>0.3</mml:mn><mml:mspace width="thinmathspace" /><mml:mi>C</mml:mi><mml:mo>,</mml:mo><mml:mi>m</mml:mi><mml:mo>=</mml:mo><mml:mn>0.2</mml:mn><mml:mspace width="thinmathspace" /><mml:mi>C</mml:mi><mml:mo>,</mml:mo><mml:mi>t</mml:mi><mml:mo>=</mml:mo><mml:mn>0.5</mml:mn><mml:mspace width="thinmathspace" /><mml:mi>k</mml:mi><mml:mo>,</mml:mo><mml:mi>&#x03B8;</mml:mi><mml:mo>&#x2208;</mml:mo><mml:mrow><mml:mo>[</mml:mo><mml:mn>0.2</mml:mn><mml:mo>,</mml:mo><mml:mn>0.4</mml:mn><mml:mo>]</mml:mo></mml:mrow></mml:math></inline-formula></td>
<td>88.9</td>
</tr>
<tr>
<td rowspan="5">Vary <inline-formula id="ieqn-114"><mml:math id="mml-ieqn-114"><mml:mi>k</mml:mi></mml:math></inline-formula></td>
<td>0.2 <italic>C</italic></td>
<td>88.3 (&#x2212;0.6)</td>
</tr>
<tr>
  
<td>0.25 <italic>C</italic></td>
<td>88.7 (&#x2212;0.2)</td>
</tr>
<tr>

<td><bold>0.3 <italic>C</italic></bold></td>
<td><bold>88.9</bold></td>
</tr>
<tr>

<td>0.35 <italic>C</italic></td>
<td>88.5 (&#x2212;0.4)</td>
</tr>
<tr>

<td>0.4 <italic>C</italic></td>
<td>87.9 (&#x2212;1.0)</td>
</tr>
<tr>
<td rowspan="3">Vary <inline-formula id="ieqn-115"><mml:math id="mml-ieqn-115"><mml:mi>m</mml:mi></mml:math></inline-formula></td>
<td>0.1<italic>C</italic></td>
<td>88.6 (&#x2212;0.3)</td>
</tr>
<tr>

<td><bold>0.2 <italic>C</italic></bold></td>
<td><bold>88.9</bold></td>
</tr>
<tr>

<td>0.3 C</td>
<td>88.4 (&#x2212;0.5)</td>
</tr>
<tr>
<td rowspan="3">Vary <inline-formula id="ieqn-116"><mml:math id="mml-ieqn-116"><mml:mi>t</mml:mi></mml:math></inline-formula></td>
<td>0.3 k</td>
<td>88.5 (&#x2212;0.4)</td>
</tr>
<tr>

<td><bold>0.5 k</bold></td>
<td><bold>88.9</bold></td>
</tr>
<tr>

<td>0.7 k</td>
<td>87.8 (&#x2212;1.1)</td>
</tr>
<tr>
<td rowspan="3">Vary <inline-formula id="ieqn-117"><mml:math id="mml-ieqn-117"><mml:mi>&#x03B8;</mml:mi></mml:math></inline-formula></td>
<td>Fixed <inline-formula id="ieqn-118"><mml:math id="mml-ieqn-118"><mml:msub><mml:mi>&#x03B8;</mml:mi><mml:mrow><mml:mi>c</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mi>&#x03B8;</mml:mi><mml:mrow><mml:mi>d</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mn>0.3</mml:mn></mml:math></inline-formula></td>
<td>88.7 (&#x2212;0.2)</td>
</tr>
<tr>

<td>Fixed <inline-formula id="ieqn-119"><mml:math id="mml-ieqn-119"><mml:msub><mml:mi>&#x03B8;</mml:mi><mml:mrow><mml:mi>c</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mi>&#x03B8;</mml:mi><mml:mrow><mml:mi>d</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mn>0.35</mml:mn></mml:math></inline-formula></td>
<td>88.8 (&#x2212;0.1)</td>
</tr>
<tr>

<td>Random in <inline-formula id="ieqn-120"><mml:math id="mml-ieqn-120"><mml:mo stretchy="false">[</mml:mo><mml:mn>0.2</mml:mn><mml:mo>,</mml:mo><mml:mn>0.4</mml:mn><mml:mo stretchy="false">]</mml:mo></mml:math></inline-formula></td>
<td><bold>88.9</bold></td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
<sec id="s4_4_5">
<label>4.4.5</label>
<title>Efficiency Analysis</title>
<p>A key motivation of our work is to achieve high recognition accuracy with significantly lower computational cost and training time compared to dual-stream Bilinear CNNs. To quantitatively verify this, we compared our proposed model with several baselines on the following six key indicators. Six key indicators are explained in detail as follows: GFLOPs: Its full name is Giga Floating point Operations Per Second, representing the billion floating-point operations required for a single forward propagation of the model, used to measure computational complexity. Params (M) (number of parameters, in millions): The scale of parameters that the model needs to learn, reflecting the complexity and storage requirements of the model. Peak Mem (GB): The maximum memory usage during model runtime. Train Time/Epoch (s) (training time per round, seconds): The average time required to fully traverse the training set once, used to evaluate training efficiency. Infer Time/Image (ms): The average time required to predict a single image, measuring inference speed. Acc (%) (accuracy): The classification accuracy of the model on the test set is the core performance indicator. The three baseline models we have selected are as follows. (i) a plain ResNet-50, representing a standard first-order model; (ii) a Bilinear CNN (B-CNN) [<xref ref-type="bibr" rid="ref-8">8</xref>] constructed with two ResNet-50 backbones, representing a classic second-order approach; and (iii) a lightweight Vision Transformer, DeiT-Tiny [<xref ref-type="bibr" rid="ref-27">27</xref>], as a modern baseline. All models were evaluated under the same hardware and software configuration (NVIDIA GeForce RTX 2070 GPU, PyTorch 1.0) on the CUB-200-2011 dataset with an input size of 224 &#x00D7; 224. The results are summarized in <xref ref-type="table" rid="table-6">Table 6</xref>.</p>
<table-wrap id="table-6">
<label>Table 6</label>
<caption>
<title>Techniques used by different models</title>
</caption>
<table>
<colgroup>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/> </colgroup>
<thead>
<tr>
<th>Model</th>
<th>GFLOPs</th>
<th>Params (M)</th>
<th>Peak mem (GB)</th>
<th>Train Time/Epoch (s)</th>
<th>Infer Time/Image (ms)</th>
<th>Acc. (%)</th>
</tr>
</thead>
<tbody>
<tr>
<td>ResNet-50 [<xref ref-type="bibr" rid="ref-15">15</xref>]</td>
<td>4.1</td>
<td>25.6</td>
<td>2.1</td>
<td>125</td>
<td>15.2</td>
<td>83.5</td>
</tr>
<tr>
<td>B-CNN [<xref ref-type="bibr" rid="ref-8">8</xref>]</td>
<td>8.2</td>
<td>51.2</td>
<td>3.8</td>
<td>230</td>
<td>28.5</td>
<td>84.1</td>
</tr>
<tr>
<td>DeiT-Tiny [<xref ref-type="bibr" rid="ref-27">27</xref>]</td>
<td>1.3</td>
<td>5.7</td>
<td>1.5</td>
<td>110</td>
<td>18.1</td>
<td>82.0</td>
</tr>
<tr>
<td><bold>Ours</bold></td>
<td><bold>4.3</bold></td>
<td><bold>26.1</bold></td>
<td><bold>2.3</bold></td>
<td><bold>135</bold></td>
<td><bold>16.8</bold></td>
<td><bold>88.9</bold></td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
</sec>
</sec>
<sec id="s5">
<label>5</label>
<title>Conclusion</title>
<p>This study proposes a fine-grained recognition model, sfeModel, based on discriminative region localization and efficient second-order feature encoding. By selecting channels with higher importance, the model accurately localizes discriminative regions, ensuring that the augmented samples contain critical discriminative information. The efficient second-order feature encoding module requires only single round feature extraction, followed by feature dimension reduction and down sampling, effectively reducing training time. Experimental results demonstrate that this model generates fewer ineffective augmented samples, leading to improved recognition accuracy.</p>
<p>Based on the current research, we identify the following future research directions grounded in the present work&#x2019;s limitations, such as the dependency on the initial weight matrix for region localization and the computational burden of second-order features: (1) Advanced Discriminative Region Localization: Future work should explore moving beyond reliance on the classifier&#x2019;s weight matrix for region discovery [<xref ref-type="bibr" rid="ref-23">23</xref>,<xref ref-type="bibr" rid="ref-28">28</xref>]. One promising direction involves developing an attention-based decoupling module to explicitly model the relationship between the weight matrix and feature channels, potentially through a lightweight Transformer architecture. Such a module could autonomously generate attention maps more closely aligned with discriminative parts, ultimately producing higher-quality augmented samples and improving localization accuracy without manual annotation. (2) Long-Tail Recognition Enhancement: Addressing class imbalance requires dedicated strategies for long-tail recognition [<xref ref-type="bibr" rid="ref-20">20</xref>,<xref ref-type="bibr" rid="ref-24">24</xref>]. A two-stage training strategy could be developed, where general features are first learned from all data, followed by a balanced meta-learning phase simulating few-shot scenarios for tail classes. Additionally, tail-class-focused data augmentation techniques should be investigated, leveraging discriminative regions to strategically oversample and enhance tail-class images. These approaches aim to significantly improve recognition accuracy for underrepresented classes. (3) Efficient Second-Order Feature Encoding: To address computational complexity, future research should explore sparse second-order encoding schemes [<xref ref-type="bibr" rid="ref-20">20</xref>,<xref ref-type="bibr" rid="ref-22">22</xref>]. A potential solution involves implementing a gating network to calculate confidence scores for local features, enabling selective participation of only the top-k most informative features in second-order computations. This approach, possibly combined with low-rank approximation techniques, could substantially reduce FLOPs and memory usage while preserving the performance benefits of second-order statistics. (4) Furthermore, we acknowledge the importance of in-depth evaluation under long-tail settings, as raised by the reviewer. A comprehensive investigation into model performance on rare versus frequent classes, potentially using metrics like macro-F1 and per-class accuracy analysis, constitutes an independent and significant research direction. We have already initiated research dedicated to addressing the long-tail challenge</p>
</sec>
</body>
<back>
<ack>
<p>We are grateful to Nanjing Tech University and Nanjing University of Information Science and Technology for providing study environment and computing equipment.</p>
</ack>
<sec>
<title>Funding Statement</title>
<p>This study was supported, in part, by the National Nature Science Foundation of China under Grant 62272236, 62376128 and 62306139; in part, by the Natural Science Foundation of Jiangsu Province under Grant BK20201136, BK20191401.</p>
</sec>
<sec>
<title>Author Contributions</title>
<p>Conceptualization, Yingying Wang, Wei Sun, Shiyu Zhou, Haoming Zhang and Pengpai Wang; methodology, Yingying Wang; software, Yingying Wang; validation, Xiaorui Zhang; formal analysis, Yingying Wang; resources, Wei Sun and Shiyu Zhou; data curation, Yingying Wang; writing&#x2014;original draft preparation, Yingying Wang; writing&#x2014;review and editing, Yingying Wang, Xiaorui Zhang, Wei Sun, Shiyu Zhou, Haoming Zhang and Pengpai Wang; visualization, Yingying Wang; supervision, Xiaorui Zhang, Wei Sun, Shiyu Zhou, Haoming Zhang and Pengpai Wang; project administration, Xiaorui Zhang and Yingying Wang; funding acquisition, Xiaorui Zhang. All authors reviewed the results and approved the final version of the manuscript.</p>
</sec>
<sec sec-type="data-availability">
<title>Availability of Data and Materials</title>
<p>The data that support the findings of this study are available from the corresponding author, [Xiaorui Zhang], upon reasonable request.</p>
</sec>
<sec>
<title>Ethics Approval</title>
<p>Not applicable.</p>
</sec>
<sec sec-type="COI-statement">
<title>Conflicts of Interest</title>
<p>The authors declare no conflicts of interest to report regarding the present study.</p>
</sec>
<ref-list content-type="authoryear">
<title>References</title>
<ref id="ref-1"><label>[1]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Xu</surname> <given-names>X</given-names></string-name>, <string-name><surname>Yang</surname> <given-names>CC</given-names></string-name>, <string-name><surname>Xiao</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Kong</surname> <given-names>JL</given-names></string-name></person-group>. <article-title>A fine-grained recognition neural network with high-order feature maps via graph-based embedding for natural bird diversity conservation</article-title>. <source>Int J Environ Res Public Health</source>. <year>2023</year>;<volume>20</volume>(<issue>6</issue>):<fpage>4294</fpage>&#x2013;<lpage>314</lpage>. doi:<pub-id pub-id-type="doi">10.3390/ijerph20064924</pub-id>; <pub-id pub-id-type="pmid">36981832</pub-id></mixed-citation></ref>
<ref id="ref-2"><label>[2]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Huang</surname> <given-names>S</given-names></string-name>, <string-name><surname>Xu</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Tao</surname> <given-names>D</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>Y</given-names></string-name></person-group>. <article-title>Part-stacked CNN for fine-grained visual categorization</article-title>. In: <conf-name>Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition</conf-name>; 2016 Jun 27&#x2013;30; <publisher-loc>Las Vegas, NV, USA</publisher-loc>. p. <fpage>1173</fpage>&#x2013;<lpage>82</lpage>.</mixed-citation></ref>
<ref id="ref-3"><label>[3]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Shroff</surname> <given-names>P</given-names></string-name>, <string-name><surname>Chen</surname> <given-names>T</given-names></string-name>, <string-name><surname>Wei</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>Z</given-names></string-name></person-group>. <article-title>Focus longer to see better: recursively refined attention for fine-grained image classification</article-title>. In: <conf-name>Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops; 2020 Jun 13&#x2013;19</conf-name>; <publisher-loc>Seattle, WA, USA</publisher-loc>; <year>2020</year>. p. <fpage>868</fpage>&#x2013;<lpage>9</lpage>.</mixed-citation></ref>
<ref id="ref-4"><label>[4]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Sun</surname> <given-names>W</given-names></string-name>, <string-name><surname>Dai</surname> <given-names>G</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>XR</given-names></string-name>, <string-name><surname>He</surname> <given-names>X</given-names></string-name>, <string-name><surname>Chen</surname> <given-names>X</given-names></string-name></person-group>. <article-title>TBE-Net: a three-branch embedding network with part-aware ability and feature complementary learning for vehicle re-identification</article-title>. <source>IEEE Trans Intell Transp Syst</source>. <year>2021</year>;<volume>23</volume>(<issue>9</issue>):<fpage>14557</fpage>&#x2013;<lpage>69</lpage>. doi:<pub-id pub-id-type="doi">10.1109/tits.2021.3130403</pub-id>.</mixed-citation></ref>
<ref id="ref-5"><label>[5]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Liang</surname> <given-names>M</given-names></string-name>, <string-name><surname>Huang</surname> <given-names>S</given-names></string-name>, <string-name><surname>Liu</surname> <given-names>W</given-names></string-name></person-group>. <article-title>Dynamic semantic structure distillation for low-resolution fine-grained recognition</article-title>. <source>Pattern Recognit</source>. <year>2024</year>;<volume>148</volume>:<fpage>110216</fpage>. doi:<pub-id pub-id-type="doi">10.1016/j.patcog.2023.110216</pub-id>.</mixed-citation></ref>
<ref id="ref-6"><label>[6]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Shi</surname> <given-names>S</given-names></string-name>, <string-name><surname>Li</surname> <given-names>W</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>Y</given-names></string-name>, <string-name><surname>He</surname> <given-names>J</given-names></string-name>, <string-name><surname>Gong</surname> <given-names>B</given-names></string-name>, <etal>et al.</etal></person-group> <article-title>ResMaster: mastering high-resolution image generation via structural and fine-grained guidance</article-title>. In: <conf-name>Proceedings of the AAAI Conference on Artificial Intelligence; 2025 Feb 25&#x2013;Mar 4</conf-name>; <publisher-loc>Philadelphia, PA, USA. Palo Alto, CA, USA</publisher-loc>: <publisher-name>AAAI Press</publisher-name>; <year>2025</year>. p. <fpage>6887</fpage>&#x2013;<lpage>95</lpage>.</mixed-citation></ref>
<ref id="ref-7"><label>[7]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Zheng</surname> <given-names>Q</given-names></string-name>, <string-name><surname>Zhao</surname> <given-names>P</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>H</given-names></string-name>, <string-name><surname>Elhanashi</surname> <given-names>A</given-names></string-name>, <string-name><surname>Saponara</surname> <given-names>S</given-names></string-name></person-group>. <article-title>Fine-grained modulation classification using multi-scale radio transformer with dual-channel representation</article-title>. <source>IEEE Communicat Letters</source>. <year>2025</year>;<volume>26</volume>(<issue>6</issue>):<fpage>1298</fpage>&#x2013;<lpage>302</lpage>. doi:<pub-id pub-id-type="doi">10.1109/lcomm.2022.3145647</pub-id>.</mixed-citation></ref>
<ref id="ref-8"><label>[8]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Lin</surname> <given-names>TY</given-names></string-name>, <string-name><surname>RoyChowdhury</surname> <given-names>A</given-names></string-name>, <string-name><surname>Maji</surname> <given-names>S</given-names></string-name></person-group>. <article-title>Bilinear CNN models for fine-grained visual recognition</article-title>. In: <conf-name>Proceedings of the IEEE International Conference on Computer Vision (ICCV); 2015 Dec 7&#x2013;13</conf-name>; <publisher-loc>Santiago, Chile. Piscataway, NJ, USA</publisher-loc>: <publisher-name>IEEE</publisher-name>; <year>2015</year>. p. <fpage>1449</fpage>&#x2013;<lpage>57</lpage>.</mixed-citation></ref>
<ref id="ref-9"><label>[9]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Tran</surname> <given-names>T</given-names></string-name>, <string-name><surname>Pham</surname> <given-names>T</given-names></string-name>, <string-name><surname>Carneiro</surname> <given-names>G</given-names></string-name>, <string-name><surname>Palmer</surname> <given-names>L</given-names></string-name>, <string-name><surname>Reid</surname> <given-names>I</given-names></string-name></person-group>. <article-title>A Bayesian data augmentation approach for learning deep models</article-title>. <source>Adv Neural Inf Process Syst</source>. <year>2017</year>;<volume>30</volume>:<fpage>1</fpage>&#x2013;<lpage>10</lpage>.</mixed-citation></ref>
<ref id="ref-10"><label>[10]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><surname>Hu</surname> <given-names>T</given-names></string-name>, <string-name><surname>Qi</surname> <given-names>H</given-names></string-name>, <string-name><surname>Huang</surname> <given-names>Q</given-names></string-name>, <string-name><surname>Lu</surname> <given-names>Y</given-names></string-name></person-group>. <article-title>See better before looking closer: weakly supervised data augmentation network for fine-grained visual classification</article-title>. <comment>arXiv:1901.09891. 2019</comment>.</mixed-citation></ref>
<ref id="ref-11"><label>[11]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Chen</surname> <given-names>T</given-names></string-name>, <string-name><surname>Kornblith</surname> <given-names>S</given-names></string-name>, <string-name><surname>Norouzi</surname> <given-names>M</given-names></string-name>, <string-name><surname>Hinton</surname> <given-names>G</given-names></string-name></person-group>. <article-title>A simple framework for contrastive learning of visual representations</article-title>. In: <conf-name>Proceedings of the 38th International Conference on Machine Learning(PMLR); 2020 Jul 13&#x2013;18</conf-name>; <publisher-loc>Vienna, Austria.</publisher-loc> p. <fpage>1597</fpage>&#x2013;<lpage>607</lpage>.</mixed-citation></ref>
<ref id="ref-12"><label>[12]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Li</surname> <given-names>P</given-names></string-name>, <string-name><surname>Xie</surname> <given-names>J</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>Q</given-names></string-name>, <string-name><surname>Zuo</surname> <given-names>W</given-names></string-name></person-group>. <article-title>Is second-order information helpful for large-scale visual recognition?</article-title>. In: <conf-name>Proceedings of the IEEE International Conference on Computer Vision; 2017 Oct 22&#x2013;29</conf-name>; <publisher-loc>Venice, Italy. Piscataway, NJ, USA</publisher-loc>: <publisher-name>IEEE</publisher-name>; <year>2017</year>. p. <fpage>2070</fpage>&#x2013;<lpage>8</lpage>.</mixed-citation></ref>
<ref id="ref-13"><label>[13]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Sikdar</surname> <given-names>A</given-names></string-name>, <string-name><surname>Liu</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Kedarisetty</surname> <given-names>S</given-names></string-name>, <string-name><surname>Zhao</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Ahmed</surname> <given-names>A</given-names></string-name>, <etal>et al.</etal></person-group> <article-title>Interweaving insights: high-order feature interaction for fine-grained visual recognition</article-title>. <source>Int J Comput Vis</source>. <year>2025</year>;<volume>133</volume>(<issue>4</issue>):<fpage>1755</fpage>&#x2013;<lpage>79</lpage>. doi:<pub-id pub-id-type="doi">10.1007/s11263-024-02260-y</pub-id>; <pub-id pub-id-type="pmid">40160952</pub-id></mixed-citation></ref>
<ref id="ref-14"><label>[14]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Kong</surname> <given-names>S</given-names></string-name>, <string-name><surname>Fowlkes</surname> <given-names>C</given-names></string-name></person-group>. <article-title>Low-rank bilinear pooling for fine-grained classification</article-title>. In: <conf-name>Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR); 2017 Jul 21&#x2013;26</conf-name>; <publisher-loc>Honolulu, HI, USA. Piscataway, NJ, USA</publisher-loc>: <publisher-name>IEEE</publisher-name>; <year>2017</year>. p. <fpage>365</fpage>&#x2013;<lpage>74</lpage>.</mixed-citation></ref>
<ref id="ref-15"><label>[15]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Cui</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Zhou</surname> <given-names>F</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>J</given-names></string-name>, <string-name><surname>Liu</surname> <given-names>X</given-names></string-name>, <string-name><surname>Lin</surname> <given-names>Y</given-names></string-name>, <etal>et al.</etal></person-group> <article-title>Kernel pooling for convolutional neural networks</article-title>. In: <conf-name>Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; 2017 Jul 21&#x2013;26</conf-name>; <publisher-loc>Honolulu, HI, USA. Piscataway, NJ, USA</publisher-loc>: <publisher-name>IEEE</publisher-name>; <year>2017</year>. p. <fpage>2921</fpage>&#x2013;<lpage>30</lpage>.</mixed-citation></ref>
<ref id="ref-16"><label>[16]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><surname>Paschali</surname> <given-names>M</given-names></string-name>, <string-name><surname>Simson</surname> <given-names>W</given-names></string-name>, <string-name><surname>Roy</surname> <given-names>AG</given-names></string-name>, <string-name><surname>Naeem</surname> <given-names>MF</given-names></string-name>, <string-name><surname>G&#x00F6;bl</surname> <given-names>R</given-names></string-name>, <etal>et al.</etal></person-group> <article-title>Data augmentation with manifold exploring geometric transformations for increased performance and robustness</article-title>. <comment>arXiv:1901.04420. 2019</comment>.</mixed-citation></ref>
<ref id="ref-17"><label>[17]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Tang</surname> <given-names>H</given-names></string-name>, <string-name><surname>Yuan</surname> <given-names>C</given-names></string-name>, <string-name><surname>Li</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Tang</surname> <given-names>J</given-names></string-name></person-group>. <article-title>Learning attention-guided pyramidal features for few-shot fine-grained recognition</article-title>. <source>Pattern Recognit</source>. <year>2022</year>;<volume>130</volume>:<fpage>108792</fpage>. doi:<pub-id pub-id-type="doi">10.1016/j.patcog.2022.108792</pub-id>.</mixed-citation></ref>
<ref id="ref-18"><label>[18]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><surname>Zhu</surname> <given-names>L</given-names></string-name>, <string-name><surname>Chen</surname> <given-names>T</given-names></string-name>, <string-name><surname>Yin</surname> <given-names>J</given-names></string-name>, <string-name><surname>See</surname> <given-names>S</given-names></string-name>, <string-name><surname>Liu</surname> <given-names>J</given-names></string-name></person-group>. <article-title>Learning Gabor texture features for fine-grained recognition</article-title>. <comment>arXiv:1603.06765. 2023</comment>.</mixed-citation></ref>
<ref id="ref-19"><label>[19]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Fu</surname> <given-names>J</given-names></string-name>, <string-name><surname>Zheng</surname> <given-names>H</given-names></string-name>, <string-name><surname>Mei</surname> <given-names>T</given-names></string-name></person-group>. <article-title>Look closer to see better: recurrent attention convolutional neural network for fine-grained image recognition</article-title>. In: <conf-name>Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; 2017 Jul 21&#x2013;26</conf-name>; <publisher-loc>Honolulu, HI, USA. Piscataway, NJ, USA</publisher-loc>: <publisher-name>IEEE</publisher-name>; <year>2017</year>. p. <fpage>4438</fpage>&#x2013;<lpage>46</lpage>.</mixed-citation></ref>
<ref id="ref-20"><label>[20]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Pu</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Han</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Feng</surname> <given-names>J</given-names></string-name>, <string-name><surname>Deng</surname> <given-names>C</given-names></string-name>, <etal>et al.</etal></person-group> <article-title>Fine-grained recognition with learnable semantic data augmentation</article-title>. In: <conf-name>Proceedings of the 6th International Conference on Control and Computer Vision (ICCCV); 2024 Jun 13&#x2013;15</conf-name>; <publisher-loc>Tianjin, China. Tianjin, China</publisher-loc>: <publisher-name>Tianjin University of Technology and Education</publisher-name>; <year>2024</year>. p. <fpage>5209</fpage>&#x2013;<lpage>17</lpage>.</mixed-citation></ref>
<ref id="ref-21"><label>[21]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Yang</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Luo</surname> <given-names>T</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>D</given-names></string-name>, <string-name><surname>Hu</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Gao</surname> <given-names>J</given-names></string-name>, <etal>et al.</etal></person-group> <article-title>Learning to navigate for fine-grained classification</article-title>. In: <conf-name>Proceedings of the European Conference on Computer Vision (ECCV); 2018 Sep 8&#x2013;14</conf-name>; <publisher-loc>Munich, Germany. Cham, Switzerland</publisher-loc>: <publisher-name>Springer</publisher-name>; <year>2018</year>. p. <fpage>420</fpage>&#x2013;<lpage>35</lpage>.</mixed-citation></ref>
<ref id="ref-22"><label>[22]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Wang</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Morariu</surname> <given-names>VI</given-names></string-name>, <string-name><surname>Davis</surname> <given-names>S</given-names></string-name></person-group>. <article-title>Learning a discriminative filter bank within a CNN for fine-grained recognition</article-title>. In: <conf-name>Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; 2018 Jun 18&#x2013;22</conf-name>; <publisher-loc>Salt Lake City, UT, USA. Piscataway, NJ, USA</publisher-loc>: <publisher-name>IEEE</publisher-name>; <year>2018</year>. p. <fpage>4148</fpage>&#x2013;<lpage>57</lpage>.</mixed-citation></ref>
<ref id="ref-23"><label>[23]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Ge</surname> <given-names>W</given-names></string-name>, <string-name><surname>Lin</surname> <given-names>X</given-names></string-name>, <string-name><surname>Yu</surname> <given-names>Y</given-names></string-name></person-group>. <article-title>Weakly supervised complementary parts models for fine-grained image classification from the bottom up</article-title>. In: <conf-name>Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2019 Jun 16&#x2013;20</conf-name>; <publisher-loc>Long Beach, CA, USA. Piscataway, NJ, USA</publisher-loc>: <publisher-name>IEEE</publisher-name>; <year>2019</year>. p. <fpage>3034</fpage>&#x2013;<lpage>43</lpage>.</mixed-citation></ref>
<ref id="ref-24"><label>[24]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Zhang</surname> <given-names>S</given-names></string-name>, <string-name><surname>Li</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Yan</surname> <given-names>S</given-names></string-name>, <string-name><surname>He</surname> <given-names>X</given-names></string-name>, <string-name><surname>Sun</surname> <given-names>J</given-names></string-name></person-group>. <article-title>Distribution alignment: a unified framework for long-tail visual recognition</article-title>. In: <conf-name>Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2021 Jun 19&#x2013;25</conf-name>; <publisher-loc>Nashville, TN, USA. Piscataway, NJ, USA</publisher-loc>: <publisher-name>IEEE</publisher-name>; <year>2021</year>. p. <fpage>2361</fpage>&#x2013;<lpage>70</lpage>.</mixed-citation></ref>
<ref id="ref-25"><label>[25]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Krause</surname> <given-names>J</given-names></string-name>, <string-name><surname>Jin</surname> <given-names>H</given-names></string-name>, <string-name><surname>Yang</surname> <given-names>J</given-names></string-name>, <string-name><surname>Fei-Fei</surname> <given-names>L</given-names></string-name></person-group>. <article-title>Fine-grained recognition without part annotations</article-title>. In: <conf-name>Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR); 2015 Jun 7&#x2013;12</conf-name>; <publisher-loc>Boston, MA, USA. Piscataway, NJ, USA</publisher-loc>: <publisher-name>IEEE</publisher-name>; <year>2015</year>. p. <fpage>5546</fpage>&#x2013;<lpage>55</lpage>.</mixed-citation></ref>
<ref id="ref-26"><label>[26]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Deng</surname> <given-names>J</given-names></string-name>, <string-name><surname>Krause</surname> <given-names>J</given-names></string-name>, <string-name><surname>Fei-Fei</surname> <given-names>L</given-names></string-name></person-group>. <article-title>Fine-grained crowdsourcing for fine-grained recognition</article-title>. In: <conf-name>Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; 2013 Jun 23&#x2013;28</conf-name>; <publisher-loc>Portland, OR, USA. Piscataway, NJ, USA</publisher-loc>: <publisher-name>IEEE</publisher-name>; <year>2013</year>. p. <fpage>580</fpage>&#x2013;<lpage>7</lpage>.</mixed-citation></ref>
<ref id="ref-27"><label>[27]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Touvron</surname> <given-names>H</given-names></string-name>, <string-name><surname>Cord</surname> <given-names>M</given-names></string-name>, <string-name><surname>Douze</surname> <given-names>M</given-names></string-name>, <string-name><surname>Massa</surname> <given-names>F</given-names></string-name>, <string-name><surname>Sablayrolles</surname> <given-names>A</given-names></string-name>, <etal>et al.</etal></person-group> <article-title>Training data-efficient image transformers &#x0026; distillation through attention</article-title>. In: <conf-name>Proceedings of the 38th International Conference on Machine Learning (PMLR); 2021 Jul 18&#x2013;24</conf-name>; <publisher-loc>Vienna, Austria.</publisher-loc> p. <fpage>10347</fpage>&#x2013;<lpage>57</lpage>.</mixed-citation></ref>
<ref id="ref-28"><label>[28]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Zha</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Tang</surname> <given-names>H</given-names></string-name>, <string-name><surname>Sun</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Tang</surname> <given-names>J</given-names></string-name></person-group>. <article-title>Boosting few-shot fine-grained recognition with background suppression and foreground alignment</article-title>. In: <conf-name>Proceedings of the European Conference on Computer Vision (ECCV); 2023 Sep 25&#x2013;29</conf-name>; <publisher-loc>Tel Aviv, Israel. Cham, Switzerland</publisher-loc>: <publisher-name>Springer</publisher-name>; <year>2023</year>. p. <fpage>3947</fpage>&#x2013;<lpage>61</lpage>.</mixed-citation></ref>
</ref-list>
</back></article>