<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.1 20151215//EN" "http://jats.nlm.nih.gov/publishing/1.1/JATS-journalpublishing1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" article-type="research-article" dtd-version="1.1">
<front>
<journal-meta>
<journal-id journal-id-type="pmc">CMC</journal-id>
<journal-id journal-id-type="nlm-ta">CMC</journal-id>
<journal-id journal-id-type="publisher-id">CMC</journal-id>
<journal-title-group>
<journal-title>Computers, Materials &#x0026; Continua</journal-title>
</journal-title-group>
<issn pub-type="epub">1546-2226</issn>
<issn pub-type="ppub">1546-2218</issn>
<publisher>
<publisher-name>Tech Science Press</publisher-name>
<publisher-loc>USA</publisher-loc>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="publisher-id">29297</article-id>
<article-id pub-id-type="doi">10.32604/cmc.2022.029297</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Article</subject>
</subj-group>
</article-categories>
<title-group>
<article-title>Fine-grained Ship Image Recognition Based on BCNN with Inception and&#x00A0;AM-Softmax</article-title>
<alt-title alt-title-type="left-running-head">Fine-grained Ship Image Recognition Based on BCNN with Inception and AM-Softmax</alt-title>
<alt-title alt-title-type="right-running-head">Fine-grained Ship Image Recognition Based on BCNN with Inception and AM-Softmax</alt-title>
</title-group>
<contrib-group content-type="authors">
<contrib id="author-1" contrib-type="author">
<name name-style="western"><surname>Zhang</surname><given-names>Zhilin</given-names></name><xref ref-type="aff" rid="aff-1">1</xref></contrib>
<contrib id="author-2" contrib-type="author">
<name name-style="western"><surname>Zhang</surname><given-names>Ting</given-names></name><xref ref-type="aff" rid="aff-1">1</xref></contrib>
<contrib id="author-3" contrib-type="author" corresp="yes">
<name name-style="western"><surname>Liu</surname><given-names>Zhaoying</given-names></name><xref ref-type="aff" rid="aff-1">1</xref><email>zhaoying.liu@bjut.edu.cn</email>
</contrib>
<contrib id="author-4" contrib-type="author">
<name name-style="western"><surname>Zhang</surname><given-names>Peijie</given-names></name><xref ref-type="aff" rid="aff-1">1</xref></contrib>
<contrib id="author-5" contrib-type="author">
<name name-style="western"><surname>Tu</surname><given-names>Shanshan</given-names></name><xref ref-type="aff" rid="aff-1">1</xref></contrib>
<contrib id="author-6" contrib-type="author">
<name name-style="western"><surname>Li</surname><given-names>Yujian</given-names></name><xref ref-type="aff" rid="aff-2">2</xref></contrib>
<contrib id="author-7" contrib-type="author">
<name name-style="western"><surname>Waqas</surname><given-names>Muhammad</given-names></name><xref ref-type="aff" rid="aff-3">3</xref></contrib>
<aff id="aff-1"><label>1</label><institution>Faculty of Information Technology, Beijing University of Technology</institution>, <addr-line>Beijing 100124</addr-line>, <country>China</country></aff>
<aff id="aff-2"><label>2</label><institution>School of Artificial Intelligence, Guilin University of Electronic Technology</institution>, <addr-line>Guilin, 541004</addr-line>, <country>China</country></aff>
<aff id="aff-3"><label>3</label><institution>School of Engineering, Edith Cowan University</institution>, <addr-line>Perth WA 6027</addr-line>, <country>Australia</country></aff>
</contrib-group>
<author-notes>
<corresp id="cor1"><label>&#x002A;</label>Corresponding Author: Zhaoying Liu. Email: <email>zhaoying.liu@bjut.edu.cn</email></corresp>
</author-notes>
<pub-date pub-type="epub" date-type="pub" iso-8601-date="2022-05-16"><day>16</day>
<month>05</month>
<year>2022</year></pub-date>
<volume>73</volume>
<issue>1</issue>
<fpage>1527</fpage>
<lpage>1539</lpage>
<history>
<date date-type="received"><day>01</day><month>3</month><year>2022</year></date>
<date date-type="accepted"><day>01</day><month>4</month><year>2022</year></date>
</history>
<permissions>
<copyright-statement>&#x00A9; 2022 Zhang et al.</copyright-statement>
<copyright-year>2022</copyright-year>
<copyright-holder>Zhang et al.</copyright-holder>
<license xlink:href="https://creativecommons.org/licenses/by/4.0/">
<license-p>This work is licensed under a <ext-link ext-link-type="uri" xlink:type="simple" xlink:href="https://creativecommons.org/licenses/by/4.0/">Creative Commons Attribution 4.0 International License</ext-link>, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
</license>
</permissions>
<self-uri content-type="pdf" xlink:href="TSP_CMC_29297.pdf"></self-uri>
<abstract>
<p>The fine-grained ship image recognition task aims to identify various classes of ships. However, small inter-class, large intra-class differences between ships, and lacking of training samples are the reasons that make the task difficult. Therefore, to enhance the accuracy of the fine-grained ship image recognition, we design a fine-grained ship image recognition network based on bilinear convolutional neural network (BCNN) with Inception and additive margin Softmax (AM-Softmax). This network improves the BCNN in two aspects. Firstly, by introducing Inception branches to the BCNN network, it is helpful to enhance the ability of extracting comprehensive features from ships. Secondly, by adding margin values to the decision boundary, the AM-Softmax function can better extend the inter-class differences and reduce the intra-class differences. In addition, as there are few publicly available datasets for fine-grained ship image recognition, we construct a Ship-43 dataset containing 47,300 ship images belonging to 43 categories. Experimental results on the constructed Ship-43 dataset demonstrate that our method can effectively improve the accuracy of ship image recognition, which is 4.08&#x0025; higher than the BCNN model. Moreover, comparison results on the other three public fine-grained datasets (Cub, Cars, and Aircraft) further validate the effectiveness of the proposed method.</p>
</abstract>
<kwd-group kwd-group-type="author">
<kwd>Fine-grained ship image recognition</kwd>
<kwd>Inception</kwd>
<kwd>AM-softmax</kwd>
<kwd>BCNN</kwd>
</kwd-group>
</article-meta>
</front>
<body>
<sec id="s1"><label>1</label><title>Introduction</title>
<p>Fine-grained image recognition (FGIR) refers to the recognition of different subclasses of the same category [<xref ref-type="bibr" rid="ref-1">1</xref>], for example, the recognition of &#x201C;freighters&#x201D; and &#x201C;merchant ships&#x201D;. Currently, traditional image recognition tasks have achieved great success. However, due to the small inter-class and large intra-class differences, the performance of fine-grained image recognition is not so satisfying. As a major carrier of marine traffic and transport, fine-grained ship image recognition has attracted more and more attention, it has been widely applied for maintaining maritime safety, such as maritime traffic monitoring and maritime search, thus to improve the capability of coastal defense and early warning [<xref ref-type="bibr" rid="ref-2">2</xref>,<xref ref-type="bibr" rid="ref-3">3</xref>]. However, for ship targets, the shapes and structures are similar from one category to another, and there are also rich diversity components between the same classes, thus making fine-grained ship recognition a very challenging task.</p>
<p>Traditional methods of fine-grained image recognition of ships mainly use manually designed feature extraction algorithms for feature matching [<xref ref-type="bibr" rid="ref-4">4</xref>,<xref ref-type="bibr" rid="ref-5">5</xref>]. They cannot fully utilize the information contained in the dataset to extract the distinctive features of the objects, resulting in the performance of fine-grained recognition is limited. Furthermore, all of these methods have a low generalization capacity. With the development of deep learning techniques, many deep learning models have been developed to improve the accuracy by learning better feature representations from the dataset automatically based on convolutional neural networks (CNN) [<xref ref-type="bibr" rid="ref-6">6</xref>]. Among those deep models, the bilinear convolutional neural network (BCNN) [<xref ref-type="bibr" rid="ref-7">7</xref>] demonstrates satisfying performance for fine-grained image recognition. The BCNN typically utilizes two parallel branches of the VGGNet network [<xref ref-type="bibr" rid="ref-8">8</xref>] to retrieve the features of each image position, then an outer product operation to integrate features, and update the training network by end-to-end. However, BCNN has the following two deficiencies. 1) The two branches of the network only consist of 3&#x2009;&#x00D7;&#x2009;3 convolutional kernels, and generally small convolutional kernels ignore certain global information [<xref ref-type="bibr" rid="ref-9">9</xref>]. 2) The BCNN uses the Softmax loss function, which has a weak ability to activate subtle features, and it is likely to misclassify certain images with particularly small inter-class differences [<xref ref-type="bibr" rid="ref-10">10</xref>,<xref ref-type="bibr" rid="ref-11">11</xref>].</p>
<p>To enhance the performance of fine-grained recognition of ships image, we developed a fine-grained image recognition network for ships based on BCNN with Inception and AM-Softmax, which improved the BCNN from two perspectives. First, to gather global information, we replaced one branch of the BCNN with Inception module, this is helpful to aggregate feature information on a large scale and increase the ability of global information extraction. Second, to activate the distinctive characteristics between different classes, and extend the inter-class distance while reducing the intra-class distance, we introduced the AM-Softmax function, which effectively activate the differences between the ship classes by adding an additive margin to different decision boundaries. Moreover, we construct a fine-grained ship image dataset containing 47,300 images belonging to 43 categories. The key advantages and major contributions of the proposed method are:
<list list-type="bullet">
<list-item><p>To extract global information, we design Inception modules and use them to replace a branch of the BCNN network.</p></list-item>
<list-item><p>To better activate the features between fine-grained images, we introduce AM-Softmax, which can by adding an additive margin to different decision boundaries.</p></list-item>
<list-item><p>Based on the existing dataset, we construct a richer ship dataset.</p></list-item>
</list></p><p>The rest of the paper is organized as follows. Section 2 summarizes related work. The proposed method is described in Section 3. Detailed experiments and analysis are conducted in Section 4. Section 5 concludes the paper.</p>
</sec>
<sec id="s2"><label>2</label><title>Related Work</title>
<p>In recent years, many fine-grained image recognition methods have been developed, and these methods can be roughly classified into three main paradigms, i.e., fine-grained recognition with localization-classification subnetworks, with end-to-end feature encoding and with external information. Fine-grained with localization-classification subnetworks approaches design a localization subnetwork for locating these key parts [<xref ref-type="bibr" rid="ref-12">12</xref>], while later, a classification subnetwork follows and is employed for recognition of the key parts, such as Part-based CNN [<xref ref-type="bibr" rid="ref-13">13</xref>], Mask-CNN [<xref ref-type="bibr" rid="ref-14">14</xref>]. Those approaches are more likely to find distinguished parts [<xref ref-type="bibr" rid="ref-15">15</xref>,<xref ref-type="bibr" rid="ref-16">16</xref>], and require more annotation information. End-to-end feature coding methods, by designing powerful models, learn a more discriminative feature representation. The most representative method among them is BCNN. Beyond the two paradigms, another paradigm is to leverage external information, such as web data and multi-modality data, to further assist fine-gained recognition [<xref ref-type="bibr" rid="ref-17">17</xref>,<xref ref-type="bibr" rid="ref-18">18</xref>].</p>
<p>The BCNN extracts feature via a network of two parallel branches, each of which is VGG16, and an outer product operation is performed on the two outputs. The outer product operation completes the feature fusion at each location, which can capture discriminative features. The structure of the VGG16 network is relatively simple, and each layer of the network uses small-sized convolutional kernels. By increasing the depth of the network, rich feature information can be obtained and the overall performance of the network can be improved. However, small convolutional kernels ignore some global information when extracting feature layer by layer, and just increasing the depth of the network brings a few problems, such as overfitting, gradient vanishing, and training difficulties. The Inception network [<xref ref-type="bibr" rid="ref-19">19</xref>] proposed by Szegedy is wider and more efficient, it uses larger scale convolutional kernels to extract global information, and reduces the number of parameters by exploring the factorization of convolutional kernels.</p>
<p>In recent years, besides the commonly used Softmax, there are various loss functions [<xref ref-type="bibr" rid="ref-20">20</xref>] have been proposed that can optimize the distance between classes. The L-Softmax [<xref ref-type="bibr" rid="ref-21">21</xref>] was first proposed as an angle-based loss function, which can reduce the angle between the feature vector and the corresponding weight vector by introducing a parameter <italic>m</italic>. The A-Softmax [<xref ref-type="bibr" rid="ref-22">22</xref>] performs a normalization operation on the weights, and by adding a large angle margin, the network more focused on optimizing the angles of features and vectors. The Cosface [<xref ref-type="bibr" rid="ref-23">23</xref>] reformulates the Softmax as a cosine loss, then it can remove radial variations by <inline-formula id="ieqn-1"><mml:math id="mml-ieqn-1"><mml:mrow><mml:msub><mml:mi>L</mml:mi><mml:mn>2</mml:mn></mml:msub></mml:mrow></mml:math></inline-formula> normalizing both features and weight vectors, and can further maximize the decision margin in the angular space by introducing a cosine margin term. By introducing additive angles to the decision boundary, the Arc-Softmax [<xref ref-type="bibr" rid="ref-24">24</xref>] maximizes the classification bounds in the angle space.</p>
</sec>
<sec id="s3"><label>3</label><title>The Proposed Method</title>
<p>In this paper, based on the BCNN framework, we design a fine-grained ship image recognition network by introducing Inception and AM-Softmax. By adding the Inception module to a branch of BCNN, it is helpful to enhance the ability of the whole network to extract global information. Meanwhile, the network uses the AM-Softmax function to learn decision boundaries among the different classes, which can increase the inter-class distance and reduce the intra-class distance.</p>
<p>The architecture of the proposed method is illustrated in <xref ref-type="fig" rid="fig-1">Fig. 1</xref>. There are two parallel branches, one of them uses VGG16 to extract features from local information, and the other branch introduces the Inception module to extract features from global information. The results of both branches are combined using the outer product and average pooled to get the bilinear feature representation. Then the bilinear vector is passed through a linear classifier and AM-Softmax layer to obtain class predictions. Finally, the cross-entropy loss function is used to guide and optimize the training of the network.</p>
<fig id="fig-1"><label>Figure 1</label><caption><title>Network architecture</title></caption><graphic mimetype="image" mime-subtype="png" xlink:href="CMC_29297-fig-1.png"/></fig>
<sec id="s3_1"><label>3.1</label><title>The Inception Branch</title>
<p>In the VGG16 network, small scale convolutional kernels are used to get local feature information more easily, but it is difficult to extract global information features. Meanwhile, network with larger scale convolutional kernels usually requires a large amount of calculation. According to the literatures, we know that the Inception network can extract global information and reduce the calculation consumption while using larger scale convolutional kernels during the network. Inspired by the Inception network, we design three modules, IncepA, IncepB, and IncepC, to extract global information. These modules, as shown in <xref ref-type="fig" rid="fig-2">Fig. 2</xref>, have a convolutional kernel size of 3&#x2009;&#x00D7;&#x2009;3, 5&#x2009;&#x00D7;&#x2009;5, and 7&#x2009;&#x00D7;&#x2009;7, respectively.</p>
<fig id="fig-2"><label>Figure 2</label><caption><title>Three Inception modules</title></caption><graphic mimetype="image" mime-subtype="png" xlink:href="CMC_29297-fig-2.png"/></fig>
<p>In all three modules, 1&#x2009;&#x00D7;&#x2009;1 convolutional kernel and a pool operation are performed, which can help the modules reduce the amount of calculation. Then 3&#x2009;&#x00D7;&#x2009;3 convolutional kernel is decomposed into 1&#x2009;&#x00D7;&#x2009;3 and 3&#x2009;&#x00D7;&#x2009;1 vector kernels. In the IncepB and IncepC, 1&#x2009;&#x00D7;&#x2009;5 or 1&#x2009;&#x00D7;&#x2009;7 vector kernels are stacked two times. Finally, all the three components are concatenated. Those cascaded vector kernels can roughly achieve the effect of large-scale convolution kernels. By decomposing the large-scale kernels, it can effectively reduce the total number of parameters without increasing the calculation consumption.</p>
<p>Once the three modules have been designed, the Inception branch network is built as shown in <xref ref-type="fig" rid="fig-3">Fig. 3</xref>. The VGG16 branch network is shown in <xref ref-type="fig" rid="fig-4">Fig. 4</xref>, and it consists of 13 convolutional layers and 3 fully connected layers. In the Inception branch network, we also used 13 convolutional layers, just using 3 modules instead of 9 convolutional layers in VGG16.</p>
<fig id="fig-3"><label>Figure 3</label><caption><title>Inception branch network</title></caption><graphic mimetype="image" mime-subtype="png" xlink:href="CMC_29297-fig-3.png"/></fig>
<fig id="fig-4"><label>Figure 4</label><caption><title>VGG16 branch network</title></caption><graphic mimetype="image" mime-subtype="png" xlink:href="CMC_29297-fig-4.png"/></fig>
<p>By using Inception modules in the Inception branch, large-scale convolutional kernels are added to this network. Furthermore, the decomposed kernels help the network to extract much richer global features without increasing the overall computational effort.</p>
</sec>
<sec id="s3_2"><label>3.2</label><title>Additive Margin Softmax</title>
<p>The BCNN uses the original Softmax loss function. If ignoring the bias, the formulation of the original Softmax loss is defined as
<disp-formula id="eqn-1"><label>(1)</label><mml:math id="mml-eqn-1" display="block"><mml:mrow><mml:msub><mml:mi>L</mml:mi><mml:mrow><mml:mrow><mml:mtext>softmaxLoss</mml:mtext></mml:mrow></mml:mrow></mml:msub></mml:mrow><mml:mo>=</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:mi>N</mml:mi></mml:mfrac><mml:munderover><mml:mo movablelimits="false">&#x2211;</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mi>N</mml:mi></mml:munderover><mml:mrow><mml:mi>log</mml:mi><mml:mo>&#x2061;</mml:mo><mml:mfrac><mml:mrow><mml:mrow><mml:msup><mml:mi>e</mml:mi><mml:mrow><mml:msubsup><mml:mi>W</mml:mi><mml:mi>i</mml:mi><mml:mi>T</mml:mi></mml:msubsup><mml:mrow><mml:msub><mml:mi>x</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mrow></mml:mrow></mml:msup></mml:mrow></mml:mrow><mml:mrow><mml:msubsup><mml:mo movablelimits="false">&#x2211;</mml:mo><mml:mrow><mml:mi>j</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mi>c</mml:mi></mml:msubsup><mml:mrow><mml:mrow><mml:msup><mml:mi>e</mml:mi><mml:mrow><mml:msubsup><mml:mi>W</mml:mi><mml:mi>j</mml:mi><mml:mi>T</mml:mi></mml:msubsup><mml:mrow><mml:msub><mml:mi>x</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mrow></mml:mrow></mml:msup></mml:mrow></mml:mrow></mml:mrow></mml:mfrac></mml:mrow><mml:mo>=</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:mi>N</mml:mi></mml:mfrac><mml:munderover><mml:mo movablelimits="false">&#x2211;</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mi>N</mml:mi></mml:munderover><mml:mrow><mml:mi>log</mml:mi><mml:mo>&#x2061;</mml:mo><mml:mfrac><mml:mrow><mml:mrow><mml:msup><mml:mi>e</mml:mi><mml:mrow><mml:mo fence="false" stretchy="false">|</mml:mo><mml:mo fence="false" stretchy="false">|</mml:mo><mml:mrow><mml:msub><mml:mi>W</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mrow><mml:mo fence="false" stretchy="false">|</mml:mo><mml:mo fence="false" stretchy="false">|</mml:mo><mml:mo>.</mml:mo><mml:mo fence="false" stretchy="false">|</mml:mo><mml:mrow><mml:mo fence="false" stretchy="false">|</mml:mo><mml:mo fence="false" stretchy="false">|</mml:mo></mml:mrow><mml:mrow><mml:msub><mml:mi>x</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mrow><mml:mrow><mml:mo fence="false" stretchy="false">|</mml:mo><mml:mo fence="false" stretchy="false">|</mml:mo></mml:mrow><mml:mo>&#x22C5;</mml:mo><mml:mi>cos</mml:mi><mml:mo>&#x2061;</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mrow><mml:msub><mml:mi>&#x03B8;</mml:mi><mml:mrow><mml:msub><mml:mi></mml:mi><mml:mrow><mml:mrow><mml:msub><mml:mi>y</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mrow></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:msup></mml:mrow></mml:mrow><mml:mrow><mml:msubsup><mml:mo movablelimits="false">&#x2211;</mml:mo><mml:mrow><mml:mi>j</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mi>c</mml:mi></mml:msubsup><mml:mrow><mml:mrow><mml:msup><mml:mi>e</mml:mi><mml:mrow><mml:mo fence="false" stretchy="false">|</mml:mo><mml:mo fence="false" stretchy="false">|</mml:mo><mml:mrow><mml:msub><mml:mi>W</mml:mi><mml:mi>j</mml:mi></mml:msub></mml:mrow><mml:mo fence="false" stretchy="false">|</mml:mo><mml:mo fence="false" stretchy="false">|</mml:mo><mml:mo>.</mml:mo><mml:mo fence="false" stretchy="false">|</mml:mo><mml:mo fence="false" stretchy="false">|</mml:mo><mml:mrow><mml:msub><mml:mi>x</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mrow><mml:mo fence="false" stretchy="false">|</mml:mo><mml:mo fence="false" stretchy="false">|</mml:mo><mml:mo>&#x22C5;</mml:mo><mml:mi>cos</mml:mi><mml:mo>&#x2061;</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mrow><mml:msub><mml:mi>&#x03B8;</mml:mi><mml:mrow><mml:mrow><mml:msub><mml:mi>y</mml:mi><mml:mi>j</mml:mi></mml:msub></mml:mrow><mml:mo>,</mml:mo><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:msup></mml:mrow></mml:mrow></mml:mrow></mml:mfrac></mml:mrow></mml:math></disp-formula>where <inline-formula id="ieqn-2"><mml:math id="mml-ieqn-2"><mml:mrow><mml:msub><mml:mi>x</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mrow></mml:math></inline-formula> denotes the feature of the <italic>i</italic>-th sample, belonging to the <inline-formula id="ieqn-3"><mml:math id="mml-ieqn-3"><mml:mrow><mml:msub><mml:mi>y</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mrow></mml:math></inline-formula>-th class, <inline-formula id="ieqn-4"><mml:math id="mml-ieqn-4"><mml:mrow><mml:msub><mml:mi>W</mml:mi><mml:mi>j</mml:mi></mml:msub></mml:mrow></mml:math></inline-formula> is the <italic>j</italic>-th column weight of the last fully connected layer. The <inline-formula id="ieqn-5"><mml:math id="mml-ieqn-5"><mml:msubsup><mml:mi>W</mml:mi><mml:mi>i</mml:mi><mml:mi>T</mml:mi></mml:msubsup><mml:mrow><mml:msub><mml:mi>x</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mrow></mml:math></inline-formula> is called as the target logit of the <italic>i</italic>-th sample, <inline-formula id="ieqn-6"><mml:math id="mml-ieqn-6"><mml:mi>&#x03B8;</mml:mi></mml:math></inline-formula> represents the angle between the weight and input value. Then, the weights and inputs in the above <xref ref-type="disp-formula" rid="eqn-1">Eq. (1)</xref> are normalized (making <inline-formula id="ieqn-7"><mml:math id="mml-ieqn-7"><mml:mrow><mml:mo fence="false" stretchy="false">|</mml:mo><mml:mo fence="false" stretchy="false">|</mml:mo></mml:mrow><mml:mrow><mml:msub><mml:mi>W</mml:mi><mml:mi>j</mml:mi></mml:msub></mml:mrow><mml:mrow><mml:mo fence="false" stretchy="false">|</mml:mo><mml:mo fence="false" stretchy="false">|</mml:mo></mml:mrow></mml:math></inline-formula> and <inline-formula id="ieqn-8"><mml:math id="mml-ieqn-8"><mml:mrow><mml:mo fence="false" stretchy="false">|</mml:mo><mml:mo fence="false" stretchy="false">|</mml:mo></mml:mrow><mml:mrow><mml:msub><mml:mi>x</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mrow><mml:mrow><mml:mo fence="false" stretchy="false">|</mml:mo><mml:mo fence="false" stretchy="false">|</mml:mo></mml:mrow></mml:math></inline-formula> to be 1), we obtain the modified expression as
<disp-formula id="eqn-2"><label>(2)</label><mml:math id="mml-eqn-2" display="block"><mml:mrow><mml:msub><mml:mi>L</mml:mi><mml:mrow><mml:mrow><mml:mtext>modified</mml:mtext></mml:mrow></mml:mrow></mml:msub></mml:mrow><mml:mo>=</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:mi>N</mml:mi></mml:mfrac><mml:munderover><mml:mo movablelimits="false">&#x2211;</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mi>N</mml:mi></mml:munderover><mml:mrow><mml:mi>log</mml:mi><mml:mo>&#x2061;</mml:mo><mml:mfrac><mml:mrow><mml:mrow><mml:msup><mml:mi>e</mml:mi><mml:mrow><mml:mi>cos</mml:mi><mml:mo>&#x2061;</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mrow><mml:msub><mml:mi>&#x03B8;</mml:mi><mml:mrow><mml:mrow><mml:msub><mml:mi>y</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mrow><mml:mo>,</mml:mo><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:msup></mml:mrow></mml:mrow><mml:mrow><mml:msubsup><mml:mo movablelimits="false">&#x2211;</mml:mo><mml:mrow><mml:mi>j</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mi>c</mml:mi></mml:msubsup><mml:mrow><mml:mrow><mml:msup><mml:mi>e</mml:mi><mml:mrow><mml:mi>cos</mml:mi><mml:mo>&#x2061;</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mrow><mml:msub><mml:mi>&#x03B8;</mml:mi><mml:mrow><mml:mrow><mml:msub><mml:mi>y</mml:mi><mml:mi>j</mml:mi></mml:msub></mml:mrow><mml:mo>,</mml:mo><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:msup></mml:mrow></mml:mrow></mml:mrow></mml:mfrac></mml:mrow></mml:math></disp-formula></p>
<p>If a two-dimensional feature is used as an example and the feature is represented in a circle, a geometric interpretation of the above equation can be clearly illustrated as shown in <xref ref-type="fig" rid="fig-5">Fig. 5</xref>. Where, <inline-formula id="ieqn-9"><mml:math id="mml-ieqn-9"><mml:mrow><mml:msub><mml:mi>W</mml:mi><mml:mn>1</mml:mn></mml:msub></mml:mrow></mml:math></inline-formula> and <inline-formula id="ieqn-10"><mml:math id="mml-ieqn-10"><mml:mrow><mml:msub><mml:mi>W</mml:mi><mml:mn>2</mml:mn></mml:msub></mml:mrow></mml:math></inline-formula> can be considered as center vectors of the two class; <inline-formula id="ieqn-11"><mml:math id="mml-ieqn-11"><mml:mrow><mml:msub><mml:mi>&#x03B8;</mml:mi><mml:mn>1</mml:mn></mml:msub></mml:mrow></mml:math></inline-formula> and <inline-formula id="ieqn-12"><mml:math id="mml-ieqn-12"><mml:mrow><mml:msub><mml:mi>&#x03B8;</mml:mi><mml:mn>2</mml:mn></mml:msub></mml:mrow></mml:math></inline-formula> represent the angles between sample vector <italic>x</italic> and the two center vectors. <inline-formula id="ieqn-13"><mml:math id="mml-ieqn-13"><mml:mrow><mml:msub><mml:mi>P</mml:mi><mml:mn>0</mml:mn></mml:msub></mml:mrow></mml:math></inline-formula> represents the decision boundary generated by the Softmax function for the two classes, and accordingly, <inline-formula id="ieqn-14"><mml:math id="mml-ieqn-14"><mml:mrow><mml:msub><mml:mi>P</mml:mi><mml:mn>1</mml:mn></mml:msub></mml:mrow></mml:math></inline-formula> and <inline-formula id="ieqn-15"><mml:math id="mml-ieqn-15"><mml:mrow><mml:msub><mml:mi>P</mml:mi><mml:mn>2</mml:mn></mml:msub></mml:mrow></mml:math></inline-formula> are generated by AM-Softmax. If <inline-formula id="ieqn-16"><mml:math id="mml-ieqn-16"><mml:mi>cos</mml:mi><mml:mo>&#x2061;</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mrow><mml:msub><mml:mi>&#x03B8;</mml:mi><mml:mn>1</mml:mn></mml:msub></mml:mrow></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mo>&#x003C;</mml:mo><mml:mi>cos</mml:mi><mml:mo>&#x2061;</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mrow><mml:msub><mml:mi>&#x03B8;</mml:mi><mml:mn>2</mml:mn></mml:msub></mml:mrow></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula>, the feature can be identified as category 1. As a result, the decision boundary between the two classes has only one <inline-formula id="ieqn-17"><mml:math id="mml-ieqn-17"><mml:mrow><mml:msub><mml:mi>P</mml:mi><mml:mn>0</mml:mn></mml:msub></mml:mrow></mml:math></inline-formula>, i.e.,<inline-formula id="ieqn-18"><mml:math id="mml-ieqn-18"><mml:mi>cos</mml:mi><mml:mo>&#x2061;</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mrow><mml:msub><mml:mi>&#x03B8;</mml:mi><mml:mn>1</mml:mn></mml:msub></mml:mrow></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mi>cos</mml:mi><mml:mo>&#x2061;</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mrow><mml:msub><mml:mi>&#x03B8;</mml:mi><mml:mn>2</mml:mn></mml:msub></mml:mrow></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula>. If there is only one decision boundary, some special samples, with the intra-class distance being larger than the inter-class distance, can be easily misclassified.</p>
<fig id="fig-5"><label>Figure 5</label><caption><title>Decision boundaries of Softmax and AM-Softmax</title></caption><graphic mimetype="image" mime-subtype="png" xlink:href="CMC_29297-fig-5.png"/></fig>
<p>The ship images are characterized by small differences between classes and large differences within classes. It is necessary to design an activation function which increases the distance between classes and decreases the distance within classes.</p>
<p>To increase the inter-class distance and decrease the intra-class distance, a margin <italic>m</italic> can be explicitly added to the decision boundaries of the categories. That is, based on the Softmax loss function, the decision boundary has two decisional surfaces <inline-formula id="ieqn-19"><mml:math id="mml-ieqn-19"><mml:mrow><mml:msub><mml:mi>P</mml:mi><mml:mn>1</mml:mn></mml:msub></mml:mrow></mml:math></inline-formula> and <inline-formula id="ieqn-20"><mml:math id="mml-ieqn-20"><mml:mrow><mml:msub><mml:mi>P</mml:mi><mml:mn>2</mml:mn></mml:msub></mml:mrow></mml:math></inline-formula>. The boundary for category 1 is <inline-formula id="ieqn-21"><mml:math id="mml-ieqn-21"><mml:mi>cos</mml:mi><mml:mo>&#x2061;</mml:mo><mml:mrow><mml:msub><mml:mi>&#x03B8;</mml:mi><mml:mn>1</mml:mn></mml:msub></mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mi>m</mml:mi><mml:mo>=</mml:mo><mml:mi>cos</mml:mi><mml:mo>&#x2061;</mml:mo><mml:mrow><mml:msub><mml:mi>&#x03B8;</mml:mi><mml:mn>2</mml:mn></mml:msub></mml:mrow></mml:math></inline-formula>, category 2 is <inline-formula id="ieqn-22"><mml:math id="mml-ieqn-22"><mml:mi>cos</mml:mi><mml:mo>&#x2061;</mml:mo><mml:mrow><mml:msub><mml:mi>&#x03B8;</mml:mi><mml:mn>1</mml:mn></mml:msub></mml:mrow><mml:mo>=</mml:mo><mml:mi>cos</mml:mi><mml:mo>&#x2061;</mml:mo><mml:mrow><mml:msub><mml:mi>&#x03B8;</mml:mi><mml:mn>2</mml:mn></mml:msub></mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mi>m</mml:mi></mml:math></inline-formula>. Then, in this paper, we assume that the norm of both <inline-formula id="ieqn-23"><mml:math id="mml-ieqn-23"><mml:mrow><mml:msub><mml:mi>W</mml:mi><mml:mrow><mml:mrow><mml:msub><mml:mi>y</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mrow></mml:mrow></mml:msub></mml:mrow></mml:math></inline-formula> and <inline-formula id="ieqn-24"><mml:math id="mml-ieqn-24"><mml:mrow><mml:msub><mml:mi>x</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mrow></mml:math></inline-formula> are normalized to 1, the Additive Margin Softmax loss function can be designed as, which is denoted as AM-Softmax loss function,
<disp-formula id="eqn-3"><label>(3)</label><mml:math id="mml-eqn-3" display="block"><mml:mrow><mml:msub><mml:mi>L</mml:mi><mml:mrow><mml:mrow><mml:mtext>AM - Softmax</mml:mtext></mml:mrow></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:mo>=</mml:mo></mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:mi>N</mml:mi></mml:mfrac><mml:msub><mml:mo movablelimits="false">&#x2211;</mml:mo><mml:mi>i</mml:mi></mml:msub><mml:mrow><mml:mi>log</mml:mi><mml:mo>&#x2061;</mml:mo><mml:mfrac><mml:mrow><mml:mrow><mml:msup><mml:mi>e</mml:mi><mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mi>cos</mml:mi><mml:mo>&#x2061;</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mrow><mml:msub><mml:mi>&#x03B8;</mml:mi><mml:mrow><mml:mrow><mml:msub><mml:mi>y</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mrow></mml:mrow></mml:msub></mml:mrow></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mi>m</mml:mi></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:msup></mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:msup><mml:mi>e</mml:mi><mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mi>cos</mml:mi><mml:mo>&#x2061;</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mrow><mml:msub><mml:mi>&#x03B8;</mml:mi><mml:mrow><mml:mi>y</mml:mi><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mi>m</mml:mi></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:msup></mml:mrow><mml:mo>+</mml:mo><mml:msubsup><mml:mo movablelimits="false">&#x2211;</mml:mo><mml:mrow><mml:mi>j</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mi>j</mml:mi><mml:mo>&#x2260;</mml:mo><mml:mrow><mml:msub><mml:mi>y</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mrow></mml:mrow><mml:mi>n</mml:mi></mml:msubsup><mml:mrow><mml:mrow><mml:msup><mml:mi>e</mml:mi><mml:mrow><mml:mi>cos</mml:mi><mml:mo>&#x2061;</mml:mo><mml:mrow><mml:msub><mml:mi>&#x03B8;</mml:mi><mml:mi>j</mml:mi></mml:msub></mml:mrow></mml:mrow></mml:msup></mml:mrow></mml:mrow></mml:mrow></mml:mfrac></mml:mrow><mml:mo>=</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:mi>N</mml:mi></mml:mfrac><mml:munder><mml:mo movablelimits="false">&#x2211;</mml:mo><mml:mi>i</mml:mi></mml:munder><mml:mrow><mml:mi>log</mml:mi><mml:mo>&#x2061;</mml:mo><mml:mfrac><mml:mrow><mml:mrow><mml:msup><mml:mi>e</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:msubsup><mml:mi>W</mml:mi><mml:mrow><mml:mrow><mml:msub><mml:mi>y</mml:mi><mml:mrow><mml:msub><mml:mi></mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mrow></mml:msub></mml:mrow></mml:mrow><mml:mi>T</mml:mi></mml:msubsup><mml:mrow><mml:msub><mml:mi>x</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mi>m</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup></mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:msup><mml:mi>e</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:msubsup><mml:mi>W</mml:mi><mml:mrow><mml:mrow><mml:msub><mml:mi>y</mml:mi><mml:mrow><mml:msub><mml:mi></mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mrow></mml:msub></mml:mrow></mml:mrow><mml:mi>T</mml:mi></mml:msubsup><mml:mrow><mml:msub><mml:mi>x</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mi>m</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup></mml:mrow><mml:mo>+</mml:mo><mml:msubsup><mml:mo movablelimits="false">&#x2211;</mml:mo><mml:mrow><mml:mi>j</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mi>j</mml:mi><mml:mo>&#x2260;</mml:mo><mml:mrow><mml:msub><mml:mi>y</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mrow></mml:mrow><mml:mi>n</mml:mi></mml:msubsup><mml:mrow><mml:mrow><mml:msup><mml:mi>e</mml:mi><mml:mrow><mml:msubsup><mml:mi>W</mml:mi><mml:mrow><mml:mrow><mml:msub><mml:mi>y</mml:mi><mml:mrow><mml:msub><mml:mi></mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mrow></mml:msub></mml:mrow></mml:mrow><mml:mi>T</mml:mi></mml:msubsup><mml:mrow><mml:msub><mml:mi>x</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mrow></mml:mrow></mml:msup></mml:mrow></mml:mrow></mml:mrow></mml:mfrac></mml:mrow></mml:math></disp-formula></p>
</sec>
<sec id="s3_3"><label>3.3</label><title>The Overall Procedure of the Proposed Method</title>
<p>By adding the Inception branch network and the AM-Softmax loss function, the network can extract features with local and global information, and optimize the differences between different classes of ships. The whole procedure of the proposed method is described in details as below.
<list list-type="simple">
<list-item><label>(1)</label><p>The input image <italic>I</italic> is cropped to 448&#x2009;&#x00D7;&#x2009;448, and it is horizontally flipped, randomly rotated, randomly cropped, etc. The processed image is noted as <italic>X</italic>.</p></list-item>
<list-item><label>(2)</label><p>The processed images are input to the proposed network, and the feature extraction process is denoted as <inline-formula id="ieqn-25"><mml:math id="mml-ieqn-25"><mml:mi>W</mml:mi><mml:mo>&#x2217;</mml:mo><mml:mi>X</mml:mi></mml:math></inline-formula>, where <inline-formula id="ieqn-26"><mml:math id="mml-ieqn-26"><mml:mo>&#x2217;</mml:mo></mml:math></inline-formula> represents a series of convolutional, Relu and pooling operations, <inline-formula id="ieqn-27"><mml:math id="mml-ieqn-27"><mml:mrow><mml:msub><mml:mi>W</mml:mi><mml:mi>A</mml:mi></mml:msub></mml:mrow></mml:math></inline-formula> and <inline-formula id="ieqn-28"><mml:math id="mml-ieqn-28"><mml:mrow><mml:msub><mml:mi>W</mml:mi><mml:mi>B</mml:mi></mml:msub></mml:mrow></mml:math></inline-formula> represent all the parameters of the two branches. <inline-formula id="ieqn-29"><mml:math id="mml-ieqn-29"><mml:mrow><mml:msub><mml:mi>f</mml:mi><mml:mi>A</mml:mi></mml:msub></mml:mrow></mml:math></inline-formula> and <inline-formula id="ieqn-30"><mml:math id="mml-ieqn-30"><mml:mrow><mml:msub><mml:mi>f</mml:mi><mml:mi>B</mml:mi></mml:msub></mml:mrow></mml:math></inline-formula> are the extracted feature maps, and each of shape is 28&#x2009;&#x00D7;&#x2009;28&#x2009;&#x00D7;&#x2009;512.
<disp-formula id="eqn-4"><label>(4)</label><mml:math id="mml-eqn-4" display="block"><mml:mrow><mml:msub><mml:mi>f</mml:mi><mml:mi>A</mml:mi></mml:msub></mml:mrow><mml:mo>=</mml:mo><mml:mrow><mml:msub><mml:mi>W</mml:mi><mml:mi>A</mml:mi></mml:msub></mml:mrow><mml:mo>&#x2217;</mml:mo><mml:mi>X</mml:mi></mml:math></disp-formula>
<disp-formula id="eqn-5"><label>(5)</label><mml:math id="mml-eqn-5" display="block"><mml:mrow><mml:msub><mml:mi>f</mml:mi><mml:mi>B</mml:mi></mml:msub></mml:mrow><mml:mo>=</mml:mo><mml:mrow><mml:msub><mml:mi>W</mml:mi><mml:mi>B</mml:mi></mml:msub></mml:mrow><mml:mo>&#x2217;</mml:mo><mml:mi>X</mml:mi></mml:math></disp-formula></p></list-item>
<list-item><label>(3)</label><p>At the same position <italic>l</italic> of the two feature maps, <inline-formula id="ieqn-31"><mml:math id="mml-ieqn-31"><mml:mrow><mml:msub><mml:mi>f</mml:mi><mml:mi>A</mml:mi></mml:msub></mml:mrow></mml:math></inline-formula> and <inline-formula id="ieqn-32"><mml:math id="mml-ieqn-32"><mml:mrow><mml:msub><mml:mi>f</mml:mi><mml:mi>B</mml:mi></mml:msub></mml:mrow></mml:math></inline-formula> have a 1&#x2009;&#x00D7;&#x2009;512 vectors, i.e., <inline-formula id="ieqn-33"><mml:math id="mml-ieqn-33"><mml:mrow><mml:msub><mml:mi>f</mml:mi><mml:mi>A</mml:mi></mml:msub></mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mi>l</mml:mi><mml:mo>,</mml:mo><mml:mi>X</mml:mi></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula> and <inline-formula id="ieqn-34"><mml:math id="mml-ieqn-34"><mml:mrow><mml:msub><mml:mi>f</mml:mi><mml:mi>B</mml:mi></mml:msub></mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mi>l</mml:mi><mml:mo>,</mml:mo><mml:mi>X</mml:mi></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula>, the outer product operation yields a 512&#x2009;&#x00D7;&#x2009;512 matrix <inline-formula id="ieqn-35"><mml:math id="mml-ieqn-35"><mml:mi>b</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mi>l</mml:mi><mml:mo>,</mml:mo><mml:mi>X</mml:mi></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula>.
<disp-formula id="eqn-6"><label>(6)</label><mml:math id="mml-eqn-6" display="block"><mml:mi>b</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mi>l</mml:mi><mml:mo>,</mml:mo><mml:mi>X</mml:mi></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mrow><mml:mo>=</mml:mo></mml:mrow><mml:msubsup><mml:mi>f</mml:mi><mml:mi>A</mml:mi><mml:mi>T</mml:mi></mml:msubsup><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mi>l</mml:mi><mml:mo>,</mml:mo><mml:mi>X</mml:mi></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mrow><mml:msub><mml:mi>f</mml:mi><mml:mi>B</mml:mi></mml:msub></mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mi>l</mml:mi><mml:mo>,</mml:mo><mml:mi>X</mml:mi></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:math></disp-formula></p></list-item>
<list-item><label>(4)</label><p>Perform an average pooling operation on the matrixes <inline-formula id="ieqn-36"><mml:math id="mml-ieqn-36"><mml:mrow><mml:msup><mml:mi>b</mml:mi><mml:mi>X</mml:mi></mml:msup></mml:mrow></mml:math></inline-formula>.
<disp-formula id="eqn-7"><label>(7)</label><mml:math id="mml-eqn-7" display="block"><mml:mrow><mml:msup><mml:mi>b</mml:mi><mml:mi>X</mml:mi></mml:msup></mml:mrow><mml:mrow><mml:mo>=</mml:mo><mml:mi mathvariant="normal">a</mml:mi><mml:mi mathvariant="normal">v</mml:mi><mml:mi mathvariant="normal">e</mml:mi><mml:mi mathvariant="normal">r</mml:mi><mml:mi mathvariant="normal">a</mml:mi><mml:mi mathvariant="normal">g</mml:mi><mml:mi mathvariant="normal">e</mml:mi><mml:mi mathvariant="normal">p</mml:mi><mml:mi mathvariant="normal">o</mml:mi><mml:mi mathvariant="normal">o</mml:mi><mml:mi mathvariant="normal">l</mml:mi><mml:mi mathvariant="normal">i</mml:mi><mml:mi mathvariant="normal">n</mml:mi><mml:mi mathvariant="normal">g</mml:mi></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>b</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>l</mml:mi><mml:mo>,</mml:mo><mml:mi>X</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:math></disp-formula></p></list-item>
<list-item><label>(5)</label><p>The bilinear vector <inline-formula id="ieqn-37"><mml:math id="mml-ieqn-37"><mml:mrow><mml:msup><mml:mi>B</mml:mi><mml:mi>X</mml:mi></mml:msup></mml:mrow></mml:math></inline-formula> is obtained by vectorising <inline-formula id="ieqn-38"><mml:math id="mml-ieqn-38"><mml:mrow><mml:msup><mml:mi>b</mml:mi><mml:mi>X</mml:mi></mml:msup></mml:mrow></mml:math></inline-formula>.
<disp-formula id="eqn-8"><label>(8)</label><mml:math id="mml-eqn-8" display="block"><mml:mrow><mml:msup><mml:mi>B</mml:mi><mml:mi>X</mml:mi></mml:msup></mml:mrow><mml:mo>=</mml:mo><mml:mrow><mml:mtext>vector</mml:mtext></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msup><mml:mi>b</mml:mi><mml:mi>X</mml:mi></mml:msup></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:math></disp-formula></p></list-item>
<list-item><label>(6)</label><p>The obtained bilinear vector <inline-formula id="ieqn-39"><mml:math id="mml-ieqn-39"><mml:mrow><mml:msup><mml:mi>B</mml:mi><mml:mi>p</mml:mi></mml:msup></mml:mrow></mml:math></inline-formula> is normalized as follows.
<disp-formula id="eqn-9"><label>(9)</label><mml:math id="mml-eqn-9" display="block"><mml:mrow><mml:msup><mml:mi>B</mml:mi><mml:mi>p</mml:mi></mml:msup></mml:mrow><mml:mo>=</mml:mo><mml:mrow><mml:mi mathvariant="normal">s</mml:mi></mml:mrow><mml:mi>i</mml:mi><mml:mi>g</mml:mi><mml:mi>n</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msup><mml:mi>B</mml:mi><mml:mi>X</mml:mi></mml:msup></mml:mrow><mml:mo stretchy="false">)</mml:mo><mml:msqrt><mml:mrow><mml:mo>|</mml:mo><mml:mrow><mml:mrow><mml:msup><mml:mi>B</mml:mi><mml:mi>X</mml:mi></mml:msup></mml:mrow></mml:mrow><mml:mo>|</mml:mo></mml:mrow></mml:msqrt></mml:math></disp-formula></p></list-item>
<list-item><label>(7)</label><p>Then <inline-formula id="ieqn-40"><mml:math id="mml-ieqn-40"><mml:mrow><mml:msub><mml:mrow><mml:mtext>L</mml:mtext></mml:mrow><mml:mn>2</mml:mn></mml:msub></mml:mrow></mml:math></inline-formula> normalization of the above features is performed as follows, <italic>z</italic> is the input of the next layer.
<disp-formula id="eqn-10"><label>(10)</label><mml:math id="mml-eqn-10" display="block"><mml:mi>z</mml:mi><mml:mo>=</mml:mo><mml:mrow><mml:msup><mml:mi>B</mml:mi><mml:mi>p</mml:mi></mml:msup></mml:mrow><mml:mrow><mml:mo>/</mml:mo></mml:mrow><mml:mo fence="false" stretchy="false">|</mml:mo><mml:mo fence="false" stretchy="false">|</mml:mo><mml:mrow><mml:msup><mml:mi>B</mml:mi><mml:mi>p</mml:mi></mml:msup></mml:mrow><mml:mo fence="false" stretchy="false">|</mml:mo><mml:mrow><mml:msub><mml:mo fence="false" stretchy="false">|</mml:mo><mml:mn>2</mml:mn></mml:msub></mml:mrow></mml:math></disp-formula></p></list-item>
<list-item><label>(8)</label><p><italic>z</italic> is input into the fully connected layer and is predicted to <inline-formula id="ieqn-41"><mml:math id="mml-ieqn-41"><mml:mrow><mml:msub><mml:mi>s</mml:mi><mml:mi>j</mml:mi></mml:msub></mml:mrow></mml:math></inline-formula> by using the AM-Softmax function.
<disp-formula id="eqn-11"><label>(11)</label><mml:math id="mml-eqn-11" display="block"><mml:mrow><mml:msub><mml:mi>s</mml:mi><mml:mi>j</mml:mi></mml:msub></mml:mrow><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mrow><mml:msup><mml:mi>e</mml:mi><mml:mrow><mml:msubsup><mml:mi>w</mml:mi><mml:mrow><mml:mrow><mml:msub><mml:mi>y</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mrow></mml:mrow><mml:mi>T</mml:mi></mml:msubsup><mml:mrow><mml:msub><mml:mi>z</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mi>m</mml:mi></mml:mrow></mml:msup></mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:msup><mml:mi>e</mml:mi><mml:mrow><mml:msubsup><mml:mi>w</mml:mi><mml:mrow><mml:mrow><mml:msub><mml:mi>y</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mrow></mml:mrow><mml:mi>T</mml:mi></mml:msubsup><mml:mrow><mml:msub><mml:mi>z</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mi>m</mml:mi></mml:mrow></mml:msup></mml:mrow><mml:mo>+</mml:mo><mml:msubsup><mml:mo movablelimits="false">&#x2211;</mml:mo><mml:mrow><mml:mi>j</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mi>j</mml:mi><mml:mo>&#x2260;</mml:mo><mml:mrow><mml:msub><mml:mi>y</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mrow></mml:mrow><mml:mi>n</mml:mi></mml:msubsup><mml:mrow><mml:mrow><mml:msup><mml:mi>e</mml:mi><mml:mrow><mml:msubsup><mml:mi>w</mml:mi><mml:mrow><mml:mrow><mml:msub><mml:mi>y</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mrow></mml:mrow><mml:mi>T</mml:mi></mml:msubsup><mml:mrow><mml:msub><mml:mi>z</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mrow></mml:mrow></mml:msup></mml:mrow></mml:mrow></mml:mrow></mml:mfrac></mml:math></disp-formula></p></list-item>
<list-item><label>(9)</label><p>The loss is calculated using the AM-Softmax loss function, then the loss is back propagation for network optimization, and the network parameters are updated.
<disp-formula id="eqn-12"><label>(12)</label><mml:math id="mml-eqn-12" display="block"><mml:mrow><mml:msub><mml:mi>L</mml:mi><mml:mrow><mml:mrow><mml:mtext>AM - Softmax</mml:mtext></mml:mrow></mml:mrow></mml:msub></mml:mrow><mml:mo>=</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:mi>N</mml:mi></mml:mfrac><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>log</mml:mi><mml:mo>&#x2061;</mml:mo><mml:mrow><mml:msub><mml:mi>s</mml:mi><mml:mi>j</mml:mi></mml:msub></mml:mrow></mml:mrow></mml:math></disp-formula></p></list-item>
</list></p>
</sec>
</sec>
<sec id="s4"><label>4</label><title>Experimental Results</title>
<p>To validate the performance of the proposed method, we conduct experiments on the constructed dataset and three other public datasets. And comparison experiments are performed with other four popular methods to further verify the effectiveness of the proposed method. In the following parts, we will present the dataset, details of the training process, the ablation experiments, and the comparison results in details.</p>
<sec id="s4_1"><label>4.1</label><title>Dataset</title>
<p>The Ship-43 dataset is a fine-grained image dataset constructed by our group independently. Some of its images and labels come from the website CNSS (<uri xlink:href="https://www.cnss.com.cn">www.cnss.com.cn</uri>). The Ship-43 dataset contains 43 categories, each containing approximately 1,100 images. Some examples are shown in the <xref ref-type="fig" rid="fig-6">Fig. 6</xref>. In each category, 1,000 images were used for training and the other 100 images were used for testing. In addition, to validate the generalization capacity of the proposed method, three commonly used public datasets for fine-grained image recognition are also used, including the Cub dataset [<xref ref-type="bibr" rid="ref-25">25</xref>], the Car dataset [<xref ref-type="bibr" rid="ref-26">26</xref>], and the Aircraft dataset [<xref ref-type="bibr" rid="ref-27">27</xref>]. The Cub dataset contains 11,788 images of 200 bird species, where each category contains a relatively balanced set of 30 training images and 29 test images. The Car dataset contains 16,185 images of 196 categories of cars, the cars&#x2019; key features include vehicle manufacturer, car make, and model, etc. The Aircraft dataset contains 102 categories, there are 100 images in each class, of which two-thirds are used for training and the other images are used for testing.</p>
<fig id="fig-6"><label>Figure 6</label><caption><title>Examples of ship images in Ship-43</title></caption><graphic mimetype="image" mime-subtype="png" xlink:href="CMC_29297-fig-6.png"/></fig>
</sec>
<sec id="s4_2"><label>4.2</label><title>Training Details</title>
<p><bold>Experimental frameworks and devices.</bold> This paper chooses PyTorch framework for experiments. The experiments use four NVIDIA Tesla V100, each of them with a memory size of 32G.</p>
<p><bold>Network training.</bold> This paper adopts the transfer learning approach for the training of the model, and the network is pre-trained on ImageNet. In the first stage, all parameters of the network except the fully connected layer are frozen, and the parameters of the fully connected layer are learned in the fine-grained dataset using a larger learning rate. In the second stage, the entire network is fine-tuned with a smaller learning rate.</p>
<p><bold>Image size.</bold> Image size will affect the ultimate accuracy of the experiment. Taking into account the machine memory, the image size is set to 448&#x2009;&#x00D7;&#x2009;448.</p>
<p><bold>Learning rate.</bold> The learning rate is set to 1e&#x2212;2 in the first stage, then it is set to 1e&#x2212;3 when the network is fine-tuned.</p>
<p><bold>Batch Size.</bold> When defining this parameter, we consider the size of the dataset and the computer&#x2019;s memory, so the batch size is 256 in this paper.</p>
</sec>
<sec id="s4_3"><label>4.3</label><title>Fine-grained Ship Recognition Results</title>
<p>To evaluate the performance of the proposed model, ablation experiments and comparison experiments are carried out on the above datasets. Firstly, ablation experiments are performed to verify the influence of the Inception branch and the AM-Softmax for fine-grained ship image recognition, respectively. Then, comparison experiments with four well-known methods are performed to validate the effectiveness of the proposed method.</p>
<sec id="s4_3_1"><label>4.3.1</label><title>Ablation Experiment for Inception Branch</title>
<p>Based on the Softmax loss function, to verify the efficiency of the Inception branch network, we design three different networks: (1) BCNN: the two branch networks use the VGG16 network. (2) BCNN-I: both networks use the Inception branch. (3) BCNN-II: one branch uses the VGG16 network, and another uses the Inception branch network. The experimental results are presented in <xref ref-type="fig" rid="fig-7">Fig. 7</xref>.</p>
<fig id="fig-7"><label>Figure 7</label><caption><title>Accuracy of different networks on different datasets</title></caption><graphic mimetype="image" mime-subtype="png" xlink:href="CMC_29297-fig-7.png"/></fig>
<p>From the experimental results, the network merging the Inception branch and the VGG branch has the highest accuracy in all datasets. On the Ship-43 dataset, our method improves by 2.06&#x0025; compared to the BCNN network. It also improves by 1.14&#x0025; compared to the network of the two Inception branches. It states that the network extracted simultaneously the global information and local information is more appropriate. Meanwhile, it can be seen that our proposed network also improves the accuracy on three general fine-grained image datasets.</p>
</sec>
<sec id="s4_3_2"><label>4.3.2</label><title>Ablation Experiment for AM-Softmax</title>
<p>To properly assess the influence of the AM-Softmax, the benchmark network for all the experiments in this section is BCNN. The influence of AM-Softmax function on ship fine-grained recognition will be analyzed in two parts: the influence of different additive margin values <italic>m</italic>, and the comparison of accuracy between different loss functions.
<list list-type="alpha-upper">
<list-item><label>A.</label><p>The influence of different additive margin <italic>m</italic> values</p></list-item>
</list></p>
<p>To explore how a margin can be added manually to assist the network in achieving better accuracy, the <italic>m</italic> values are set to 0.1, 0.2, 0.3, 0.4, 0.5, and 0.6 in this section, and the accuracy is shown in <xref ref-type="table" rid="table-1">Tab. 1</xref>.</p>
<table-wrap id="table-1"><label>Table 1</label><caption><title>Results of different <italic>m</italic> values</title></caption>
<table>
<colgroup>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
</colgroup>
<thead>
<tr>
<th align="left"><italic>m</italic> values</th>
<th align="left">Ship-43</th>
<th align="left">Cub</th>
<th align="left">Car</th>
<th align="left">Aircraft</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">0.1</td>
<td align="left">80.86</td>
<td align="left">82.41</td>
<td align="left">88.47</td>
<td align="left">82.63</td>
</tr>
<tr>
<td align="left">0.2</td>
<td align="left">80.59</td>
<td align="left">81.28</td>
<td align="left">88.34</td>
<td align="left">81.38</td>
</tr>
<tr>
<td align="left">0.3</td>
<td align="left">80.15</td>
<td align="left">82.31</td>
<td align="left">87.42</td>
<td align="left">81.76</td>
</tr>
<tr>
<td align="left">0.4</td>
<td align="left">80.97</td>
<td align="left">83.68</td>
<td align="left"><bold>91.94</bold></td>
<td align="left">84.07</td>
</tr>
<tr>
<td align="left">0.5</td>
<td align="left"><bold>81.76</bold></td>
<td align="left"><bold>83.96</bold></td>
<td align="left">90.89</td>
<td align="left"><bold>84.11</bold></td>
</tr>
<tr>
<td align="left">0.6</td>
<td align="left">81.24</td>
<td align="left">83.51</td>
<td align="left">89.53</td>
<td align="left">83.17</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>From <xref ref-type="table" rid="table-1">Tab. 1</xref>, we can see that margin value is a hyperparameters, and different margin values result in different accuracy. Moreover, for different dataset, the best margin value is different. For example, when <italic>m</italic>&#x2009;&#x003D;&#x2009;0.5, three datasets, Ship-43, Cub, and Aircraft, obtain the best result compared with other margin values. Compared with the Cub and Car dataset, different values has less effect on the Ship-43 datasets, this may be because the Ship-43 dataset has a relatively small number of categories and a large number of pictures.
<list list-type="simple">
<list-item><label>B.</label><p>The comparison of accuracy between different loss functions</p></list-item>
</list></p>
<p>Based on the analysis of different margin values in AM-Softmax, <italic>m</italic>&#x2009;&#x003D;&#x2009;0.5 is selected for the following experiments. Meanwhile, the default optimal hyper-parameters are used for A-Softmax and Arc-Softmax, respectively. To demonstrate the advantages of the AM-Softmax loss function, comparison experiments are conducted with other commonly used loss functions, and the comparison results are shown in <xref ref-type="table" rid="table-2">Tab. 2</xref>.</p>
<table-wrap id="table-2"><label>Table 2</label><caption><title>Comparison results of different loss functions</title></caption>
<table>
<colgroup>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
</colgroup>
<thead>
<tr>
<th align="left">Loss function</th>
<th align="left">Ship-43</th>
<th align="left">Cub</th>
<th align="left">Cars</th>
<th align="left">Aircraft</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">Softmax</td>
<td align="left">79.37</td>
<td align="left">83.10</td>
<td align="left">86.50</td>
<td align="left">82.10</td>
</tr>
<tr>
<td align="left">A-Softmax</td>
<td align="left">80.89</td>
<td align="left">83.35</td>
<td align="left">89.35</td>
<td align="left">75.26</td>
</tr>
<tr>
<td align="left">Arc-Softmax</td>
<td align="left">81.95</td>
<td align="left">79.05</td>
<td align="left"><bold>91.94</bold></td>
<td align="left">82.21</td>
</tr>
<tr>
<td align="left">AM-Softmax</td>
<td align="left"><bold>82.15</bold></td>
<td align="left"><bold>83.96</bold></td>
<td align="left">91.47</td>
<td align="left"><bold>84.11</bold></td>
</tr>
</tbody>
</table>
</table-wrap>
<p>On the Ship-43 dataset, compared to the Softmax function, these modified functions (A-Softmax, Arc-Softmax and AM-Softmax) improve the recognition accuracy, and especially the model with the AM-Softmax function improves the accuracy by 2.78&#x0025;. Moreover, AM-Softmax achieves the highest accuracy on both the Cub and Aircraft datasets.</p>
<p>The <xref ref-type="fig" rid="fig-8">Fig. 8</xref> shows the trend of the loss value of AM-Softmax and Softmax.</p>
<fig id="fig-8"><label>Figure 8</label><caption><title>Loss graph</title></caption><graphic mimetype="image" mime-subtype="png" xlink:href="CMC_29297-fig-8.png"/></fig>
<p>We can see that the loss value of the AM-Softmax is always much smaller than the loss value of the Softmax during the training process. In addition, using AM-Softmax loss function, the network can not only improve the convergence speed but also improve the accuracy. Meanwhile, during the validation process, the AM-Softmax loss value is also smaller than the Softmax loss value, which further indicates that AM-Softmax is more accurate for fine-grained recognition. Due to the two-step training method, the learning rate decay is used to fine-tune the network in the later stage, which makes the loss value of the loss function further decrease.</p>
</sec>
<sec id="s4_3_3"><label>4.3.3</label><title>Comparison Results</title>
<p>To further verify the effectiveness of the proposed method. We conducted comparison experiments with four popular models for fine-grained image recognition, that are the compact bilinear pooling network (CBP) [<xref ref-type="bibr" rid="ref-28">28</xref>], the low-rank bilinear pooling network (LRBP) [<xref ref-type="bibr" rid="ref-29">29</xref>], the BCNN with Softmax function and BCNN with AM-Softmax function. The CBP obtains feature representation by designing novel convolutional kernel based on BCNN. The LRBP can compress the model through the co-decomposition of the larger classifiers. These networks are also frequently used for fine-grained recognition tasks. Because the benchmark framework of this paper is BCNN, in order to be fair, on the same framework, using the variant network of BCNN and our method to compare, it can be reflected that the improvement for different links has significantly different effects.</p>
<p>As shown in <xref ref-type="table" rid="table-3">Tab. 3</xref>, our method achieves the highest accuracy on the Ship-43 dataset. Compared with CBP and LRBP, it exceeds 2.29&#x0025; and 0.88&#x0025;, respectively. Compared to the BCNN with the Softmax or the AM-Softmax, the maximum improvement is 4.08&#x0025;. Our method achieves the highest accuracy on the Cub and Aircraft datasets. In addition, our method only improves on the backbone network and loss function, so our method has a similar computational cost to BCNN. Overall, the effectiveness and generalizability of the method described in this paper for fine-grained recognition is further validated.</p>
<table-wrap id="table-3"><label>Table 3</label><caption><title>Recognition accuracy of different models</title></caption>
<table>
<colgroup>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
</colgroup>
<thead>
<tr>
<th align="left"/>
<th align="left">Ship-43</th>
<th align="left">Cub</th>
<th align="left">Cars</th>
<th align="left">Aircraft</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">BCNN&#x2009;&#x002B;&#x2009;Softmax</td>
<td align="left">79.34</td>
<td align="left">83.10</td>
<td align="left">86.94</td>
<td align="left">82.10</td>
</tr>
<tr>
<td align="left">CBP</td>
<td align="left">81.13</td>
<td align="left">84.01</td>
<td align="left">90.83</td>
<td align="left"><bold>87.40</bold></td>
</tr>
<tr>
<td align="left">LRBP</td>
<td align="left">82.54</td>
<td align="left">84.21</td>
<td align="left">90.92</td>
<td align="left">87.31</td>
</tr>
<tr>
<td align="left">BCNN&#x2009;&#x002B;&#x2009;AM-Softmax</td>
<td align="left">82.15</td>
<td align="left">83.96</td>
<td align="left"><bold>91.47</bold></td>
<td align="left">84.11</td>
</tr>
<tr>
<td align="left">Our method</td>
<td align="left"><bold>83.42</bold></td>
<td align="left"><bold>85.32</bold></td>
<td align="left">90.63</td>
<td align="left">86.81</td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
</sec>
</sec>
<sec id="s5"><label>5</label><title>Conclusion</title>
<p>In this paper, to improve the performance of fine-grained ship image recognition, we modify the BCNN network in two aspects. Firstly, by adding Inception branches to the feature extraction network, the network can merge local and global feature information from different scale kernels. Secondly, by adding margin values to the decision boundary, the AM-Softmax function can optimize the difference between ship classes, and can better activate different categories. Moreover, we construct a fine-grained ship image dataset. Ablation experiments and comparison result on the fine-grained ship dataset and three other fine-grained datasets demonstrate that our method is effective and has high generalization ability. And the proposed method can be applied to many fine-grained applications, such as bird species identification, cars identification, aircraft type identification, online plant identification. Our future work will focus on designing end-to-end models that can extract more distinguishable details to further improve the accuracy of fine-grained ship image recognition.</p>
</sec>
</body>
<back>
<ack>
<p>We express our thanks to Professor Li Yujian for providing devices.</p>
</ack>
<fn-group>
<fn fn-type="other"><p><bold>Funding Statement:</bold> This work is supported by the National Natural Science Foundation of China (61806013, 61876010, 62176009, and 61906005), General project of Science and Technology Plan of Beijing Municipal Education Commission (KM202110005028), Beijing Municipal Education Commission Project (KZ201910005008), Project of Interdisciplinary Research Institute of Beijing University of Technology (2021020101) and International Research Cooperation Seed Fund of Beijing University of Technology (2021A01).</p></fn>
<fn fn-type="conflict"><p><bold>Conflicts of Interest:</bold> The authors declare that they have no conflicts of interest to report regarding the present study.</p></fn>
</fn-group>
<ref-list content-type="authoryear">
<title>References</title>
<ref id="ref-1"><label>[1]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>X. S.</given-names> <surname>Wei</surname></string-name>, <string-name><given-names>J.</given-names> <surname>Wu</surname></string-name> and <string-name><given-names>Q.</given-names> <surname>Cui</surname></string-name></person-group>, &#x201C;<article-title>Deep learning for fine-grained image analysis: A survey</article-title>,&#x201D; <source>IEEE Transactions on Pattern Analysis and Machine Intelligence</source>, vol. <volume>19</volume>, no. <issue>23</issue>, pp. <fpage>118</fpage>&#x2013;<lpage>173</lpage>, <year>2021</year>.</mixed-citation></ref>
<ref id="ref-2"><label>[2]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>T.</given-names> <surname>Zhang</surname></string-name>, <string-name><given-names>X.</given-names> <surname>Zhang</surname></string-name> and <string-name><given-names>J.</given-names> <surname>Shi</surname></string-name></person-group>, &#x201C;<article-title>Depthwise separable convolution neural network for high-speed SAR ship detection</article-title>,&#x201D; <source>Remote Sensing</source>, vol. <volume>21</volume>, no. <issue>21</issue>, pp. <fpage>2483</fpage>&#x2013;<lpage>2492</lpage>, <year>2019</year>.</mixed-citation></ref>
<ref id="ref-3"><label>[3]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>T.</given-names> <surname>Vaiyapuri</surname></string-name>, <string-name><given-names>S. N.</given-names> <surname>Mohanty</surname></string-name>, <string-name><given-names>M.</given-names> <surname>Sivaram</surname></string-name>, <string-name><given-names>I. V.</given-names> <surname>Pustokhina</surname></string-name>, <string-name><given-names>D. A.</given-names> <surname>Pustokhin</surname></string-name> <etal>et al.,</etal></person-group> &#x201C;<article-title>Automatic vehicle license plate recognition using optimal deep learning model</article-title>,&#x201D; <source>Computers, Materials &#x0026; Continua</source>, vol. <volume>67</volume>, no. <issue>2</issue>, pp. <fpage>1881</fpage>&#x2013;<lpage>1897</lpage>, <year>2021</year>.</mixed-citation></ref>
<ref id="ref-4"><label>[4]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>Y.</given-names> <surname>Xia</surname></string-name> and <string-name><given-names>S.</given-names> <surname>Wan</surname></string-name></person-group>, &#x201C;<article-title>A novel sea-land segmentation algorithm based on local binary patterns for ship detection</article-title>,&#x201D; <source>Signal Processing, Image Processing and Pattern Recognition</source>, vol. <volume>7</volume>, no. <issue>3</issue>, pp. <fpage>237</fpage>&#x2013;<lpage>246</lpage>, <year>2014</year>.</mixed-citation></ref>
<ref id="ref-5"><label>[5]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>S. Y.</given-names> <surname>Fan</surname></string-name> and <string-name><given-names>F.</given-names> <surname>Luo</surname></string-name></person-group>, &#x201C;<article-title>Fractal properties of autoregressive spectrum and its application on weak target detection in sea clutter background</article-title>,&#x201D; <source>IET Radar, Sonar &#x0026; Navigation</source>, vol. <volume>9</volume>, no. <issue>8</issue>, pp. <fpage>1070</fpage>&#x2013;<lpage>1077</lpage>, <year>2015</year>.</mixed-citation></ref>
<ref id="ref-6"><label>[6]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>Y.</given-names> <surname>LeCun</surname></string-name>, <string-name><given-names>Y.</given-names> <surname>Bengion</surname></string-name> and <string-name><given-names>G.</given-names> <surname>Hinton</surname></string-name></person-group>, &#x201C;<article-title>Deep learning</article-title>,&#x201D; <source>Nature</source>, vol. <volume>5</volume>, no. <issue>21</issue>, pp. <fpage>436</fpage>&#x2013;<lpage>444</lpage>, <year>2015</year>.</mixed-citation></ref>
<ref id="ref-7"><label>[7]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>T. Y.</given-names> <surname>Lin</surname></string-name> and <string-name><given-names>S.</given-names> <surname>Maji</surname></string-name></person-group>. &#x201C;<article-title>Bilinear cnn models for fine-grained visual recognition</article-title>,&#x201D; in <conf-name>Proc. ICCV</conf-name>, <conf-loc>New York, NY, USA</conf-loc>, pp. <fpage>1449</fpage>&#x2013;<lpage>1457</lpage>, <year>2015</year>.</mixed-citation></ref>
<ref id="ref-8"><label>[8]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>O.</given-names> <surname>Russakovsky</surname></string-name>, <string-name><given-names>J.</given-names> <surname>Deng</surname></string-name> and <string-name><given-names>H.</given-names> <surname>Su</surname></string-name></person-group>, &#x201C;<article-title>Imagenet large scale visual recognition challenge</article-title>,&#x201D; <source>International Journal of Computer Vision</source>, vol. <volume>115</volume>, no. <issue>3</issue>, pp. <fpage>211</fpage>&#x2013;<lpage>252</lpage>, <year>2019</year>.</mixed-citation></ref>
<ref id="ref-9"><label>[9]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>C.</given-names> <surname>Szegedy</surname></string-name></person-group>, &#x201C;<article-title>Going deeper with convolutions</article-title>,&#x201D; in <conf-name>Proc. CVPR</conf-name>, <conf-loc>New York, NY, USA</conf-loc>, pp. <fpage>1</fpage>&#x2013;<lpage>9</lpage>, <year>2015</year>.</mixed-citation></ref>
<ref id="ref-10"><label>[10]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>F.</given-names> <surname>Wang</surname></string-name>, <string-name><given-names>J.</given-names> <surname>Cheng</surname></string-name> and <string-name><given-names>W.</given-names> <surname>Liu</surname></string-name></person-group>, &#x201C;<article-title>Additive margin softmax for face verification</article-title>,&#x201D; <source>IEEE Signal Processing Letters</source>, vol. <volume>25</volume>, no. <issue>7</issue>, pp. <fpage>926</fpage>&#x2013;<lpage>930</lpage>, <year>2018</year>.</mixed-citation></ref>
<ref id="ref-11"><label>[11]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>S.</given-names> <surname>Saqib</surname></string-name>, <string-name><given-names>A.</given-names> <surname>Ditta</surname></string-name>, <string-name><given-names>M. A.</given-names> <surname>Khan</surname></string-name>, <string-name><given-names>S.</given-names> <surname>Asad</surname></string-name> and <string-name><given-names>H.</given-names> <surname>Alquhayz</surname></string-name></person-group>, &#x201C;<article-title>Intelligent dynamic gesture recognition using cnn empowered by edit distance</article-title>,&#x201D; <source>Computers, Materials &#x0026; Continua</source>, vol. <volume>66</volume>, no. <issue>2</issue>, pp. <fpage>2061</fpage>&#x2013;<lpage>2076</lpage>, <year>2021</year>.</mixed-citation></ref>
<ref id="ref-12"><label>[12]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>W.</given-names> <surname>Tan</surname></string-name>, <string-name><given-names>Y.</given-names> <surname>Wu</surname></string-name>, <string-name><given-names>P.</given-names> <surname>Wu</surname></string-name> and <string-name><given-names>B.</given-names> <surname>Chen</surname></string-name></person-group>, &#x201C;<article-title>A survey on digital image copy-move forgery localization using passive techniques</article-title>,&#x201D; <source>Journal of New Media</source>, vol. <volume>1</volume>, no. <issue>1</issue>, pp. <fpage>11</fpage>&#x2013;<lpage>25</lpage>, <year>2019</year>.</mixed-citation></ref>
<ref id="ref-13"><label>[13]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>N.</given-names> <surname>Zhang</surname></string-name> and <string-name><given-names>J.</given-names> <surname>Donahue</surname></string-name></person-group>, &#x201C;<article-title>Part-based r-cnns for fine-grained category detection</article-title>,&#x201D; in <conf-name>Proc. ECCV</conf-name>, <conf-loc>Springer, Switzerland</conf-loc>, pp. <fpage>834</fpage>&#x2013;<lpage>849</lpage>, <year>2014</year>.</mixed-citation></ref>
<ref id="ref-14"><label>[14]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>X. S.</given-names> <surname>Wei</surname></string-name> and <string-name><given-names>C. W.</given-names> <surname>Xie</surname></string-name></person-group>, &#x201C;<article-title>Mask-cnn: Localizing parts and selecting descriptors for fine-grained bird species categorization</article-title>,&#x201D; <source>Pattern Recognition</source>, vol. <volume>23</volume>, no. <issue>4</issue>, pp. <fpage>704</fpage>&#x2013;<lpage>714</lpage>, <year>2018</year>.</mixed-citation></ref>
<ref id="ref-15"><label>[15]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>Y.</given-names> <surname>Peng</surname></string-name>, <string-name><given-names>X.</given-names> <surname>He</surname></string-name> and <string-name><given-names>J.</given-names> <surname>Zhao</surname></string-name></person-group>, &#x201C;<article-title>Object-part attention model for fine-grained image classification</article-title>,&#x201D; <source>IEEE Transactions on Image Processing</source>, vol. <volume>27</volume>, no. <issue>3</issue>, pp. <fpage>1487</fpage>&#x2013;<lpage>1500</lpage>, <year>2017</year>.</mixed-citation></ref>
<ref id="ref-16"><label>[16]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>H.</given-names> <surname>Zheng</surname></string-name> and <string-name><given-names>J.</given-names> <surname>Fu</surname></string-name></person-group>, &#x201C;<article-title>Looking for the devil in the details: Learning trilinear attention sampling network for fine-grained image recognition</article-title>,&#x201D; in <conf-name>Proc. CVPR</conf-name>, <conf-loc>New York, NY, USA</conf-loc>, pp. <fpage>5012</fpage>&#x2013;<lpage>5021</lpage>, <year>2019</year>.</mixed-citation></ref>
<ref id="ref-17"><label>[17]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>C.</given-names> <surname>Zhu</surname></string-name>, <string-name><given-names>Y. K.</given-names> <surname>Wang</surname></string-name>, <string-name><given-names>D. B.</given-names> <surname>Pu</surname></string-name>, <string-name><given-names>M.</given-names> <surname>Qi</surname></string-name>, <string-name><given-names>H.</given-names> <surname>Sun</surname></string-name> <etal>et al.,</etal></person-group> &#x201C;<article-title>Multi-modality video representation for action recognition</article-title>,&#x201D; <source>Journal on Big Data</source>, vol. <volume>2</volume>, no. <issue>3</issue>, pp. <fpage>95</fpage>&#x2013;<lpage>104</lpage>, <year>2020</year>.</mixed-citation></ref>
<ref id="ref-18"><label>[18]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>C.</given-names> <surname>Yuan</surname></string-name>, <string-name><given-names>S.</given-names> <surname>Jiao</surname></string-name> and <string-name><given-names>X.</given-names> <surname>Sun</surname></string-name></person-group>, &#x201C;<article-title>Mfffld: A multi-modal feature fusion based fingerprint liveness detection</article-title>,&#x201D; <source>IEEE Transactions on Cognitive and Developmental Systems</source>, vol. <volume>1</volume>, no. <issue>1</issue>, pp. <fpage>1</fpage>&#x2013;<lpage>14</lpage>, <year>2021</year>.</mixed-citation></ref>
<ref id="ref-19"><label>[19]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>C.</given-names> <surname>Szegedy</surname></string-name>, <string-name><given-names>V.</given-names> <surname>Vanhoucke</surname></string-name> and <string-name><given-names>S.</given-names> <surname>Ioffe</surname></string-name></person-group>, &#x201C;<article-title>Rethinking the inception architecture for computer vision</article-title>,&#x201D; in <conf-name>Proc. CVPR</conf-name>, <conf-loc>New York, NY, USA</conf-loc>, pp. <fpage>2818</fpage>&#x2013;<lpage>2826</lpage>, <year>2016</year>.</mixed-citation></ref>
<ref id="ref-20"><label>[20]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>X. R.</given-names> <surname>Zhang</surname></string-name>, <string-name><given-names>X.</given-names> <surname>Chen</surname></string-name>, <string-name><given-names>W.</given-names> <surname>Sun</surname></string-name>, <string-name><given-names>X. Z.</given-names> <surname>He</surname></string-name></person-group>, &#x201C;<article-title>Vehicle Re-identification model based on optimized densenet121 with joint loss</article-title>,&#x201D; <source>Computers, Materials &#x0026; Continua</source>, vol. <volume>67</volume>, no. <issue>3</issue>, pp. <fpage>3933</fpage>&#x2013;<lpage>3948</lpage>, <year>2021</year>.</mixed-citation></ref>
<ref id="ref-21"><label>[21]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>W.</given-names> <surname>Liu</surname></string-name> and <string-name><given-names>Y.</given-names> <surname>Wen</surname></string-name></person-group>, &#x201C;<article-title>Large-margin softmax loss for convolutional neural networks</article-title>,&#x201D; in <conf-name>Proc. ICML</conf-name>, <conf-loc>New York, NY, USA</conf-loc>, pp. <fpage>1</fpage>&#x2013;<lpage>7</lpage>, <year>2016</year>.</mixed-citation></ref>
<ref id="ref-22"><label>[22]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>W.</given-names> <surname>Liu</surname></string-name>, <string-name><given-names>Y.</given-names> <surname>Wen</surname></string-name> and <string-name><given-names>Z.</given-names> <surname>Yu</surname></string-name></person-group>, &#x201C;<article-title>Sphereface: Deep hypersphere embedding for face recognition</article-title>,&#x201D; in <conf-name>Proc. CVPR</conf-name>, <conf-loc>New York, NY, USA</conf-loc>, pp. <fpage>212</fpage>&#x2013;<lpage>220</lpage>, <year>2017</year>.</mixed-citation></ref>
<ref id="ref-23"><label>[23]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>H.</given-names> <surname>Wang</surname></string-name>, <string-name><given-names>Y.</given-names> <surname>Wang</surname></string-name> and <string-name><given-names>Z.</given-names> <surname>Zhou</surname></string-name></person-group>. &#x201C;<article-title>Cosface: Large margin cosine loss for deep face recognition</article-title>,&#x201D; in <conf-name>Proc. CVPR</conf-name>, <conf-loc>New York, NY, USA</conf-loc>, pp. <fpage>5265</fpage>&#x2013;<lpage>5274</lpage>, <year>2018</year>.</mixed-citation></ref>
<ref id="ref-24"><label>[24]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>J.</given-names> <surname>Deng</surname></string-name> and <string-name><given-names>J.</given-names> <surname>Guo</surname></string-name></person-group>, &#x201C;<article-title>Arcface: Additive angular margin loss for deep face recognition</article-title>,&#x201D; in <conf-name>Proc. CVPR</conf-name>, <conf-loc>New York, NY, USA</conf-loc>, pp. <fpage>4690</fpage>&#x2013;<lpage>4699</lpage>, <year>2019</year>.</mixed-citation></ref>
<ref id="ref-25"><label>[25]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>P.</given-names> <surname>Welinder</surname></string-name> and <string-name><given-names>S.</given-names> <surname>Branson</surname></string-name></person-group>, &#x201C;<article-title>Caltech-ucsd birds 200</article-title>,&#x201D; <source>California Institute of Technology</source>, vol. <volume>12</volume>, no. <issue>3</issue>, pp. <fpage>1487</fpage>&#x2013;<lpage>1500</lpage>, <year>2010</year>.</mixed-citation></ref>
<ref id="ref-26"><label>[26]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>J.</given-names> <surname>Krause</surname></string-name> and <string-name><given-names>M.</given-names> <surname>STARK</surname></string-name></person-group>. &#x201C;<article-title>3D object representations for fine-grained categorization</article-title>,&#x201D; in <conf-name>Proc. ICCV</conf-name>, <conf-loc>New York, NY, USA</conf-loc>, pp. <fpage>554</fpage>&#x2013;<lpage>561</lpage>, <year>2013</year>.</mixed-citation></ref>
<ref id="ref-27"><label>[27]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>W.</given-names> <surname>Sun</surname></string-name>, <string-name><given-names>G. C.</given-names> <surname>Zhang</surname></string-name> and <string-name><given-names>X. R.</given-names> <surname>Zhang</surname></string-name></person-group>, &#x201C;<article-title>Fine-grained vehicle type classification using lightweight convolutional neural network with feature optimization and joint learning strategy</article-title>,&#x201D; <source>Multimedia Tools and Applications</source>, vol. <volume>80</volume>, pp. <fpage>30803</fpage>&#x2013;<lpage>30816</lpage>, <year>2021</year>.</mixed-citation></ref>
<ref id="ref-28"><label>[28]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>Y.</given-names> <surname>Gao</surname></string-name> and <string-name><given-names>O.</given-names> <surname>Beijbom</surname></string-name></person-group>, &#x201C;<article-title>Compact bilinear pooling</article-title>,&#x201D; in <conf-name>Proc. CVPR</conf-name>, <conf-loc>New York, NY, USA</conf-loc>, pp. <fpage>317</fpage>&#x2013;<lpage>326</lpage>, <year>2016</year>.</mixed-citation></ref>
<ref id="ref-29"><label>[29]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>S.</given-names> <surname>Kong</surname></string-name> and <string-name><given-names>C.</given-names> <surname>Fowlkes</surname></string-name></person-group>, &#x201C;<article-title>Low-rank bilinear pooling for fine-grained classification</article-title>,&#x201D; in <conf-name>Proc. CVPR</conf-name>, <conf-loc>New York, NY, USA</conf-loc>, pp. <fpage>365</fpage>&#x2013;<lpage>374</lpage>, <year>2017</year>.</mixed-citation></ref>
</ref-list>
</back>
</article>