<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.1 20151215//EN" "http://jats.nlm.nih.gov/publishing/1.1/JATS-journalpublishing1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:mml="http://www.w3.org/1998/Math/MathML" xml:lang="en" article-type="research-article" dtd-version="1.1">
<front>
<journal-meta>
<journal-id journal-id-type="pmc">CMC</journal-id>
<journal-id journal-id-type="nlm-ta">CMC</journal-id>
<journal-id journal-id-type="publisher-id">CMC</journal-id>
<journal-title-group>
<journal-title>Computers, Materials &#x0026; Continua</journal-title>
</journal-title-group>
<issn pub-type="epub">1546-2226</issn>
<issn pub-type="ppub">1546-2218</issn>
<publisher>
<publisher-name>Tech Science Press</publisher-name>
<publisher-loc>USA</publisher-loc>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="publisher-id">37861</article-id>
<article-id pub-id-type="doi">10.32604/cmc.2023.037861</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Article</subject>
</subj-group>
</article-categories>
<title-group>
<article-title>PCATNet: Position-Class Awareness Transformer for Image Captioning</article-title>
<alt-title alt-title-type="left-running-head">PCATNet: Position-Class Awareness Transformer for Image Captioning</alt-title>
<alt-title alt-title-type="right-running-head">PCATNet: Position-Class Awareness Transformer for Image Captioning</alt-title>
</title-group>
<contrib-group>
<contrib id="author-1" contrib-type="author">
<name name-style="western"><surname>Tang</surname><given-names>Ziwei</given-names></name><xref ref-type="aff" rid="aff-1">1</xref></contrib>
<contrib id="author-2" contrib-type="author" corresp="yes">
<name name-style="western"><surname>Yi</surname><given-names>Yaohua</given-names></name><xref ref-type="aff" rid="aff-2">2</xref><email>whudcil@whu.edu.cn</email></contrib>
<contrib id="author-3" contrib-type="author">
<name name-style="western"><surname>Yu</surname><given-names>Changhui</given-names></name><xref ref-type="aff" rid="aff-2">2</xref></contrib>
<contrib id="author-4" contrib-type="author">
<name name-style="western"><surname>Yin</surname><given-names>Aiguo</given-names></name><xref ref-type="aff" rid="aff-3">3</xref></contrib>
<aff id="aff-1"><label>1</label><institution>Research Center of Graphic Communication, Printing and Packaging, Wuhan University</institution>, <addr-line>Wuhan, 430072</addr-line>, <country>China</country></aff>
<aff id="aff-2"><label>2</label><institution>School of Remote Sensing and Information Engineering, Wuhan University</institution>, <addr-line>Wuhan, 430072</addr-line>, <country>China</country></aff>
<aff id="aff-3"><label>3</label><institution>Zhuhai Pantum Electronics Co., Ltd.</institution>, <addr-line>Zhuhai, 519060</addr-line>, <country>China</country></aff>
</contrib-group>
<author-notes>
<corresp id="cor1"><label>&#x002A;</label>Corresponding Author: Yaohua Yi. Email: <email>whudcil@whu.edu.cn</email></corresp>
</author-notes>
<pub-date date-type="collection" publication-format="electronic"><year>2023</year></pub-date>
<pub-date date-type="pub" publication-format="electronic"><day>1</day><month>5</month><year>2023</year></pub-date>
<volume>75</volume>
<issue>3</issue>
<fpage>6007</fpage>
<lpage>6022</lpage>
<history>
<date date-type="received"><day>18</day><month>11</month><year>2023</year>
</date>
<date date-type="accepted"><day>07</day><month>3</month><year>2023</year>
</date>
</history>
<permissions>
<copyright-statement>&#x00A9; 2023 Tang et al.</copyright-statement>
<copyright-year>2023</copyright-year>
<copyright-holder>Tang et al.</copyright-holder>
<license xlink:href="https://creativecommons.org/licenses/by/4.0/">
<license-p>This work is licensed under a <ext-link ext-link-type="uri" xlink:type="simple" xlink:href="https://creativecommons.org/licenses/by/4.0/">Creative Commons Attribution 4.0 International License</ext-link>, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
</license>
</permissions>
<self-uri content-type="pdf" xlink:href="TSP_CMC_37861.pdf"></self-uri>
<abstract>
<p>Existing image captioning models usually build the relation between visual information and words to generate captions, which lack spatial information and object classes. To address the issue, we propose a novel Position-Class Awareness Transformer (PCAT) network which can serve as a bridge between the visual features and captions by embedding spatial information and awareness of object classes. In our proposal, we construct our PCAT network by proposing a novel Grid Mapping Position Encoding (GMPE) method and refining the encoder-decoder framework. First, GMPE includes mapping the regions of objects to grids, calculating the relative distance among objects and quantization. Meanwhile, we also improve the Self-attention to adapt the GMPE. Then, we propose a Classes Semantic Quantization strategy to extract semantic information from the object classes, which is employed to facilitate embedding features and refining the encoder-decoder framework. To capture the interaction between multi-modal features, we propose Object Classes Awareness (OCA) to refine the encoder and decoder, namely OCA<sub>E</sub> and OCA<sub>D</sub>, respectively. Finally, we apply GMPE, OCA<sub>E</sub> and OCA<sub>D</sub> to form various combinations and to complete the entire PCAT. We utilize the MSCOCO dataset to evaluate the performance of our method. The results demonstrate that PCAT outperforms the other competitive methods.</p>
</abstract>
<kwd-group kwd-group-type="author">
<kwd>Image captioning</kwd>
<kwd>relative position encoding</kwd>
<kwd>object classes awareness</kwd>
</kwd-group>
<funding-group>
<award-group id="awg1">
<funding-source>National Key Research and Development Program of China</funding-source>
<award-id>2021YFB2206200</award-id>
</award-group>
</funding-group>
</article-meta>
</front>
<body>
<sec id="s1">
<label>1</label>
<title>Introduction</title>
<p>Image captioning is the research to generate human descriptions for images [<xref ref-type="bibr" rid="ref-1">1</xref>&#x2013;<xref ref-type="bibr" rid="ref-3">3</xref>]. Recently, image captioning makes great progress because of improved classification [<xref ref-type="bibr" rid="ref-4">4</xref>&#x2013;<xref ref-type="bibr" rid="ref-6">6</xref>], object detection [<xref ref-type="bibr" rid="ref-7">7</xref>,<xref ref-type="bibr" rid="ref-8">8</xref>] and machine translation [<xref ref-type="bibr" rid="ref-9">9</xref>]. Inspired by these, many researchers propose their methods based on the encoder-decoder framework, in which the images are encoded to features by pre-trained Convolutional Neural Network (CNN) and then decoded to sentences by Recurrent Neural Network (RNN) [<xref ref-type="bibr" rid="ref-10">10</xref>,<xref ref-type="bibr" rid="ref-11">11</xref>], Transformer [<xref ref-type="bibr" rid="ref-12">12</xref>] or Bert [<xref ref-type="bibr" rid="ref-13">13</xref>] models. In addition, the attention mechanism has been proposed to help the model build relevance between image regions and the generated sentence [<xref ref-type="bibr" rid="ref-14">14</xref>&#x2013;<xref ref-type="bibr" rid="ref-17">17</xref>]. Therefore, the concentration of improving image caption can be summarized as two aspects: (1) optimizing the image representation [<xref ref-type="bibr" rid="ref-14">14</xref>,<xref ref-type="bibr" rid="ref-17">17</xref>&#x2013;<xref ref-type="bibr" rid="ref-19">19</xref>], including the visual feature, position and classes information, and (2) improving the process of image representation by modifying the structure [<xref ref-type="bibr" rid="ref-14">14</xref>,<xref ref-type="bibr" rid="ref-20">20</xref>].</p>
<p>Objectively, when a man tries to generate a sentence for an image, he will implement three steps: (1) get the region and classes of objects, (2) build the relationship among them, and (3) search the appropriate words to complete the whole caption. However, the researches on image captioning tend to overlook the first step and focus on the latter steps to construct directly the relationship between the visual features and words [<xref ref-type="bibr" rid="ref-12">12</xref>,<xref ref-type="bibr" rid="ref-20">20</xref>&#x2013;<xref ref-type="bibr" rid="ref-23">23</xref>]. According to the latest research, the image can be represented by object-attribute region-based features [<xref ref-type="bibr" rid="ref-14">14</xref>] or grid features [<xref ref-type="bibr" rid="ref-19">19</xref>] whose classes and position information are dropped. Recently, Wu et al. [<xref ref-type="bibr" rid="ref-24">24</xref>] revisit the position encoding for visual Transformer and demonstrate that excellent relative/absolute position encoding can improve the performance of visual features for object recognition &#x0026; detection. Nevertheless, object positions often fall into disuse for image captioning, which inevitably results in the loss of spatial information. Besides, Li et al. [<xref ref-type="bibr" rid="ref-25">25</xref>] propose the feature pairs to solve the problem between the image features and language features, and then apply the big-data pre-training to generate a corpus which is so time-consuming with the Bert model.</p>
<p>With the enlightenment from the aforementioned works, we propose the Position-Class Awareness Transformer (PCAT) network for image captioning, where the network is transformer-based with distinctive position encoding and structure of class feature embedding. On the one hand, we propose a relative position encoding method to quantize the spatial information to vectors for CNN-based visual features. Then, we embed these quantized vectors into the Self-attention [<xref ref-type="bibr" rid="ref-12">12</xref>] (SA) module to ameliorate the relation among objects for the encoder phase. On the other hand, we embed class names as the language vectors and reconstruct the Transformer, which can build the semantics relationship among the objects and narrow the gap from the vision to captions, to adopt the class information from detected objects.</p>
<p>In the paper, we exploit Transformer to construct our framework. In the encoder, a novel relative position encoding method is proposed to model the relationships among the objects and update it to the Self-attention modules. Simultaneously, we construct an extra feature processing module to obtain the semantic association of classes in an image. In the decoder, we improve the block units by adding an independent attention unit, which can bridge the gap from caption features to visual features. We employ the MSCOCO dataset and perform quantitative and qualitative analyses to evaluate our method. The experiment results demonstrate that our method achieves competitive performance with 138.3% CIDEr scores.</p>
<p>The contributions include:
<list list-type="order">
<list-item>
<p>We propose the Position-Class Awareness Transformer (PCAT) network to boost image captioning by the spatial information and detected object classes.</p></list-item>
<list-item>
<p>We propose a relative position encoding method, namely Grid Mapping Position Encoding (GMPE), intuitively measuring the distances of the objects for the Self-attention module, to strengthen the correlation and subordination.</p></list-item>
<list-item>
<p>We propose a Classes Semantic Quantization strategy to improve the representation of class names and refine the encoder-decoder framework by Object Classes Awareness (OCA) to model the interaction between vision and language.</p></list-item>
</list></p>
</sec>
<sec id="s2">
<label>2</label>
<title>Related Work</title>
<sec id="s2_1">
<label>2.1</label>
<title>Image Captioning</title>
<p>Solutions for image captioning are proposed upon the encoder-decoder in recent years. For example, Vinyals et al. [<xref ref-type="bibr" rid="ref-26">26</xref>] propose the CNN-LSTM architecture to encode the image into features and decode them into a caption. Anderson et al. apply the two-layer Long Short-Term Memory (LSTM) network to concentrate on the weighting stage. These methods all employ the RNN-based decoder, which may lose relevance if the two generating words have a large step interval [<xref ref-type="bibr" rid="ref-23">23</xref>]. Until Google proposes the Transformer [<xref ref-type="bibr" rid="ref-12">12</xref>] which applies the Self-attention to calculate the similarity matrix between vision and language, image captioning is trapped in this issue.</p>
<p>Transformer is still an encoder-decoder framework consisting of the attention and Feed Forward Network. Upon this, some optimized Transformers are proposed to obtain better features by improving the structure of the model. M<sup>2</sup>-Transformer [<xref ref-type="bibr" rid="ref-22">22</xref>] encodes image regions and their relationships into a multi-layer structure to fuse both shallow and deep relationships. Then, the generation of sentences adopts a multi-level structure by low- and high-level visual relations, which is better than the application of single semantic features. However, M2-Transformer is an optimization of Transformer, it still researches the feature and can&#x2019;t solve the splitting problem of cross-modal feature conversion for image captioning. To address this issue, X-Transformer [<xref ref-type="bibr" rid="ref-20">20</xref>] focuses on the interaction between image and language by spatial and channel bilinear attention distribution. According to this improvement, X-Transformer achieves excellent performance in 2020. Furthermore, Zhang et al. [<xref ref-type="bibr" rid="ref-19">19</xref>] propose RSTNet to count the contribution of visual features and context features while generating fine-grained captions by novel adaptive attention. Meanwhile, RSTNet is the first to apply the grid features of the image for image captioning and obtain excellent performance. Since 2021, many pre-training methods for image captioning are proposed. For example, Zhang et al. [<xref ref-type="bibr" rid="ref-18">18</xref>] and Li et al. [<xref ref-type="bibr" rid="ref-25">25</xref>] research the visual representation of an image and propose the grid features and pre-training strategy of visual objects features respectively. The pre-training methods apply big data to construct relationships between the visual features and language features and achieve powerful performance for image captioning.</p>
</sec>
<sec id="s2_2">
<label>2.2</label>
<title>Self-Attention and Position Encoding in Transformer</title>
<p>Self-attention is the sub-unit of Transformer, which maps the query, key and value to the output. Moreover, for each input token <inline-formula id="ieqn-1"><mml:math id="mml-ieqn-1"><mml:msub><mml:mi>v</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>d</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula>, the Self-attention can output a corresponding sequence <inline-formula id="ieqn-2"><mml:math id="mml-ieqn-2"><mml:msub><mml:mi>z</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>d</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula> which can be computed as follows:</p>
<p><disp-formula id="eqn-1"><label>(1)</label><mml:math id="mml-eqn-1" display="block"><mml:msub><mml:mi>z</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>j</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:munderover><mml:msub><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>v</mml:mi><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:msup><mml:mrow><mml:mtext>W</mml:mtext></mml:mrow><mml:mrow><mml:mi>V</mml:mi></mml:mrow></mml:msup><mml:mo>)</mml:mo></mml:mrow></mml:math></disp-formula></p>
<p><disp-formula id="eqn-2"><label>(2)</label><mml:math id="mml-eqn-2" display="block"><mml:msub><mml:mi mathvariant="normal">&#x2202;</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mi>exp</mml:mi><mml:mo>&#x2061;</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>&#x03B4;</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>k</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:munderover><mml:mi>exp</mml:mi><mml:mo>&#x2061;</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>&#x03B4;</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:mfrac></mml:math></disp-formula></p>
<p><disp-formula id="eqn-3"><label>(3)</label><mml:math id="mml-eqn-3" display="block"><mml:msub><mml:mi>&#x03B4;</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>v</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:msup><mml:mrow><mml:mtext>W</mml:mtext></mml:mrow><mml:mrow><mml:mi>Q</mml:mi></mml:mrow></mml:msup><mml:mo>)</mml:mo></mml:mrow><mml:msup><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>v</mml:mi><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:msup><mml:mrow><mml:mtext>W</mml:mtext></mml:mrow><mml:mrow><mml:mi>K</mml:mi></mml:mrow></mml:msup><mml:mo>)</mml:mo></mml:mrow><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msup></mml:mrow><mml:msqrt><mml:mi>d</mml:mi></mml:msqrt></mml:mfrac></mml:math></disp-formula>where the projections <inline-formula id="ieqn-3"><mml:math id="mml-ieqn-3"><mml:msup><mml:mi>W</mml:mi><mml:mrow><mml:mi>Q</mml:mi></mml:mrow></mml:msup><mml:mo>,</mml:mo><mml:msup><mml:mi>W</mml:mi><mml:mrow><mml:mi>K</mml:mi></mml:mrow></mml:msup><mml:mo>,</mml:mo><mml:msup><mml:mi>W</mml:mi><mml:mrow><mml:mi>V</mml:mi></mml:mrow></mml:msup><mml:mo>&#x2208;</mml:mo><mml:mrow><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow></mml:mrow></mml:math></inline-formula> are trainable matrixes, <inline-formula id="ieqn-4"><mml:math id="mml-ieqn-4"><mml:mi mathvariant="normal">&#x2202;</mml:mi></mml:math></inline-formula> is the SoftMax function and <inline-formula id="ieqn-5"><mml:math id="mml-ieqn-5"><mml:mi>&#x03B4;</mml:mi></mml:math></inline-formula> is the scaled dot-product attention.</p>
<p>The position encoding methods that we discuss are the absence of the image captioning encoder. As we know, the position encoding is initially designed to generate the order of sequence for the embedding token [<xref ref-type="bibr" rid="ref-12">12</xref>] named absolute position encoding, which can be formulated as:</p>
<p><disp-formula id="eqn-4"><label>(4)</label><mml:math id="mml-eqn-4" display="block"><mml:msub><mml:mi>v</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mi>v</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:msub><mml:mi>p</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:math></disp-formula>where the <inline-formula id="ieqn-6"><mml:math id="mml-ieqn-6"><mml:msub><mml:mi>p</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> is the positional encodings and <inline-formula id="ieqn-7"><mml:math id="mml-ieqn-7"><mml:msub><mml:mi>v</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>&#x2208;</mml:mo><mml:msubsup><mml:mrow><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>v</mml:mi></mml:mrow><mml:mrow><mml:mi>d</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula>. There are several methods to accomplish the encoding such as the sine and cosine functions and the learnable parameters [<xref ref-type="bibr" rid="ref-12">12</xref>,<xref ref-type="bibr" rid="ref-27">27</xref>].</p>
<p>Besides the absolute position encoding, researchers recently reconsider the pairwise relationships between the tokens. Relative position encoding is significant for the tasks that request distance or sequence to measure the association [<xref ref-type="bibr" rid="ref-24">24</xref>,<xref ref-type="bibr" rid="ref-28">28</xref>]. The relative position between tokens <inline-formula id="ieqn-8"><mml:math id="mml-ieqn-8"><mml:msub><mml:mi>v</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> and <inline-formula id="ieqn-9"><mml:math id="mml-ieqn-9"><mml:msub><mml:mi>v</mml:mi><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> is encoded to <inline-formula id="ieqn-10"><mml:math id="mml-ieqn-10"><mml:msubsup><mml:mi>p</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mi>Q</mml:mi></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:msubsup><mml:mi>p</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mi>K</mml:mi></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:msubsup><mml:mi>p</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mi>V</mml:mi></mml:mrow></mml:msubsup><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>d</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula> and embedded into the Self-attention, which can be defined as:</p>
<p><disp-formula id="eqn-5"><label>(5)</label><mml:math id="mml-eqn-5" display="block"><mml:msubsup><mml:mi>v</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>Q</mml:mi></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:msub><mml:mi>v</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:msup><mml:mrow><mml:mtext>W</mml:mtext></mml:mrow><mml:mrow><mml:mi>Q</mml:mi></mml:mrow></mml:msup><mml:mo>+</mml:mo><mml:msubsup><mml:mi>p</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mi>Q</mml:mi></mml:mrow></mml:msubsup></mml:math></disp-formula></p>
<p><disp-formula id="eqn-6"><label>(6)</label><mml:math id="mml-eqn-6" display="block"><mml:msubsup><mml:mi>v</mml:mi><mml:mrow><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mi>K</mml:mi></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:msub><mml:mi>v</mml:mi><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:msup><mml:mrow><mml:mtext>W</mml:mtext></mml:mrow><mml:mrow><mml:mi>K</mml:mi></mml:mrow></mml:msup><mml:mo>+</mml:mo><mml:msubsup><mml:mi>p</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mi>K</mml:mi></mml:mrow></mml:msubsup></mml:math></disp-formula></p>
<p><disp-formula id="eqn-7"><label>(7)</label><mml:math id="mml-eqn-7" display="block"><mml:msubsup><mml:mi>v</mml:mi><mml:mrow><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mi>V</mml:mi></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:msub><mml:mi>v</mml:mi><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:msup><mml:mrow><mml:mtext>W</mml:mtext></mml:mrow><mml:mrow><mml:mi>V</mml:mi></mml:mrow></mml:msup><mml:mo>+</mml:mo><mml:msubsup><mml:mi>p</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mi>V</mml:mi></mml:mrow></mml:msubsup></mml:math></disp-formula></p>
<p>Although relative position encoding has been widely applied in object detection, it is hardly employed in image captioning. Considering the semantic information of the relative distance, we believe it can advance the interaction of vision and language.</p>
</sec>
</sec>
<sec id="s3">
<label>3</label>
<title>The Proposed Method</title>
<p>The architecture of the PCAT network is presented in <xref ref-type="fig" rid="fig-1">Fig. 1</xref>. We use the transformer-based framework. The encoder includes the <inline-formula id="ieqn-11"><mml:math id="mml-ieqn-11"><mml:mi>N</mml:mi></mml:math></inline-formula> refining blocks which are in charge of embedding position and objects classes information to capture the relationship among detected objects. The decoder applies the image features and reconstructs the blocks to embed the object classes information between the captions and visual features to bridge them.</p>
<fig id="fig-1">
<label>Figure 1</label>
<caption>
<title>Overview of our proposed PCAT network</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_37861-fig-1.tif"/>
</fig>
<sec id="s3_1">
<label>3.1</label>
<title>Grid Mapping Position Encoding</title>
<p>As the extra information of the objects in an image, position encoding is always ignored for image captioning. To re-weigh the attention by the spatial information, we design a novel learnable spatial grid feature map to improve position encoding, as well as update the Self-attention to adopt it.</p>
<p>Given the objects detected in an image <inline-formula id="ieqn-12"><mml:math id="mml-ieqn-12"><mml:mi>I</mml:mi></mml:math></inline-formula>, we first extract their center point position, height and width of regions as <inline-formula id="ieqn-13"><mml:math id="mml-ieqn-13"><mml:msub><mml:mi>P</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mrow><mml:mo>{</mml:mo><mml:msubsup><mml:mi>p</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:msup><mml:mi></mml:mi><mml:mo>&#x2032;</mml:mo></mml:msup></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:msubsup><mml:mi>p</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:msup><mml:mi></mml:mi><mml:mo>&#x2033;</mml:mo></mml:msup></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:msub><mml:mi>h</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>w</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>}</mml:mo></mml:mrow></mml:math></inline-formula>. Then, we design a learnable spatial grid feature map <italic>M</italic> and set its size to <inline-formula id="ieqn-14"><mml:math id="mml-ieqn-14"><mml:mi>m</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mi>m</mml:mi></mml:math></inline-formula>, as shown in <xref ref-type="fig" rid="fig-2">Fig. 2</xref>, and apply the data process <inline-formula id="ieqn-15"><mml:math id="mml-ieqn-15"><mml:mi>&#x03B8;</mml:mi></mml:math></inline-formula> to work out the absolute distance encoding <inline-formula id="ieqn-16"><mml:math id="mml-ieqn-16"><mml:msubsup><mml:mi>d</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>A</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula> via the objects&#x2019; positions <inline-formula id="ieqn-17"><mml:math id="mml-ieqn-17"><mml:msub><mml:mi>P</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> and the <italic>M</italic>, which can be formulated as:</p>
<fig id="fig-2">
<label>Figure 2</label>
<caption>
<title>Illustration of Grid Mapping</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_37861-fig-2.tif"/>
</fig>
<p><disp-formula id="eqn-8"><label>(8)</label><mml:math id="mml-eqn-8" display="block"><mml:msubsup><mml:mi>d</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>A</mml:mi></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:msub><mml:mi>M</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:mi>&#x03B8;</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:msubsup><mml:mi>p</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:msup><mml:mi></mml:mi><mml:mo>&#x2032;</mml:mo></mml:msup></mml:mrow></mml:msubsup><mml:mo>)</mml:mo></mml:mrow><mml:mo>,</mml:mo><mml:mspace width="thinmathspace" /><mml:mi>&#x03B8;</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:msubsup><mml:mi>p</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:msup><mml:mi></mml:mi><mml:mo>&#x2033;</mml:mo></mml:msup></mml:mrow></mml:msubsup><mml:mo>)</mml:mo></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mo>+</mml:mo><mml:msub><mml:mi>M</mml:mi><mml:mrow><mml:mi>b</mml:mi><mml:mi>i</mml:mi><mml:mi>a</mml:mi><mml:mi>s</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:msubsup><mml:mi>p</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:msup><mml:mi></mml:mi><mml:mo>&#x2032;</mml:mo></mml:msup></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:mspace width="thinmathspace" /><mml:msubsup><mml:mi>p</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:msup><mml:mi></mml:mi><mml:mo>&#x2033;</mml:mo></mml:msup></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:mspace width="thinmathspace" /><mml:msub><mml:mi>h</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mspace width="thinmathspace" /><mml:msub><mml:mi>w</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow></mml:math></disp-formula>where <inline-formula id="ieqn-18"><mml:math id="mml-ieqn-18"><mml:msub><mml:mi>M</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:mo>,</mml:mo><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula> denotes the position grid feature with indexes (the blue box in <xref ref-type="fig" rid="fig-2">Fig. 2</xref>) and <inline-formula id="ieqn-19"><mml:math id="mml-ieqn-19"><mml:msub><mml:mi>M</mml:mi><mml:mrow><mml:mi>b</mml:mi><mml:mi>i</mml:mi><mml:mi>a</mml:mi><mml:mi>s</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> is the compensation of the extra region which can&#x2019;t be covered by a grid (the shaded area in <xref ref-type="fig" rid="fig-2">Fig. 2</xref>). Finally, we apply the Euclidean distance and linearization to compute relative distance and obtain the relative distance encoding <inline-formula id="ieqn-20"><mml:math id="mml-ieqn-20"><mml:msub><mml:mi>p</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>.</p>
<p>As the spatial information of objects, we refine the Self-attention of the encoder (which is inspired by the contextual mode in [<xref ref-type="bibr" rid="ref-24">24</xref>]) to embed the relative position encodings <inline-formula id="ieqn-21"><mml:math id="mml-ieqn-21"><mml:msub><mml:mi>p</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>, as shown in the encoder in the green box of <xref ref-type="fig" rid="fig-1">Fig. 1</xref> and the optimized Self-attention in <xref ref-type="fig" rid="fig-3">Fig. 3</xref>.</p>
<fig id="fig-3">
<label>Figure 3</label>
<caption>
<title>Illustration of optimized Self-attention with position encoding</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_37861-fig-3.tif"/>
</fig>
<p>Considering the interaction of visual features of objects <inline-formula id="ieqn-22"><mml:math id="mml-ieqn-22"><mml:mi>v</mml:mi></mml:math></inline-formula>, we regard the relative position encoding <inline-formula id="ieqn-23"><mml:math id="mml-ieqn-23"><mml:msub><mml:mi>p</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> as the bias for the similarity matrix <inline-formula id="ieqn-24"><mml:math id="mml-ieqn-24"><mml:msubsup><mml:mi>v</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>Q</mml:mi></mml:mrow></mml:msubsup><mml:msup><mml:mrow><mml:mo>(</mml:mo><mml:msubsup><mml:mi>v</mml:mi><mml:mrow><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mi>K</mml:mi></mml:mrow></mml:msubsup><mml:mo>)</mml:mo></mml:mrow><mml:mrow><mml:mrow><mml:mtext>T</mml:mtext></mml:mrow></mml:mrow></mml:msup></mml:math></inline-formula> of query and key. Therefore, <inline-formula id="ieqn-25"><mml:math id="mml-ieqn-25"><mml:mi>&#x03B4;</mml:mi></mml:math></inline-formula> in <xref ref-type="disp-formula" rid="eqn-3">Eq. (3)</xref> can be refined as:</p>
<p><disp-formula id="eqn-9"><label>(9)</label><mml:math id="mml-eqn-9" display="block"><mml:msub><mml:mi>&#x03B4;</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>v</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:msubsup><mml:mrow><mml:mtext>W</mml:mtext></mml:mrow><mml:mrow><mml:mi>p</mml:mi></mml:mrow><mml:mrow><mml:mi>Q</mml:mi></mml:mrow></mml:msubsup><mml:mo>)</mml:mo></mml:mrow><mml:msup><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>v</mml:mi><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:msup><mml:mrow><mml:mtext>W</mml:mtext></mml:mrow><mml:mrow><mml:mi>K</mml:mi></mml:mrow></mml:msup><mml:mo>)</mml:mo></mml:mrow><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msup><mml:mo>+</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>v</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:msubsup><mml:mrow><mml:mtext>W</mml:mtext></mml:mrow><mml:mrow><mml:mi>p</mml:mi></mml:mrow><mml:mrow><mml:mi>Q</mml:mi></mml:mrow></mml:msubsup><mml:mo>)</mml:mo></mml:mrow><mml:msup><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>p</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msup></mml:mrow><mml:msqrt><mml:mi>d</mml:mi></mml:msqrt></mml:mfrac></mml:math></disp-formula>where the projection <inline-formula id="ieqn-26"><mml:math id="mml-ieqn-26"><mml:msubsup><mml:mrow><mml:mtext>W</mml:mtext></mml:mrow><mml:mrow><mml:mi>p</mml:mi></mml:mrow><mml:mrow><mml:mi>Q</mml:mi></mml:mrow></mml:msubsup><mml:mo>&#x2208;</mml:mo><mml:mrow><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow></mml:mrow></mml:math></inline-formula> is trainable matrixes.</p>
<p><bold>Grid Mapping.</bold> The reason that Transformer for image captioning abandons the position encoding while encoding is that image is 2D and the regions of objects are not a sequence. To calculate the position on a 2D image and define the absolute distance encoding <inline-formula id="ieqn-27"><mml:math id="mml-ieqn-27"><mml:msubsup><mml:mi>d</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>A</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula>, we propose an undirected mapping method, for the process <inline-formula id="ieqn-28"><mml:math id="mml-ieqn-28"><mml:mi>&#x03B8;</mml:mi></mml:math></inline-formula> and feature map <inline-formula id="ieqn-29"><mml:math id="mml-ieqn-29"><mml:msub><mml:mi>M</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mi mathvariant="normal">&#x0026;</mml:mi><mml:msub><mml:mi>M</mml:mi><mml:mrow><mml:mi>b</mml:mi><mml:mi>i</mml:mi><mml:mi>a</mml:mi><mml:mi>s</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>.</p>
<p>Considering that each grid of feature map <italic>M</italic> possesses a fixed index which can be represented as a 2D sequence, what we should concentrate on is that an object region covers how many grids. Unfortunately, we can hardly map an object region to only a grid region entirely. Thus, we define a parameter <inline-formula id="ieqn-30"><mml:math id="mml-ieqn-30"><mml:msub><mml:mi>M</mml:mi><mml:mrow><mml:mi>b</mml:mi><mml:mi>i</mml:mi><mml:mi>a</mml:mi><mml:mi>s</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> to represent the partial feature map of neighbor grids. According to this issue, we first calculate the corresponding indexes of the covered grids for each of the detected regions and find out the center of the covered grids (the blue box in <xref ref-type="fig" rid="fig-2">Fig. 2</xref>), which can be formulated as:</p>
<p><disp-formula id="eqn-10"><label>(10)</label><mml:math id="mml-eqn-10" display="block"><mml:mi>&#x03B8;</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:msubsup><mml:mi>p</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:msup><mml:mi></mml:mi><mml:mo>&#x2032;</mml:mo></mml:msup></mml:mrow></mml:msubsup><mml:mo>)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mrow><mml:mo>&#x2308;</mml:mo><mml:mfrac><mml:msubsup><mml:mi>p</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:msup><mml:mi></mml:mi><mml:mo>&#x2032;</mml:mo></mml:msup></mml:mrow></mml:msubsup><mml:mfrac><mml:msub><mml:mi>h</mml:mi><mml:mrow><mml:mi>I</mml:mi></mml:mrow></mml:msub><mml:mi>m</mml:mi></mml:mfrac></mml:mfrac><mml:mo>&#x2309;</mml:mo></mml:mrow></mml:math></disp-formula></p>
<p><disp-formula id="eqn-11"><label>(11)</label><mml:math id="mml-eqn-11" display="block"><mml:mi>&#x03B8;</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:msubsup><mml:mi>p</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:msup><mml:mi></mml:mi><mml:mo>&#x2033;</mml:mo></mml:msup></mml:mrow></mml:msubsup><mml:mo>)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mrow><mml:mo>&#x2308;</mml:mo><mml:mfrac><mml:msubsup><mml:mi>p</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:msup><mml:mi></mml:mi><mml:mrow><mml:msup><mml:mi></mml:mi><mml:mo>&#x2033;</mml:mo></mml:msup></mml:mrow></mml:msup></mml:mrow></mml:msubsup><mml:mfrac><mml:msub><mml:mi>w</mml:mi><mml:mrow><mml:mi>I</mml:mi></mml:mrow></mml:msub><mml:mi>m</mml:mi></mml:mfrac></mml:mfrac><mml:mo>&#x2309;</mml:mo></mml:mrow></mml:math></disp-formula>where <inline-formula id="ieqn-31"><mml:math id="mml-ieqn-31"><mml:msub><mml:mi>h</mml:mi><mml:mrow><mml:mi>I</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> and <inline-formula id="ieqn-32"><mml:math id="mml-ieqn-32"><mml:msub><mml:mi>w</mml:mi><mml:mrow><mml:mi>I</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> refer to the height and width of <italic>I</italic>. Then, we collect the 8-neighbor grids of the computed <inline-formula id="ieqn-33"><mml:math id="mml-ieqn-33"><mml:msub><mml:mi>M</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> and the process can be followed as:</p>
<p><disp-formula id="eqn-12"><label>(12)</label><mml:math id="mml-eqn-12" display="block"><mml:msub><mml:mi>M</mml:mi><mml:mrow><mml:mi>b</mml:mi><mml:mi>i</mml:mi><mml:mi>a</mml:mi><mml:mi>s</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>n</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mn>8</mml:mn></mml:mrow></mml:munderover><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>&#x03BB;</mml:mi><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:msub><mml:msubsup><mml:mi>M</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:msubsup><mml:mo>)</mml:mo></mml:mrow></mml:math></disp-formula>where <inline-formula id="ieqn-34"><mml:math id="mml-ieqn-34"><mml:msub><mml:mi>&#x03BB;</mml:mi><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> is the intersection between the <inline-formula id="ieqn-35"><mml:math id="mml-ieqn-35"><mml:mi>n</mml:mi></mml:math></inline-formula>th-neighbor grid and <inline-formula id="ieqn-36"><mml:math id="mml-ieqn-36"><mml:msub><mml:mi>M</mml:mi><mml:mrow><mml:mi>b</mml:mi><mml:mi>i</mml:mi><mml:mi>a</mml:mi><mml:mi>s</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>, calculated by <inline-formula id="ieqn-37"><mml:math id="mml-ieqn-37"><mml:msub><mml:mi>P</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mrow><mml:mo>{</mml:mo><mml:msubsup><mml:mi>p</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:msup><mml:mi></mml:mi><mml:mo>&#x2032;</mml:mo></mml:msup></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:msubsup><mml:mi>p</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:msup><mml:mi></mml:mi><mml:mo>&#x2033;</mml:mo></mml:msup></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:msub><mml:mi>h</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>w</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>}</mml:mo></mml:mrow></mml:math></inline-formula>. Note that if the 8-neighbor grids may not cover the target entirely, we can calculate the all covered grid features for <inline-formula id="ieqn-38"><mml:math id="mml-ieqn-38"><mml:msub><mml:mi>M</mml:mi><mml:mrow><mml:mi>b</mml:mi><mml:mi>i</mml:mi><mml:mi>a</mml:mi><mml:mi>s</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>, except the center grid, by the approach of concentric circles with different weights. The 8-neighbor is the normal situation of concentric circles. Therefore, we can obtain the absolute position encoding <inline-formula id="ieqn-39"><mml:math id="mml-ieqn-39"><mml:msubsup><mml:mi>d</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>A</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula>.</p>
<p><bold>Relative Position Encoding</bold>. The relative position encoding <inline-formula id="ieqn-40"><mml:math id="mml-ieqn-40"><mml:msub><mml:mi>p</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> is determined by the relative distance calculation and the correlation between objects. Note that anyone relative distance isn&#x2019;t mapped into an integer because the semantic distance between two objects can&#x2019;t altogether be replaced with position distance. Therefore, we follow <xref ref-type="disp-formula" rid="eqn-8">Eq. (8)</xref> to obtain the two mapped absolute distance encodings <inline-formula id="ieqn-41"><mml:math id="mml-ieqn-41"><mml:msubsup><mml:mi>d</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>A</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula> and <inline-formula id="ieqn-42"><mml:math id="mml-ieqn-42"><mml:msubsup><mml:mi>d</mml:mi><mml:mrow><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mi>A</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula>, and measure their relative distance encoding <inline-formula id="ieqn-43"><mml:math id="mml-ieqn-43"><mml:msubsup><mml:mi>d</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mi>R</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula>. Considering that the order of the words responding objects can&#x2019;t be predicted while generating, we regard mathematics vectors <inline-formula id="ieqn-44"><mml:math id="mml-ieqn-44"><mml:msubsup><mml:mi>d</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>A</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula> and <inline-formula id="ieqn-45"><mml:math id="mml-ieqn-45"><mml:msubsup><mml:mi>d</mml:mi><mml:mrow><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mi>A</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula> as <inline-formula id="ieqn-46"><mml:math id="mml-ieqn-46"><mml:msubsup><mml:mi>p</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>A</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula> and <inline-formula id="ieqn-47"><mml:math id="mml-ieqn-47"><mml:msubsup><mml:mi>p</mml:mi><mml:mrow><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mi>A</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula> to measure their Euclidean distance and linearize them into the encoding <inline-formula id="ieqn-48"><mml:math id="mml-ieqn-48"><mml:msub><mml:mi>p</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> influenced by captions:</p>
<p><disp-formula id="eqn-13"><label>(13)</label><mml:math id="mml-eqn-13" display="block"><mml:msubsup><mml:mi>d</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mi>R</mml:mi></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:msqrt><mml:msup><mml:mrow><mml:mo>(</mml:mo><mml:msubsup><mml:mi>p</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>A</mml:mi></mml:mrow></mml:msubsup><mml:mo>&#x2212;</mml:mo><mml:msubsup><mml:mi>p</mml:mi><mml:mrow><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mi>A</mml:mi></mml:mrow></mml:msubsup><mml:mo>)</mml:mo></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msup></mml:msqrt></mml:math></disp-formula></p>
<p><disp-formula id="eqn-14"><label>(14)</label><mml:math id="mml-eqn-14" display="block"><mml:msub><mml:mi>p</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msubsup><mml:mi>d</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mi>R</mml:mi></mml:mrow></mml:msubsup><mml:msub><mml:mrow><mml:mtext>W</mml:mtext></mml:mrow><mml:mrow><mml:mi>R</mml:mi></mml:mrow></mml:msub></mml:math></disp-formula>where <inline-formula id="ieqn-49"><mml:math id="mml-ieqn-49"><mml:msub><mml:mrow><mml:mtext>W</mml:mtext></mml:mrow><mml:mrow><mml:mi>R</mml:mi></mml:mrow></mml:msub><mml:mo>&#x2208;</mml:mo><mml:mrow><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow></mml:mrow></mml:math></inline-formula> is a trainable matrix.</p>
</sec>
<sec id="s3_2">
<label>3.2</label>
<title>Object Classes Awareness Methods</title>
<p>In contrast to the positional encodings, classes are the semantic information of objects. Classes are explained as the two different forms of token: (a) they are the attributes of the visual objects; (b) they are the sources of words in the captions. According to these two characteristics, we propose the Classes Semantic Quantization strategy to quantize the objects-classes word, as well as the Object Classes Awareness network (OCA) to refine the encoder-decoder framework.</p>
<p><bold>Classes Semantic Quantization</bold></p>
<p>The objects-classes are essentially the words <inline-formula id="ieqn-50"><mml:math id="mml-ieqn-50"><mml:msub><mml:mi>w</mml:mi><mml:mrow><mml:mi>C</mml:mi><mml:mi>l</mml:mi><mml:mi>s</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>. To quantize them and to ensure they hold the same semantics field as the captions, we utilize the words dictionary (Section 4.1) from the dataset to quantize the <inline-formula id="ieqn-51"><mml:math id="mml-ieqn-51"><mml:msub><mml:mi>w</mml:mi><mml:mrow><mml:mi>C</mml:mi><mml:mi>l</mml:mi><mml:mi>s</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> to word embedding vectors <inline-formula id="ieqn-52"><mml:math id="mml-ieqn-52"><mml:msup><mml:mi>v</mml:mi><mml:mrow><mml:mi>C</mml:mi><mml:mi>l</mml:mi><mml:mi>s</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula>:</p>
<p><disp-formula id="eqn-15"><label>(15)</label><mml:math id="mml-eqn-15" display="block"><mml:msup><mml:mi>v</mml:mi><mml:mrow><mml:mi>C</mml:mi><mml:mi>l</mml:mi><mml:mi>s</mml:mi></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:msub><mml:mi>w</mml:mi><mml:mrow><mml:mi>C</mml:mi><mml:mi>l</mml:mi><mml:mi>s</mml:mi></mml:mrow></mml:msub><mml:msub><mml:mrow><mml:mtext>W</mml:mtext></mml:mrow><mml:mrow><mml:mi>e</mml:mi></mml:mrow></mml:msub></mml:math></disp-formula>where <inline-formula id="ieqn-53"><mml:math id="mml-ieqn-53"><mml:msub><mml:mrow><mml:mtext>W</mml:mtext></mml:mrow><mml:mrow><mml:mi>e</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> is the word embedding matrix. Note that some classes are word groups <inline-formula id="ieqn-54"><mml:math id="mml-ieqn-54"><mml:msub><mml:mi>g</mml:mi><mml:mrow><mml:mi>C</mml:mi><mml:mi>l</mml:mi><mml:mi>s</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mrow><mml:mo>{</mml:mo><mml:msubsup><mml:mi>w</mml:mi><mml:mrow><mml:mi>C</mml:mi><mml:mi>l</mml:mi><mml:mi>s</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:msubsup><mml:mi>w</mml:mi><mml:mrow><mml:mi>C</mml:mi><mml:mi>l</mml:mi><mml:mi>s</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:msubsup><mml:mi>w</mml:mi><mml:mrow><mml:mi>C</mml:mi><mml:mi>l</mml:mi><mml:mi>s</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msubsup><mml:mrow><mml:mo>|</mml:mo><mml:mi>k</mml:mi><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mi>N</mml:mi><mml:mrow><mml:mo>&#x2217;</mml:mo></mml:mrow></mml:msup><mml:mo fence="true" stretchy="true" symmetric="true"></mml:mo></mml:mrow><mml:mo>}</mml:mo></mml:mrow></mml:math></inline-formula>, such as &#x201C;<italic>hot dog</italic>&#x201D;, &#x201C;<italic>traffic lights</italic>&#x201D; and &#x201C;<italic>fire hydrant</italic>&#x201D;. If the word group can be represented by the core word, for example the &#x201C;<italic>fire hydrant</italic>&#x201D; is almost equivalent to the word &#x201C;<italic>hydrant</italic>&#x201D;, we will crop the auxiliary word. Besides, if the word group retains the new semantics different from any word, we will add their vectors together. The process can be defined as a piecewise function:</p>
<p><disp-formula id="eqn-16"><label>(16)</label><mml:math id="mml-eqn-16" display="block"><mml:msup><mml:mi>v</mml:mi><mml:mrow><mml:mi>C</mml:mi><mml:mi>l</mml:mi><mml:mi>s</mml:mi></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:mrow><mml:mo>{</mml:mo><mml:mtable columnalign="left left" rowspacing="4pt" columnspacing="1em"><mml:mtr><mml:mtd><mml:msubsup><mml:mi>w</mml:mi><mml:mrow><mml:mi>C</mml:mi><mml:mi>l</mml:mi><mml:mi>s</mml:mi></mml:mrow><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:msubsup><mml:msub><mml:mrow><mml:mtext>W</mml:mtext></mml:mrow><mml:mrow><mml:mi>e</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo></mml:mtd><mml:mtd><mml:mi mathvariant="normal">&#x2200;</mml:mi><mml:msubsup><mml:mi>w</mml:mi><mml:mrow><mml:mi>C</mml:mi><mml:mi>l</mml:mi><mml:mi>s</mml:mi></mml:mrow><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:msubsup><mml:mo>&#x2248;</mml:mo><mml:msub><mml:mi>g</mml:mi><mml:mrow><mml:mi>C</mml:mi><mml:mi>l</mml:mi><mml:mi>s</mml:mi></mml:mrow></mml:msub></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>n</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:munderover><mml:mrow><mml:mo>(</mml:mo><mml:msubsup><mml:mi>w</mml:mi><mml:mrow><mml:mi>C</mml:mi><mml:mi>l</mml:mi><mml:mi>s</mml:mi></mml:mrow><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:msubsup><mml:msub><mml:mrow><mml:mtext>W</mml:mtext></mml:mrow><mml:mrow><mml:mi>e</mml:mi></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow><mml:mo>,</mml:mo></mml:mtd><mml:mtd><mml:mi mathvariant="normal">&#x2203;</mml:mi><mml:msubsup><mml:mi>w</mml:mi><mml:mrow><mml:mi>C</mml:mi><mml:mi>l</mml:mi><mml:mi>s</mml:mi></mml:mrow><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:msubsup><mml:mo>&#x2260;</mml:mo><mml:msub><mml:mi>g</mml:mi><mml:mrow><mml:mi>C</mml:mi><mml:mi>l</mml:mi><mml:mi>s</mml:mi></mml:mrow></mml:msub></mml:mtd></mml:mtr></mml:mtable><mml:mo fence="true" stretchy="true" symmetric="true"></mml:mo></mml:mrow></mml:math></disp-formula>where <inline-formula id="ieqn-55"><mml:math id="mml-ieqn-55"><mml:msub><mml:mi>g</mml:mi><mml:mrow><mml:mi>C</mml:mi><mml:mi>l</mml:mi><mml:mi>s</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> represents the word group and <inline-formula id="ieqn-56"><mml:math id="mml-ieqn-56"><mml:mi>n</mml:mi><mml:mo>&#x2208;</mml:mo><mml:mrow><mml:mo>{</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mn>2</mml:mn><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:mi>k</mml:mi><mml:mo>}</mml:mo></mml:mrow></mml:math></inline-formula>.</p>
<p><bold>Encoder</bold></p>
<p>For the encoder, we quantize the class words to the vectors <inline-formula id="ieqn-57"><mml:math id="mml-ieqn-57"><mml:msup><mml:mi>v</mml:mi><mml:mrow><mml:mi>C</mml:mi><mml:mi>l</mml:mi><mml:mi>s</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula> and accept them as the tokens which are homologous with the visual features <inline-formula id="ieqn-58"><mml:math id="mml-ieqn-58"><mml:mi>v</mml:mi></mml:math></inline-formula> of detected objects. We refine the encoder by improving the encoder block. The OCA module (the purple box in <xref ref-type="fig" rid="fig-1">Fig. 1</xref>) is proposed to build the relationship among classes, which is identical to the Self-attention for the <inline-formula id="ieqn-59"><mml:math id="mml-ieqn-59"><mml:mi>v</mml:mi></mml:math></inline-formula>. We apply the extra multi-head Self-attention (MHSA), residual structure and layer-normalization (LayerNorm) to model the semantic relationship of objects, which can be defined as:</p>
<p><disp-formula id="eqn-17"><label>(17)</label><mml:math id="mml-eqn-17" display="block"><mml:msubsup><mml:mi>z</mml:mi><mml:mrow><mml:msub><mml:mi>n</mml:mi><mml:mrow><mml:mi>h</mml:mi><mml:mi>e</mml:mi><mml:mi>a</mml:mi><mml:mi>d</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:mi>C</mml:mi><mml:mi>l</mml:mi><mml:mi>s</mml:mi></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:mi>S</mml:mi><mml:msub><mml:mi>A</mml:mi><mml:mrow><mml:mrow><mml:msub><mml:mi>n</mml:mi><mml:mrow><mml:mi>h</mml:mi><mml:mi>e</mml:mi><mml:mi>a</mml:mi><mml:mi>d</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:mrow></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:msup><mml:mi>v</mml:mi><mml:mrow><mml:mi>C</mml:mi><mml:mi>l</mml:mi><mml:mi>s</mml:mi></mml:mrow></mml:msup><mml:mo>,</mml:mo><mml:msup><mml:mi>v</mml:mi><mml:mrow><mml:mi>C</mml:mi><mml:mi>l</mml:mi><mml:mi>s</mml:mi></mml:mrow></mml:msup><mml:mo>,</mml:mo><mml:msup><mml:mi>v</mml:mi><mml:mrow><mml:mi>C</mml:mi><mml:mi>l</mml:mi><mml:mi>s</mml:mi></mml:mrow></mml:msup><mml:mo>)</mml:mo></mml:mrow></mml:math></disp-formula></p>
<p><disp-formula id="eqn-18"><label>(18)</label><mml:math id="mml-eqn-18" display="block"><mml:msubsup><mml:mi>Z</mml:mi><mml:mrow><mml:mi>h</mml:mi><mml:mi>e</mml:mi><mml:mi>a</mml:mi><mml:mi>d</mml:mi></mml:mrow><mml:mrow><mml:mi>C</mml:mi><mml:mi>l</mml:mi><mml:mi>s</mml:mi></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:mrow><mml:mo>[</mml:mo><mml:msubsup><mml:mi>z</mml:mi><mml:mrow><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>C</mml:mi><mml:mi>l</mml:mi><mml:mi>s</mml:mi></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:msubsup><mml:mi>z</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow><mml:mrow><mml:mi>C</mml:mi><mml:mi>l</mml:mi><mml:mi>s</mml:mi></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:msubsup><mml:mi>z</mml:mi><mml:mrow><mml:msub><mml:mi>n</mml:mi><mml:mrow><mml:mi>h</mml:mi><mml:mi>e</mml:mi><mml:mi>a</mml:mi><mml:mi>d</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:mi>C</mml:mi><mml:mi>l</mml:mi><mml:mi>s</mml:mi></mml:mrow></mml:msubsup><mml:mo>]</mml:mo></mml:mrow><mml:msubsup><mml:mrow><mml:mtext>W</mml:mtext></mml:mrow><mml:mrow><mml:mi>E</mml:mi></mml:mrow><mml:mrow><mml:mn>0</mml:mn></mml:mrow></mml:msubsup></mml:math></disp-formula></p>
<p><disp-formula id="eqn-19"><label>(19)</label><mml:math id="mml-eqn-19" display="block"><mml:msup><mml:mi>Z</mml:mi><mml:mrow><mml:mi>C</mml:mi><mml:mi>l</mml:mi><mml:mi>s</mml:mi></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:mrow><mml:mtext>LayerNorm</mml:mtext></mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:msup><mml:mi>v</mml:mi><mml:mrow><mml:mi>C</mml:mi><mml:mi>l</mml:mi><mml:mi>s</mml:mi></mml:mrow></mml:msup><mml:mo>+</mml:mo><mml:mrow><mml:mtext>ReLU</mml:mtext></mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:msubsup><mml:mi>Z</mml:mi><mml:mrow><mml:mi>h</mml:mi><mml:mi>e</mml:mi><mml:mi>a</mml:mi><mml:mi>d</mml:mi></mml:mrow><mml:mrow><mml:mi>C</mml:mi><mml:mi>l</mml:mi><mml:mi>s</mml:mi></mml:mrow></mml:msubsup><mml:mo>)</mml:mo></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:math></disp-formula>where <inline-formula id="ieqn-60"><mml:math id="mml-ieqn-60"><mml:msub><mml:mi>n</mml:mi><mml:mrow><mml:mi>h</mml:mi><mml:mi>e</mml:mi><mml:mi>a</mml:mi><mml:mi>d</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> means the <italic>n</italic>-th head of the MHSA, the projection <inline-formula id="ieqn-61"><mml:math id="mml-ieqn-61"><mml:msubsup><mml:mrow><mml:mtext>W</mml:mtext></mml:mrow><mml:mrow><mml:mi>E</mml:mi></mml:mrow><mml:mrow><mml:mn>0</mml:mn></mml:mrow></mml:msubsup><mml:mo>&#x2208;</mml:mo><mml:mrow><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow></mml:mrow></mml:math></inline-formula> is a learnable matrix, the [,] is the concatenation of the vectors, ReLU is the activation function and <inline-formula id="ieqn-62"><mml:math id="mml-ieqn-62"><mml:msubsup><mml:mi>Z</mml:mi><mml:mrow><mml:mi>h</mml:mi><mml:mi>e</mml:mi><mml:mi>a</mml:mi><mml:mi>d</mml:mi></mml:mrow><mml:mrow><mml:mi>C</mml:mi><mml:mi>l</mml:mi><mml:mi>s</mml:mi></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:msup><mml:mi>Z</mml:mi><mml:mrow><mml:mi>C</mml:mi><mml:mi>l</mml:mi><mml:mi>s</mml:mi></mml:mrow></mml:msup><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>d</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula> are the results of MHSA for classes vectors and LayerNorm respectively. Besides, we have to calculate them without sharing layers because <inline-formula id="ieqn-63"><mml:math id="mml-ieqn-63"><mml:msup><mml:mi>v</mml:mi><mml:mrow><mml:mi>C</mml:mi><mml:mi>l</mml:mi><mml:mi>s</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula> possess a different modality from the <inline-formula id="ieqn-64"><mml:math id="mml-ieqn-64"><mml:mi>v</mml:mi></mml:math></inline-formula>.</p>
<p>In this phase, we propose a fusion strategy to embed class information for the vision encoder, namely OCA<sub>E</sub>. We provide <inline-formula id="ieqn-65"><mml:math id="mml-ieqn-65"><mml:msubsup><mml:mi>Z</mml:mi><mml:mrow><mml:mi>h</mml:mi><mml:mi>e</mml:mi><mml:mi>a</mml:mi><mml:mi>d</mml:mi></mml:mrow><mml:mrow><mml:mi>C</mml:mi><mml:mi>l</mml:mi><mml:mi>s</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula> for the visual features <inline-formula id="ieqn-66"><mml:math id="mml-ieqn-66"><mml:mi>v</mml:mi></mml:math></inline-formula> to add semantic information and help the model optimize relationships among objectives. As the blue dashed in <xref ref-type="fig" rid="fig-1">Fig. 1</xref>, we input <inline-formula id="ieqn-67"><mml:math id="mml-ieqn-67"><mml:msubsup><mml:mi>Z</mml:mi><mml:mrow><mml:mi>h</mml:mi><mml:mi>e</mml:mi><mml:mi>a</mml:mi><mml:mi>d</mml:mi></mml:mrow><mml:mrow><mml:mi>C</mml:mi><mml:mi>l</mml:mi><mml:mi>s</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula> and <inline-formula id="ieqn-68"><mml:math id="mml-ieqn-68"><mml:mi>v</mml:mi></mml:math></inline-formula> simultaneously. Considering the improved Self-attention with position encoding, we can just renovate the <inline-formula id="ieqn-69"><mml:math id="mml-ieqn-69"><mml:mi>v</mml:mi></mml:math></inline-formula> with <inline-formula id="ieqn-70"><mml:math id="mml-ieqn-70"><mml:msubsup><mml:mi>Z</mml:mi><mml:mrow><mml:mi>h</mml:mi><mml:mi>e</mml:mi><mml:mi>a</mml:mi><mml:mi>d</mml:mi></mml:mrow><mml:mrow><mml:mi>C</mml:mi><mml:mi>l</mml:mi><mml:mi>s</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula>, defined as:</p>
<p><disp-formula id="eqn-20"><label>(20)</label><mml:math id="mml-eqn-20" display="block"><mml:mi>v</mml:mi><mml:mo>=</mml:mo><mml:mi>v</mml:mi><mml:mo>+</mml:mo><mml:msubsup><mml:mi>Z</mml:mi><mml:mrow><mml:mi>h</mml:mi><mml:mi>e</mml:mi><mml:mi>a</mml:mi><mml:mi>d</mml:mi></mml:mrow><mml:mrow><mml:mi>C</mml:mi><mml:mi>l</mml:mi><mml:mi>s</mml:mi></mml:mrow></mml:msubsup></mml:math></disp-formula></p>
<p><bold>Decoder</bold></p>
<p>The decoder aims to generate the final captions with the visual feature and class information from the encoder. As shown in the blue box of <xref ref-type="fig" rid="fig-1">Fig. 1</xref>, we refine the decoder with an additional feature processing module that can embed class information between language features and visual features, namely OCA<sub>D</sub>. The refining decoder consists of <inline-formula id="ieqn-71"><mml:math id="mml-ieqn-71"><mml:mi>N</mml:mi></mml:math></inline-formula> blocks each of which can be divided into three modules: (a) Language Masked MHSA Module, which can achieve the interaction in the generated words; (b) Bridge MHSA Module (words-to-classes), which includes an MHSA, the residual connection and a LayerNorm and can be regarded as the interaction between caption words and detected objects names; (c) Cross MHSA Module (classes-to-vision), which contains an MHSA, a feed-forward Network (FFN), the residual connections, the LayerNorms, a linear and a SoftMax function and generates the caption word by word eventually.</p>
<p><bold>Language Masked MHSA Module</bold>. We apply this module to build the relations (words-to-words) among the words <inline-formula id="ieqn-72"><mml:math id="mml-ieqn-72"><mml:msub><mml:mi>y</mml:mi><mml:mrow><mml:mn>1</mml:mn><mml:mo>&#x003A;</mml:mo><mml:mi>t</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula> that can be represented as:</p>
<p><disp-formula id="eqn-21"><label>(21)</label><mml:math id="mml-eqn-21" display="block"><mml:msub><mml:mover><mml:mi>y</mml:mi><mml:mo>&#x223C;</mml:mo></mml:mover><mml:mrow><mml:mi>t</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mrow><mml:mtext>LayerNorm</mml:mtext></mml:mrow><mml:mstyle scriptlevel="0"><mml:mrow><mml:mo maxsize="2.047em" minsize="2.047em">(</mml:mo></mml:mrow></mml:mstyle><mml:msub><mml:mi>y</mml:mi><mml:mrow><mml:mi>t</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:mrow><mml:mtext>MHSA</mml:mtext></mml:mrow><mml:mspace width="thinmathspace" /><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>y</mml:mi><mml:mrow><mml:mi>t</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:msubsup><mml:mrow><mml:mtext>W</mml:mtext></mml:mrow><mml:mrow><mml:mi>l</mml:mi></mml:mrow><mml:mrow><mml:mi>Q</mml:mi></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:msub><mml:mi>y</mml:mi><mml:mrow><mml:mi>t</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:msubsup><mml:mrow><mml:mtext>W</mml:mtext></mml:mrow><mml:mrow><mml:mi>l</mml:mi></mml:mrow><mml:mrow><mml:mi>K</mml:mi></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:msub><mml:mi>y</mml:mi><mml:mrow><mml:mi>t</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:msubsup><mml:mrow><mml:mtext>W</mml:mtext></mml:mrow><mml:mrow><mml:mi>l</mml:mi></mml:mrow><mml:mrow><mml:mi>V</mml:mi></mml:mrow></mml:msubsup><mml:mo>)</mml:mo></mml:mrow><mml:mstyle scriptlevel="0"><mml:mrow><mml:mo maxsize="2.047em" minsize="2.047em">)</mml:mo></mml:mrow></mml:mstyle></mml:math></disp-formula>where <inline-formula id="ieqn-73"><mml:math id="mml-ieqn-73"><mml:msubsup><mml:mrow><mml:mtext>W</mml:mtext></mml:mrow><mml:mrow><mml:mi>l</mml:mi></mml:mrow><mml:mrow><mml:mi>Q</mml:mi></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:msubsup><mml:mrow><mml:mtext>W</mml:mtext></mml:mrow><mml:mrow><mml:mi>l</mml:mi></mml:mrow><mml:mrow><mml:mi>K</mml:mi></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:msubsup><mml:mrow><mml:mtext>W</mml:mtext></mml:mrow><mml:mrow><mml:mi>l</mml:mi></mml:mrow><mml:mrow><mml:mi>V</mml:mi></mml:mrow></mml:msubsup><mml:mo>&#x2208;</mml:mo><mml:mrow><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow></mml:mrow></mml:math></inline-formula> are learnable matrixes and <inline-formula id="ieqn-74"><mml:math id="mml-ieqn-74"><mml:msub><mml:mi>y</mml:mi><mml:mrow><mml:mi>t</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula> indicates the vectors of the word at <inline-formula id="ieqn-75"><mml:math id="mml-ieqn-75"><mml:mrow><mml:mo>(</mml:mo><mml:mi>t</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula>-th step.</p>
<p><bold>Bridge MHSA Module (words-to-classes)</bold>. Because a detected object itself corresponds to a region and the class of this object is semantic, we propose the structure of words-to-classes-to-vision. This module aims to model the relationship between words <inline-formula id="ieqn-76"><mml:math id="mml-ieqn-76"><mml:msub><mml:mover><mml:mi>y</mml:mi><mml:mo>&#x223C;</mml:mo></mml:mover><mml:mrow><mml:mi>t</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula> and class features <inline-formula id="ieqn-77"><mml:math id="mml-ieqn-77"><mml:msup><mml:mi>Z</mml:mi><mml:mrow><mml:mi>C</mml:mi><mml:mi>l</mml:mi><mml:mi>s</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula>. Therefore, we construct bridge attention to capture the class context information, which denotes the primary multi-modal interaction to bridge language and vision by classes and can be formulated as:</p>
<p><disp-formula id="eqn-22"><label>(22)</label><mml:math id="mml-eqn-22" display="block"><mml:msub><mml:mi>&#x03BA;</mml:mi><mml:mrow><mml:mi>t</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mrow><mml:mtext>MHSA</mml:mtext></mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mover><mml:mi>y</mml:mi><mml:mo>&#x223C;</mml:mo></mml:mover><mml:mrow><mml:mi>t</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:msubsup><mml:mrow><mml:mtext>W</mml:mtext></mml:mrow><mml:mrow><mml:mi>w</mml:mi><mml:mi>c</mml:mi></mml:mrow><mml:mrow><mml:mi>Q</mml:mi></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:msup><mml:mi>Z</mml:mi><mml:mrow><mml:mi>C</mml:mi><mml:mi>l</mml:mi><mml:mi>s</mml:mi></mml:mrow></mml:msup><mml:msubsup><mml:mrow><mml:mtext>W</mml:mtext></mml:mrow><mml:mrow><mml:mi>w</mml:mi><mml:mi>c</mml:mi></mml:mrow><mml:mrow><mml:mi>K</mml:mi></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:msup><mml:mi>Z</mml:mi><mml:mrow><mml:mi>C</mml:mi><mml:mi>l</mml:mi><mml:mi>s</mml:mi></mml:mrow></mml:msup><mml:msubsup><mml:mrow><mml:mtext>W</mml:mtext></mml:mrow><mml:mrow><mml:mi>w</mml:mi><mml:mi>c</mml:mi></mml:mrow><mml:mrow><mml:mi>V</mml:mi></mml:mrow></mml:msubsup><mml:mo>)</mml:mo></mml:mrow></mml:math></disp-formula></p>
<p><disp-formula id="eqn-23"><label>(23)</label><mml:math id="mml-eqn-23" display="block"><mml:msubsup><mml:mover><mml:mi>Z</mml:mi><mml:mo>&#x223C;</mml:mo></mml:mover><mml:mrow><mml:mi>t</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>C</mml:mi><mml:mi>l</mml:mi><mml:mi>s</mml:mi></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:mrow><mml:mtext>LayerNorm</mml:mtext></mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:msup><mml:mi>Z</mml:mi><mml:mrow><mml:mi>C</mml:mi><mml:mi>l</mml:mi><mml:mi>s</mml:mi></mml:mrow></mml:msup><mml:mo>+</mml:mo><mml:msub><mml:mi>&#x03BA;</mml:mi><mml:mrow><mml:mi>t</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:msubsup><mml:mrow><mml:mtext>W</mml:mtext></mml:mrow><mml:mrow><mml:mi>w</mml:mi><mml:mi>c</mml:mi></mml:mrow><mml:mrow><mml:mn>0</mml:mn></mml:mrow></mml:msubsup><mml:mo>)</mml:mo></mml:mrow></mml:math></disp-formula>where <inline-formula id="ieqn-78"><mml:math id="mml-ieqn-78"><mml:msubsup><mml:mrow><mml:mtext>W</mml:mtext></mml:mrow><mml:mrow><mml:mi>w</mml:mi><mml:mi>c</mml:mi></mml:mrow><mml:mrow><mml:mi>Q</mml:mi></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:msubsup><mml:mrow><mml:mtext>W</mml:mtext></mml:mrow><mml:mrow><mml:mi>w</mml:mi><mml:mi>c</mml:mi></mml:mrow><mml:mrow><mml:mi>K</mml:mi></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:msubsup><mml:mrow><mml:mtext>W</mml:mtext></mml:mrow><mml:mrow><mml:mi>w</mml:mi><mml:mi>c</mml:mi></mml:mrow><mml:mrow><mml:mi>V</mml:mi></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:msubsup><mml:mrow><mml:mtext>W</mml:mtext></mml:mrow><mml:mrow><mml:mi>w</mml:mi><mml:mi>c</mml:mi></mml:mrow><mml:mrow><mml:mn>0</mml:mn></mml:mrow></mml:msubsup><mml:mo>&#x2208;</mml:mo><mml:mrow><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow></mml:mrow></mml:math></inline-formula> are learnable matrixes, <inline-formula id="ieqn-79"><mml:math id="mml-ieqn-79"><mml:msubsup><mml:mover><mml:mi>Z</mml:mi><mml:mo>&#x223C;</mml:mo></mml:mover><mml:mrow><mml:mi>t</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>C</mml:mi><mml:mi>l</mml:mi><mml:mi>s</mml:mi></mml:mrow></mml:msubsup><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mi>t</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo>)</mml:mo></mml:mrow><mml:mo>&#x00D7;</mml:mo><mml:mi>d</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula> denotes the output of the Bridge MHSA with the <inline-formula id="ieqn-80"><mml:math id="mml-ieqn-80"><mml:msub><mml:mover><mml:mi>y</mml:mi><mml:mo>&#x223C;</mml:mo></mml:mover><mml:mrow><mml:mi>t</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula> and is exploited as the input of Cross MHSA Module (classes-to-vision).</p>
<p><bold>Cross MHSA Module (classes-to-vision).</bold> This module aims at modeling the relationship between the attended classes <inline-formula id="ieqn-81"><mml:math id="mml-ieqn-81"><mml:msup><mml:mover><mml:mi>Z</mml:mi><mml:mo>&#x223C;</mml:mo></mml:mover><mml:mrow><mml:mi>C</mml:mi><mml:mi>l</mml:mi><mml:mi>s</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula> and visual features <inline-formula id="ieqn-82"><mml:math id="mml-ieqn-82"><mml:mi>Z</mml:mi></mml:math></inline-formula>, which refers to another multi-modal crossing to bridge language and vision by MHSA with classes and vision. The process can be given by:</p>
<p><disp-formula id="eqn-24"><label>(24)</label><mml:math id="mml-eqn-24" display="block"><mml:msub><mml:mi>&#x03BB;</mml:mi><mml:mrow><mml:mi>t</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mrow><mml:mtext>MHSA</mml:mtext></mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:msubsup><mml:mover><mml:mi>Z</mml:mi><mml:mo>&#x223C;</mml:mo></mml:mover><mml:mrow><mml:mi>t</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>C</mml:mi><mml:mi>l</mml:mi><mml:mi>s</mml:mi></mml:mrow></mml:msubsup><mml:msubsup><mml:mrow><mml:mtext>W</mml:mtext></mml:mrow><mml:mrow><mml:mi>c</mml:mi><mml:mi>v</mml:mi></mml:mrow><mml:mrow><mml:mi>Q</mml:mi></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:mi>Z</mml:mi><mml:msubsup><mml:mrow><mml:mtext>W</mml:mtext></mml:mrow><mml:mrow><mml:mi>c</mml:mi><mml:mi>v</mml:mi></mml:mrow><mml:mrow><mml:mi>K</mml:mi></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:mi>Z</mml:mi><mml:msubsup><mml:mrow><mml:mtext>W</mml:mtext></mml:mrow><mml:mrow><mml:mi>c</mml:mi><mml:mi>v</mml:mi></mml:mrow><mml:mrow><mml:mi>V</mml:mi></mml:mrow></mml:msubsup><mml:mo>)</mml:mo></mml:mrow></mml:math></disp-formula></p>
<p><disp-formula id="eqn-25"><label>(25)</label><mml:math id="mml-eqn-25" display="block"><mml:msub><mml:mover><mml:mi>Z</mml:mi><mml:mo>&#x223C;</mml:mo></mml:mover><mml:mrow><mml:mi>t</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mrow><mml:mtext>LayerNorm</mml:mtext></mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mi>Z</mml:mi><mml:mo>+</mml:mo><mml:msub><mml:mi>&#x03BB;</mml:mi><mml:mrow><mml:mi>t</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:msubsup><mml:mrow><mml:mtext>W</mml:mtext></mml:mrow><mml:mrow><mml:mi>c</mml:mi><mml:mi>v</mml:mi></mml:mrow><mml:mrow><mml:mn>0</mml:mn></mml:mrow></mml:msubsup><mml:mo>)</mml:mo></mml:mrow></mml:math></disp-formula></p>
<p><disp-formula id="eqn-26"><label>(26)</label><mml:math id="mml-eqn-26" display="block"><mml:msubsup><mml:mi>y</mml:mi><mml:mrow><mml:mi>t</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>c</mml:mi><mml:mi>v</mml:mi></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:mrow><mml:mtext>LayerNorm</mml:mtext></mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mover><mml:mi>Z</mml:mi><mml:mo>&#x223C;</mml:mo></mml:mover><mml:mrow><mml:mi>t</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:mrow><mml:mtext>FFN</mml:mtext></mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mover><mml:mi>Z</mml:mi><mml:mo>&#x223C;</mml:mo></mml:mover><mml:mrow><mml:mi>t</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:math></disp-formula>where <inline-formula id="ieqn-83"><mml:math id="mml-ieqn-83"><mml:msubsup><mml:mrow><mml:mtext>W</mml:mtext></mml:mrow><mml:mrow><mml:mi>c</mml:mi><mml:mi>v</mml:mi></mml:mrow><mml:mrow><mml:mi>Q</mml:mi></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:msubsup><mml:mrow><mml:mtext>W</mml:mtext></mml:mrow><mml:mrow><mml:mi>c</mml:mi><mml:mi>v</mml:mi></mml:mrow><mml:mrow><mml:mi>K</mml:mi></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:msubsup><mml:mrow><mml:mtext>W</mml:mtext></mml:mrow><mml:mrow><mml:mi>c</mml:mi><mml:mi>v</mml:mi></mml:mrow><mml:mrow><mml:mi>V</mml:mi></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:msubsup><mml:mrow><mml:mtext>W</mml:mtext></mml:mrow><mml:mrow><mml:mi>c</mml:mi><mml:mi>v</mml:mi></mml:mrow><mml:mrow><mml:mn>0</mml:mn></mml:mrow></mml:msubsup><mml:mo>&#x2208;</mml:mo><mml:mrow><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow></mml:mrow></mml:math></inline-formula> are learned parameters, <inline-formula id="ieqn-84"><mml:math id="mml-ieqn-84"><mml:msubsup><mml:mover><mml:mi>Z</mml:mi><mml:mo>&#x223C;</mml:mo></mml:mover><mml:mrow><mml:mi>t</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>C</mml:mi><mml:mi>l</mml:mi><mml:mi>s</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula> from the former module is input into MHSA as query, and visual features <inline-formula id="ieqn-85"><mml:math id="mml-ieqn-85"><mml:mi>Z</mml:mi></mml:math></inline-formula> which are composed of position information from the encoder are fed into MHSA as key and value.</p>
<p>The distribution of the vocabulary is as follows:</p>
<p><disp-formula id="eqn-27"><label>(27)</label><mml:math id="mml-eqn-27" display="block"><mml:mi>p</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>y</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:msub><mml:mi>y</mml:mi><mml:mrow><mml:mn>1</mml:mn><mml:mo>&#x003A;</mml:mo><mml:mi>t</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mi>S</mml:mi><mml:mi>o</mml:mi><mml:mi>f</mml:mi><mml:mi>t</mml:mi><mml:mi>m</mml:mi><mml:mi>a</mml:mi><mml:mi>x</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:msubsup><mml:mi>y</mml:mi><mml:mrow><mml:mi>t</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>c</mml:mi><mml:mi>v</mml:mi></mml:mrow></mml:msubsup><mml:msup><mml:mrow><mml:mtext>W</mml:mtext></mml:mrow><mml:mrow><mml:mi>y</mml:mi></mml:mrow></mml:msup><mml:mo>)</mml:mo></mml:mrow></mml:math></disp-formula>where <inline-formula id="ieqn-86"><mml:math id="mml-ieqn-86"><mml:msup><mml:mrow><mml:mtext>W</mml:mtext></mml:mrow><mml:mrow><mml:mi>y</mml:mi></mml:mrow></mml:msup><mml:mo>&#x2208;</mml:mo><mml:mrow><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow></mml:mrow></mml:math></inline-formula> is a learnable matrix.</p>
</sec>
<sec id="s3_3">
<label>3.3</label>
<title>Training and Objectives</title>
<p><bold>Train by Cross-Entropy Loss.</bold> First, we train our model by the Cross-Entropy Loss <italic>L</italic><sub><italic>XE</italic></sub>:</p>
<p><disp-formula id="eqn-28"><label>(28)</label><mml:math id="mml-eqn-28" display="block"><mml:msub><mml:mi>L</mml:mi><mml:mrow><mml:mi>X</mml:mi><mml:mi>E</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:mi>&#x03B8;</mml:mi><mml:mo>)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mo>&#x2212;</mml:mo><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>t</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:munderover><mml:mi>log</mml:mi><mml:mo>&#x2061;</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>p</mml:mi><mml:mrow><mml:mi>&#x03B8;</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:msubsup><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mo>&#x2217;</mml:mo></mml:mrow></mml:msubsup><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:msubsup><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mrow><mml:mn>1</mml:mn><mml:mo>&#x003A;</mml:mo><mml:mi>t</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mo>&#x2217;</mml:mo></mml:mrow></mml:msubsup><mml:mo>)</mml:mo></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:math></disp-formula>where <inline-formula id="ieqn-87"><mml:math id="mml-ieqn-87"><mml:msubsup><mml:mi>y</mml:mi><mml:mrow><mml:mn>1</mml:mn><mml:mo>&#x003A;</mml:mo><mml:mrow><mml:mtext>T</mml:mtext></mml:mrow></mml:mrow><mml:mrow><mml:mo>&#x2217;</mml:mo></mml:mrow></mml:msubsup></mml:math></inline-formula> represents the ground truth.</p>
<p><bold>Optimize by CIDEr Score.</bold> Then, we employ Self-Critical Sequence Training (SCST) [<xref ref-type="bibr" rid="ref-29">29</xref>] to optimize:</p>
<p><disp-formula id="eqn-29"><label>(29)</label><mml:math id="mml-eqn-29" display="block"><mml:msub><mml:mi>L</mml:mi><mml:mrow><mml:mi>R</mml:mi><mml:mi>L</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:mi>&#x03B8;</mml:mi><mml:mo>)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mrow><mml:mi mathvariant="bold">E</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:msub><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mrow><mml:mn>1</mml:mn><mml:mo>&#x003A;</mml:mo><mml:mrow><mml:mtext>T</mml:mtext></mml:mrow></mml:mrow></mml:msub></mml:mrow><mml:mo>&#x223C;</mml:mo><mml:mrow><mml:msub><mml:mi>p</mml:mi><mml:mrow><mml:mi>&#x03B8;</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:mrow></mml:msub><mml:mrow><mml:mo>[</mml:mo><mml:mi>r</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mrow><mml:mn>1</mml:mn><mml:mo>&#x003A;</mml:mo><mml:mrow><mml:mtext>T</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow><mml:mo>]</mml:mo></mml:mrow></mml:math></disp-formula>where the reward <italic>r</italic>(.) is computed by CIDEr (Consensus-based Image Description Evaluation) [<xref ref-type="bibr" rid="ref-30">30</xref>]. The gradient can be defined as:</p>
<p><disp-formula id="eqn-30"><label>(30)</label><mml:math id="mml-eqn-30" display="block"><mml:msub><mml:mi mathvariant="normal">&#x2207;</mml:mi><mml:mrow><mml:mi>&#x03B8;</mml:mi></mml:mrow></mml:msub><mml:msub><mml:mi>L</mml:mi><mml:mrow><mml:mi>R</mml:mi><mml:mi>L</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:mi>&#x03B8;</mml:mi><mml:mo>)</mml:mo></mml:mrow><mml:mo>&#x2248;</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:mi>r</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:msubsup><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mrow><mml:mn>1</mml:mn><mml:mo>&#x003A;</mml:mo><mml:mrow><mml:mtext>T</mml:mtext></mml:mrow></mml:mrow><mml:mrow><mml:mi>s</mml:mi></mml:mrow></mml:msubsup><mml:mo>)</mml:mo></mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mi>r</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo stretchy="false">&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mn>1</mml:mn><mml:mo>&#x003A;</mml:mo><mml:mrow><mml:mtext>T</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:msub><mml:mi mathvariant="normal">&#x2207;</mml:mi><mml:mrow><mml:mi>&#x03B8;</mml:mi></mml:mrow></mml:msub><mml:mi>log</mml:mi><mml:mo>&#x2061;</mml:mo><mml:msub><mml:mi>p</mml:mi><mml:mrow><mml:mi>&#x03B8;</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:msubsup><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mrow><mml:mn>1</mml:mn><mml:mo>&#x003A;</mml:mo><mml:mrow><mml:mtext>T</mml:mtext></mml:mrow></mml:mrow><mml:mrow><mml:mi>s</mml:mi></mml:mrow></mml:msubsup><mml:mo>)</mml:mo></mml:mrow></mml:math></disp-formula>where <inline-formula id="ieqn-88"><mml:math id="mml-ieqn-88"><mml:msup><mml:mi>y</mml:mi><mml:mrow><mml:mi>s</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula> refers to the result of sampled probability and the <inline-formula id="ieqn-89"><mml:math id="mml-ieqn-89"><mml:mrow><mml:mover><mml:mi>y</mml:mi><mml:mo stretchy="false">&#x005E;</mml:mo></mml:mover></mml:mrow></mml:math></inline-formula> means the result of the greedy algorithm.</p>
</sec>
</sec>
<sec id="s4">
<label>4</label>
<title>Experiment</title>
<sec id="s4_1">
<label>4.1</label>
<title>Dataset and Implementation Details</title>
<p>We apply the MSCOCO dataset [<xref ref-type="bibr" rid="ref-31">31</xref>] to conduct experiments. The dataset has 123287 images (82783 for training and 40775 for validation) with 5 captions for each. We adopt the Karpathy split [<xref ref-type="bibr" rid="ref-32">32</xref>] to obtain the training set, the validation set and the testing set. Besides, we collect the words that occur more than 4 times in the training set and get a dictionary containing 10369 words. The metrics of BLEU (Bilingual Evaluation Understudy) [<xref ref-type="bibr" rid="ref-33">33</xref>], CIDEr [<xref ref-type="bibr" rid="ref-30">30</xref>], METEOR (Metric for Evaluation of Translation with Explicit ORdering) [<xref ref-type="bibr" rid="ref-34">34</xref>], SPICE (Semantic Propositional Image Caption Evaluation) [<xref ref-type="bibr" rid="ref-35">35</xref>] and ROUGE-L (Recall-Oriented Understudy for Gisting Evaluation) [<xref ref-type="bibr" rid="ref-36">36</xref>] are applied to evaluate our method. We compute these metrics with the public code from the MSCOCO dataset.</p>
<p>Differing from the image grid features, we demand accurate object classes and position information for our framework. Therefore, we exploit the Objects365 [<xref ref-type="bibr" rid="ref-37">37</xref>], MSCOCO [<xref ref-type="bibr" rid="ref-31">31</xref>], OpenImages [<xref ref-type="bibr" rid="ref-38">38</xref>] and Visual Genome [<xref ref-type="bibr" rid="ref-39">39</xref>] datasets to train the Faster-Rcnn model [<xref ref-type="bibr" rid="ref-7">7</xref>] for extracting objects features, and merge their classes to obtain a label list with more than 1800 classes, which is similar to VinVL [<xref ref-type="bibr" rid="ref-18">18</xref>]. These objects&#x2019; visual vectors are extracted in 2048-dimension and transformed into 512-dimension vectors to match the embedding size. The number of block <inline-formula id="ieqn-90"><mml:math id="mml-ieqn-90"><mml:mi>N</mml:mi></mml:math></inline-formula> is set to 6. With Cross-Entropy Loss, we adopt the learning rate of 4e-4 decayed 0.8 every 2 epochs and ADAM during the total 20 epochs. While training with CIDEr Score Optimization in another 30 epochs, we set the learning rate to 4e-5 and decay it by 50%. Furthermore, the batch size is 10 and the beam size is 2.</p>
</sec>
<sec id="s4_2">
<label>4.2</label>
<title>Comparisons with Other Models</title>
<p>We report the performances of the other methods and our method in <xref ref-type="table" rid="table-1">Table 1</xref>. The compared methods include Show&#x0026;tell (LSTM) [<xref ref-type="bibr" rid="ref-26">26</xref>], SCST [<xref ref-type="bibr" rid="ref-29">29</xref>], RFNet [<xref ref-type="bibr" rid="ref-11">11</xref>], UpDown [<xref ref-type="bibr" rid="ref-14">14</xref>], AoANet [<xref ref-type="bibr" rid="ref-40">40</xref>], Pos-aware [<xref ref-type="bibr" rid="ref-41">41</xref>], M<sup>2</sup>-Transformer [<xref ref-type="bibr" rid="ref-22">22</xref>], X-Transformer [<xref ref-type="bibr" rid="ref-20">20</xref>], RSTNet [<xref ref-type="bibr" rid="ref-19">19</xref>] and PureT [<xref ref-type="bibr" rid="ref-42">42</xref>]. These methods are operated with LSTM or Transformer.</p>
<table-wrap id="table-1">
<label>Table 1</label>
<caption>
<title>The results of our method and other methods. The B, M, R, C and S represent the metrics of BLEU, METEOR, ROUGE-L, CIDEr and SPICE. &#x002A;indicates the results that we reproduce based on VinVL</title>
</caption>
<table frame="hsides">
<colgroup>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
</colgroup>
<thead>
<tr>
<th>Method</th>
<th>B-1</th>
<th>B-2</th>
<th>B-3</th>
<th>B-4</th>
<th>M</th>
<th>R</th>
<th>C</th>
<th>S</th>
</tr>
</thead>
<tbody>
<tr>
<td colspan="9" align="center">Trained by cross-entropy loss</td>
</tr>
<tr>
<td>Baseline [<xref ref-type="bibr" rid="ref-18">18</xref>]<sup>&#x002A;</sup></td>
<td>76.7</td>
<td>61.3</td>
<td>47.1</td>
<td>36.7</td>
<td>28.2</td>
<td>57.1</td>
<td>118.5</td>
<td>20.9</td>
</tr>
<tr>
<td>LSTM [<xref ref-type="bibr" rid="ref-26">26</xref>]</td>
<td>&#x2013;</td>
<td>&#x2013;</td>
<td>&#x2013;</td>
<td>29.6</td>
<td>25.2</td>
<td>52.6</td>
<td>94.0</td>
<td>&#x2013;</td>
</tr>
<tr>
<td>SCST [<xref ref-type="bibr" rid="ref-29">29</xref>]</td>
<td>&#x2013;</td>
<td>&#x2013;</td>
<td>&#x2013;</td>
<td>30.0</td>
<td>25.9</td>
<td>53.4</td>
<td>99.4</td>
<td>&#x2013;</td>
</tr>
<tr>
<td>Adaptive-Attention [<xref ref-type="bibr" rid="ref-15">15</xref>]</td>
<td>73.4</td>
<td>56.6</td>
<td>41.8</td>
<td>30.4</td>
<td>257.</td>
<td>&#x2013;</td>
<td>102.9</td>
<td>&#x2013;</td>
</tr>
<tr>
<td>RFNet [<xref ref-type="bibr" rid="ref-11">11</xref>]</td>
<td>76.4</td>
<td>60.4</td>
<td>46.6</td>
<td>35.8</td>
<td>27.4</td>
<td>56.5</td>
<td>112.5</td>
<td>20.5</td>
</tr>
<tr>
<td>UpDown [<xref ref-type="bibr" rid="ref-14">14</xref>]</td>
<td>77.2</td>
<td>&#x2013;</td>
<td>&#x2013;</td>
<td>36.2</td>
<td>27.0</td>
<td>56.4</td>
<td>113.5</td>
<td>20.3</td>
</tr>
<tr>
<td>AoANet [<xref ref-type="bibr" rid="ref-40">40</xref>]</td>
<td>77.4</td>
<td>&#x2013;</td>
<td>&#x2013;</td>
<td>37.2</td>
<td>28.4</td>
<td>57.5</td>
<td>119.8</td>
<td>21.3</td>
</tr>
<tr>
<td>X-Transformer [<xref ref-type="bibr" rid="ref-20">20</xref>]</td>
<td>77.3</td>
<td>61.5</td>
<td>47.8</td>
<td>37.0</td>
<td>28.7</td>
<td>57.5</td>
<td>120.0</td>
<td>21.8</td>
</tr>
<tr>
<td><bold>PCATNet (w/OCA</bold><sub><bold>E</bold></sub><bold>)</bold></td>
<td>77.7</td>
<td><bold>61.8</bold></td>
<td><bold>47.8</bold></td>
<td><bold>37.4</bold></td>
<td><bold>28.8</bold></td>
<td><bold>57.6</bold></td>
<td><bold>122.3</bold></td>
<td><bold>22.1</bold></td>
</tr>
<tr>
<td><bold>PCATNet (w/OCA</bold><sub><bold>D</bold></sub><bold>)</bold></td>
<td><bold>77.8</bold></td>
<td>61.7</td>
<td><bold>47.8</bold></td>
<td>37.2</td>
<td>28.7</td>
<td>57.5</td>
<td>121.2</td>
<td>21.9</td>
</tr>
<tr>
<td colspan="9" align="center">Optimized by CIDEr Score Optimization</td>
</tr>
<tr>
<td>Baseline [<xref ref-type="bibr" rid="ref-18">18</xref>] <sup>&#x002A;</sup></td>
<td>81.9</td>
<td>66.9</td>
<td>52.1</td>
<td>40.3</td>
<td>29.8</td>
<td>59.6</td>
<td>135.5</td>
<td>23.2</td>
</tr>
<tr>
<td>LSTM [<xref ref-type="bibr" rid="ref-26">26</xref>]</td>
<td>&#x2013;</td>
<td>&#x2013;</td>
<td>&#x2013;</td>
<td>31.9</td>
<td>25.5</td>
<td>54.3</td>
<td>106.3</td>
<td>&#x2013;</td>
</tr>
<tr>
<td>SCST [<xref ref-type="bibr" rid="ref-29">29</xref>]</td>
<td>&#x2013;</td>
<td>&#x2013;</td>
<td>&#x2013;</td>
<td>34.2</td>
<td>26.7</td>
<td>55.7</td>
<td>114.0</td>
<td>&#x2013;</td>
</tr>
<tr>
<td>UpDown [<xref ref-type="bibr" rid="ref-14">14</xref>]</td>
<td>79.8</td>
<td>&#x2013;</td>
<td>&#x2013;</td>
<td>36.3</td>
<td>27.7</td>
<td>56.9</td>
<td>120.1</td>
<td>21.4</td>
</tr>
<tr>
<td>AoANet [<xref ref-type="bibr" rid="ref-40">40</xref>]</td>
<td>80.2</td>
<td>&#x2013;</td>
<td>&#x2013;</td>
<td>38.9</td>
<td>29.2</td>
<td>58.8</td>
<td>129.8</td>
<td>22.4</td>
</tr>
<tr>
<td>Pos-aware [<xref ref-type="bibr" rid="ref-41">41</xref>]</td>
<td>80.8</td>
<td>65.1</td>
<td>50.6</td>
<td>39.3</td>
<td>29.0</td>
<td>59.2</td>
<td>128.9</td>
<td>22.8</td>
</tr>
<tr>
<td>M<sup>2</sup>-Transformer [<xref ref-type="bibr" rid="ref-22">22</xref>]</td>
<td>80.8</td>
<td>&#x2013;</td>
<td>&#x2013;</td>
<td>39.1</td>
<td>29.2</td>
<td>58.6</td>
<td>131.2</td>
<td>22.6</td>
</tr>
<tr>
<td>X-Transformer [<xref ref-type="bibr" rid="ref-20">20</xref>]</td>
<td>80.9</td>
<td>65.8</td>
<td>51.5</td>
<td>39.7</td>
<td>29.5</td>
<td>59.1</td>
<td>132.8</td>
<td>23.4</td>
</tr>
<tr>
<td>RSTNet [<xref ref-type="bibr" rid="ref-19">19</xref>]</td>
<td>81.8</td>
<td>&#x2013;</td>
<td>&#x2013;</td>
<td>40.1</td>
<td>29.8</td>
<td>59.5</td>
<td>135.6</td>
<td>23.3</td>
</tr>
<tr>
<td>PureT [<xref ref-type="bibr" rid="ref-42">42</xref>]</td>
<td>82.1</td>
<td>&#x2013;</td>
<td>&#x2013;</td>
<td>40.9</td>
<td><bold>30.2</bold></td>
<td>60.1</td>
<td>138.2</td>
<td>24.2</td>
</tr>
<tr>
<td><bold>PCATNet (w/OCA</bold><sub><bold>E</bold></sub><bold>)</bold></td>
<td><bold>82.6</bold></td>
<td><bold>67.7</bold></td>
<td>53.2</td>
<td>41.2</td>
<td>29.9</td>
<td>60.2</td>
<td>137.8</td>
<td>24.0</td>
</tr>
<tr>
<td><bold>PCATNet (w/OCA</bold><sub><bold>D</bold></sub><bold>)</bold></td>
<td><bold>82.6</bold></td>
<td><bold>67.7</bold></td>
<td><bold>53.3</bold></td>
<td><bold>41.6</bold></td>
<td>30.0</td>
<td><bold>60.3</bold></td>
<td><bold>138.3</bold></td>
<td><bold>24.3</bold></td>
</tr>
</tbody>
</table>
</table-wrap>
<p>We adopt the strategy of pre-training for the visual feature in VinVL [<xref ref-type="bibr" rid="ref-18">18</xref>] and Transformer [<xref ref-type="bibr" rid="ref-12">12</xref>] as our baseline. Therefore, our baseline achieves good scores because of the great pre-training of detection which also provides accurate position and class information for the proposed method.</p>
<p>For stability, we first present the results of a single model in <xref ref-type="table" rid="table-1">Table 1</xref>. Our models with XE Loss and SCST training are both superior to others. With the XE Loss training, our single model with different terms (OCA<sub>E</sub> and OCA<sub>D</sub>) achieves the highest scores in all metrics, especially the CIDEr score which obtains advancement of over 1% to the X-Transformer and AoANet. With the SCST training, our models also achieve the best comprehensive performance. While comparing with the strong competitors M<sup>2</sup>-Transformer, X-Transformer and RSTNet, our two models are superior to them in all terms of metrics, especially the CIDEr score improved by over 2%. Besides, the BLEU-4 score of our methods reach 41.2% and 41.6% which achieve improvements of 0.3% and 0.7% to the latest PureT, respectively. Meanwhile, our methods surpass PrueT in terms of all metrics except METEOR.</p>

<p>In addition, we report the results of the ensemble of four models with SCST in <xref ref-type="table" rid="table-2">Table 2</xref>. Our method also achieves excellent performance and advances the M<sup>2</sup>-Transformer and RSTNet by more than 6% in terms of CIDEr. Furthermore, our method and PureT are about equal in performance, as outlined in the case of the single model. We also present some generated captions in <xref ref-type="table" rid="table-3">Table 3</xref> to demonstrate the performance of our approach.</p>
<table-wrap id="table-2">
<label>Table 2</label>
<caption>
<title>The ensemble results of four models</title>
</caption>
<table frame="hsides">
<colgroup>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
</colgroup>
<thead>
<tr>
<th>Method</th>
<th>B-4</th>
<th>M</th>
<th>R</th>
<th>C</th>
<th>S</th>
</tr>
</thead>
<tbody>
<tr>
<td>SCST [<xref ref-type="bibr" rid="ref-29">29</xref>]</td>
<td>35.4</td>
<td>27.1</td>
<td>56.6</td>
<td>117.5</td>
<td>&#x2013;</td>
</tr>
<tr>
<td>RFNet [<xref ref-type="bibr" rid="ref-11">11</xref>]</td>
<td>37.9</td>
<td>28.3</td>
<td>58.3</td>
<td>125.7</td>
<td>21.7</td>
</tr>
<tr>
<td>M<sup>2</sup> [<xref ref-type="bibr" rid="ref-22">22</xref>]</td>
<td>40.5</td>
<td>29.7</td>
<td>59.5</td>
<td>134.5</td>
<td>23.5</td>
</tr>
<tr>
<td>PureT [<xref ref-type="bibr" rid="ref-42">42</xref>]</td>
<td>42.1</td>
<td><bold>30.4</bold></td>
<td>60.8</td>
<td><bold>141.0</bold></td>
<td>24.3</td>
</tr>
<tr>
<td><bold>PCATNet (w/OCA</bold><sub><bold>E</bold></sub><bold>)</bold></td>
<td>42.3</td>
<td>30.2</td>
<td>61.0</td>
<td>140.6</td>
<td>24.2</td>
</tr>
<tr>
<td><bold>PCATNet (w/OCA</bold><sub><bold>D</bold></sub><bold>)</bold></td>
<td><bold>42.4</bold></td>
<td>30.2</td>
<td><bold>61.2</bold></td>
<td>140.9</td>
<td><bold>24.4</bold></td>
</tr>
</tbody>
</table>
</table-wrap><table-wrap id="table-3">
<label>Table 3</label>
<caption>
<title>Visualization for generated captions of GroundTruth, Baseline and our method</title>
</caption>
<table frame="hsides">
<colgroup>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
</colgroup>
<thead>
<tr>
<th><inline-graphic xlink:href="CMC_37861-inline-1.tif"/></th>
<th><inline-graphic xlink:href="CMC_37861-inline-2.tif"/></th>
<th><inline-graphic xlink:href="CMC_37861-inline-3.tif"/></th>
<th><inline-graphic xlink:href="CMC_37861-inline-4.tif"/></th>
</tr>
</thead>
<tbody>
<tr>
<td><bold>GT:</bold> three people sit at a table holding lollipops</td>
<td><bold>GT:</bold> a wooden bench sitting on a beach next to the ocean</td>
<td><bold>GT:</bold> a man on a snowboard standing at the bottom of the mountain</td>
<td><bold>GT:</bold> several zebras are on the grass by a truck</td>
</tr>
<tr>
<td><bold>Basline:</bold> a group of people sitting at a table with a birthday cake</td>
<td><bold>Basline:</bold> a bench on a beach near the ocean</td>
<td><bold>Basline:</bold> a man holding a snowboard in the snow</td>
<td><bold>Basline:</bold> a herd of zebras standing in a field next to a car</td>
</tr>
<tr>
<td><bold>Ours (w/OCA</bold><sub><bold>E</bold></sub><bold>):</bold> a group of people sitting at a table with lollipops</td>
<td><bold>Ours (w/OCA</bold><sub><bold>E</bold></sub><bold>):</bold> a wooden bench sitting on the beach next to the water</td>
<td><bold>Ours (w/OCA</bold><sub><bold>E</bold></sub><bold>):</bold> a man standing on a snowboard in the snow</td>
<td><bold>Ours (w/OCA</bold><sub><bold>E</bold></sub><bold>):</bold> a herd of zebras standing on the side of a truck</td>
</tr>
<tr>
<td><bold>Ours (w/OCA</bold><sub><bold>D</bold></sub><bold>):</bold> three people sitting at a table</td>
<td><bold>Ours (w/OCA</bold><sub><bold>D</bold></sub><bold>):</bold> a wooden bench sitting on a beach next to the ocean</td>
<td><bold>Ours (w/OCA</bold><sub><bold>D</bold></sub><bold>):</bold> a man standing on a snowboard on the slopes</td>
<td><bold>Ours (w/OCA</bold><sub><bold>D</bold></sub><bold>):</bold> a group of zebras grazing in the grass next to a truck</td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
<sec id="s4_3">
<label>4.3</label>
<title>Ablative Studies</title>
<p>We conduct ablative experiments to understand the influence of each module in our model.</p>
<p><bold>Influence of GMPE</bold>. To quantify the influence of GMPE in the refined encoder, we conduct experiments with different modules. We adopt 6 blocks of encoder and decoder and set the size of grid feature <inline-formula id="ieqn-91"><mml:math id="mml-ieqn-91"><mml:mi>m</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mi>m</mml:mi></mml:math></inline-formula> to <inline-formula id="ieqn-92"><mml:math id="mml-ieqn-92"><mml:mn>16</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mn>16</mml:mn></mml:math></inline-formula>. Note that we adopt the Transformer as our baseline in row 1. In <xref ref-type="table" rid="table-4">Table 4</xref>, we evaluate the performance of GMPE in three combinations including the baseline with GMPE (row 2), OCA<sub>E</sub> with GMPE (row 5) and OCA<sub>D</sub> with GMPE (row 6). As we can see, combining baseline and GMPE can achieve improvements of 1.4% in terms of the CIDEr score to the pure baseline. Furthermore, GMPE can increase the CIDEr score of pure OCA<sub>D</sub> from 136.7% to 138.3% and improve the performance of OCA<sub>E</sub> from 137.1% to 137.8%.</p>
<table-wrap id="table-4">
<label>Table 4</label>
<caption>
<title>The results of ablation studies, which are obtained after CIDEr Score Optimization</title>
</caption>
<table frame="hsides">
<colgroup>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
</colgroup>
<thead>
<tr>
<th>GMPE</th>
<th>OCA<sub>E</sub></th>
<th>OCA<sub>D</sub></th>
<th>B-4</th>
<th>R</th>
<th>C</th>
</tr>
</thead>
<tbody>
<tr>
<td><bold>&#x00D7;</bold></td>
<td><bold>&#x00D7;</bold></td>
<td><bold>&#x00D7;</bold></td>
<td>39.8</td>
<td>59.6</td>
<td>135.5</td>
</tr>
<tr>
<td><bold>&#x221A;</bold></td>
<td><bold>&#x00D7;</bold></td>
<td><bold>&#x00D7;</bold></td>
<td>40.9</td>
<td>60.0</td>
<td>136.9</td>
</tr>
<tr>
<td><bold>&#x00D7;</bold></td>
<td><bold>&#x221A;</bold></td>
<td><bold>&#x00D7;</bold></td>
<td>40.8</td>
<td>60.1</td>
<td>137.1</td>
</tr>
<tr>
<td><bold>&#x00D7;</bold></td>
<td><bold>&#x00D7;</bold></td>
<td><bold>&#x221A;</bold></td>
<td>41.0</td>
<td>59.9</td>
<td>136.7</td>
</tr>
<tr>
<td><bold>&#x221A;</bold></td>
<td><bold>&#x221A;</bold></td>
<td><bold>&#x00D7;</bold></td>
<td>41.2</td>
<td>60.2</td>
<td>137.8</td>
</tr>
<tr>
<td><bold>&#x221A;</bold></td>
<td><bold>&#x00D7;</bold></td>
<td><bold>&#x221A;</bold></td>
<td><bold>41.6</bold></td>
<td><bold>60.3</bold></td>
<td><bold>138.3</bold></td>
</tr>
<tr>
<td><bold>&#x221A;</bold></td>
<td><bold>&#x221A;</bold></td>
<td><bold>&#x221A;</bold></td>
<td>40.6</td>
<td>59.7</td>
<td>136.4</td>
</tr>
</tbody>
</table>
</table-wrap>
<p><bold>Influence of OCA</bold><sub><bold>E</bold></sub> <bold>and OCA</bold><sub><bold>D</bold></sub>. To better understand the influence of OCA<sub>E</sub> and OCA<sub>D</sub> in encoder and decoder respectively, we conduct several experiments to evaluate them. Note that the number of block <italic>N</italic> is set to 6 and <inline-formula id="ieqn-93"><mml:math id="mml-ieqn-93"><mml:mi>m</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mi>m</mml:mi></mml:math></inline-formula> is set to <inline-formula id="ieqn-94"><mml:math id="mml-ieqn-94"><mml:mn>16</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mn>16</mml:mn></mml:math></inline-formula>. In <xref ref-type="table" rid="table-4">Table 4</xref>, we present the results of OCA<sub>E</sub> in rows 3, 5 and 7, as well as the performance of OCA<sub>D</sub> in rows 4, 6 and 7. It can be seen that the baseline with only OCA<sub>E</sub> or OCA<sub>D</sub> (row 3 and row 4) can achieve improvements of 1.6% and 1.2% in terms of CIDEr score, respectively. Besides, the CIDEr score of OCA<sub>E</sub> and OCA<sub>D</sub> combined with GMPE (row 5 and row 6) can reach 137.8% and 138.3%, which achieve improvements of 0.9% and 1.4% to the baseline with GMPE (row 2), respectively. However, while combining OCA<sub>E</sub> and OCA<sub>D</sub>, we obtain a poor record (row 7) resulting from too much specific class information which can fragment the generated captions.</p>

<p><bold>Influence of the Number of Blocks</bold>. We fine-tune the number of refining encoder-decoder blocks <inline-formula id="ieqn-95"><mml:math id="mml-ieqn-95"><mml:mi>N</mml:mi></mml:math></inline-formula> and the size of the grid feature <inline-formula id="ieqn-96"><mml:math id="mml-ieqn-96"><mml:mi>m</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mi>m</mml:mi></mml:math></inline-formula>. Note that we adopt the baseline while experimenting on <italic>N</italic> and the baseline with GMPE for <inline-formula id="ieqn-97"><mml:math id="mml-ieqn-97"><mml:mi>m</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mi>m</mml:mi></mml:math></inline-formula>. As shown in <xref ref-type="table" rid="table-5">Table 5</xref>, the baseline has a continuous improvement in the CIDEr score with the gradual increase of the value <italic>N</italic>. Besides, the baseline model tends to be stable when <italic>N</italic> is set to 6. Therefore, we set <italic>N</italic> to 6 as the final configuration. The baseline with GMPE also gets a significant improvement in all metrics and reaches peak performance while <inline-formula id="ieqn-98"><mml:math id="mml-ieqn-98"><mml:mn>16</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mn>16</mml:mn></mml:math></inline-formula>. Nevertheless, we still believe in the effectiveness of the other sizes and don&#x2019;t suggest setting the size lower than <inline-formula id="ieqn-99"><mml:math id="mml-ieqn-99"><mml:mn>11</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mn>11</mml:mn></mml:math></inline-formula>, because the small size can result in too many objects in one grid and lose the advantage of GMPE.</p>
<table-wrap id="table-5">
<label>Table 5</label>
<caption>
<title>Experiments about the number of block <inline-formula id="ieqn-100"><mml:math id="mml-ieqn-100"><mml:mi>N</mml:mi></mml:math></inline-formula> and the size of grid feature <inline-formula id="ieqn-101"><mml:math id="mml-ieqn-101"><mml:mi>m</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mi>m</mml:mi></mml:math></inline-formula></title>
</caption>
<table frame="hsides">
<colgroup>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
</colgroup>
<thead>
<tr>
<th><inline-formula id="ieqn-102"><mml:math id="mml-ieqn-102"><mml:mi>N</mml:mi></mml:math></inline-formula></th>
<th>B-1</th>
<th>B-4</th>
<th>M</th>
<th>R</th>
<th>C</th>
</tr>
</thead>
<tbody>
<tr>
<td>3</td>
<td>81.3</td>
<td>39.7</td>
<td>29.4</td>
<td>58.8</td>
<td>133.8</td>
</tr>
<tr>
<td>4</td>
<td>81.6</td>
<td>39.8</td>
<td>29.6</td>
<td>59.2</td>
<td>134.4</td>
</tr>
<tr>
<td>5</td>
<td>81.8</td>
<td>40.0</td>
<td>29.6</td>
<td>59.5</td>
<td>135.2</td>
</tr>
<tr>
<td>6</td>
<td>81.9</td>
<td>40.3</td>
<td>29.8</td>
<td>59.6</td>
<td>135.5</td>
</tr>
<tr>
<td>7</td>
<td>81.9</td>
<td>40.2</td>
<td>29.8</td>
<td>59.5</td>
<td>135.4</td>
</tr>
<tr>
<td><inline-formula id="ieqn-103"><mml:math id="mml-ieqn-103"><mml:mi>m</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mi>m</mml:mi></mml:math></inline-formula></td>
<td>B-1</td>
<td>B-4</td>
<td>M</td>
<td>R</td>
<td>C</td>
</tr>
<tr>
<td><inline-formula id="ieqn-104"><mml:math id="mml-ieqn-104"><mml:mn>9</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mn>9</mml:mn></mml:math></inline-formula></td>
<td>81.9</td>
<td>40.4</td>
<td>29.7</td>
<td>59.6</td>
<td>135.5</td>
</tr>
<tr>
<td><inline-formula id="ieqn-105"><mml:math id="mml-ieqn-105"><mml:mn>11</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mn>11</mml:mn></mml:math></inline-formula></td>
<td>82.1</td>
<td>40.5</td>
<td>29.7</td>
<td>59.8</td>
<td>136.2</td>
</tr>
<tr>
<td><inline-formula id="ieqn-106"><mml:math id="mml-ieqn-106"><mml:mn>14</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mn>14</mml:mn></mml:math></inline-formula></td>
<td>82.2</td>
<td>40.8</td>
<td>29.8</td>
<td>59.8</td>
<td>136.6</td>
</tr>
<tr>
<td><inline-formula id="ieqn-107"><mml:math id="mml-ieqn-107"><mml:mn>16</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mn>16</mml:mn></mml:math></inline-formula></td>
<td>82.2</td>
<td>40.9</td>
<td>29.9</td>
<td>60.0</td>
<td>136.9</td>
</tr>
<tr>
<td><inline-formula id="ieqn-108"><mml:math id="mml-ieqn-108"><mml:mn>18</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mn>18</mml:mn></mml:math></inline-formula></td>
<td>82.2</td>
<td>40.9</td>
<td>29.9</td>
<td>59.9</td>
<td>136.8</td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
</sec>
<sec id="s5">
<label>5</label>
<title>Conclusion</title>
<p>In this paper, we propose a novel Position-Class Awareness Transformer network, which can embed more information, such as spatial and classes of objects, from an image to relate vision with language. To achieve this purpose, the GMPE module and OCA module are proposed, which are designed by spatial information and object classes respectively. The proposed GMPE, a relative position encoding method for embedding spatial correlations, constructs a grid mapping feature to calculate the relative distance among objects and quantizes them to the vectors. Moreover, we propose the OCA to refine the encoder-decoder framework, which can model the correlation between visual features and language features by the extracted semantic information of object classes. Formally, we also associate the GMPE with the OCA. Experiment results demonstrate that our method can significantly boost captioning, where GMPE supplies the model with spatial information and OCA bridges the visual features and language features. In particular, our method achieves excellent performance against other methods and provides a novel scheme for embedding information.</p>
<p>In the future, we will explore how to generate captions with object classes directly and further develop relative position encoding with direction for image captioning. With the information of object classes, we will attempt at combining generated word and objects classes name, which can provide more semantic information for the next generating word. Furthermore, we plan to improve the proposed GMPE with the directions among objects and the semantics of captions, which can capture more interaction among objects by associating the language module.</p>
</sec>
</body>
<back>
<ack>
<p>The numerical calculations in this paper have been done on the supercomputing system in the Supercomputing Center of Wuhan University.</p>
</ack>
<sec><title>Funding Statement</title>
<p>This work was supported by the National Key Research and Development Program of China [No. 2021YFB2206200].</p>
</sec>
<sec sec-type="COI-statement"><title>Conflicts of Interest</title>
<p>The authors declare that they have no conflicts of interest to report regarding the present study.</p>
</sec>
<ref-list content-type="authoryear">
<title>References</title>
<ref id="ref-1"><label>[1]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>G.</given-names> <surname>Kulkarni</surname></string-name>, <string-name><given-names>V.</given-names> <surname>Premraj</surname></string-name>, <string-name><given-names>S.</given-names> <surname>Dhar</surname></string-name>, <string-name><given-names>S.</given-names> <surname>Li</surname></string-name>, <string-name><given-names>Y.</given-names> <surname>Choi</surname></string-name> <etal>et al.</etal></person-group><italic>,</italic> &#x201C;<article-title>Baby talk: Understanding and generating simple image descriptions</article-title>,&#x201D; in <conf-name>Proc. CVPR</conf-name>, <publisher-loc>Colorado Springs, CO, USA</publisher-loc>, pp. <fpage>1601</fpage>&#x2013;<lpage>1608</lpage>, <year>2011</year>. </mixed-citation></ref>
<ref id="ref-2"><label>[2]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>Q.</given-names> <surname>Yang</surname></string-name>, <string-name><given-names>Z.</given-names> <surname>Ni</surname></string-name> and <string-name><given-names>P. P.</given-names> <surname>Ren</surname></string-name></person-group>, &#x201C;<article-title>Meta captioning: A meta learning based remote sensing image captioning framework</article-title>,&#x201D; <source>ISPRS Journal of Photogrammetry and Remote Sensing</source>, vol. <volume>186</volume>, pp. <fpage>190</fpage>&#x2013;<lpage>200</lpage>, <year>2022</year>.</mixed-citation></ref>
<ref id="ref-3"><label>[3]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>W. H.</given-names> <surname>Jiang</surname></string-name>, <string-name><given-names>M. W.</given-names> <surname>Zhu</surname></string-name>, <string-name><given-names>Y. M.</given-names> <surname>Fang</surname></string-name>, <string-name><given-names>G. M.</given-names> <surname>Shi</surname></string-name>, <string-name><given-names>X. W.</given-names> <surname>Zhao</surname></string-name> <etal>et al.</etal></person-group><italic>,</italic> &#x201C;<article-title>Visual cluster grounding for image captioning</article-title>,&#x201D; <source>IEEE Transactions on Image Processing</source>, vol. <volume>31</volume>, pp. <fpage>3920</fpage>&#x2013;<lpage>3934</lpage>, <year>2022</year>; <pub-id pub-id-type="pmid">35635813</pub-id></mixed-citation></ref>
<ref id="ref-4"><label>[4]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>O.</given-names> <surname>Russakovsky</surname></string-name>, <string-name><given-names>J.</given-names> <surname>Deng</surname></string-name>, <string-name><given-names>H.</given-names> <surname>Su</surname></string-name>, <string-name><given-names>J.</given-names> <surname>Krause</surname></string-name>, <string-name><given-names>S.</given-names> <surname>Satheesh</surname></string-name> <etal>et al.</etal></person-group><italic>,</italic> &#x201C;<article-title>ImageNet large scale visual recognition challenge</article-title>,&#x201D; <source>International Journal of Computer Vision</source>, vol. <volume>115</volume>, pp. <fpage>211</fpage>&#x2013;<lpage>252</lpage>, <year>2015</year>.</mixed-citation></ref>
<ref id="ref-5"><label>[5]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>A.</given-names> <surname>Srinivas</surname></string-name>, <string-name><given-names>T. Y.</given-names> <surname>Lin</surname></string-name>, <string-name><given-names>N.</given-names> <surname>Parmar</surname></string-name>, <string-name><given-names>J.</given-names> <surname>Shlens</surname></string-name>, <string-name><given-names>P.</given-names> <surname>Abbeel</surname></string-name> <etal>et al.</etal></person-group><italic>,</italic> &#x201C;<article-title>Bottleneck Transformers for visual recognition</article-title>,&#x201D; in <conf-name>Proc. CVPR</conf-name>, <publisher-loc>Nashville, TN, USA</publisher-loc>, pp. <fpage>16514</fpage>&#x2013;<lpage>16524</lpage>, <year>2021</year>. </mixed-citation></ref>
<ref id="ref-6"><label>[6]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><given-names>A.</given-names> <surname>Dosovitskiy</surname></string-name>, <string-name><given-names>L.</given-names> <surname>Beyer</surname></string-name>, <string-name><given-names>A.</given-names> <surname>Kolesnikov</surname></string-name>, <string-name><given-names>D.</given-names> <surname>Weissenborn</surname></string-name>, <string-name><given-names>X.</given-names> <surname>Zhai</surname></string-name> <etal>et al.</etal></person-group><italic>,</italic> &#x201C;<article-title>An image is worth 16 &#x00D7; 16 words: Transformers for image recognition at scale</article-title>,&#x201D; <comment>arXiv preprint arXiv:2010.11929</comment>, <year>2010</year>.</mixed-citation></ref>
<ref id="ref-7"><label>[7]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>S.</given-names> <surname>Ren</surname></string-name>, <string-name><given-names>K.</given-names> <surname>He</surname></string-name>, <string-name><given-names>R.</given-names> <surname>Girshick</surname></string-name> and <string-name><given-names>J.</given-names> <surname>Sun</surname></string-name></person-group>, &#x201C;<article-title>Faster R-CNN: Towards real-time object detection with region proposal networks</article-title>,&#x201D; <source>IEEE Transactions on Pattern Analysis and Machine Intelligence</source>, vol. <volume>39</volume>, no. <issue>6</issue>, pp. <fpage>1137</fpage>&#x2013;<lpage>1149</lpage>, <year>2017</year>; <pub-id pub-id-type="pmid">27295650</pub-id></mixed-citation></ref>
<ref id="ref-8"><label>[8]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>N.</given-names> <surname>Carion</surname></string-name>, <string-name><given-names>F.</given-names> <surname>Massa</surname></string-name>, <string-name><given-names>G.</given-names> <surname>Synnaeve</surname></string-name>, <string-name><given-names>N.</given-names> <surname>Usunier</surname></string-name>, <string-name><given-names>A.</given-names> <surname>Kirillov</surname></string-name> <etal>et al.</etal></person-group><italic>,</italic> &#x201C;<article-title>End-to-end object detection with transformers</article-title>,&#x201D; in <conf-name>Proc. ECCV</conf-name>, pp. <fpage>213</fpage>&#x2013;<lpage>229</lpage>, <year>2020</year>. </mixed-citation></ref>
<ref id="ref-9"><label>[9]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>I.</given-names> <surname>Sutskever</surname></string-name>, <string-name><given-names>O.</given-names> <surname>Vinyals</surname></string-name> and <string-name><given-names>Q. V.</given-names> <surname>Le</surname></string-name></person-group>, &#x201C;<article-title>Sequence to sequence learning with neural networks</article-title>,&#x201D; in <conf-name>Proc. NIPS</conf-name>, <publisher-loc>Montreal, Canada</publisher-loc>, pp. <fpage>3104</fpage>&#x2013;<lpage>3112</lpage>, <year>2014</year>. </mixed-citation></ref>
<ref id="ref-10"><label>[10]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><given-names>K.</given-names> <surname>Cho</surname></string-name>, <string-name><given-names>B. V.</given-names> <surname>Merrienboer</surname></string-name>, <string-name><given-names>C.</given-names> <surname>Gulcehre</surname></string-name>, <string-name><given-names>D.</given-names> <surname>Bahdanau</surname></string-name>, <string-name><given-names>F.</given-names> <surname>Bougares</surname></string-name> <etal>et al.</etal></person-group><italic>,</italic> &#x201C;<article-title>Learning phrase representations using RNN Encoder-Decoder for statistical machine translation</article-title>,&#x201D; <comment>arXiv preprint arXiv:1406.1078</comment>, <year>2014</year>.</mixed-citation></ref>
<ref id="ref-11"><label>[11]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>W.</given-names> <surname>Jiang</surname></string-name>, <string-name><given-names>L.</given-names> <surname>Ma</surname></string-name>, <string-name><given-names>Y. G.</given-names> <surname>Jiang</surname></string-name>, <string-name><given-names>W.</given-names> <surname>Liu</surname></string-name> and <string-name><given-names>T.</given-names> <surname>Zhang</surname></string-name></person-group>, &#x201C;<article-title>Recurrent fusion network for image captioning</article-title>,&#x201D; in <conf-name>Proc. ECCV</conf-name>, <publisher-loc>Munich, Germany</publisher-loc>, pp. <fpage>510</fpage>&#x2013;<lpage>526</lpage>, <year>2018</year>. </mixed-citation></ref>
<ref id="ref-12"><label>[12]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>A.</given-names> <surname>Vaswani</surname></string-name>, <string-name><given-names>N.</given-names> <surname>Shazeer</surname></string-name>, <string-name><given-names>N.</given-names> <surname>Parmar</surname></string-name>, <string-name><given-names>J.</given-names> <surname>Uszkoreit</surname></string-name>, <string-name><given-names>L.</given-names> <surname>Jones</surname></string-name> <etal>et al.</etal></person-group><italic>,</italic> &#x201C;<article-title>Attention is all you need</article-title>,&#x201D; in <conf-name>Proc. NIPS</conf-name>, <publisher-loc>Long Beach, California, USA</publisher-loc>, pp. <fpage>6000</fpage>&#x2013;<lpage>6010</lpage>, <year>2017</year>. </mixed-citation></ref>
<ref id="ref-13"><label>[13]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><given-names>J.</given-names> <surname>Devlin</surname></string-name>, <string-name><given-names>M. W.</given-names> <surname>Chang</surname></string-name>, <string-name><given-names>K.</given-names> <surname>Lee</surname></string-name> and <string-name><given-names>K.</given-names> <surname>Toutanova</surname></string-name></person-group>, &#x201C;<article-title>Bert: Pre-training of deep bidirectional transformers for language understanding</article-title>,&#x201D; <comment>arXiv preprint arXiv:1810.04805</comment>, <year>2018</year>.</mixed-citation></ref>
<ref id="ref-14"><label>[14]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>P.</given-names> <surname>Anderson</surname></string-name>, <string-name><given-names>X.</given-names> <surname>He</surname></string-name>, <string-name><given-names>C.</given-names> <surname>Buehler</surname></string-name>, <string-name><given-names>D.</given-names> <surname>Teney</surname></string-name>, <string-name><given-names>M.</given-names> <surname>Johnson</surname></string-name> <etal>et al.</etal></person-group><italic>,</italic> &#x201C;<article-title>Bottom-up and top-down attention for image captioning and visual question answering</article-title>,&#x201D; in <conf-name>Proc. CVPR</conf-name>, <publisher-loc>Salt Lake City, USA</publisher-loc>, pp. <fpage>6077</fpage>&#x2013;<lpage>6086</lpage>, <year>2018</year>. </mixed-citation></ref>
<ref id="ref-15"><label>[15]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>J.</given-names> <surname>Lu</surname></string-name>, <string-name><given-names>C.</given-names> <surname>Xiong</surname></string-name>, <string-name><given-names>D.</given-names> <surname>Parikh</surname></string-name> and <string-name><given-names>R.</given-names> <surname>Socher</surname></string-name></person-group>, &#x201C;<article-title>Knowing when to look: Adaptive attention via a visual sentinel for image captioning</article-title>,&#x201D; in <conf-name>Proc. CVPR</conf-name>, <publisher-loc>Hawaii, USA</publisher-loc>, pp. <fpage>3242</fpage>&#x2013;<lpage>3250</lpage>, <year>2017</year>. </mixed-citation></ref>
<ref id="ref-16"><label>[16]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>L.</given-names> <surname>Chen</surname></string-name>, <string-name><given-names>H.</given-names> <surname>Zhang</surname></string-name>, <string-name><given-names>J.</given-names> <surname>Xiao</surname></string-name>, <string-name><given-names>L.</given-names> <surname>Nie</surname></string-name>, <string-name><given-names>J.</given-names> <surname>Shao</surname></string-name> <etal>et al.</etal></person-group><italic>,</italic> &#x201C;<article-title>SCA-CNN: Spatial and channel-wise attention in convolutional networks for image captioning</article-title>,&#x201D; in <conf-name>Proc. CVPR</conf-name>, <publisher-loc>Hawaii, USA</publisher-loc>, pp. <fpage>6298</fpage>&#x2013;<lpage>6306</lpage>, <year>2017</year>. </mixed-citation></ref>
<ref id="ref-17"><label>[17]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>X.</given-names> <surname>Yang</surname></string-name>, <string-name><given-names>D. Q.</given-names> <surname>Liu</surname></string-name>, <string-name><given-names>H. W.</given-names> <surname>Zhang</surname></string-name>, <string-name><given-names>Y. D.</given-names> <surname>Zhang</surname></string-name> and <string-name><given-names>F.</given-names> <surname>Wu</surname></string-name></person-group>, &#x201C;<article-title>Context-aware visual policy network for fine-grained image captioning</article-title>,&#x201D; <source>IEEE Transactions on Pattern Analysis and Machine Intelligence</source>, vol. <volume>44</volume>, no. <issue>2</issue>, pp. <fpage>710</fpage>&#x2013; <lpage>722</lpage>, <year>2022</year>.</mixed-citation></ref>
<ref id="ref-18"><label>[18]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>P.</given-names> <surname>Zhang</surname></string-name>, <string-name><given-names>X.</given-names> <surname>Li</surname></string-name>, <string-name><given-names>X.</given-names> <surname>Hu</surname></string-name>, <string-name><given-names>J.</given-names> <surname>Yang</surname></string-name>, <string-name><given-names>L.</given-names> <surname>Zhang</surname></string-name> <etal>et al.</etal></person-group><italic>,</italic> &#x201C;<article-title>VinVL: Revisiting visual representations in vision-language models</article-title>,&#x201D; in <conf-name>Proc. CVPR</conf-name>, <publisher-loc>Nashville, TN, USA</publisher-loc>, pp. <fpage>5575</fpage>&#x2013;<lpage>5584</lpage>, <year>2021</year>. </mixed-citation></ref>
<ref id="ref-19"><label>[19]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>X.</given-names> <surname>Zhang</surname></string-name>, <string-name><given-names>X.</given-names> <surname>Sun</surname></string-name>, <string-name><given-names>Y.</given-names> <surname>Luo</surname></string-name>, <string-name><given-names>J.</given-names> <surname>Ji</surname></string-name>, <string-name><given-names>Y.</given-names> <surname>Zhou</surname></string-name> <etal>et al.</etal></person-group><italic>,</italic> &#x201C;<article-title>RSTNet: Captioning with adaptive attention on visual and non-visual words</article-title>,&#x201D; in <conf-name>Proc. CVPR</conf-name>, <publisher-loc>Nashville, TN, USA</publisher-loc>, pp. <fpage>15460</fpage>&#x2013;<lpage>15469</lpage>, <year>2021</year>. </mixed-citation></ref>
<ref id="ref-20"><label>[20]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>Y.</given-names> <surname>Pan</surname></string-name>, <string-name><given-names>T.</given-names> <surname>Yao</surname></string-name>, <string-name><given-names>Y.</given-names> <surname>Li</surname></string-name> and <string-name><given-names>T.</given-names> <surname>Mei</surname></string-name></person-group>, &#x201C;<article-title>X-Linear attention networks for image captioning</article-title>,&#x201D; in <conf-name>Proc. CVPR</conf-name>, <publisher-loc>Seattle, USA</publisher-loc>, pp. <fpage>10968</fpage>&#x2013;<lpage>10977</lpage>, <year>2020</year>. </mixed-citation></ref>
<ref id="ref-21"><label>[21]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>T.</given-names> <surname>Nguyen</surname></string-name> and <string-name><given-names>B.</given-names> <surname>Fernando</surname></string-name></person-group>, &#x201C;<article-title>Effective multimodal encoding for image paragraph captioning</article-title>,&#x201D; <source>IEEE Transactions on Image Processing</source>, vol. <volume>31</volume>, pp. <fpage>6381</fpage>&#x2013; <lpage>6395</lpage>, <year>2022</year>; <pub-id pub-id-type="pmid">36215365</pub-id></mixed-citation></ref>
<ref id="ref-22"><label>[22]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>M.</given-names> <surname>Cornia</surname></string-name>, <string-name><given-names>M.</given-names> <surname>Stefanini</surname></string-name>, <string-name><given-names>L.</given-names> <surname>Baraldi</surname></string-name> and <string-name><given-names>R.</given-names> <surname>Cucchiara</surname></string-name></person-group>, &#x201C;<article-title>Meshed-memory Transformer for image captioning</article-title>,&#x201D; in <conf-name>Proc. CVPR</conf-name>, <publisher-loc>Seattle, USA</publisher-loc>, pp. <fpage>10575</fpage>&#x2013;<lpage>10584</lpage>, <year>2020</year>. </mixed-citation></ref>
<ref id="ref-23"><label>[23]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>Z. W.</given-names> <surname>Tang</surname></string-name>, <string-name><given-names>Y.</given-names> <surname>Yi</surname></string-name> and <string-name><given-names>H.</given-names> <surname>Sheng</surname></string-name></person-group>, &#x201C;<article-title>Attention-guided image captioning through word information</article-title>,&#x201D; <source>Sensors</source>, vol. <volume>21</volume>, no. <issue>23</issue>, pp. <fpage>7982</fpage>, <year>2021</year>; <pub-id pub-id-type="pmid">34883986</pub-id></mixed-citation></ref>
<ref id="ref-24"><label>[24]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>K.</given-names> <surname>Wu</surname></string-name>, <string-name><given-names>H.</given-names> <surname>Peng</surname></string-name>, <string-name><given-names>M.</given-names> <surname>Chen</surname></string-name>, <string-name><given-names>J.</given-names> <surname>Fu</surname></string-name> and <string-name><given-names>H.</given-names> <surname>Chao</surname></string-name></person-group>, &#x201C;<article-title>Rethinking and improving relative position encoding for vision Transformer</article-title>,&#x201D; in <conf-name>Proc. ICCV</conf-name>, <publisher-loc>Montreal, Canada</publisher-loc>, pp. <fpage>10013</fpage>&#x2013;<lpage>10021</lpage>, <year>2021</year>. </mixed-citation></ref>
<ref id="ref-25"><label>[25]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>X.</given-names> <surname>Li</surname></string-name>, <string-name><given-names>X.</given-names> <surname>Yin</surname></string-name>, <string-name><given-names>C.</given-names> <surname>Li</surname></string-name>, <string-name><given-names>P.</given-names> <surname>Zhang</surname></string-name>, <string-name><given-names>X.</given-names> <surname>Hu</surname></string-name> <etal>et al.</etal></person-group><italic>,</italic> &#x201C;<article-title>Oscar: Object-semantics aligned pre-training for vision-language tasks</article-title>,&#x201D; in <conf-name>Proc. ECCV</conf-name>, pp. <fpage>121</fpage>&#x2013;<lpage>137</lpage>, <year>2020</year>. </mixed-citation></ref>
<ref id="ref-26"><label>[26]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>O.</given-names> <surname>Vinyals</surname></string-name>, <string-name><given-names>A.</given-names> <surname>Toshev</surname></string-name>, <string-name><given-names>S.</given-names> <surname>Bengio</surname></string-name> and <string-name><given-names>D.</given-names> <surname>Erhan</surname></string-name></person-group>, &#x201C;<article-title>Show and tell: A neural image caption generator</article-title>,&#x201D; in <conf-name>Proc. CVPR</conf-name>, <publisher-loc>Boston, USA</publisher-loc>, pp. <fpage>3156</fpage>&#x2013;<lpage>3164</lpage>, <year>2015</year>. </mixed-citation></ref>
<ref id="ref-27"><label>[27]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>J.</given-names> <surname>Gehring</surname></string-name>, <string-name><given-names>M.</given-names> <surname>Auli</surname></string-name>, <string-name><given-names>D.</given-names> <surname>Grangier</surname></string-name>, <string-name><given-names>D.</given-names> <surname>Yarats</surname></string-name> and <string-name><given-names>Y. N.</given-names> <surname>Dauphin</surname></string-name></person-group>, &#x201C;<article-title>Convolutional sequence to sequence learning</article-title>,&#x201D; in <conf-name>Proc. ICML</conf-name>, <publisher-loc>Sydney, NSW, Australia</publisher-loc>, pp. <fpage>1243</fpage>&#x2013;<lpage>1252</lpage>, <year>2017</year>. </mixed-citation></ref>
<ref id="ref-28"><label>[28]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><given-names>P.</given-names> <surname>Shaw</surname></string-name>, <string-name><given-names>J.</given-names> <surname>Uszkoreit</surname></string-name> and <string-name><given-names>A.</given-names> <surname>Vaswani</surname></string-name></person-group>, &#x201C;<article-title>Self-attention with relative position representations</article-title>,&#x201D; <comment>arXiv preprint arXiv:1803.02155</comment>, <year>2018</year>.</mixed-citation></ref>
<ref id="ref-29"><label>[29]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>S. J.</given-names> <surname>Rennie</surname></string-name>, <string-name><given-names>E.</given-names> <surname>Marcheret</surname></string-name>, <string-name><given-names>Y.</given-names> <surname>Mroueh</surname></string-name>, <string-name><given-names>J.</given-names> <surname>Ross</surname></string-name> and <string-name><given-names>V.</given-names> <surname>Goel</surname></string-name></person-group>, &#x201C;<article-title>Self-critical sequence training for image captioning</article-title>,&#x201D; in <conf-name>Proc. CVPR</conf-name>, <publisher-loc>Hawaii, USA</publisher-loc>, pp. <fpage>1179</fpage>&#x2013;<lpage>1195</lpage>, <year>2017</year>. </mixed-citation></ref>
<ref id="ref-30"><label>[30]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>R.</given-names> <surname>Vedantam</surname></string-name>, <string-name><given-names>C. L.</given-names> <surname>Zitnick</surname></string-name> and <string-name><given-names>D.</given-names> <surname>Parikh</surname></string-name></person-group>, &#x201C;<article-title>CIDEr: Consensus-based image description evaluation</article-title>,&#x201D; in <conf-name>Proc. CVPR</conf-name>, <publisher-loc>Boston, USA</publisher-loc>, pp. <fpage>4566</fpage>&#x2013;<lpage>4575</lpage>, <year>2015</year>. </mixed-citation></ref>
<ref id="ref-31"><label>[31]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>T. Y.</given-names> <surname>Lin</surname></string-name>, <string-name><given-names>M.</given-names> <surname>Maire</surname></string-name>, <string-name><given-names>S.</given-names> <surname>Belongie</surname></string-name>, <string-name><given-names>J.</given-names> <surname>Hays</surname></string-name>, <string-name><given-names>P.</given-names> <surname>Perona</surname></string-name> <etal>et al.</etal></person-group><italic>,</italic> &#x201C;<article-title>Common objects in context</article-title>,&#x201D; in <conf-name>Proc. ECCV</conf-name>, <publisher-loc>Zurich, Switzerland</publisher-loc>, pp. <fpage>740</fpage>&#x2013;<lpage>755</lpage>, <year>2014</year>. </mixed-citation></ref>
<ref id="ref-32"><label>[32]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>A.</given-names> <surname>Karpathy</surname></string-name> and <string-name><given-names>L.</given-names> <surname>Fei-Fei</surname></string-name></person-group>, &#x201C;<article-title>Deep visual-semantic alignments for generating image descriptions</article-title>,&#x201D; in <conf-name>Proc. CVPR</conf-name>, <publisher-loc>Boston, USA</publisher-loc>, pp. <fpage>3128</fpage>&#x2013;<lpage>3137</lpage>, <year>2015</year>. </mixed-citation></ref>
<ref id="ref-33"><label>[33]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>K.</given-names> <surname>Papineni</surname></string-name>, <string-name><given-names>S.</given-names> <surname>Roukos</surname></string-name>, <string-name><given-names>T.</given-names> <surname>Ward</surname></string-name> and <string-name><given-names>W. J.</given-names> <surname>Zhu</surname></string-name></person-group>, &#x201C;<article-title>BLEU: A method for automatic evaluation of machine translation</article-title>,&#x201D; in <conf-name>Proc. ACL</conf-name>, <publisher-loc>Philadelphia, Pennsylvania</publisher-loc>, pp. <fpage>311</fpage>&#x2013;<lpage>318</lpage>, <year>2002</year>. </mixed-citation></ref>
<ref id="ref-34"><label>[34]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>B.</given-names> <surname>Satanjeev</surname></string-name></person-group>, &#x201C;<article-title>METEOR: An automatic metric for MT evaluation with improved correlation with human judgments</article-title>,&#x201D; in <conf-name>Proc. ACL</conf-name>, <publisher-loc>Ann Arbor, Michigan, USA</publisher-loc>, pp. <fpage>228</fpage>&#x2013;<lpage>231</lpage>, <year>2005</year>. </mixed-citation></ref>
<ref id="ref-35"><label>[35]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>P.</given-names> <surname>Anderson</surname></string-name>, <string-name><given-names>B.</given-names> <surname>Fernando</surname></string-name>, <string-name><given-names>M.</given-names> <surname>Johnson</surname></string-name> and <string-name><given-names>S.</given-names> <surname>Gould</surname></string-name></person-group>, &#x201C;<article-title>SPICE: Semantic propositional image caption evaluation</article-title>,&#x201D; in <conf-name>Proc. ACM</conf-name>, <publisher-loc>Scottsdale, AZ, USA</publisher-loc>, pp. <fpage>382</fpage>&#x2013;<lpage>398</lpage>, <year>2016</year>. </mixed-citation></ref>
<ref id="ref-36"><label>[36]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>C. Y.</given-names> <surname>Lin</surname></string-name></person-group>, &#x201C;<article-title>ROUGE: A package for automatic evaluation of summaries</article-title>,&#x201D; in <conf-name>Proc. ACL</conf-name>, <publisher-loc>Barcelona, Spain</publisher-loc>, pp. <fpage>74</fpage>&#x2013;<lpage>81</lpage>, <year>2004</year>. </mixed-citation></ref>
<ref id="ref-37"><label>[37]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>S.</given-names> <surname>Shao</surname></string-name>, <string-name><given-names>Z.</given-names> <surname>Li</surname></string-name>, <string-name><given-names>T.</given-names> <surname>Zhang</surname></string-name>, <string-name><given-names>C.</given-names> <surname>Peng</surname></string-name>, <string-name><given-names>G.</given-names> <surname>Yu</surname></string-name> <etal>et al.</etal></person-group><italic>,</italic> &#x201C;<article-title>Objects365: A large-scale, high-quality dataset for object detection</article-title>,&#x201D; in <conf-name>Proc. ICCV</conf-name>, <conf-loc>Seoul, Korea(south)</conf-loc>, pp. <fpage>8429</fpage>&#x2013;<lpage>8438</lpage>, <year>2019</year>. </mixed-citation></ref>
<ref id="ref-38"><label>[38]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>A.</given-names> <surname>Kuznetsova</surname></string-name>, <string-name><given-names>H.</given-names> <surname>Rom</surname></string-name>, <string-name><given-names>N.</given-names> <surname>Alldrin</surname></string-name>, <string-name><given-names>J.</given-names> <surname>Uijlings</surname></string-name>, <string-name><given-names>I.</given-names> <surname>Krasin</surname></string-name> <etal>et al.</etal></person-group><italic>,</italic> &#x201C;<article-title>The open images dataset v4</article-title>,&#x201D; <source>International Journal of Computer Vision</source>, vol. <volume>128</volume>, pp. <fpage>1956</fpage>&#x2013;<lpage>1981</lpage>, <year>2020</year>.</mixed-citation></ref>
<ref id="ref-39"><label>[39]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>R.</given-names> <surname>Krishna</surname></string-name>, <string-name><given-names>Y.</given-names> <surname>Zhu</surname></string-name>, <string-name><given-names>O.</given-names> <surname>Groth</surname></string-name>, <string-name><given-names>J.</given-names> <surname>Johnson</surname></string-name>, <string-name><given-names>K.</given-names> <surname>Hata</surname></string-name> <etal>et al.</etal></person-group><italic>,</italic> &#x201C;<article-title>Visual Genome: Connecting language and vision using crowdsourced dense image annotations</article-title>,&#x201D; <source>International Journal of Computer Vision</source>, vol. <volume>123</volume>, pp. <fpage>32</fpage>&#x2013;<lpage>73</lpage>, <year>2017</year>.</mixed-citation></ref>
<ref id="ref-40"><label>[40]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>L.</given-names> <surname>Huang</surname></string-name>, <string-name><given-names>W.</given-names> <surname>Wang</surname></string-name>, <string-name><given-names>J.</given-names> <surname>Chen</surname></string-name> and <string-name><given-names>X.</given-names> <surname>Wei</surname></string-name></person-group>, &#x201C;<article-title>Attention on attention for image captioning</article-title>,&#x201D; in <conf-name>Proc. ICCV</conf-name>, <conf-loc>Seoul, Korea(south)</conf-loc>, pp. <fpage>4633</fpage>&#x2013;<lpage>4642</lpage>, <year>2019</year>. </mixed-citation></ref>
<ref id="ref-41"><label>[41]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>Y.</given-names> <surname>Duan</surname></string-name>, <string-name><given-names>Z.</given-names> <surname>Wang</surname></string-name>, <string-name><given-names>J.</given-names> <surname>Wang</surname></string-name>, <string-name><given-names>Y. K.</given-names> <surname>Wang</surname></string-name> and <string-name><given-names>C. T.</given-names> <surname>Lin</surname></string-name></person-group>, &#x201C;<article-title>Position-aware image captioning with spatial relation</article-title>,&#x201D; <source>Neurocomputing</source>, vol. <volume>497</volume>, pp. <fpage>28</fpage>&#x2013;<lpage>38</lpage>, <year>2022</year>.</mixed-citation></ref>
<ref id="ref-42"><label>[42]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>Y.</given-names> <surname>Wang</surname></string-name>, <string-name><given-names>J.</given-names> <surname>Xu</surname></string-name> and <string-name><given-names>Y.</given-names> <surname>Sun</surname></string-name></person-group>, &#x201C;<article-title>End-to-end Transformer based model for image captioning</article-title>,&#x201D; in <conf-name>Proc. AAAI</conf-name>, <comment>online, no.8053</comment>, <year>2022</year>.</mixed-citation></ref>
</ref-list>
</back>
</article>