<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.1 20151215//EN" "http://jats.nlm.nih.gov/publishing/1.1/JATS-journalpublishing1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" article-type="research-article" dtd-version="1.1">
<front>
<journal-meta>
<journal-id journal-id-type="pmc">CMC</journal-id>
<journal-id journal-id-type="nlm-ta">CMC</journal-id>
<journal-id journal-id-type="publisher-id">CMC</journal-id>
<journal-title-group>
<journal-title>Computers, Materials &#x0026; Continua</journal-title>
</journal-title-group>
<issn pub-type="epub">1546-2226</issn>
<issn pub-type="ppub">1546-2218</issn>
<publisher>
<publisher-name>Tech Science Press</publisher-name>
<publisher-loc>USA</publisher-loc>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="publisher-id">19328</article-id>
<article-id pub-id-type="doi">10.32604/cmc.2022.019328</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Article</subject>
</subj-group>
</article-categories>
<title-group>
<article-title>A Position-Aware Transformer for Image Captioning</article-title>
<alt-title alt-title-type="left-running-head">A Position-Aware Transformer for Image Captioning</alt-title>
<alt-title alt-title-type="right-running-head">A Position-Aware Transformer for Image Captioning</alt-title>
</title-group>
<contrib-group content-type="authors">
<contrib id="author-1" contrib-type="author" corresp="yes"><name name-style="western"><surname>Deng</surname><given-names>Zelin</given-names></name><xref ref-type="aff" rid="aff-1">1</xref><email>zl_deng@sina.com</email>
</contrib>
<contrib id="author-2" contrib-type="author"><name name-style="western"><surname>Zhou</surname><given-names>Bo</given-names></name><xref ref-type="aff" rid="aff-1">1</xref>
</contrib>
<contrib id="author-3" contrib-type="author"><name name-style="western"><surname>He</surname><given-names>Pei</given-names></name><xref ref-type="aff" rid="aff-2">2</xref>
</contrib>
<contrib id="author-4" contrib-type="author"><name name-style="western"><surname>Huang</surname><given-names>Jianfeng</given-names></name><xref ref-type="aff" rid="aff-3">3</xref>
</contrib>
<contrib id="author-5" contrib-type="author"><name name-style="western"><surname>Alfarraj</surname><given-names>Osama</given-names></name><xref ref-type="aff" rid="aff-4">4</xref>
</contrib>
<contrib id="author-6" contrib-type="author"><name name-style="western"><surname>Tolba</surname><given-names>Amr</given-names></name><xref ref-type="aff" rid="aff-4">4</xref><xref ref-type="aff" rid="aff-5">5</xref>
</contrib>
<aff id="aff-1"><label>1</label><institution>School of Computer and Communication Engineering, Changsha University of Science and Technology</institution>, <addr-line>Changsha, 410114</addr-line>, <country>China</country></aff>
<aff id="aff-2"><label>2</label><institution>School of Computer Science and Cyber Engineering, Guangzhou University</institution>, <addr-line>Guangzhou, 510006</addr-line>, <country>China</country></aff>
<aff id="aff-3"><label>3</label><institution>Advanced Forming Research Centre, University of Strathclyde</institution>, <addr-line>Renfrewshire, PA4 9LJ, Glasgow</addr-line>, <country>United Kingdom</country></aff>
<aff id="aff-4"><label>4</label><institution>Department of Computer Science, Community College, King Saud University</institution>, <addr-line>Riyadh, 11437</addr-line>, <country>Saudi Arabia</country></aff>
<aff id="aff-5"><label>5</label><institution>Department of Mathematics and Computer Science, Faculty of Science, Menoufia University</institution>, <country>Egypt</country></aff>
</contrib-group>
<author-notes>
<corresp id="cor1"><label>&#x002A;</label>Corresponding Author: Zelin Deng. Email: <email>zl_deng@sina.com</email></corresp>
</author-notes>
<pub-date pub-type="epub" date-type="pub" iso-8601-date="2021-08-30"><day>30</day><month>08</month><year>2021</year>
</pub-date>
<volume>70</volume>
<issue>1</issue>
<fpage>2065</fpage>
<lpage>2081</lpage>
<history>
<date date-type="received"><day>10</day><month>4</month><year>2021</year>
</date>
<date date-type="accepted"><day>16</day><month>6</month><year>2021</year>
</date>
</history>
<permissions>
<copyright-statement>&#x00A9; 2022 Deng et al.</copyright-statement>
<copyright-year>2022</copyright-year>
<copyright-holder>Deng et al.</copyright-holder>
<license xlink:href="https://creativecommons.org/licenses/by/4.0/">
<license-p>This work is licensed under a <ext-link ext-link-type="uri" xlink:type="simple" xlink:href="https://creativecommons.org/licenses/by/4.0/">Creative Commons Attribution 4.0 International License</ext-link>, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
</license>
</permissions>
<self-uri content-type="pdf" xlink:href="TSP_CMC_19328.pdf"></self-uri>
<abstract>
<p>Image captioning aims to generate a corresponding description of an image. In recent years, neural encoder-decoder models have been the dominant approaches, in which the Convolutional Neural Network (CNN) and Long Short Term Memory (LSTM) are used to translate an image into a natural language description. Among these approaches, the visual attention mechanisms are widely used to enable deeper image understanding through fine-grained analysis and even multiple steps of reasoning. However, most conventional visual attention mechanisms are based on high-level image features, ignoring the effects of other image features, and giving insufficient consideration to the relative positions between image features. In this work, we propose a Position-Aware Transformer model with image-feature attention and position-aware attention mechanisms for the above problems. The image-feature attention firstly extracts multi-level features by using Feature Pyramid Network (FPN), then utilizes the scaled-dot-product to fuse these features, which enables our model to detect objects of different scales in the image more effectively without increasing parameters. In the position-aware attention mechanism, the relative positions between image features are obtained at first, afterwards the relative positions are incorporated into the original image features to generate captions more accurately. Experiments are carried out on the MSCOCO dataset and our approach achieves competitive BLEU-4, METEOR, ROUGE-L, CIDEr scores compared with some state-of-the-art approaches, demonstrating the effectiveness of our approach.</p>
</abstract>
<kwd-group kwd-group-type="author">
<kwd>Deep learning</kwd>
<kwd>image captioning</kwd>
<kwd>transformer</kwd>
<kwd>attention</kwd>
<kwd>position-aware</kwd>
</kwd-group>
</article-meta>
</front>
<body>
<sec id="s1">
<label>1</label>
<title>Introduction</title>
<p>Image captioning [<xref ref-type="bibr" rid="ref-1">1</xref>] aims to describe the visual contents of an image in natural language, which is a sequence-to-sequence problem and can be viewed as translating an image into its corresponding descriptive sentence. With these characteristics, the model not only needs to be able to identify objects, actions, and scenes in the image, but also to be powerful enough to capture and express the relationships of these elements in a properly-formed sentence. This scheme analogically simulates the extraordinary abilities of humans to convert large amounts of visual information into descriptive semantic information.</p>
<p>Earlier captioning approaches [<xref ref-type="bibr" rid="ref-2">2</xref>,<xref ref-type="bibr" rid="ref-3">3</xref>] used some unsophisticated templates and two auxiliary modules object detector and attribute detector. The two detectors filled the blank items of the templates to generate a complete sentence. According to the great successes achieved by deep neural networks [<xref ref-type="bibr" rid="ref-4">4</xref>] in computer vision [<xref ref-type="bibr" rid="ref-5">5</xref>,<xref ref-type="bibr" rid="ref-6">6</xref>] and natural language processing [<xref ref-type="bibr" rid="ref-7">7</xref>,<xref ref-type="bibr" rid="ref-8">8</xref>], a broad collection of image captioning methods has been proposed [<xref ref-type="bibr" rid="ref-1">1</xref>,<xref ref-type="bibr" rid="ref-9">9</xref>,<xref ref-type="bibr" rid="ref-10">10</xref>]. Based on the neural encoder-decoder framework [<xref ref-type="bibr" rid="ref-1">1</xref>], these methods use the Convolutional Neural Network (CNN) [<xref ref-type="bibr" rid="ref-4">4</xref>] to encode the input image into image features. Subsequently, the Recurrent Neural Network (RNN) [<xref ref-type="bibr" rid="ref-11">11</xref>] is applied to decode these features word-by-word into a natural language description of the image.</p>
<p>However, there are two major drawbacks in the plain encoder-decoder based models as follows: (1) the image representation does not change during the caption generation process; (2) The decoder processes the image representation from a global view, rather than focusing on local aspects related to parts of the description. The visual attention mechanisms [<xref ref-type="bibr" rid="ref-12">12</xref>&#x2013;<xref ref-type="bibr" rid="ref-15">15</xref>] can solve these problems by dynamically attending to different parts of image features relevant to the semantic context of the current partially-completed caption.</p>
<p>RNN-based caption models have become the dominant approaches in recent years, but the recurrent structure of RNN makes models suffer from gradient-vanishing or gradient-exploding with the growth of sentence and precludes parallelization within training examples. Recently, the work of Vaswani et al. [<xref ref-type="bibr" rid="ref-16">16</xref>] shows that the transformer has excellent performance on machine translation or other sequence-to-sequence problems. It is based on the self-attention mechanism and enables models to be trained in parallel by excluding recurrent structures.</p>
<p>Human-like and descriptive captions require the model to describe primary objects in the image and also present their relations in a fluent style. While image features obtained by CNN commonly correspond to a uniform grid of equally-sized image regions, each feature only contains information in its corresponding region, irrespective of the relative positions with any other features. Thus, it is hard to get an accurate expression. Furthermore, these image features are mainly visual features extracted from a global view of the image, and only contain a small amount of local visual features that are crucial for detecting small objects. Such limitations of image features keep the model from producing more human-like captions.</p>
<p>In order to obtain captions of superior quality, a Position-aware Transformer model for image captioning is proposed. The contributions of this model are as follows: (1) To enable the model to detect objects of different scales in the image without increasing the number of parameters, the image-feature attention is proposed, which uses the scaled-dot-product to fuse multi-level features within an image feature pyramid; (2) To generate more human-like captions, the position-aware attention is proposed to learn relative positions between image features, making features can be explained from the perspective of spatial relationship.</p>
<p>The rest of this paper is organized as follows. In Section 2, the previous critical works about image captioning and the transformer architecture are briefly introduced. In Section 3, the overall architecture and the details of our approach are introduced. In Section 4, the results of the experiment on the COCO dataset are reported and analyzed. In Section 5, the contributions of our work are concluded.</p>
</sec>
<sec id="s2">
<label>2</label>
<title>Related Works</title>
<sec id="s2_1">
<label>2.1</label>
<title>Image Captioning and Attention Mechanism</title>
<p>Image captioning is the task of generating a descriptive sentence of an image. It requires an algorithm to understand and model the relations between visual and textual elements. With the development of deep learning, a variety of methods based on deep neural networks have been proposed. Vinyals et al. [<xref ref-type="bibr" rid="ref-1">1</xref>] firstly proposed an encoder-decoder framework, which used the CNN as the encoder and the RNN as the decoder. However, the input of RNN was a consistent representation of an image, and this representation was generally analyzed from an overall perspective, thus leading to a mismatch between the context of visual information and the context of semantic information.</p>
<p>To solve the above problems, Xu et al. [<xref ref-type="bibr" rid="ref-12">12</xref>] introduced the attention mechanism for image captioning, which guided the model to different salient regions of the image dynamically at each step, instead of feeding all image features to the decoder at the initial step. Based on Xu&#x2019;s work, more and more improvements in attention mechanisms have been developed. Chen et al. [<xref ref-type="bibr" rid="ref-13">13</xref>] proposed spatial and channel-wise attention, in which the attention mechanism calculated where (spatial locations at multiple layers) and what (channels) the visual attention was. Anderson et al. [<xref ref-type="bibr" rid="ref-14">14</xref>] proposed a combined bottom-up and top-down visual attention mechanism. The bottom-up mechanism chose a set of salient image regions through the object detection technology, the top-down mechanism used task-specific context to predict attention distribution of the chosen image regions. Lu et al. [<xref ref-type="bibr" rid="ref-15">15</xref>] proposed adaptive attention by adding a visual sentinel, determining when to attend to an image or the visual sentinel.</p>
</sec>
<sec id="s2_2">
<label>2.2</label>
<title>Transformer and Self-Attention Mechanism</title>
<p>Recurrent models have some limitations on parallel computation and have gradient-vanishing or gradient-exploding problems when trained with long sentences. Vaswani et al. [<xref ref-type="bibr" rid="ref-16">16</xref>] proposed the transformer architecture and achieved state-of-the-art results for machine translation. Experimental results showed that the transformer was superior in quality while being more parallelizable and requiring significantly less time to be trained. Recently, the work in [<xref ref-type="bibr" rid="ref-17">17</xref>,<xref ref-type="bibr" rid="ref-18">18</xref>] applied the transformer to the task of image captioning and improved the model performance. Without recurrence, the transformer uses the self-attention mechanism to compute the relation of two arbitrary elements of a single input, and outputs a contextualized representation of this input, avoiding the vanishing or exploding gradients and accelerating the training process.</p>
</sec>
<sec id="s2_3">
<label>2.3</label>
<title>Relative Position Information</title>
<p>Most attention mechanisms for image captioning attend to CNN features at each step [<xref ref-type="bibr" rid="ref-12">12</xref>,<xref ref-type="bibr" rid="ref-13">13</xref>], while CNN features do not contain relative position information. This makes relative position information unavailable during the caption generation process. However, not all the words have corresponding CNN features. Consider <?A3B2 "fig1",5,"anchor"?><xref ref-type="fig" rid="fig-1">Fig. 1a</xref> and its ground truth caption &#x201C;A brown toy horse stands on a red chair&#x201D;. The words &#x201C;stand&#x201D; and &#x201C;on&#x201D; do not have corresponding CNN features, but can be determined by the relative position information between CNN features (see <xref ref-type="fig" rid="fig-1">Fig. 1b</xref>). Therefore, we developed the position-aware attention to learn relative position information during training.</p>
</sec>
</sec>
<sec id="s3">
<label>3</label>
<title>The Proposed Approach</title>
<p>To generate more reasonable captions, a Position-aware Transformer model is proposed to make full use of the relative position information. It contains two components: the image encoder, and the caption decoder. As shown in <?A3B2 "fig2",5,"anchor"?><xref ref-type="fig" rid="fig-2">Fig. 2</xref>, the combination of the Feature Pyramid Network (FPN) [<xref ref-type="bibr" rid="ref-19">19</xref>], image-feature attention, and position-aware attention is regarded as the encoder to obtain visual features. The decoder is the original transformer decoder. Given an image, the FPN is first leveraged to obtain two kinds of image features, one is high-level visual features containing the global semantics of the image, the other is low-level visual features which are local details of the image [<xref ref-type="bibr" rid="ref-19">19</xref>]. These two kinds of features are fed into the image-feature attention and position-aware attention to get fused features containing relative position information. Finally, the transformer takes the fused features and the start token &#x003C;BOS&#x003E; or the partially-completed sentence as input, and then outputs probabilities of each word in the dictionary being the next word of the sentence.</p>
<fig id="fig-1">
<label>Figure 1</label>
<caption>
<title>Original image and relative position (a) Original image (b) Red arrows represent relative position information</title>
</caption>
<graphic mimetype="image" mime-subtype="png" xlink:href="CMC_19328-fig-1.png"/>
</fig>
<fig id="fig-2">
<label>Figure 2</label>
<caption>
<title>Overall structure of our proposed approach</title>
</caption>
<graphic mimetype="image" mime-subtype="png" xlink:href="CMC_19328-fig-2.png"/>
</fig>
<sec id="s3_1">
<label>3.1</label>
<title>Image-Feature Attention for Feature Fusion</title>
<p>The input of image captioning is an image. Traditional methods use a pre-trained CNN model on the image classification task as the feature extractor and mostly adopt the final conv-layer feature map as the image representation. However, not all objects in the image have corresponding features stored in this representation, particularly for those small-sized objects. As shown in <?A3B2 "fig3",5,"anchor"?><xref ref-type="fig" rid="fig-3">Fig. 3</xref>.</p>
<fig id="fig-3">
<label>Figure 3</label>
<caption>
<title>Original image and its features (a) Original image (b) The first-level feature (c) The second-level feature (d) The third-level feature (e) The fourth-level feature</title>
</caption>
<graphic mimetype="image" mime-subtype="png" xlink:href="CMC_19328-fig-3.png"/>
</fig>
<p><xref ref-type="fig" rid="fig-3">Fig. 3a</xref> is the original image, and the others are image features having semantics from low-level to high-level. The lower the feature is, the more information it contains, and the weaker semantics it presents. Weaker semantics are harmful to the model to grasp the topic of the image; less information is negative for capturing the local details of the image. As a result, determining an optimal level of image features invariably leads to an unwinnable trade-off. To recognize image objects at different scales, we use the FPN model to construct a feature pyramid. Features in the pyramid combine low-resolution, semantically strong features with high resolution, semantically weak features <italic>via</italic> a top-down pathway and lateral connections. In this work, the feature pyramid has four feature maps in total. The first two are high-level features and the rest are low-level features.</p>
<p>Predicting on each level feature of a feature pyramid has many limitations, especially the inference time will increase considerably, making this approach impractical for real applications. Moreover, training deep networks end-to-end on all features is infeasible in terms of memory. To build an effective and lightweight model, we choose one feature from high-level features and low-level features respectively: <inline-formula id="ieqn-1"><mml:math id="mml-ieqn-1"><mml:msup><mml:mi>V</mml:mi><mml:mrow><mml:mi>l</mml:mi><mml:mi>o</mml:mi><mml:mi>w</mml:mi></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:mo fence="false" stretchy="false">{</mml:mo><mml:msubsup><mml:mi>v</mml:mi><mml:mrow><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>l</mml:mi></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:mspace width="thinmathspace" /><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:mspace width="thinmathspace" /><mml:msubsup><mml:mi>v</mml:mi><mml:mrow><mml:mi>m</mml:mi></mml:mrow><mml:mrow><mml:mi>l</mml:mi></mml:mrow></mml:msubsup><mml:mo fence="false" stretchy="false">}</mml:mo><mml:mo>,</mml:mo><mml:mspace width="thinmathspace" /><mml:msubsup><mml:mi>v</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>l</mml:mi></mml:mrow></mml:msubsup><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mi>R</mml:mi><mml:mrow><mml:mrow><mml:msub><mml:mi>d</mml:mi><mml:mrow><mml:mrow><mml:mi mathvariant="italic">m</mml:mi><mml:mi mathvariant="italic">o</mml:mi><mml:mi mathvariant="italic">d</mml:mi><mml:mi mathvariant="italic">e</mml:mi><mml:mi mathvariant="italic">l</mml:mi></mml:mrow></mml:mrow></mml:msub></mml:mrow></mml:mrow></mml:msup></mml:math></inline-formula> and <inline-formula id="ieqn-2"><mml:math id="mml-ieqn-2"><mml:msup><mml:mi>V</mml:mi><mml:mrow><mml:mi>h</mml:mi><mml:mi>i</mml:mi><mml:mi>g</mml:mi><mml:mi>h</mml:mi></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:mo fence="false" stretchy="false">{</mml:mo><mml:msubsup><mml:mi>v</mml:mi><mml:mrow><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>h</mml:mi></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:mspace width="thinmathspace" /><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:mspace width="thinmathspace" /><mml:msubsup><mml:mi>v</mml:mi><mml:mrow><mml:mi>n</mml:mi></mml:mrow><mml:mrow><mml:mi>h</mml:mi></mml:mrow></mml:msubsup><mml:mo fence="false" stretchy="false">}</mml:mo><mml:mo>,</mml:mo><mml:mspace width="thinmathspace" /><mml:msubsup><mml:mi>v</mml:mi><mml:mrow><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mi>h</mml:mi></mml:mrow></mml:msubsup><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mi>R</mml:mi><mml:mrow><mml:mrow><mml:msub><mml:mi>d</mml:mi><mml:mrow><mml:mrow><mml:mi mathvariant="italic">m</mml:mi><mml:mi mathvariant="italic">o</mml:mi><mml:mi mathvariant="italic">d</mml:mi><mml:mi mathvariant="italic">e</mml:mi><mml:mi mathvariant="italic">l</mml:mi></mml:mrow></mml:mrow></mml:msub></mml:mrow></mml:mrow></mml:msup></mml:math></inline-formula>, <inline-formula id="ieqn-3"><mml:math id="mml-ieqn-3"><mml:msub><mml:mi>d</mml:mi><mml:mrow><mml:mrow><mml:mi mathvariant="italic">m</mml:mi><mml:mi mathvariant="italic">o</mml:mi><mml:mi mathvariant="italic">d</mml:mi><mml:mi mathvariant="italic">e</mml:mi><mml:mi mathvariant="italic">l</mml:mi></mml:mrow></mml:mrow></mml:msub></mml:math></inline-formula> is the hidden dimension of the model. Because low-level features are still too large to use (<italic>e.g</italic>., 4 times more than high-level features in spatial size), the image-feature attention is then used to fuse such two features according to <?A3B2 "fig4",5,"anchor"?><xref ref-type="fig" rid="fig-4">Fig. 4</xref>.</p>
<fig id="fig-4">
<label>Figure 4</label>
<caption>
<title>The structure of image-feature attention</title>
</caption>
<graphic mimetype="image" mime-subtype="png" xlink:href="CMC_19328-fig-4.png"/>
</fig>
<p>As shown in <xref ref-type="fig" rid="fig-4">Fig. 4</xref>, the image-feature attention takes <inline-formula id="ieqn-4"><mml:math id="mml-ieqn-4"><mml:msup><mml:mi>V</mml:mi><mml:mrow><mml:mi>l</mml:mi><mml:mi>o</mml:mi><mml:mi>w</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula> and <inline-formula id="ieqn-5"><mml:math id="mml-ieqn-5"><mml:msup><mml:mi>V</mml:mi><mml:mrow><mml:mi>h</mml:mi><mml:mi>i</mml:mi><mml:mi>g</mml:mi><mml:mi>h</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula> as input and firstly uses <xref ref-type="disp-formula" rid="eqn-1">Eq. (1)</xref> to calculate the relevance-coefficients matrix C between elements in <inline-formula id="ieqn-6"><mml:math id="mml-ieqn-6"><mml:msup><mml:mi>V</mml:mi><mml:mrow><mml:mi>l</mml:mi><mml:mi>o</mml:mi><mml:mi>w</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula> and <inline-formula id="ieqn-7"><mml:math id="mml-ieqn-7"><mml:msup><mml:mi>V</mml:mi><mml:mrow><mml:mi>h</mml:mi><mml:mi>i</mml:mi><mml:mi>g</mml:mi><mml:mi>h</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula>.</p>
<p><disp-formula id="eqn-1">
<label>(1)</label>
<mml:math id="mml-eqn-1" display="block"><mml:mi>C</mml:mi><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:msup><mml:mi>V</mml:mi><mml:mrow><mml:mi>h</mml:mi><mml:mi>i</mml:mi><mml:mi>g</mml:mi><mml:mi>h</mml:mi></mml:mrow></mml:msup><mml:msup><mml:mi>W</mml:mi><mml:mrow><mml:mi>Q</mml:mi></mml:mrow></mml:msup><mml:mo>)</mml:mo></mml:mrow><mml:msup><mml:mrow><mml:mo>(</mml:mo><mml:msup><mml:mi>V</mml:mi><mml:mrow><mml:mi>l</mml:mi><mml:mi>o</mml:mi><mml:mi>w</mml:mi></mml:mrow></mml:msup><mml:msup><mml:mi>W</mml:mi><mml:mrow><mml:mi>K</mml:mi></mml:mrow></mml:msup><mml:mo>)</mml:mo></mml:mrow><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msup></mml:mrow><mml:msqrt><mml:msub><mml:mi>d</mml:mi><mml:mrow><mml:mrow><mml:mi mathvariant="italic">m</mml:mi><mml:mi mathvariant="italic">o</mml:mi><mml:mi mathvariant="italic">d</mml:mi><mml:mi mathvariant="italic">e</mml:mi><mml:mi mathvariant="italic">l</mml:mi></mml:mrow></mml:mrow></mml:msub></mml:msqrt></mml:mfrac></mml:math>
</disp-formula></p>
<p>The relevance-coefficients matrix <inline-formula id="ieqn-8"><mml:math id="mml-ieqn-8"><mml:mi>C</mml:mi></mml:math></inline-formula> is then used to compute attention weights <inline-formula id="ieqn-9"><mml:math id="mml-ieqn-9"><mml:mrow><mml:mi mathvariant="script">W</mml:mi></mml:mrow></mml:math></inline-formula> according to <xref ref-type="disp-formula" rid="eqn-2">Eq. (2)</xref>.</p>
<p><disp-formula id="eqn-2">
<label>(2)</label>
<mml:math id="mml-eqn-2" display="block"><mml:mrow><mml:mi mathvariant="script">W</mml:mi></mml:mrow><mml:mo>=</mml:mo><mml:mrow><mml:mi mathvariant="italic">s</mml:mi><mml:mi mathvariant="italic">o</mml:mi><mml:mi mathvariant="italic">f</mml:mi><mml:mi mathvariant="italic">t</mml:mi><mml:mi mathvariant="italic">m</mml:mi><mml:mi mathvariant="italic">a</mml:mi><mml:mi mathvariant="italic">x</mml:mi></mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mi>C</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:math>
</disp-formula></p>
<p>Finally, the attention weights <inline-formula id="ieqn-10"><mml:math id="mml-ieqn-10"><mml:mrow><mml:mi mathvariant="script">W</mml:mi></mml:mrow></mml:math></inline-formula> are applied to calculate a weighted sum of <inline-formula id="ieqn-11"><mml:math id="mml-ieqn-11"><mml:msup><mml:mi>V</mml:mi><mml:mrow><mml:mi>l</mml:mi><mml:mi>o</mml:mi><mml:mi>w</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula>, and the fused feature <inline-formula id="ieqn-12"><mml:math id="mml-ieqn-12"><mml:msup><mml:mi>V</mml:mi><mml:mrow><mml:mrow><mml:mi mathvariant="italic">f</mml:mi><mml:mi mathvariant="italic">u</mml:mi><mml:mi mathvariant="italic">s</mml:mi><mml:mi mathvariant="italic">e</mml:mi><mml:mi mathvariant="italic">d</mml:mi></mml:mrow></mml:mrow></mml:msup></mml:math></inline-formula> is computed by <xref ref-type="disp-formula" rid="eqn-3">Eq. (3)</xref>.</p>
<p><disp-formula id="eqn-3">
<label>(3)</label>
<mml:math id="mml-eqn-3" display="block"><mml:msup><mml:mi>V</mml:mi><mml:mrow><mml:mrow><mml:mi mathvariant="italic">f</mml:mi><mml:mi mathvariant="italic">u</mml:mi><mml:mi mathvariant="italic">s</mml:mi><mml:mi mathvariant="italic">e</mml:mi><mml:mi mathvariant="italic">d</mml:mi></mml:mrow></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:msup><mml:mi>V</mml:mi><mml:mrow><mml:mi>l</mml:mi><mml:mi>o</mml:mi><mml:mi>w</mml:mi></mml:mrow></mml:msup><mml:msup><mml:mrow><mml:mi mathvariant="script">W</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mi mathvariant="script">T</mml:mi></mml:mrow></mml:mrow></mml:msup><mml:mo>+</mml:mo><mml:msup><mml:mi>V</mml:mi><mml:mrow><mml:mi>h</mml:mi><mml:mi>i</mml:mi><mml:mi>g</mml:mi><mml:mi>h</mml:mi></mml:mrow></mml:msup></mml:math>
</disp-formula></p>
<p>where <inline-formula id="ieqn-13"><mml:math id="mml-ieqn-13"><mml:msub><mml:mi>d</mml:mi><mml:mrow><mml:mrow><mml:mi mathvariant="italic">m</mml:mi><mml:mi mathvariant="italic">o</mml:mi><mml:mi mathvariant="italic">d</mml:mi><mml:mi mathvariant="italic">e</mml:mi><mml:mi mathvariant="italic">l</mml:mi></mml:mrow></mml:mrow></mml:msub></mml:math></inline-formula> is the hidden dimension of our approach, <inline-formula id="ieqn-14"><mml:math id="mml-ieqn-14"><mml:msup><mml:mi>W</mml:mi><mml:mrow><mml:mi>Q</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula>, <inline-formula id="ieqn-15"><mml:math id="mml-ieqn-15"><mml:msup><mml:mi>W</mml:mi><mml:mrow><mml:mi>K</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula>, <inline-formula id="ieqn-16"><mml:math id="mml-ieqn-16"><mml:msup><mml:mi>W</mml:mi><mml:mrow><mml:mi>V</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula> are learnable parameters during the training process.</p>
</sec>
<sec id="s3_2">
<label>3.2</label>
<title>Position-Aware Attention</title>
<p>RNN networks capture relative positions between input elements directly through their recurrent structure. However, the recurrent structure is abandoned in the transformer to support the use of self-attention, and CNN features do not contain relative position information. As we mentioned earlier, relative position information is helpful for achieving an accurate expression, so introducing it explicitly is a considerably important step. When dealing with the machine translation task, the transformer manually introduces position information to the model using sinusoidal position coding. But sinusoidal position coding might not work for image captioning, because images and language sentences are two very different ways of describing things, images mainly contain visual information, while sentences mainly contain semantic information. In this work, rather than using an elaborated handwritten function as the transformer does, the position-aware attention is proposed to learn relative position information during training.</p>
<p>Because an image is split into a uniform grid of equally-sized regions from the perspective of image features, in this sense, we model the image features as a normative directed graph, see <?A3B2 "fig5",5,"anchor"?><xref ref-type="fig" rid="fig-5">Fig. 5</xref>. Each vertex (the blue block in the image) stands for the feature of a certain image region, and each directed edge (the red arrows) denotes the relative position between two vertices. Note that in this graph all the edges are direct, because the relative positions from feature A to B are different from the relative positions from feature B to A.</p>
<fig id="fig-5">
<label>Figure 5</label>
<caption>
<title>The directed graph model of image features</title>
</caption>
<graphic mimetype="image" mime-subtype="png" xlink:href="CMC_19328-fig-5.png"/>
</fig>
<p>The position-aware attention takes two inputs, <inline-formula id="ieqn-17"><mml:math id="mml-ieqn-17"><mml:msup><mml:mi>V</mml:mi><mml:mrow><mml:mrow><mml:mi mathvariant="italic">f</mml:mi><mml:mi mathvariant="italic">u</mml:mi><mml:mi mathvariant="italic">s</mml:mi><mml:mi mathvariant="italic">e</mml:mi><mml:mi mathvariant="italic">d</mml:mi></mml:mrow></mml:mrow></mml:msup></mml:math></inline-formula>, and an edge matrix <inline-formula id="ieqn-18"><mml:math id="mml-ieqn-18"><mml:mi>E</mml:mi></mml:math></inline-formula> in which each element <inline-formula id="ieqn-19"><mml:math id="mml-ieqn-19"><mml:msub><mml:mi>E</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> represents the edge starts from vertex <inline-formula id="ieqn-20"><mml:math id="mml-ieqn-20"><mml:msub><mml:mi>S</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> to vertex <inline-formula id="ieqn-21"><mml:math id="mml-ieqn-21"><mml:msub><mml:mi>S</mml:mi><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>. In this case, we use <xref ref-type="disp-formula" rid="eqn-4">Eq. (4)</xref> to calculate the relevance-coefficients within elements of <inline-formula id="ieqn-22"><mml:math id="mml-ieqn-22"><mml:msup><mml:mi>V</mml:mi><mml:mrow><mml:mrow><mml:mi mathvariant="italic">f</mml:mi><mml:mi mathvariant="italic">u</mml:mi><mml:mi mathvariant="italic">s</mml:mi><mml:mi mathvariant="italic">e</mml:mi><mml:mi mathvariant="italic">d</mml:mi></mml:mrow></mml:mrow></mml:msup></mml:math></inline-formula>.</p>
<p><disp-formula id="eqn-4">
<label>(4)</label>
<mml:math id="mml-eqn-4" display="block"><mml:mi>C</mml:mi><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:msup><mml:mi>V</mml:mi><mml:mrow><mml:mrow><mml:mi mathvariant="italic">f</mml:mi><mml:mi mathvariant="italic">u</mml:mi><mml:mi mathvariant="italic">s</mml:mi><mml:mi mathvariant="italic">e</mml:mi><mml:mi mathvariant="italic">d</mml:mi></mml:mrow></mml:mrow></mml:msup><mml:msup><mml:mi>W</mml:mi><mml:mrow><mml:mi>Q</mml:mi></mml:mrow></mml:msup><mml:mo>)</mml:mo></mml:mrow><mml:msup><mml:mrow><mml:mo>(</mml:mo><mml:msup><mml:mi>V</mml:mi><mml:mrow><mml:mrow><mml:mi mathvariant="italic">f</mml:mi><mml:mi mathvariant="italic">u</mml:mi><mml:mi mathvariant="italic">s</mml:mi><mml:mi mathvariant="italic">e</mml:mi><mml:mi mathvariant="italic">d</mml:mi></mml:mrow></mml:mrow></mml:msup><mml:msup><mml:mi>W</mml:mi><mml:mrow><mml:mi>K</mml:mi></mml:mrow></mml:msup><mml:mo>+</mml:mo><mml:mi>E</mml:mi><mml:mo>)</mml:mo></mml:mrow><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msup></mml:mrow><mml:msqrt><mml:msub><mml:mi>d</mml:mi><mml:mrow><mml:mrow><mml:mi mathvariant="italic">m</mml:mi><mml:mi mathvariant="italic">o</mml:mi><mml:mi mathvariant="italic">d</mml:mi><mml:mi mathvariant="italic">e</mml:mi><mml:mi mathvariant="italic">l</mml:mi></mml:mrow></mml:mrow></mml:msub></mml:msqrt></mml:mfrac></mml:math>
</disp-formula></p>
<p>Then obtain a new representation of <inline-formula id="ieqn-23"><mml:math id="mml-ieqn-23"><mml:msup><mml:mi>V</mml:mi><mml:mrow><mml:mrow><mml:mi mathvariant="italic">f</mml:mi><mml:mi mathvariant="italic">u</mml:mi><mml:mi mathvariant="italic">s</mml:mi><mml:mi mathvariant="italic">e</mml:mi><mml:mi mathvariant="italic">d</mml:mi></mml:mrow></mml:mrow></mml:msup></mml:math></inline-formula> through incorporating relative position information according to <xref ref-type="disp-formula" rid="eqn-5">Eq. (5)</xref>.</p>
<p><disp-formula id="eqn-5">
<label>(5)</label>
<mml:math id="mml-eqn-5" display="block"><mml:msup><mml:mi>V</mml:mi><mml:mrow><mml:mrow><mml:mi mathvariant="italic">f</mml:mi><mml:mi mathvariant="italic">u</mml:mi><mml:mi mathvariant="italic">s</mml:mi><mml:mi mathvariant="italic">e</mml:mi><mml:mi mathvariant="italic">d</mml:mi></mml:mrow></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:msup><mml:mi>V</mml:mi><mml:mrow><mml:mrow><mml:mi mathvariant="italic">f</mml:mi><mml:mi mathvariant="italic">u</mml:mi><mml:mi mathvariant="italic">s</mml:mi><mml:mi mathvariant="italic">e</mml:mi><mml:mi mathvariant="italic">d</mml:mi></mml:mrow></mml:mrow></mml:msup><mml:mo>+</mml:mo><mml:mi>E</mml:mi><mml:mo>)</mml:mo></mml:mrow><mml:msup><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mi mathvariant="italic">s</mml:mi><mml:mi mathvariant="italic">o</mml:mi><mml:mi mathvariant="italic">f</mml:mi><mml:mi mathvariant="italic">t</mml:mi><mml:mi mathvariant="italic">m</mml:mi><mml:mi mathvariant="italic">a</mml:mi><mml:mi mathvariant="italic">x</mml:mi></mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mi>C</mml:mi><mml:mo>)</mml:mo></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msup><mml:mo>+</mml:mo><mml:msup><mml:mi>V</mml:mi><mml:mrow><mml:mrow><mml:mi mathvariant="italic">f</mml:mi><mml:mi mathvariant="italic">u</mml:mi><mml:mi mathvariant="italic">s</mml:mi><mml:mi mathvariant="italic">e</mml:mi><mml:mi mathvariant="italic">d</mml:mi></mml:mrow></mml:mrow></mml:msup></mml:math>
</disp-formula></p>
<p>Given a feature map of size <inline-formula id="ieqn-24"><mml:math id="mml-ieqn-24"><mml:mi>m</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mi>n</mml:mi></mml:math></inline-formula>, the directed graph model has <inline-formula id="ieqn-25"><mml:math id="mml-ieqn-25"><mml:mi>m</mml:mi><mml:mi>n</mml:mi></mml:math></inline-formula> vertices, and each vertex has edges that directly connect any other vertices, so the position-aware attention has to maintain <inline-formula id="ieqn-26"><mml:math id="mml-ieqn-26"><mml:mi>O</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:msup><mml:mi>m</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msup><mml:msup><mml:mi>n</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msup><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula> edges, which are redundant in most cases because objects are usually located sparsely in the image. Moreover, maintaining edges with space complexity <inline-formula id="ieqn-27"><mml:math id="mml-ieqn-27"><mml:mi>O</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:msup><mml:mi>m</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msup><mml:msup><mml:mi>n</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msup><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula> leads to parameters to be trained increasing significantly.</p>
<p>In order to reduce space complexity, the locations of two vertices in horizontal and vertical directions are leveraged to construct the relative positions between these two vertices. As shown in <?A3B2 "fig6",5,"anchor"?><xref ref-type="fig" rid="fig-6">Fig. 6</xref>, the vertices are placed in a cartesian coordinate, and each vertex has an unique coordinate.</p>
<fig id="fig-6">
<label>Figure 6</label>
<caption>
<title>Using differences in horizontal and vertical directions to construct the relative positions</title>
</caption>
<graphic mimetype="image" mime-subtype="png" xlink:href="CMC_19328-fig-6.png"/>
</fig>
<table-wrap id="table-6">
<label>Algorithm 1</label>
<caption>
<title>Calculate Edge Matrix <bold>E</bold><sub><italic>mn</italic></sub> for each element in m &#x00D7; n size feature map</title>
</caption>
<table>
<colgroup>
<col/>
</colgroup>
<tbody>
<tr>
<td><inline-graphic xlink:href="CMC_19328-inline-1.png"/></td>
</tr>
</tbody>
</table>
</table-wrap>
<p>Instead of using the edge that directly connects two vertices (the dashed line in <xref ref-type="fig" rid="fig-6">Fig. 6</xref>), the coordinates of these two vertices are utilized to compute the edge. For example, <inline-formula id="ieqn-28"><mml:math id="mml-ieqn-28"><mml:msub><mml:mi>S</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> has coordinate <inline-formula id="ieqn-29"><mml:math id="mml-ieqn-29"><mml:mrow><mml:mo>(</mml:mo><mml:mn>2</mml:mn><mml:mo>,</mml:mo><mml:mspace width="thinmathspace" /><mml:mi>n</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula> and <inline-formula id="ieqn-30"><mml:math id="mml-ieqn-30"><mml:msub><mml:mi>S</mml:mi><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> has coordinate <inline-formula id="ieqn-31"><mml:math id="mml-ieqn-31"><mml:mrow><mml:mo>(</mml:mo><mml:mn>4</mml:mn><mml:mo>,</mml:mo><mml:mspace width="thinmathspace" /><mml:mi>n</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>3</mml:mn><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula>, their distance (from <inline-formula id="ieqn-32"><mml:math id="mml-ieqn-32"><mml:msub><mml:mi>S</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> to <inline-formula id="ieqn-33"><mml:math id="mml-ieqn-33"><mml:msub><mml:mi>S</mml:mi><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>) in horizontal direction is <inline-formula id="ieqn-34"><mml:math id="mml-ieqn-34"><mml:mo>&#x2212;</mml:mo><mml:mn>2</mml:mn></mml:math></inline-formula>, in vertical direction is <inline-formula id="ieqn-35"><mml:math id="mml-ieqn-35"><mml:mn>3</mml:mn></mml:math></inline-formula>, and their relative position (from <inline-formula id="ieqn-36"><mml:math id="mml-ieqn-36"><mml:msub><mml:mi>S</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> to <inline-formula id="ieqn-37"><mml:math id="mml-ieqn-37"><mml:msub><mml:mi>S</mml:mi><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>) <inline-formula id="ieqn-38"><mml:math id="mml-ieqn-38"><mml:msub><mml:mi>E</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> is represented by <inline-formula id="ieqn-39"><mml:math id="mml-ieqn-39"><mml:msubsup><mml:mi>E</mml:mi><mml:mrow><mml:mn>3</mml:mn></mml:mrow><mml:mrow><mml:mi>r</mml:mi><mml:mi>o</mml:mi><mml:mi>w</mml:mi></mml:mrow></mml:msubsup><mml:mo>+</mml:mo><mml:msubsup><mml:mi>E</mml:mi><mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mn>2</mml:mn></mml:mrow><mml:mrow><mml:mi>c</mml:mi><mml:mi>o</mml:mi><mml:mi>l</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula>. In practice, in order to get a compact computation process, we use <bold>Algorithm 1</bold> to get an edge matrix <inline-formula id="ieqn-40"><mml:math id="mml-ieqn-40"><mml:mi>E</mml:mi></mml:math></inline-formula> for each element.</p>
<p>The model needs to store two kinds of edges in this way, one is <inline-formula id="ieqn-41"><mml:math id="mml-ieqn-41"><mml:msup><mml:mi>E</mml:mi><mml:mrow><mml:mi>r</mml:mi><mml:mi>o</mml:mi><mml:mi>w</mml:mi></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:msubsup><mml:mi>E</mml:mi><mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:mi>m</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo>)</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:mi>r</mml:mi><mml:mi>o</mml:mi><mml:mi>w</mml:mi></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:mspace width="thinmathspace" /><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:mspace width="thinmathspace" /><mml:msubsup><mml:mi>E</mml:mi><mml:mrow><mml:mn>0</mml:mn></mml:mrow><mml:mrow><mml:mi>r</mml:mi><mml:mi>o</mml:mi><mml:mi>w</mml:mi></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:mspace width="thinmathspace" /><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:mo fence="true" stretchy="true" symmetric="true"></mml:mo></mml:mrow><mml:mrow><mml:mo fence="true" stretchy="true" symmetric="true"></mml:mo><mml:msubsup><mml:mi>E</mml:mi><mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mi>m</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo>)</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:mi>r</mml:mi><mml:mi>o</mml:mi><mml:mi>w</mml:mi></mml:mrow></mml:msubsup><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula>, and the other is <inline-formula id="ieqn-42"><mml:math id="mml-ieqn-42"><mml:msup><mml:mi>E</mml:mi><mml:mrow><mml:mi>c</mml:mi><mml:mi>o</mml:mi><mml:mi>l</mml:mi></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:msubsup><mml:mi>E</mml:mi><mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:mi>n</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo>)</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:mi>c</mml:mi><mml:mi>o</mml:mi><mml:mi>l</mml:mi></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:mspace width="thinmathspace" /><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:mspace width="thinmathspace" /><mml:msubsup><mml:mi>E</mml:mi><mml:mrow><mml:mn>0</mml:mn></mml:mrow><mml:mrow><mml:mi>c</mml:mi><mml:mi>o</mml:mi><mml:mi>l</mml:mi></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:mspace width="thinmathspace" /><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:mspace width="thinmathspace" /><mml:msubsup><mml:mi>E</mml:mi><mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mi>n</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo>)</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:mi>c</mml:mi><mml:mi>o</mml:mi><mml:mi>l</mml:mi></mml:mrow></mml:msubsup><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula>, there are <inline-formula id="ieqn-43"><mml:math id="mml-ieqn-43"><mml:mn>2</mml:mn><mml:mo>&#x22C5;</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:mi>m</mml:mi><mml:mo>+</mml:mo><mml:mi>n</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula> edges in total. For a feature map of size <inline-formula id="ieqn-44"><mml:math id="mml-ieqn-44"><mml:mi>m</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mi>n</mml:mi></mml:math></inline-formula>, we reduce the space complexity of storing edges from <inline-formula id="ieqn-45"><mml:math id="mml-ieqn-45"><mml:mi>O</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:msup><mml:mi>m</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msup><mml:msup><mml:mi>n</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msup><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula> to <inline-formula id="ieqn-46"><mml:math id="mml-ieqn-46"><mml:mi>O</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mi>m</mml:mi><mml:mi>a</mml:mi><mml:mi>x</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mi>m</mml:mi><mml:mo>,</mml:mo><mml:mspace width="thinmathspace" /><mml:mi>n</mml:mi><mml:mo>)</mml:mo></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula> by using coordinates of two vertices to compute their edge.</p>
</sec>
</sec>
<sec id="s4">
<label>4</label>
<title>Experimental Results and Analysis</title>
<sec id="s4_1">
<label>4.1</label>
<title>Metrics</title>
<p>Our caption model was evaluated in several different evaluation metrics, including BLEU [<xref ref-type="bibr" rid="ref-20">20</xref>], CIDEr [<xref ref-type="bibr" rid="ref-21">21</xref>], METEOR [<xref ref-type="bibr" rid="ref-22">22</xref>], and SPICE [<xref ref-type="bibr" rid="ref-23">23</xref>], etc. These metrics focus on different aspects of generated captions and give a scalar evaluation value quantitatively. BLEU is a precision-based metric and is traditionally used in machine translation to measure the similarity between the generated captions and the ground truth captions. CIDEr measures consensus in generated captions by performing a Term Frequency-Inverse Document Frequency weighting for each n-gram. METEOR is based on the explicit word to word matches between the generated captions and the ground-truth captions. SPICE is a semantic-based method that measures how well caption models recover objects, attributes and relations shown in the ground truth captions.</p>
</sec>
<sec id="s4_2">
<label>4.2</label>
<title>Loss Functions</title>
<p>Given the ground truth sentence <inline-formula id="ieqn-47"><mml:math id="mml-ieqn-47"><mml:msub><mml:mi>S</mml:mi><mml:mrow><mml:mi>g</mml:mi><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mo fence="false" stretchy="false">{</mml:mo><mml:msub><mml:mi>y</mml:mi><mml:mrow><mml:mn>0</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mspace width="thinmathspace" /><mml:msub><mml:mi>y</mml:mi><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mspace width="thinmathspace" /><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:mspace width="thinmathspace" /><mml:msub><mml:mi>y</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo fence="false" stretchy="false">}</mml:mo></mml:math></inline-formula> and its corresponding image <inline-formula id="ieqn-48"><mml:math id="mml-ieqn-48"><mml:mi>I</mml:mi></mml:math></inline-formula>, the sentence <inline-formula id="ieqn-49"><mml:math id="mml-ieqn-49"><mml:msub><mml:mi>S</mml:mi><mml:mrow><mml:mi>g</mml:mi><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> was split into two parts <inline-formula id="ieqn-50"><mml:math id="mml-ieqn-50"><mml:msub><mml:mi>S</mml:mi><mml:mrow><mml:mrow><mml:mi mathvariant="italic">t</mml:mi><mml:mi mathvariant="italic">a</mml:mi><mml:mi mathvariant="italic">r</mml:mi><mml:mi mathvariant="italic">g</mml:mi><mml:mi mathvariant="italic">e</mml:mi><mml:mi mathvariant="italic">t</mml:mi></mml:mrow></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mi>S</mml:mi><mml:mrow><mml:mi>g</mml:mi><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo>[</mml:mo><mml:mn>0</mml:mn><mml:mo>&#x003A;</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo>]</mml:mo></mml:mrow></mml:math></inline-formula> and <inline-formula id="ieqn-51"><mml:math id="mml-ieqn-51"><mml:msub><mml:mi>S</mml:mi><mml:mrow><mml:mrow><mml:mi mathvariant="italic">t</mml:mi><mml:mi mathvariant="italic">a</mml:mi><mml:mi mathvariant="italic">r</mml:mi><mml:mi mathvariant="italic">g</mml:mi><mml:mi mathvariant="italic">e</mml:mi><mml:mi mathvariant="italic">t</mml:mi></mml:mrow><mml:mi mathvariant="normal">&#x005F;</mml:mi><mml:mi>y</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mi>S</mml:mi><mml:mrow><mml:mi>g</mml:mi><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo>[</mml:mo><mml:mn>1</mml:mn><mml:mo>&#x003A;</mml:mo><mml:mo>]</mml:mo></mml:mrow></mml:math></inline-formula>. The model was trained by minimizing the following cross-entropy loss:</p>
<p><disp-formula id="eqn-6">
<label>(6)</label>
<mml:math id="mml-eqn-6" display="block"><mml:msub><mml:mi>L</mml:mi><mml:mrow><mml:mrow><mml:mi mathvariant="italic">c</mml:mi><mml:mi mathvariant="italic">r</mml:mi><mml:mi mathvariant="italic">o</mml:mi><mml:mi mathvariant="italic">s</mml:mi><mml:mi mathvariant="italic">s</mml:mi></mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mrow><mml:mi mathvariant="italic">e</mml:mi><mml:mi mathvariant="italic">n</mml:mi><mml:mi mathvariant="italic">t</mml:mi><mml:mi mathvariant="italic">r</mml:mi><mml:mi mathvariant="italic">o</mml:mi><mml:mi mathvariant="italic">p</mml:mi><mml:mi mathvariant="italic">y</mml:mi></mml:mrow></mml:mrow></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:mi>&#x03B8;</mml:mi><mml:mo>)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mi>l</mml:mi><mml:mi>o</mml:mi><mml:mi>g</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>p</mml:mi><mml:mrow><mml:mi>&#x03B8;</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>S</mml:mi><mml:mrow><mml:mrow><mml:mi mathvariant="italic">t</mml:mi><mml:mi mathvariant="italic">a</mml:mi><mml:mi mathvariant="italic">r</mml:mi><mml:mi mathvariant="italic">g</mml:mi><mml:mi mathvariant="italic">e</mml:mi><mml:mi mathvariant="italic">t</mml:mi></mml:mrow><mml:mi mathvariant="normal">&#x005F;</mml:mi><mml:mi>y</mml:mi></mml:mrow></mml:msub><mml:mo>&#x2223;</mml:mo><mml:msub><mml:mi>S</mml:mi><mml:mrow><mml:mrow><mml:mi mathvariant="italic">t</mml:mi><mml:mi mathvariant="italic">a</mml:mi><mml:mi mathvariant="italic">r</mml:mi><mml:mi mathvariant="italic">g</mml:mi><mml:mi mathvariant="italic">e</mml:mi><mml:mi mathvariant="italic">t</mml:mi></mml:mrow></mml:mrow></mml:msub><mml:mo>;</mml:mo><mml:mspace width="thinmathspace" /><mml:mi>&#x03B8;</mml:mi><mml:mo>;</mml:mo><mml:mspace width="thinmathspace" /><mml:mi>I</mml:mi><mml:mo>)</mml:mo></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:math>
</disp-formula></p>
<p>where <inline-formula id="ieqn-52"><mml:math id="mml-ieqn-52"><mml:mi>&#x03B8;</mml:mi></mml:math></inline-formula> was the parameters of the model. At the training stage, the model was trained to generate the next ground-truth word given the previous ground-truth words, while during the testing phase, the model used the previously generated words from the model distribution to predict the next word. This mismatch resulted in error accumulation during generation at test time, because the model had never been exposed to its own predictions. To make a fair comparison with recent works [<xref ref-type="bibr" rid="ref-24">24</xref>]. At the beginning, the model was trained with standard cross-entropy loss for 15 epochs. After that, the pre-trained model continued to adjust its parameters under the proposed Reinforcement Learning (RL) method described in [<xref ref-type="bibr" rid="ref-24">24</xref>] for another 15 epochs.</p>
<p>This method can relieve the mismatch between training and testing by minimizing the negative expected reward:</p>
<p><disp-formula id="eqn-7">
<label>(7)</label>
<mml:math id="mml-eqn-7" display="block"><mml:mi>L</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mi>&#x03B8;</mml:mi><mml:mo>)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mi>E</mml:mi><mml:mrow><mml:mrow><mml:msup><mml:mi>&#x03C9;</mml:mi><mml:mrow><mml:mi>s</mml:mi></mml:mrow></mml:msup></mml:mrow><mml:mo>&#x223C;</mml:mo><mml:mi>p</mml:mi><mml:mi>&#x03B8;</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo>[</mml:mo><mml:mi>r</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:msup><mml:mi>&#x03C9;</mml:mi><mml:mrow><mml:mi>s</mml:mi></mml:mrow></mml:msup><mml:mo>)</mml:mo></mml:mrow><mml:mo>]</mml:mo></mml:mrow></mml:math>
</disp-formula></p>
<p>where <inline-formula id="ieqn-53"><mml:math id="mml-ieqn-53"><mml:msup><mml:mi>&#x03C9;</mml:mi><mml:mrow><mml:mi>s</mml:mi></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:msubsup><mml:mi>&#x03C9;</mml:mi><mml:mrow><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>s</mml:mi></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:mspace width="thinmathspace" /><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:mspace width="thinmathspace" /><mml:msubsup><mml:mi>&#x03C9;</mml:mi><mml:mrow><mml:mi>T</mml:mi></mml:mrow><mml:mrow><mml:mi>s</mml:mi></mml:mrow></mml:msubsup><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula> was the generated sentence and <inline-formula id="ieqn-54"><mml:math id="mml-ieqn-54"><mml:mi>r</mml:mi></mml:math></inline-formula> was the CIDEr score of the generated sentence.</p>
</sec>
<sec id="s4_3">
<label>4.3</label>
<title>Dataset</title>
<p>The MSCOCO2014 dataset [<xref ref-type="bibr" rid="ref-25">25</xref>], one of the most popular datasets for image captioning, was used to evaluate the proposed model. This dataset contains 123,287 images in total (82783 training images and 40504 validation images respectively), each image has five different captions. To compare our experimental results with other methods precisely, the widely used &#x201C;Karpathy&#x201D; split [<xref ref-type="bibr" rid="ref-26">26</xref>] was adopted for MSCOCO2014 dataset. This split has 112,387 images for training, 5000 images for validation and 5000 images for testing. The performance of the model was measured on the testing set.</p>
</sec>
<sec id="s4_4">
<label>4.4</label>
<title>Data Preprocessing</title>
<p>The images were normalized to mean &#x003D; [0.485, 0.456, 0.406], std &#x003D; [0.229, 0.224, 0.225], and the captions with length larger than 16 got clipped. Subsequently, a vocabulary was built with three tokens &#x003C;BOS&#x003E;, &#x003C;EOS&#x003E;, &#x003C;UNK&#x003E; and the words that occurred at least 5 times in the preprocessed captions. The token &#x003C;UNK&#x003E; represented words appearing less than 5 times, the token &#x003C;BOS&#x003E; and &#x003C;EOS&#x003E; indicated the start and the end of a sentence. Finally, the captions were vectorized by the indices of words and tokens in the vocabulary. During the training process, for the convenience of transformation between words and indices, two maps <inline-formula id="ieqn-55"><mml:math id="mml-ieqn-55"><mml:mi>w</mml:mi><mml:mi>t</mml:mi><mml:mi>o</mml:mi><mml:mi>i</mml:mi></mml:math></inline-formula> and <inline-formula id="ieqn-56"><mml:math id="mml-ieqn-56"><mml:mi>i</mml:mi><mml:mi>t</mml:mi><mml:mi>o</mml:mi><mml:mi>w</mml:mi></mml:math></inline-formula> were maintained. <inline-formula id="ieqn-57"><mml:math id="mml-ieqn-57"><mml:mi>w</mml:mi><mml:mi>t</mml:mi><mml:mi>o</mml:mi><mml:mi>i</mml:mi></mml:math></inline-formula> maps a word or token to its corresponding index, and <inline-formula id="ieqn-58"><mml:math id="mml-ieqn-58"><mml:mi>i</mml:mi><mml:mi>t</mml:mi><mml:mi>o</mml:mi><mml:mi>w</mml:mi></mml:math></inline-formula> maps an index to the word or token.</p>
</sec>
<sec id="s4_5">
<label>4.5</label>
<title>Inference</title>
<p>The inference was similar to RNN-based models, and the word would be generated one by one at a time. Firstly, the model began with the sequence <inline-formula id="ieqn-59"><mml:math id="mml-ieqn-59"><mml:msub><mml:mi>S</mml:mi><mml:mrow><mml:mn>0</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula> that only contained the start token &#x003C;BOS&#x003E;, and obtained the dictionary probability <inline-formula id="ieqn-60"><mml:math id="mml-ieqn-60"><mml:msub><mml:mi>y</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>&#x223C;</mml:mo><mml:mi>p</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>y</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>&#x2223;</mml:mo><mml:msub><mml:mi>S</mml:mi><mml:mrow><mml:mn>0</mml:mn></mml:mrow></mml:msub><mml:mo>;</mml:mo><mml:mspace width="thinmathspace" /><mml:mi>&#x03B8;</mml:mi><mml:mo>;</mml:mo><mml:mspace width="thinmathspace" /><mml:mi>I</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula> through the first iteration. Afterwards, some sampling methods such as the greedy method or the beam search method were used to generate the first word <inline-formula id="ieqn-61"><mml:math id="mml-ieqn-61"><mml:msub><mml:mi>y</mml:mi><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula>. Then, <inline-formula id="ieqn-62"><mml:math id="mml-ieqn-62"><mml:msub><mml:mi>y</mml:mi><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula> was fed back into the model to generate the next word <inline-formula id="ieqn-63"><mml:math id="mml-ieqn-63"><mml:msub><mml:mi>y</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula>. This process would continue until the end token &#x003C;EOS&#x003E; or the max length L was reached.</p>
</sec>
<sec id="s4_6">
<label>4.6</label>
<title>Implementation Details</title>
<p>A FPN from a pretrained instance segmentation model [<xref ref-type="bibr" rid="ref-27">27</xref>] was used to produce features at five levels. Experiments were carried out based on the second and the fourth features. The spatial size of the second feature was set to 14 &#x00D7; 14 and the other was set to 28 &#x00D7; 28 <italic>via</italic> adaptive average pooling. We did not train the fine-tune model, thus, the parameters of the two features were fixed in the whole training process.</p>
<p>In <?A3B2 "tbl1",5,"anchor"?><xref ref-type="table" rid="table-1">Tab. 1</xref>, the hyperparameter settings of the position-aware transformer model trained with standard cross-entropy loss are presented.</p>
<table-wrap id="table-1">
<label>Table 1</label>
<caption>
<title>Hyperparameter settings of the model</title>
</caption>
<table>
<colgroup>
<col/>
<col/>
</colgroup>
<thead>
<tr>
<th>Parameter</th>
<th>Value</th>
</tr>
</thead>
<tbody>
<tr>
<td><italic>epochs</italic></td>
<td>15</td>
</tr>
<tr>
<td><italic>learning_rate</italic></td>
<td>0.0005</td>
</tr>
<tr>
<td><italic>label_smoothing</italic></td>
<td>0.1</td>
</tr>
<tr>
<td><italic>warmup_steps</italic></td>
<td>20000</td>
</tr>
<tr>
<td><italic>adam</italic> <inline-formula id="ieqn-64"><mml:math id="mml-ieqn-64"><mml:msub><mml:mi>&#x03B2;</mml:mi><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula></td>
<td>0.9</td>
</tr>
<tr>
<td><italic>adam</italic> <inline-formula id="ieqn-65"><mml:math id="mml-ieqn-65"><mml:msub><mml:mi>&#x03B2;</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula></td>
<td>0.98</td>
</tr>
<tr>
<td><italic>sample_method</italic></td>
<td><italic>beam_search</italic></td>
</tr>
<tr>
<td><italic>beam_size</italic></td>
<td>3</td>
</tr>
<tr>
<td><inline-formula id="ieqn-66"><mml:math id="mml-ieqn-66"><mml:msub><mml:mi>d</mml:mi><mml:mrow><mml:mrow><mml:mi mathvariant="italic">m</mml:mi><mml:mi mathvariant="italic">o</mml:mi><mml:mi mathvariant="italic">d</mml:mi><mml:mi mathvariant="italic">e</mml:mi><mml:mi mathvariant="italic">l</mml:mi></mml:mrow></mml:mrow></mml:msub></mml:math></inline-formula></td>
<td>256</td>
</tr>
<tr>
<td><italic>num_head</italic></td>
<td>4</td>
</tr>
<tr>
<td><inline-formula id="ieqn-67"><mml:math id="mml-ieqn-67"><mml:msub><mml:mi>d</mml:mi><mml:mrow><mml:mi>h</mml:mi><mml:mi>e</mml:mi><mml:mi>a</mml:mi><mml:mi>d</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula></td>
<td>64</td>
</tr>
<tr>
<td><inline-formula id="ieqn-68"><mml:math id="mml-ieqn-68"><mml:msub><mml:mi>P</mml:mi><mml:mrow><mml:mrow><mml:mi mathvariant="italic">d</mml:mi><mml:mi mathvariant="italic">r</mml:mi><mml:mi mathvariant="italic">o</mml:mi><mml:mi mathvariant="italic">p</mml:mi><mml:mi mathvariant="italic">o</mml:mi><mml:mi mathvariant="italic">u</mml:mi><mml:mi mathvariant="italic">t</mml:mi></mml:mrow></mml:mrow></mml:msub></mml:math></inline-formula></td>
<td>0.1</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>For our model trained with standard cross-entropy loss, we used 6 attention layers, <inline-formula id="ieqn-69"><mml:math id="mml-ieqn-69"><mml:msub><mml:mi>d</mml:mi><mml:mrow><mml:mrow><mml:mi mathvariant="italic">m</mml:mi><mml:mi mathvariant="italic">o</mml:mi><mml:mi mathvariant="italic">d</mml:mi><mml:mi mathvariant="italic">e</mml:mi><mml:mi mathvariant="italic">l</mml:mi></mml:mrow></mml:mrow></mml:msub></mml:math></inline-formula> &#x003D; 256, 4 attention heads, <inline-formula id="ieqn-70"><mml:math id="mml-ieqn-70"><mml:msub><mml:mi>d</mml:mi><mml:mrow><mml:mi>h</mml:mi><mml:mi>e</mml:mi><mml:mi>a</mml:mi><mml:mi>d</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> &#x003D; 64, 1024 feed forward inner-layer dimensions, and <inline-formula id="ieqn-71"><mml:math id="mml-ieqn-71"><mml:msub><mml:mi>P</mml:mi><mml:mrow><mml:mrow><mml:mi mathvariant="italic">d</mml:mi><mml:mi mathvariant="italic">r</mml:mi><mml:mi mathvariant="italic">o</mml:mi><mml:mi mathvariant="italic">p</mml:mi><mml:mi mathvariant="italic">o</mml:mi><mml:mi mathvariant="italic">u</mml:mi><mml:mi mathvariant="italic">t</mml:mi></mml:mrow></mml:mrow></mml:msub></mml:math></inline-formula> &#x003D; 0.1. This model was trained for 15 epochs, each epoch had 12000 iterations and the batch size was 10. The initial learning rate of the model was <inline-formula id="ieqn-72"><mml:math id="mml-ieqn-72"><mml:mn>5</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:msup><mml:mn>10</mml:mn><mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mn>4</mml:mn></mml:mrow></mml:msup></mml:math></inline-formula>, the warmup strategy with <inline-formula id="ieqn-73"><mml:math id="mml-ieqn-73"><mml:mrow><mml:mi mathvariant="italic">w</mml:mi><mml:mi mathvariant="italic">a</mml:mi><mml:mi mathvariant="italic">r</mml:mi><mml:mi mathvariant="italic">m</mml:mi><mml:mi mathvariant="italic">u</mml:mi></mml:mrow><mml:msub><mml:mi>p</mml:mi><mml:mrow><mml:mrow><mml:mi mathvariant="italic">s</mml:mi><mml:mi mathvariant="italic">t</mml:mi><mml:mi mathvariant="italic">e</mml:mi><mml:mi mathvariant="italic">p</mml:mi><mml:mi mathvariant="italic">s</mml:mi></mml:mrow></mml:mrow></mml:msub></mml:math></inline-formula> &#x003D; 20000 was used to speed up the training and the same weight decay strategy as in [<xref ref-type="bibr" rid="ref-16">16</xref>] was adopted for learning rate adjustment. The Adam optimizer [<xref ref-type="bibr" rid="ref-28">28</xref>] with <inline-formula id="ieqn-74"><mml:math id="mml-ieqn-74"><mml:msub><mml:mi>&#x03B2;</mml:mi><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula> &#x003D; 0.9, <inline-formula id="ieqn-75"><mml:math id="mml-ieqn-75"><mml:msub><mml:mi>&#x03B2;</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula> &#x003D; 0.98, and <inline-formula id="ieqn-76"><mml:math id="mml-ieqn-76"><mml:mi>&#x03F5;</mml:mi></mml:math></inline-formula> &#x003D; <inline-formula id="ieqn-77"><mml:math id="mml-ieqn-77"><mml:msup><mml:mn>10</mml:mn><mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mn>9</mml:mn></mml:mrow></mml:msup></mml:math></inline-formula> was used to update parameters of our model. During training, we employed label smoothing of value <inline-formula id="ieqn-78"><mml:math id="mml-ieqn-78"><mml:mi>l</mml:mi><mml:mi>a</mml:mi><mml:mi>b</mml:mi><mml:mi>e</mml:mi><mml:msub><mml:mi>l</mml:mi><mml:mrow><mml:mrow><mml:mi mathvariant="italic">s</mml:mi><mml:mi mathvariant="italic">m</mml:mi><mml:mi mathvariant="italic">o</mml:mi><mml:mi mathvariant="italic">o</mml:mi><mml:mi mathvariant="italic">t</mml:mi><mml:mi mathvariant="italic">h</mml:mi><mml:mi mathvariant="italic">i</mml:mi><mml:mi mathvariant="italic">n</mml:mi><mml:mi mathvariant="italic">g</mml:mi></mml:mrow></mml:mrow></mml:msub></mml:math></inline-formula> &#x003D; 0.1 [<xref ref-type="bibr" rid="ref-29">29</xref>]. At the inference stage, the beam search method with a beam size of 3 was chosen for better caption generation. The Pytorch framework was adopted to implement our model for image captioning.</p>
<p>For our model optimized by CIDEr optimization (Initializing from the pretrained cross-entropy trained model), it was trained for another 15 epochs to adjust parameters. The initial learning rate was set to <inline-formula id="ieqn-79"><mml:math id="mml-ieqn-79"><mml:mn>1</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:msup><mml:mn>10</mml:mn><mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mn>5</mml:mn></mml:mrow></mml:msup></mml:math></inline-formula>, and both warmup and weight decay options were turned off. The rest of the settings were identical to the cross-entropy model.</p>
</sec>
<sec id="s4_7">
<label>4.7</label>
<title>Ablation Studies</title>
<p>In this section, we conducted several ablative experiments for the position-aware transformer model on the MSCOCO datasets. In order to further verify the effectiveness of the sub-modules in our model, a Vanilla Transformer model for image captioning was implemented. It regarded the CNN and the transformer encoder as the image encoder and the transformer decoder as the caption decoder. Based on the vanilla transformer model, the other two models (FPN Transformer and Position-aware Transformer) were implemented as follows:</p>
<p>FPN Transformer: a model equipped with the image-feature attention sub-module and employed image features built by the FPN.</p>
<p>Position-aware Transformer: a model equipped with the image-feature attention and position-aware attention sub-modules. This model also used the image features built by the FPN.</p>
<p>In the experiments, Vanilla Transformer model used the ResNet to encode the given image <inline-formula id="ieqn-80"><mml:math id="mml-ieqn-80"><mml:mi>I</mml:mi></mml:math></inline-formula> to the spatial image feature and the image feature was obtained from the 5th pool layer of the ResNet. The ResNet was pre-trained on the ImageNet dataset. We then apply adaptive average pooling to obtain an image spatial feature <inline-formula id="ieqn-81"><mml:math id="mml-ieqn-81"><mml:mi>V</mml:mi><mml:mo>=</mml:mo><mml:mo fence="false" stretchy="false">{</mml:mo><mml:msub><mml:mi>v</mml:mi><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mspace width="thinmathspace" /><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:mspace width="thinmathspace" /><mml:msub><mml:mi>v</mml:mi><mml:mrow><mml:mn>14</mml:mn><mml:mi>x</mml:mi><mml:mn>14</mml:mn></mml:mrow></mml:msub><mml:mo fence="false" stretchy="false">}</mml:mo></mml:math></inline-formula>, <inline-formula id="ieqn-82"><mml:math id="mml-ieqn-82"><mml:msub><mml:mi>v</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mi>R</mml:mi><mml:mrow><mml:mrow><mml:msub><mml:mi>d</mml:mi><mml:mrow><mml:mrow><mml:mi mathvariant="italic">m</mml:mi><mml:mi mathvariant="italic">o</mml:mi><mml:mi mathvariant="italic">d</mml:mi><mml:mi mathvariant="italic">e</mml:mi><mml:mi mathvariant="italic">l</mml:mi></mml:mrow></mml:mrow></mml:msub></mml:mrow></mml:mrow></mml:msup></mml:math></inline-formula>, where 14 &#x00D7; 14 is the number of regions, and <inline-formula id="ieqn-83"><mml:math id="mml-ieqn-83"><mml:msub><mml:mi>v</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> represents a region of the image. FPN Transformer used the same FPN network as in [<xref ref-type="bibr" rid="ref-27">27</xref>] to encode the given image <inline-formula id="ieqn-84"><mml:math id="mml-ieqn-84"><mml:mi>I</mml:mi></mml:math></inline-formula> and the image feature attention to fuse image features built by the FPN to size of 14 &#x00D7; 14 too. Position-aware Transformer was the proposed approach described in <xref ref-type="fig" rid="fig-2">Fig. 2</xref>. All hyperparameters of the three models stayed the same if possible. In <?A3B2 "tbl2",5,"anchor"?><xref ref-type="table" rid="table-2">Tab. 2</xref>, the test results of the Vanilla Transformer, FPN Transformer and Position-Aware Transformer on BLEU-1, BLEU-2, BLEU-3, BLEU-4, METEOR, ROUGE-L, CIDEr metrics are presented, and the validation results of the three models are shown in <?A3B2 "fig7",5,"anchor"?><xref ref-type="fig" rid="fig-7">Fig. 7</xref>.</p>
<table-wrap id="table-2">
<label>Table 2</label>
<caption>
<title>The performance of our models optimized by standard cross-entropy loss</title>
</caption>
<table>
<colgroup>
<col/>
<col/>
<col/>
<col/>
<col/>
<col/>
<col/>
<col/>
</colgroup>
<thead>
<tr>
<th>Metric</th>
<th>BLEU-1</th>
<th>BLEU-2</th>
<th>BLEU-3</th>
<th>BLEU-4</th>
<th>METEOR</th>
<th>ROUGE-L</th>
<th>CIDEr</th>
</tr>
</thead>
<tbody>
<tr>
<td>Vanilla transformer</td>
<td>74.9</td>
<td>58.2</td>
<td>44.6</td>
<td>34.1</td>
<td>27.1</td>
<td>55.4</td>
<td>107.6</td>
</tr>
<tr>
<td>FPN transformer</td>
<td>76.2</td>
<td>60.0</td>
<td>46.3</td>
<td>35.6</td>
<td>27.6</td>
<td>56.7</td>
<td>113.9</td>
</tr>
<tr>
<td>Position-aware transformer</td>
<td>76.7</td>
<td>60.8</td>
<td>46.9</td>
<td>36.0</td>
<td>27.8</td>
<td>56.5</td>
<td>114.9</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>As shown in <xref ref-type="table" rid="table-2">Tab. 2</xref>, through image-feature attention and position-aware attention, the Vanilla Transformer model can achieve better performance in terms of BLEU-1, BLEU-2, BLEU-3, BLEU-4, METEOR, ROUGE-L and CIDEr.</p>
<fig id="fig-7">
<label>Figure 7</label>
<caption>
<title>Validation results of several metrics</title>
</caption>
<graphic mimetype="image" mime-subtype="png" xlink:href="CMC_19328-fig-7.png"/>
</fig>
<p>From <xref ref-type="fig" rid="fig-7">Fig. 7</xref>, it turns out that FPN Transformer has better performance compared with Vanilla Transformer on all metrics, which is due to the fact that the FPN produces a multi-scale feature representation in which all levels are semantically strong, including the high-resolution levels. This enables a model to detect objects across a large range of scales by scanning the model over both positions and pyramid levels. Also, it can be noticed that the combination of image-feature attention and position-aware attention provides the best performance, mainly because that the position-aware attention makes features can be explained from the perspective of spatial relationship.</p>
<p>SPICE is a semantic-based method that measures how well caption models recover objects, attributes and relations. To investigate the performance improved by the proposed sub-modules, we report SPICE F-scores over various subcategories on the MSCOCO testing set in <?A3B2 "tbl3",5,"anchor"?><xref ref-type="table" rid="table-3">Tab. 3</xref> and <?A3B2 "fig8",5,"anchor"?><xref ref-type="fig" rid="fig-8">Fig. 8</xref>. When equipped with the image-feature attention, the FPN Transformer increases the SPICE-Objects metric by 2.2% compared with the Vanilla Transformer, exceeding the relative improvement of 1.85% on the SPICE-Relations metric and the relative improvement of 0.15% on the SPICE metric. It shows that the image-feature attention can improve the performance in terms of identifying objects. After incorporating the position-aware attention, the Position-aware Transformer shows more remarkable relative improvement of 9.0% on the SPICE-Relations metric than the relative improvements on the SPICE and the SPICE-Objects metrics, demonstrating that the position-aware attention improves the performance by identifying the relationships between objects.</p>
<table-wrap id="table-3">
<label>Table 3</label>
<caption>
<title>SPICE F-scores over various subcategories on the MSCOCO test set</title>
</caption>
<table>
<colgroup>
<col/>
<col/>
<col/>
<col/>
</colgroup>
<thead>
<tr>
<th>Metric</th>
<th>SPICE</th>
<th>SPICE-objects</th>
<th>SPICE-relations</th>
</tr>
</thead>
<tbody>
<tr>
<td>Vanilla transformer</td>
<td>20.1</td>
<td>36.2</td>
<td>5.4</td>
</tr>
<tr>
<td>FPN transformer</td>
<td>20.1</td>
<td>37.0</td>
<td>5.5</td>
</tr>
<tr>
<td>Position-aware transformer</td>
<td>20.9</td>
<td>37.8</td>
<td>6.0</td>
</tr>
</tbody>
</table>
</table-wrap>
<fig id="fig-8">
<label>Figure 8</label>
<caption>
<title>Performance comparison of different transformers</title>
</caption>
<graphic mimetype="image" mime-subtype="png" xlink:href="CMC_19328-fig-8.png"/>
</fig>
</sec>
<sec id="s4_8">
<label>4.8</label>
<title>Comparing with Other State-of-the-Art Methods</title>
<p>The experimental results of the Position-aware Transformer and previous state-of-the-art models on the MSCOCO testing set are shown in <?A3B2 "tbl4",5,"anchor"?><xref ref-type="table" rid="table-4">Tab. 4</xref>. All results are produced by models trained with standard cross-entropy loss. The Soft-Attention model [<xref ref-type="bibr" rid="ref-12">12</xref>], which uses the ResNet-101 as the image encoder, is our baseline model.</p>
<table-wrap id="table-4">
<label>Table 4</label>
<caption>
<title>Experimental results of our approach compared with other methods (optimized by standard cross-entropy loss)</title>
</caption>
<table>
<colgroup>
<col/>
<col/>
<col/>
<col/>
<col/>
<col/>
<col/>
<col/>
</colgroup>
<thead>
<tr>
<th>Metric</th>
<th>BLEU-1</th>
<th>BLEU-2</th>
<th>BLEU-3</th>
<th>BLEU-4</th>
<th>METEOR</th>
<th>ROUGE-L</th>
<th>CIDEr</th>
</tr>
</thead>
<tbody>
<tr>
<td>Soft-attention [<xref ref-type="bibr" rid="ref-12">12</xref>]</td>
<td>70.7</td>
<td>49.2</td>
<td>34.4</td>
<td>24.3</td>
<td>23.9</td>
<td>&#x2013;</td>
<td>&#x2013;</td>
</tr>
<tr>
<td>Hard-attention [<xref ref-type="bibr" rid="ref-12">12</xref>]</td>
<td>71.8</td>
<td>50.4</td>
<td>35.7</td>
<td>25.0</td>
<td>23.0</td>
<td>&#x2013;</td>
<td>&#x2013;</td>
</tr>
<tr>
<td>Adaptive [<xref ref-type="bibr" rid="ref-15">15</xref>]</td>
<td>74.2</td>
<td>58.0</td>
<td>43.9</td>
<td>33.2</td>
<td>26.6</td>
<td>&#x2013;</td>
<td>108.5</td>
</tr>
<tr>
<td>Bottom-up [<xref ref-type="bibr" rid="ref-14">14</xref>]</td>
<td>77.2</td>
<td>&#x2013;</td>
<td>&#x2013;</td>
<td>36.2</td>
<td>27.0</td>
<td>56.4</td>
<td>113.5</td>
</tr>
<tr>
<td>Position-aware transformer</td>
<td>76.7</td>
<td>60.8</td>
<td>46.9</td>
<td>36.0</td>
<td>27.8</td>
<td>56.5</td>
<td>114.9</td>
</tr>
<tr>
<td>Relative improvement</td>
<td>&#x2212;0.07%</td>
<td>&#x2013;</td>
<td>&#x2013;</td>
<td>0.06%</td>
<td>3%</td>
<td>0.1%</td>
<td>1.2%</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>In contrast to recent state-of-the-art models, our model shows a better performance. When compared with the Bottom-Up model, the METEOR score, ROUGE-L score and CIDEr score increase from 27.0 to 27.8, 56.4 to 56.5, 113.5 to 114.9 respectively, the BLEU-1 score and BLEU-4 score obtain similar results. Among these metrics, METEOR, ROUGE-L and CIDEr are specialized for image captioning tasks, which validates the effectiveness of our model.</p>
<p>The experimental results of the Position-aware Transformer and Bottom-up model that trained with CIDEr optimization on the MSCOCO testing set are shown in <?A3B2 "tbl5",5,"anchor"?><xref ref-type="table" rid="table-5">Tab. 5</xref>.</p>
<p>As shown in <xref ref-type="table" rid="table-5">Tab. 5</xref>, our model improves the BLEU4 score from 36.3 to 38.4, METEOR score from 27.7 to 28.3, ROUGE-L score from 56.9 to 58.4 and CIDEr score from 120.1 to 125.5 respectively. In addition, we can also see that all the metrics increase, specifically, the CIDEr metric gets 4.5% relative improvement. This shows that the proposed approach has better performance.</p>
<table-wrap id="table-5">
<label>Table 5</label>
<caption>
<title>Experimental results of our approach compared with the bottom-up (optimized by CIDEr optimization)</title>
</caption>
<table>
<colgroup>
<col/>
<col/>
<col/>
<col/>
<col/>
<col/>
<col/>
<col/>
</colgroup>
<thead>
<tr>
<th>Metric</th>
<th>BLEU-1</th>
<th>BLEU-2</th>
<th>BLEU-3</th>
<th>BLEU-4</th>
<th>METEOR</th>
<th>ROUGE-L</th>
<th>CIDEr</th>
</tr>
</thead>
<tbody>
<tr>
<td>Bottom-up [<xref ref-type="bibr" rid="ref-14">14</xref>]</td>
<td>79.8</td>
<td>&#x2013;</td>
<td>&#x2013;</td>
<td>36.3</td>
<td>27.7</td>
<td>56.9</td>
<td>120.1</td>
</tr>
<tr>
<td>Position-aware transformer</td>
<td>79.8</td>
<td>64.7</td>
<td>50.2</td>
<td>38.4</td>
<td>28.3</td>
<td>58.4</td>
<td>125.5</td>
</tr>
<tr>
<td>Relative improvement</td>
<td>0%</td>
<td>&#x2013;</td>
<td>&#x2013;</td>
<td>5.8%</td>
<td>2.1%</td>
<td>2.6%</td>
<td>4.5%</td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
</sec>
<sec id="s5">
<label>5</label>
<title>Conclusion and Future Work</title>
<p>A position-aware transformer with two attention mechanisms, <italic>i.e</italic>., the position-aware attention and image-feature attention, is proposed in this work. To generate more accurate and more fluent captions, the position-aware attention enables the model to make use of relative positions between image features. These relative positions are modeled as the directed edges in a directed graph in which vertices represent the elements of image features. In addition, to make the model be able to detect objects of different scales in the image without increasing the number of parameters, the image-feature attention brings multi-level features through the FPN and uses the scaled-dot-product to fuse multi-level features. With these innovations, we obtained a better performance than some state-of-the-art approaches on the MSCOCO benchmark.</p>
<p>At a high level, our work utilizes multi-level features and position information to increase performance. While this suggests several directions for future research: (1) The image-feature attention pick up features of particular levels for fusion. However, in some cases, determining these features depends on the specific image. For some images, all the objects may be large objects, so the fusion of low-level features may bring inevitable noises to the prediction process of the model due to the weak semantics of low-level features; (2) The position-aware attention uses the relative positions between features to infer the words with abstract concepts in descriptions, but not all such words are related to spatial relationships. Based on these issues, further research will be carried out subsequently, and we will apply this approach to the image retrieval based on text information.</p>
</sec>
</body>
<back>
<ack>
<p>The authors extend their appreciation to the Deanship of Scientific Research at King Saud University, Riyadh, Saudi Arabia for funding this work through research Group No. RG-1438-070. This work is supported by NSFC (61977018). This work is supported by Research Foundation of Education Bureau of Hunan Province of China (16B006).</p>
</ack>
<fn-group>
<fn fn-type="other">
<p><bold>Funding Statement:</bold> This work was supported in part by the National Natural Science Foundation of China under Grant No. 61977018, the Deanship of Scientific Research at King Saud University, Riyadh, Saudi Arabia for funding this work through research Group No. RG-1438-070 and in part by the Research Foundation of Education Bureau of Hunan Province of China under Grant 16B006.</p>
</fn>
<fn fn-type="conflict">
<p><bold>Conflicts of Interest:</bold> The authors declare that they have no conflicts of interest to report regarding the present study.</p>
</fn>
</fn-group>
<ref-list content-type="authoryear">
<title>References</title>
<ref id="ref-1"><label>[1]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>O.</given-names> <surname>Vinyals</surname></string-name>, <string-name><given-names>A.</given-names> <surname>Toshev</surname></string-name>, <string-name><given-names>S.</given-names> <surname>Bengio</surname></string-name> and <string-name><given-names>D.</given-names> <surname>Erhan</surname></string-name></person-group>, &#x201C;<article-title>Show and tell: A neural image caption generator</article-title>,&#x201D; in <conf-name>Proc. the IEEE Conf. on Computer Vision and Pattern Recognition</conf-name>, <publisher-loc>Boston, MA, USA</publisher-loc>, pp. <fpage>3156</fpage>&#x2013;<lpage>3164</lpage>, <year>2015</year>. </mixed-citation></ref>
<ref id="ref-2"><label>[2]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>R.</given-names> <surname>Socher</surname></string-name> and <string-name><given-names>F.</given-names> <surname>Li</surname></string-name></person-group>, &#x201C;<article-title>Connecting modalities: Semi-supervised segmentation and annotation of images using unaligned text corpora</article-title>,&#x201D; in <conf-name>Proc. the IEEE Computer Society Conf. on Computer Vision and Pattern Recognition</conf-name>, <publisher-loc>San Francisco, CA, USA</publisher-loc>, pp. <fpage>966</fpage>&#x2013;<lpage>973</lpage>, <year>2010</year>. </mixed-citation></ref>
<ref id="ref-3"><label>[3]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>B. Z.</given-names> <surname>Yao</surname></string-name>, <string-name><given-names>X.</given-names> <surname>Yang</surname></string-name>, <string-name><given-names>L.</given-names> <surname>Lin</surname></string-name>, <string-name><given-names>M. W.</given-names> <surname>Lee</surname></string-name> and <string-name><given-names>S. C.</given-names> <surname>Zhu</surname></string-name></person-group>, &#x201C;<article-title>I2t: Image parsing to text description</article-title>,&#x201D; <source>Proceedings of the IEEE</source>, vol. <volume>98</volume>, no. <issue>8</issue>, pp. <fpage>1485</fpage>&#x2013;<lpage>1508</lpage>, <year>2010</year>.</mixed-citation></ref>
<ref id="ref-4"><label>[4]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>A.</given-names> <surname>Krizhevsky</surname></string-name>, <string-name><given-names>I.</given-names> <surname>Sutskever</surname></string-name> and <string-name><given-names>G. E.</given-names> <surname>Hinton</surname></string-name></person-group>, &#x201C;<article-title>Imagenet classification with deep convolutional neural networks</article-title>,&#x201D; <source>Proc. Communications of the ACM</source>, vol. <volume>60</volume>, no. <issue>6</issue>, pp. <fpage>84</fpage>&#x2013;<lpage>90</lpage>, <year>2017</year>.</mixed-citation></ref>
<ref id="ref-5"><label>[5]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>R.</given-names> <surname>Chen</surname></string-name>, <string-name><given-names>L.</given-names> <surname>Pan</surname></string-name>, <string-name><given-names>C.</given-names> <surname>Li</surname></string-name>, <string-name><given-names>Y.</given-names> <surname>Zhou</surname></string-name>, <string-name><given-names>A.</given-names> <surname>Chen</surname></string-name> <etal>et al.</etal></person-group><italic>,</italic> &#x201C;<article-title>An improved deep fusion CNN for image recognition</article-title>,&#x201D; <source>Computers, Materials &#x0026; Continua</source>, vol. <volume>65</volume>, no. <issue>2</issue>, pp. <fpage>1691</fpage>&#x2013;<lpage>1706</lpage>, <year>2020</year>.</mixed-citation></ref>
<ref id="ref-6"><label>[6]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>S.</given-names> <surname>Lee</surname></string-name>, <string-name><given-names>Y.</given-names> <surname>Ahn</surname></string-name> and <string-name><given-names>H. Y.</given-names> <surname>Kim</surname></string-name></person-group>, &#x201C;<article-title>Predicting concrete compressive strength using deep convolutional neural network based on image characteristics</article-title>,&#x201D; <source>Computers, Materials &#x0026; Continua</source>, vol. <volume>65</volume>, no. <issue>1</issue>, pp. <fpage>1</fpage>&#x2013;<lpage>17</lpage>, <year>2020</year>.</mixed-citation></ref>
<ref id="ref-7"><label>[7]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>Z.</given-names> <surname>Li</surname></string-name>, <string-name><given-names>C.</given-names> <surname>Chi</surname></string-name> and <string-name><given-names>Y.</given-names> <surname>Zhan</surname></string-name></person-group>, &#x201C;<article-title>Corpus augmentation for improving neural machine translation</article-title>,&#x201D; <source>Computers, Materials &#x0026; Continua</source>, vol. <volume>64</volume>, no. <issue>1</issue>, pp. <fpage>637</fpage>&#x2013;<lpage>650</lpage>, <year>2020</year>.</mixed-citation></ref>
<ref id="ref-8"><label>[8]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>J.</given-names> <surname>Qiu</surname></string-name>, <string-name><given-names>Y.</given-names> <surname>Liu</surname></string-name>, <string-name><given-names>Y.</given-names> <surname>Chai</surname></string-name>, <string-name><given-names>Y.</given-names> <surname>Si</surname></string-name>, <string-name><given-names>S.</given-names> <surname>Su</surname></string-name> <etal>et al.</etal></person-group><italic>,</italic> &#x201C;<article-title>Dependency-based local attention approach to neural machine translation</article-title>,&#x201D; <source>Computers, Materials &#x0026; Continua</source>, vol. <volume>59</volume>, no. <issue>2</issue>, pp. <fpage>547</fpage>&#x2013;<lpage>562</lpage>, <year>2019</year>.</mixed-citation></ref>
<ref id="ref-9"><label>[9]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>O.</given-names> <surname>Vinyals</surname></string-name>, <string-name><given-names>A.</given-names> <surname>Toshev</surname></string-name>, <string-name><given-names>S.</given-names> <surname>Bengio</surname></string-name> and <string-name><given-names>D.</given-names> <surname>Erhan</surname></string-name></person-group>, &#x201C;<article-title>Show and tell: Lessons learned from the 2015 mscoco image captioning challenge</article-title>,&#x201D; <source>Proc. IEEE Transactions on Pattern Analysis and Machine Intelligence</source>, vol. <volume>39</volume>, no. <issue>4</issue>, pp. <fpage>652</fpage>&#x2013;<lpage>663</lpage>, <year>2016</year>.</mixed-citation></ref>
<ref id="ref-10"><label>[10]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>J.</given-names> <surname>Lu</surname></string-name>, <string-name><given-names>J.</given-names> <surname>Yang</surname></string-name>, <string-name><given-names>D.</given-names> <surname>Batra</surname></string-name> and <string-name><given-names>D.</given-names> <surname>Parikh</surname></string-name></person-group>, &#x201C;<article-title>Neural baby talk</article-title>,&#x201D; in <conf-name>Proc. the IEEE Conf. on Computer Vision and Pattern Recognition</conf-name>, <publisher-loc>Salt Lake City, UT, USA</publisher-loc>, pp. <fpage>7219</fpage>&#x2013;<lpage>7228</lpage>, <year>2018</year>. </mixed-citation></ref>
<ref id="ref-11"><label>[11]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>K.</given-names> <surname>Cho</surname></string-name>, <string-name><given-names>M. B.</given-names> <surname>Van</surname></string-name>, <string-name><given-names>C.</given-names> <surname>Gulcehre</surname></string-name>, <string-name><given-names>F.</given-names> <surname>Bougares</surname></string-name>, <string-name><given-names>H.</given-names> <surname>Schwenk</surname></string-name> <etal>et al.</etal></person-group><italic>,</italic> &#x201C;<article-title>Learning phrase representations using RNN encoder-decoder for statistical machine translation</article-title>,&#x201D; in <conf-name>Proc. the Conf. on Empirical Methods in Natural Language Processing</conf-name>, <publisher-loc>Doha, Qatar</publisher-loc>, pp. <fpage>1724</fpage>&#x2013;<lpage>1734</lpage>, <year>2014</year>. </mixed-citation></ref>
<ref id="ref-12"><label>[12]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>K.</given-names> <surname>Xu</surname></string-name>, <string-name><given-names>J.</given-names> <surname>Ba</surname></string-name>, <string-name><given-names>R.</given-names> <surname>Kiros</surname></string-name>, <string-name><given-names>K.</given-names> <surname>Cho</surname></string-name>, <string-name><given-names>A.</given-names> <surname>Courville</surname></string-name> <etal>et al.</etal></person-group><italic>,</italic> &#x201C;<article-title>Courville etal, Show, attend and tell: Neural image caption generation with visual attention</article-title>,&#x201D; in <conf-name>Proc. Int. Conf. on Machine Learning</conf-name>, <publisher-loc>Miami, Florida, USA</publisher-loc>, pp. <fpage>2048</fpage>&#x2013;<lpage>2057</lpage>, <year>2015</year>. </mixed-citation></ref>
<ref id="ref-13"><label>[13]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>L.</given-names> <surname>Chen</surname></string-name>, <string-name><given-names>H.</given-names> <surname>Zhang</surname></string-name>, <string-name><given-names>J.</given-names> <surname>Xiao</surname></string-name>, <string-name><given-names>L.</given-names> <surname>Nie</surname></string-name>, <string-name><given-names>J.</given-names> <surname>Shao</surname></string-name> <etal>et al.</etal></person-group><italic>,</italic> &#x201C;<article-title>SCA-CNN: Spatial and channel-wise attention in convolutional networks for image captioning</article-title>,&#x201D; in <conf-name>Proc. the IEEE Conf. on Computer Vision and Pattern Recognition</conf-name>, <publisher-loc>Honolulu, HI, USA</publisher-loc>, pp. <fpage>5659</fpage>&#x2013;<lpage>5667</lpage>, <year>2017</year>. </mixed-citation></ref>
<ref id="ref-14"><label>[14]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>P.</given-names> <surname>Anderson</surname></string-name>, <string-name><given-names>X.</given-names> <surname>He</surname></string-name>, <string-name><given-names>C.</given-names> <surname>Buehler</surname></string-name>, <string-name><given-names>D.</given-names> <surname>Teney</surname></string-name>, <string-name><given-names>M.</given-names> <surname>Johnson</surname></string-name> <etal>et al.</etal></person-group><italic>,</italic> &#x201C;<article-title>Bottom-up and top-down attention for image captioning and visual question answering</article-title>,&#x201D; in <conf-name>Proc. the IEEE Conf. on Computer Vision and Pattern Recognition</conf-name>, <publisher-loc>Salt Lake City, UT, USA</publisher-loc>, pp. <fpage>6077</fpage>&#x2013;<lpage>6086</lpage>, <year>2018</year>. </mixed-citation></ref>
<ref id="ref-15"><label>[15]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>J.</given-names> <surname>Lu</surname></string-name>, <string-name><given-names>C.</given-names> <surname>Xiong</surname></string-name>, <string-name><given-names>D.</given-names> <surname>Parikh</surname></string-name> and <string-name><given-names>R.</given-names> <surname>Socher</surname></string-name></person-group>, &#x201C;<article-title>Knowing when to look: Adaptive attention via a visual sentinel for image captioning</article-title>,&#x201D; in <conf-name>Proc. the IEEE Conf. on Computer Vision and Pattern Recognition</conf-name>, <publisher-loc>Honolulu, HI, USA</publisher-loc>, pp. <fpage>375</fpage>&#x2013;<lpage>383</lpage>, <year>2017</year>. </mixed-citation></ref>
<ref id="ref-16"><label>[16]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>A.</given-names> <surname>Vaswani</surname></string-name>, <string-name><given-names>N.</given-names> <surname>Shazeer</surname></string-name>, <string-name><given-names>N.</given-names> <surname>Parmar</surname></string-name>, <string-name><given-names>J.</given-names> <surname>Uszkoreit</surname></string-name>, <string-name><given-names>L.</given-names> <surname>Jones</surname></string-name> <etal>et al.</etal></person-group><italic>,</italic> &#x201C;<article-title>Attention is all you need</article-title>,&#x201D; in <conf-name>Proc. Advances in Neural Information Processing Systems</conf-name>, <publisher-loc>Vancouver, Washington, USA</publisher-loc>, pp. <fpage>5998</fpage>&#x2013;<lpage>6008</lpage>, <year>2017</year>. </mixed-citation></ref>
<ref id="ref-17"><label>[17]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>L.</given-names> <surname>Guo</surname></string-name>, <string-name><given-names>J.</given-names> <surname>Liu</surname></string-name>, <string-name><given-names>X.</given-names> <surname>Zhu</surname></string-name>, <string-name><given-names>P.</given-names> <surname>Yao</surname></string-name>, <string-name><given-names>S.</given-names> <surname>Lu</surname></string-name> <etal>et al.</etal></person-group><italic>,</italic> &#x201C;<article-title>Normalized and geometry-aware self-attention network for image captioning</article-title>,&#x201D; in <conf-name>Proc. the IEEE/CVF Conf. on Computer Vision and Pattern Recognition</conf-name>, <publisher-loc>Washington, USA</publisher-loc>, pp. <fpage>10327</fpage>&#x2013;<lpage>10336</lpage>, <year>2020</year>. </mixed-citation></ref>
<ref id="ref-18"><label>[18]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>G.</given-names> <surname>Li</surname></string-name>, <string-name><given-names>L.</given-names> <surname>Zhu</surname></string-name>, <string-name><given-names>P.</given-names> <surname>Liu</surname></string-name> and <string-name><given-names>Y.</given-names> <surname>Yang</surname></string-name></person-group>, &#x201C;<article-title>Entangled transformer for image captioning</article-title>,&#x201D; in <conf-name>Proc. the IEEE/CVF Int. Conf. on Computer Vision</conf-name>, <publisher-loc>Beach, CA, USA</publisher-loc>, pp. <fpage>8928</fpage>&#x2013;<lpage>8937</lpage>, <year>2019</year>. </mixed-citation></ref>
<ref id="ref-19"><label>[19]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>T.</given-names> <surname>Lin</surname></string-name>, <string-name><given-names>P.</given-names> <surname>Doll&#x00E1;r</surname></string-name>, <string-name><given-names>R.</given-names> <surname>Girshick</surname></string-name>, <string-name><given-names>K.</given-names> <surname>He</surname></string-name>, <string-name><given-names>B.</given-names> <surname>Hariharan</surname></string-name> <etal>et al.</etal></person-group><italic>,</italic> &#x201C;<article-title>Feature pyramid networks for object detection</article-title>,&#x201D; in <conf-name>Proc. the IEEE Conf. on Computer Vision and Pattern Recognition</conf-name>, <publisher-loc>Honolulu, HI, USA</publisher-loc>, pp. <fpage>2117</fpage>&#x2013;<lpage>2125</lpage>, <year>2017</year>. </mixed-citation></ref>
<ref id="ref-20"><label>[20]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>K.</given-names> <surname>Papineni</surname></string-name>, <string-name><given-names>S.</given-names> <surname>Roukos</surname></string-name>, <string-name><given-names>T.</given-names> <surname>Ward</surname></string-name> and <string-name><given-names>W.</given-names> <surname>Zhu</surname></string-name></person-group>, &#x201C;<article-title>Bleu: A method for automatic evaluation of machine translation</article-title>,&#x201D; in <conf-name>Proc. the 40th Annual Meeting of the Association for Computational Linguistics</conf-name>, <publisher-loc>Morristown, NJ, USA</publisher-loc>, pp. <fpage>311</fpage>&#x2013;<lpage>318</lpage>, <year>2002</year>. </mixed-citation></ref>
<ref id="ref-21"><label>[21]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>R.</given-names> <surname>Vedantam</surname></string-name>, <string-name><given-names>Z.</given-names> <surname>C.Lawrence</surname></string-name> and <string-name><given-names>D.</given-names> <surname>Parikh</surname></string-name></person-group>, &#x201C;<article-title>Cider: Consensus-based image description evaluation</article-title>,&#x201D; in <conf-name>Proc. the IEEE Conf. on Computer Vision and Pattern Recognition</conf-name>, <publisher-loc>Boston, MA, USA</publisher-loc>, pp. <fpage>4566</fpage>&#x2013;<lpage>4575</lpage>, <year>2015</year>. </mixed-citation></ref>
<ref id="ref-22"><label>[22]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>M.</given-names> <surname>Denkowski</surname></string-name> and <string-name><given-names>A.</given-names> <surname>Lavie</surname></string-name></person-group>, &#x201C;<article-title>Meteor universal: Language specific translation evaluation for any target language</article-title>,&#x201D; in <conf-name>Proc. the Ninth Workshop on Statistical Machine Translation</conf-name>, <publisher-loc>Baltimore, Maryland, USA</publisher-loc>, pp. <fpage>376</fpage>&#x2013;<lpage>380</lpage>, <year>2014</year>. </mixed-citation></ref>
<ref id="ref-23"><label>[23]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>P.</given-names> <surname>Anderson</surname></string-name>, <string-name><given-names>B.</given-names> <surname>Fernando</surname></string-name>, <string-name><given-names>M.</given-names> <surname>Johnson</surname></string-name> and <string-name><given-names>S.</given-names> <surname>Gould</surname></string-name></person-group>, &#x201C;<article-title>Spice: Semantic propositional image caption evaluation</article-title>,&#x201D; in <conf-name>Proc. European Conf. on Computer Vision</conf-name>, <publisher-loc>Amsterdam, Netherlands</publisher-loc>, pp. <fpage>382</fpage>&#x2013;<lpage>398</lpage>, <year>2016</year>. </mixed-citation></ref>
<ref id="ref-24"><label>[24]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>S. J.</given-names> <surname>Rennie</surname></string-name>, <string-name><given-names>E.</given-names> <surname>Marcheret</surname></string-name>, <string-name><given-names>Y.</given-names> <surname>Mroueh</surname></string-name>, <string-name><given-names>J.</given-names> <surname>Ross</surname></string-name> and <string-name><given-names>V.</given-names> <surname>Goel</surname></string-name></person-group>, &#x201C;<article-title>Self-critical sequence training for image captioning</article-title>,&#x201D; in <conf-name>Proc. the IEEE Conf. on Computer Vision and Pattern Recognition</conf-name>, <publisher-loc>Honolulu, HI, USA</publisher-loc>, pp. <fpage>7008</fpage>&#x2013;<lpage>7024</lpage>, <year>2017</year>. </mixed-citation></ref>
<ref id="ref-25"><label>[25]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>T. Y.</given-names> <surname>Lin</surname></string-name>, <string-name><given-names>M.</given-names> <surname>Maire</surname></string-name>, <string-name><given-names>S.</given-names> <surname>Belongie</surname></string-name>, <string-name><given-names>J.</given-names> <surname>Hays</surname></string-name>, <string-name><given-names>P.</given-names> <surname>Perona</surname></string-name> <etal>et al.</etal></person-group><italic>,</italic> &#x201C;<article-title>Microsoft coco: Common objects in context</article-title>,&#x201D; in <conf-name>Proc. European Conf. on Computer Vision</conf-name>, <publisher-loc>Zurich, Switzerland</publisher-loc>, pp. <fpage>740</fpage>&#x2013;<lpage>755</lpage>, <year>2014</year>. </mixed-citation></ref>
<ref id="ref-26"><label>[26]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>A.</given-names> <surname>Karpathy</surname></string-name> and <string-name><given-names>F.</given-names> <surname>Li</surname></string-name></person-group>, &#x201C;<article-title>Deep visual-semantic alignments for generating image descriptions</article-title>,&#x201D; in <conf-name>Proc. the IEEE Conf. on Computer Vision and Pattern Recognition</conf-name>, <publisher-loc>New York, NY, USA</publisher-loc>, pp. <fpage>3128</fpage>&#x2013;<lpage>3137</lpage>, <year>2015</year>. </mixed-citation></ref>
<ref id="ref-27"><label>[27]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>X.</given-names> <surname>Wang</surname></string-name>, <string-name><given-names>T.</given-names> <surname>Kong</surname></string-name>, <string-name><given-names>C.</given-names> <surname>Shen</surname></string-name>, <string-name><given-names>Y.</given-names> <surname>Jiang</surname></string-name> and <string-name><given-names>L.</given-names> <surname>Li</surname></string-name></person-group>, &#x201C;<article-title>Solo: Segmenting objects by locations</article-title>,&#x201D; in <conf-name>Proc. European Conf. on Computer Vision</conf-name>, <publisher-loc>Glasgow, UK</publisher-loc>, pp. <fpage>649</fpage>&#x2013;<lpage>665</lpage>, <year>2020</year>. </mixed-citation></ref>
<ref id="ref-28"><label>[28]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>H.</given-names> <surname>Zhong</surname></string-name>, <string-name><given-names>Z.</given-names> <surname>Chen</surname></string-name>, <string-name><given-names>C.</given-names> <surname>Qin</surname></string-name>, <string-name><given-names>Z.</given-names> <surname>Huang</surname></string-name>, <string-name><given-names>V. W.</given-names> <surname>Zheng</surname></string-name> <etal>et al.</etal></person-group><italic>,</italic> &#x201C;<article-title>Adam revisited: A weighted past gradients perspective</article-title>,&#x201D; <source>Frontiers of Computer Science</source>, vol. <volume>14</volume>, no. <issue>5</issue>, pp. <fpage>1</fpage>&#x2013;<lpage>16</lpage>, <year>2020</year>.</mixed-citation></ref>
<ref id="ref-29"><label>[29]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>C.</given-names> <surname>Szegedy</surname></string-name>, <string-name><given-names>V.</given-names> <surname>Vanhoucke</surname></string-name>, <string-name><given-names>S.</given-names> <surname>Ioffe</surname></string-name>, <string-name><given-names>J.</given-names> <surname>Shlens</surname></string-name> and <string-name><given-names>Z.</given-names> <surname>Wojna</surname></string-name></person-group>, &#x201C;<article-title>Rethinking the inception architecture for computer vision</article-title>,&#x201D; in <conf-name>Proc. the IEEE Conf. on Computer Vision and Pattern Recognition</conf-name>, <publisher-loc>Las Vegas, NV, USA</publisher-loc>, pp. <fpage>2818</fpage>&#x2013;<lpage>2826</lpage>, <year>2016</year>. </mixed-citation></ref>
</ref-list>
</back>
</article>
