<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.1 20151215//EN" "http://jats.nlm.nih.gov/publishing/1.1/JATS-journalpublishing1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:mml="http://www.w3.org/1998/Math/MathML" xml:lang="en" article-type="research-article" dtd-version="1.1">
<front>
<journal-meta>
<journal-id journal-id-type="pmc">CMC</journal-id>
<journal-id journal-id-type="nlm-ta">CMC</journal-id>
<journal-id journal-id-type="publisher-id">CMC</journal-id>
<journal-title-group>
<journal-title>Computers, Materials &#x0026; Continua</journal-title>
</journal-title-group>
<issn pub-type="epub">1546-2226</issn>
<issn pub-type="ppub">1546-2218</issn>
<publisher>
<publisher-name>Tech Science Press</publisher-name>
<publisher-loc>USA</publisher-loc>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="publisher-id">38220</article-id>
<article-id pub-id-type="doi">10.32604/cmc.2023.038220</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Article</subject>
</subj-group>
</article-categories>
<title-group>
<article-title>Improving Targeted Multimodal Sentiment Classification with Semantic Description of Images</article-title>
<alt-title alt-title-type="left-running-head">Improving Targeted Multimodal Sentiment Classification with Semantic Description of Images</alt-title>
<alt-title alt-title-type="right-running-head">Improving Targeted Multimodal Sentiment Classification with Semantic Description of Images</alt-title>
</title-group>
<contrib-group>
<contrib id="author-1" contrib-type="author" corresp="yes">
<name name-style="western"><surname>An</surname><given-names>Jieyu</given-names></name><email>anjieyu@student.usm.my</email></contrib>
<contrib id="author-2" contrib-type="author">
<name name-style="western"><surname>Zainon</surname><given-names>Wan Mohd Nazmee Wan</given-names></name></contrib>
<contrib id="author-3" contrib-type="author">
<name name-style="western"><surname>Hao</surname><given-names>Zhang</given-names></name></contrib>
<aff><institution>School of Computer Sciences, Universiti Sains Malaysia</institution>, <addr-line>Penang, 11800</addr-line>, <country>Malaysia</country></aff>
</contrib-group>
<author-notes>
<corresp id="cor1"><label>&#x002A;</label>Corresponding Author: Jieyu An. Email: <email>anjieyu@student.usm.my</email></corresp>
</author-notes>
<pub-date date-type="collection" publication-format="electronic"><year>2023</year></pub-date>
<pub-date date-type="pub" publication-format="electronic"><day>1</day><month>5</month><year>2023</year></pub-date>
<volume>75</volume>
<issue>3</issue>
<fpage>5801</fpage>
<lpage>5815</lpage>
<history>
<date date-type="received"><day>02</day><month>12</month><year>2022</year>
</date>
<date date-type="accepted"><day>16</day><month>3</month><year>2023</year>
</date>
</history>
<permissions>
<copyright-statement>&#x00A9; 2023 An, Zainon and Hao</copyright-statement>
<copyright-year>2023</copyright-year>
<copyright-holder>An, Zainon and Hao</copyright-holder>
<license xlink:href="https://creativecommons.org/licenses/by/4.0/">
<license-p>This work is licensed under a <ext-link ext-link-type="uri" xlink:type="simple" xlink:href="https://creativecommons.org/licenses/by/4.0/">Creative Commons Attribution 4.0 International License</ext-link>, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
</license>
</permissions>
<self-uri content-type="pdf" xlink:href="TSP_CMC_38220.pdf"></self-uri>
<abstract>
<p>Targeted multimodal sentiment classification (TMSC) aims to identify the sentiment polarity of a target mentioned in a multimodal post. The majority of current studies on this task focus on mapping the image and the text to a high-dimensional space in order to obtain and fuse implicit representations, ignoring the rich semantic information contained in the images and not taking into account the contribution of the visual modality in the multimodal fusion representation, which can potentially influence the results of TMSC tasks. This paper proposes a general model for Improving Targeted Multimodal Sentiment Classification with Semantic Description of Images (ITMSC) as a way to tackle these issues and improve the accuracy of multimodal sentiment analysis. Specifically, the ITMSC model can automatically adjust the contribution of images in the fusion representation through the exploitation of semantic descriptions of images and text similarity relations. Further, we propose a target-based attention module to capture the target-text relevance, an image-based attention module to capture the image-text relevance, and a target-image matching module based on the former two modules to properly align the target with the image so that fine-grained semantic information can be extracted. Our experimental results demonstrate that our model achieves comparable performance with several state-of-the-art approaches on two multimodal sentiment datasets. Our findings indicate that incorporating semantic descriptions of images can enhance our understanding of multimodal content and lead to improved sentiment analysis performance.</p>
</abstract>
<kwd-group kwd-group-type="author">
<kwd>Targeted sentiment analysis</kwd>
<kwd>multimodal sentiment classification</kwd>
<kwd>visual sentiment</kwd>
<kwd>textual sentiment</kwd>
<kwd>social media</kwd>
</kwd-group>
</article-meta>
</front>
<body>
<sec id="s1">
<label>1</label>
<title>Introduction</title>
<p>With the rise in popularity of social media, an increasing number of users use multimodal posts to express their emotions or opinions (i.e., many posts contain both text and related images). Effective sentiment analysis of massive and multimodal social media data can aid in comprehending public sentiment and opinion trends, providing a scientific foundation for government and corporate decision-making [<xref ref-type="bibr" rid="ref-1">1</xref>&#x2013;<xref ref-type="bibr" rid="ref-3">3</xref>]. In comparison to traditional textual sentiment analysis [<xref ref-type="bibr" rid="ref-4">4</xref>], performing sentiment analysis utilizing data from different modalities presents a number of opportunities and challenges.</p>
<p>The targeted multimodal sentiment classification (TMSC) is a fine-grained task of natural language processing to extract sentiment polarity (e.g., positive, negative, or neutral) that has become one topic of increasing research interest over the past few years. Automatically identifying the underlying attitude of targeted entities (i.e., aspects) in a sentence and image pair is the goal of targeted multimodal sentiment classification. As shown in <xref ref-type="table" rid="table-1">Table 1</xref>, the targeted entity is expected to be extracted from the multimodal post (i.e., <italic>SamHunt</italic>: <italic>positive</italic>).</p>
<table-wrap id="table-1">
<label>Table 1</label>
<caption>
<title>An example of the TMSC task in the Twitter dataset</title>
</caption>
<table frame="hsides">
<colgroup>
<col align="left"/>
<col align="left"/>
</colgroup>
<thead>
<tr>
<th>Textual modality</th>
<th># <bold>SamHunt</bold> performs at Stagecoach # MusicFestival 2016!</th>
</tr>
</thead>
<tbody>
<tr>
<td>Visual modality</td>
<td><inline-graphic xlink:href="CMC_38220-inline-1.tif"/></td>
</tr>
<tr>
<td>Target polarity</td>
<td>SamHunt: <italic>positive</italic></td>
</tr>
</tbody>
</table>
</table-wrap>
<p>Many approaches have been proposed in recent years to perform sentiment classification for TMSC and have gained attention [<xref ref-type="bibr" rid="ref-5">5</xref>&#x2013;<xref ref-type="bibr" rid="ref-8">8</xref>]. Although these studies have shown that combining textual and visual modality information can improve performance on the sentiment classification task, they have the following limitations:
<list list-type="simple">
<list-item><label>(1)</label><p>These studies perform sentiment classification by fusing only text and image representation. They do not take into account the visual modality&#x2019;s contribution to the fusion. In other words, there is not always consistency between images and text in terms of sentiment tendencies, which can serve sentiment classification more effectively if they are consistent but can compromise its accuracy if they are inconsistent.</p></list-item>
<list-item><label>(2)</label><p>These methods exploit only the representational information of the image, ignoring the supplementary information that comes from the semantic description of the image. By using semantic descriptions of images, we can recognize information such as objects, positions, and actions in images. For instance, based on the image in <xref ref-type="table" rid="table-1">Table 1</xref>, we can generate the following image descriptions: <italic>A man is holding up a microphone to take a picture</italic>. In this description, &#x201C;<italic>a man</italic>&#x201D; is aligned with the visual modality, which is indicated by the blue rectangle, and it is also aligned with <italic>SamHunt</italic>, which is indicated by the red underline in the textual modality. According to our hypothesis, the semantic description of images helps us understand what they are about and may tell the model to focus on the parts of the image that match the given target while reducing noise in other parts.</p>
</list-item>
</list></p>
<p>In order to address the above limitations, we propose an Improving Targeted Multimodal Sentiment Classification (ITMSC) model based on the semantic description of images for the TMSC task. The following is a condensed summary of the most important contributions made by this paper:
<list list-type="bullet">
<list-item>
<p>To the best of our knowledge, this is the first time that semantic descriptions of images have been used to establish the information interaction with images and text to obtain more semantic information for the TMSC task.</p></list-item>
<list-item>
<p>To adjust the contribution of the visual modality in the fusion representation of different modalities, we propose a method to automatically and dynamically adjust the input of the image based on the similarity between the image description and the text.</p></list-item>
<list-item>
<p>To obtain the finer-grained semantic alignment information between different modalities, we develop three matching modules that effectively reduce redundant data and extract meaningful information.</p></list-item>
</list></p>
<p>Experiments conducted on two multimodal sentiment datasets have shown that our ITMSC outperforms superior performance compared to the most advanced models currently available. Furthermore, our model generates insightful and interpretable visualizations that highlight the importance of semantic descriptions of images for the TMSC task.</p>
</sec>
<sec id="s2">
<label>2</label>
<title>Related Work</title>
<sec id="s2_1">
<label>2.1</label>
<title>Multimodal Sentiment Analysis</title>
<p>Based on the content generated by users, sentiment analysis attempts to determine the emotional orientation (e.g., negative, neutral, or positive) [<xref ref-type="bibr" rid="ref-9">9</xref>]. It enables machines to comprehend human emotions and react in the proper manner. The need to automate the evaluation of customers&#x2019; feelings about products or services is on the rise. It has made significant progress in areas such as natural language processing and artificial intelligence.</p>
<p>Sentiment analysis based on textual content alone [<xref ref-type="bibr" rid="ref-10">10</xref>&#x2013;<xref ref-type="bibr" rid="ref-13">13</xref>] is no longer sufficient in today&#x2019;s social media environment, as users often share and discuss things mostly presented in a multimodal form. As a new part of the field of multimodal machine learning, researchers are paying more attention to multimodal sentiment analysis.</p>
<p>Early multimodal works are predominately handcrafted. Nevertheless, handcrafted features are usually created with limited human knowledge and cannot fully describe the highly abstract nature of emotions, leading to suboptimal results [<xref ref-type="bibr" rid="ref-14">14</xref>]. In the last few years, there have been significant advances in the field of multimodal sentiment analysis due to the emergence of deep learning models [<xref ref-type="bibr" rid="ref-15">15</xref>&#x2013;<xref ref-type="bibr" rid="ref-18">18</xref>]. Most of these studies also reached similar conclusions, correlating and supplementing the information contained in data from various modalities to achieve a more accurate classification of sentiment than analysis based on a single modality. In general, there is a close association that exists between the text and image in posts from users on social media platforms [<xref ref-type="bibr" rid="ref-19">19</xref>&#x2013;<xref ref-type="bibr" rid="ref-21">21</xref>]. However, we cannot directly apply these coarse-grained multimodal sentiment classification methods to our targeted sentiment classification tasks. Consequently, fine-grained multimodal sentiment analysis is the primary focus of our work.</p>
</sec>
<sec id="s2_2">
<label>2.2</label>
<title>Targeted Multimodal Sentiment Analysis</title>
<p>Targeted multimodal sentiment analysis is a form of fine-grained sentiment analysis task. It is a relatively new field for understanding the sentiment of a particular entity or topic from multiple sources of information, allowing for a more comprehensive view of the sentiment. By leveraging the information from various modalities, targeted multimodal sentiment analysis can be used to gain a better understanding of the sentiment expressed towards a particular entity or topic.</p>
<p>In the literature, this task has attracted considerable interest from researchers. In 2019, Xu et al. [<xref ref-type="bibr" rid="ref-5">5</xref>] proposed a multi-interactive memory network to independently model text and image data and learn the interactive influences of cross-modality data. In 2020, Yu et al. [<xref ref-type="bibr" rid="ref-6">6</xref>] proposed an entity-sensitive attention and fusion network (ESAFN) to study intra-modality and inter-modality interactions in a sentence and image pair for targeted sentiment classification. In addition, with the widespread use of pre-trained models, Yu et al. [<xref ref-type="bibr" rid="ref-7">7</xref>] proposed a target-oriented multimodal bidirectional encoder representation from Transformers (TomBERT) architecture to capture the relationship between target, text, and image for the TMSC task. More recently, to overcome the problem of short texts with little information, Khan et al. [<xref ref-type="bibr" rid="ref-8">8</xref>] introduced a two-stream model in 2021 that first translates images into auxiliary sentences and then fuses the input sentence and auxiliary sentences for the TMSC task. In 2022, Ye et al. [<xref ref-type="bibr" rid="ref-22">22</xref>] introduced a sentiment-aware multimodal pre-training (SMP) framework to address the lack of attention to sentiment signals in most existing multimodal pre-trained models, which mainly focus on general lexical and/or visual information. In addition, Yu et al. [<xref ref-type="bibr" rid="ref-23">23</xref>] proposed a multi-task learning architecture named coarse-to-fine-grained Image-Target Matching network (ITM) in order to capture both coarse-grained and fine-grained image-target matching.</p>
<p>Our research is similar to Khan&#x2019;s approach, as we also employ their image descriptions as auxiliary sentences for the TMSC task. However, we argue that merely combining the input sentence and image description, without incorporating the image feature, may lead to a loss of crucial sentiment-related information. To address this issue, our research focuses on incorporating all three elements&#x2014;input sentence, image description, and image feature&#x2014;into the sentiment analysis process. In order to extract in-depth semantic information from the feature data, we further propose an attention-based fusion mechanism, which we experimentally validated. Our results demonstrate a significant improvement over Khan&#x2019;s approach.</p>
</sec>
</sec>
<sec id="s3">
<label>3</label>
<title>Methodology</title>
<p>In contrast to existing approaches, we utilize not only text and image information for targeted multimodal sentiment classification but also image description as supplementary information to investigate image and text interactions. In this section, we define the tasks of TMSC and present the overall architecture of the proposed ITMSC model, as shown in <xref ref-type="fig" rid="fig-1">Fig. 1</xref>. We then elaborate on the details of each module in ITMSC for targeted multimodal sentiment classification. Our research is mostly about social apps where user-generated content is a paragraph of text with an image.</p>
<fig id="fig-1">
<label>Figure 1</label>
<caption>
<title>The overall architecture of the proposed ITMSC</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_38220-fig-1.tif"/>
</fig>
<p><bold>Task Formulation:</bold> Given one multimodal sentiment sample <italic>M</italic> as a pair of sentence and image, it consists of an associated image <italic>I</italic>, an opinion target with m words <italic>T</italic> &#x003D; (<italic>t</italic><sub>1</sub>, <italic>t</italic><sub>2</sub>,&#x2026; <italic>t</italic><sub>m</sub>), and a sentence with n words <italic>S</italic> &#x003D; (<italic>s</italic><sub>1</sub>, <italic>s</italic><sub>2</sub>, &#x2026; <italic>s</italic><sub>n</sub>). In this paper, we attempt to make a prediction regarding the polarity label <italic>y</italic>, which can be either <italic>neutral</italic>, <italic>negative</italic>, or <italic>positive</italic>, of each opinion target mentioned in <italic>M</italic>.</p>
<sec id="s3_1">
<label>3.1</label>
<title>Input Representations Extraction</title>
<p><bold>Textual Representation Extraction</bold>. When dealing with textual representation extraction, Bidirectional Encoder Representations from Transformers (BERT) [<xref ref-type="bibr" rid="ref-24">24</xref>] is widely used as a language representation model that can capture the full semantic information of a sentence and also discover association features between words through the context [<xref ref-type="bibr" rid="ref-25">25</xref>]. Following previous research [<xref ref-type="bibr" rid="ref-7">7</xref>], given a sentence <italic>S</italic> as input, we first extract the input target <italic>T</italic> from the sentence and replace it with a special token <italic>$T$</italic>. To enable BERT to process these text sequences, we add a special classification token [CLS] at the beginning of the sentence and a special segmentation token [SEP] between different text sequences. Then, we utilize the fine-tuned BERT to obtain the hidden representation <inline-formula id="ieqn-1"><mml:math id="mml-ieqn-1"><mml:msub><mml:mi>H</mml:mi><mml:mrow><mml:mi>S</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> and <inline-formula id="ieqn-2"><mml:math id="mml-ieqn-2"><mml:msub><mml:mi>H</mml:mi><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>, as illustrated in <xref ref-type="disp-formula" rid="eqn-1">Eqs. (1)</xref> and <xref ref-type="disp-formula" rid="eqn-2">(2)</xref>. Similarly, using the same fine-tuned BERT we also get the image description presentation <inline-formula id="ieqn-3"><mml:math id="mml-ieqn-3"><mml:msub><mml:mi>H</mml:mi><mml:mrow><mml:mi>D</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> of image description <italic>D</italic> and the concatenated textual presentation <inline-formula id="ieqn-4"><mml:math id="mml-ieqn-4"><mml:msub><mml:mi>H</mml:mi><mml:mrow><mml:mi>S</mml:mi><mml:mi>D</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>, as illustrated in <xref ref-type="disp-formula" rid="eqn-3">Eqs. (3)</xref> and <xref ref-type="disp-formula" rid="eqn-4">(4)</xref>.</p>
<p><disp-formula id="eqn-1"><label>(1)</label><mml:math id="mml-eqn-1" display="block"><mml:msub><mml:mi>H</mml:mi><mml:mrow><mml:mi>S</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mi>B</mml:mi><mml:mi>E</mml:mi><mml:mi>R</mml:mi><mml:mi>T</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mo>[</mml:mo><mml:mi>C</mml:mi><mml:mi>L</mml:mi><mml:mi>S</mml:mi><mml:mo>]</mml:mo></mml:mrow><mml:mi>S</mml:mi><mml:mrow><mml:mo>[</mml:mo><mml:mi>S</mml:mi><mml:mi>E</mml:mi><mml:mi>P</mml:mi><mml:mo>]</mml:mo></mml:mrow><mml:mi>T</mml:mi><mml:mo stretchy="false">[</mml:mo><mml:mi>S</mml:mi><mml:mi>E</mml:mi><mml:mi>P</mml:mi><mml:mo stretchy="false">]</mml:mo><mml:mo>)</mml:mo></mml:mrow><mml:mo>,</mml:mo><mml:mspace width="thinmathspace" /><mml:msub><mml:mi>H</mml:mi><mml:mrow><mml:mi>S</mml:mi></mml:mrow></mml:msub><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:mi>n</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mi>d</mml:mi></mml:mrow></mml:msup></mml:math></disp-formula></p>
<p><disp-formula id="eqn-2"><label>(2)</label><mml:math id="mml-eqn-2" display="block"><mml:msub><mml:mi>H</mml:mi><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mi>B</mml:mi><mml:mi>E</mml:mi><mml:mi>R</mml:mi><mml:mi>T</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mo>[</mml:mo><mml:mi>C</mml:mi><mml:mi>L</mml:mi><mml:mi>S</mml:mi><mml:mo>]</mml:mo></mml:mrow><mml:mi>T</mml:mi><mml:mo stretchy="false">[</mml:mo><mml:mi>S</mml:mi><mml:mi>E</mml:mi><mml:mi>P</mml:mi><mml:mo stretchy="false">]</mml:mo><mml:mo>)</mml:mo></mml:mrow><mml:mo>,</mml:mo><mml:mspace width="thinmathspace" /><mml:msub><mml:mi>H</mml:mi><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msub><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mi>d</mml:mi></mml:mrow></mml:msup></mml:math></disp-formula></p>
<p><disp-formula id="eqn-3"><label>(3)</label><mml:math id="mml-eqn-3" display="block"><mml:msub><mml:mi>H</mml:mi><mml:mrow><mml:mi>D</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mi>B</mml:mi><mml:mi>E</mml:mi><mml:mi>R</mml:mi><mml:mi>T</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mo>[</mml:mo><mml:mi>C</mml:mi><mml:mi>L</mml:mi><mml:mi>S</mml:mi><mml:mo>]</mml:mo></mml:mrow><mml:mi>D</mml:mi><mml:mo stretchy="false">[</mml:mo><mml:mi>S</mml:mi><mml:mi>E</mml:mi><mml:mi>P</mml:mi><mml:mo stretchy="false">]</mml:mo><mml:mo>)</mml:mo></mml:mrow><mml:mo>,</mml:mo><mml:mspace width="thinmathspace" /><mml:msub><mml:mi>H</mml:mi><mml:mrow><mml:mi>D</mml:mi></mml:mrow></mml:msub><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:mi>m</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mi>d</mml:mi></mml:mrow></mml:msup></mml:math></disp-formula></p>
<p><disp-formula id="eqn-4"><label>(4)</label><mml:math id="mml-eqn-4" display="block"><mml:msub><mml:mi>H</mml:mi><mml:mrow><mml:mi>S</mml:mi><mml:mi>D</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mi>B</mml:mi><mml:mi>E</mml:mi><mml:mi>R</mml:mi><mml:mi>T</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mo>[</mml:mo><mml:mi>C</mml:mi><mml:mi>L</mml:mi><mml:mi>S</mml:mi><mml:mo>]</mml:mo></mml:mrow><mml:mi>S</mml:mi><mml:mo stretchy="false">[</mml:mo><mml:mi>S</mml:mi><mml:mi>E</mml:mi><mml:mi>P</mml:mi><mml:mo stretchy="false">]</mml:mo><mml:mi>T</mml:mi><mml:mo>+</mml:mo><mml:mi>D</mml:mi><mml:mo stretchy="false">[</mml:mo><mml:mi>S</mml:mi><mml:mi>E</mml:mi><mml:mi>P</mml:mi><mml:mo stretchy="false">]</mml:mo><mml:mo>)</mml:mo></mml:mrow><mml:mo>,</mml:mo><mml:mspace width="thinmathspace" /><mml:msub><mml:mi>H</mml:mi><mml:mrow><mml:mi>S</mml:mi><mml:mi>D</mml:mi></mml:mrow></mml:msub><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>n</mml:mi><mml:mo>+</mml:mo><mml:mi>m</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>&#x00D7;</mml:mo><mml:mi>d</mml:mi></mml:mrow></mml:msup></mml:math></disp-formula>where <italic>n</italic>, <italic>t</italic>, and <italic>m</italic> are the lengths of <italic>S</italic>, <italic>T</italic> and <italic>D</italic>, respectively; <italic>d</italic> is 768-dimensional hidden state.</p>
<p><bold>Visual Representation Extraction</bold>. When it comes to extracting features of the images, we adopt one of the most advanced image recognition models, Residual Network 152 (ResNet) [<xref ref-type="bibr" rid="ref-26">26</xref>], pre-trained on ImageNet [<xref ref-type="bibr" rid="ref-27">27</xref>] classification, to obtain the image representation. Before feeding the image into the model, we first rescale it to 224 &#x00D7; 224 pixels. The visual representation is then obtained from the last convolutional layer of ResNet:</p>
<p><disp-formula id="eqn-5"><label>(5)</label><mml:math id="mml-eqn-5" display="block"><mml:mi>R</mml:mi><mml:mi>e</mml:mi><mml:mi>s</mml:mi><mml:mi>N</mml:mi><mml:mi>e</mml:mi><mml:mi>t</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>I</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mrow><mml:mo>{</mml:mo><mml:msub><mml:mi>r</mml:mi><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:msub><mml:mi>r</mml:mi><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:mn>2048</mml:mn></mml:mrow></mml:msup><mml:mo>,</mml:mo><mml:mi>j</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mn>2</mml:mn><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:mn>49</mml:mn><mml:mo>}</mml:mo></mml:mrow><mml:mo>,</mml:mo></mml:math></disp-formula>where 49 is the number of regions with the same size that have been split from the image, and <inline-formula id="ieqn-5"><mml:math id="mml-ieqn-5"><mml:msub><mml:mi>r</mml:mi><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> is a 2048-dimensional vector representing each region as depicted in the upper left-hand corner of <xref ref-type="fig" rid="fig-1">Fig. 1</xref>. Further, we use a linear function to map each region in the image to the same space as the text representation:</p>

<p><disp-formula id="eqn-6"><label>(6)</label><mml:math id="mml-eqn-6" display="block"><mml:msub><mml:mi>H</mml:mi><mml:mrow><mml:mi>V</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mi>W</mml:mi><mml:mrow><mml:mi>v</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mi>R</mml:mi><mml:mi>e</mml:mi><mml:mi>s</mml:mi><mml:mi>N</mml:mi><mml:mi>e</mml:mi><mml:mi>t</mml:mi></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>I</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>,</mml:mo></mml:math></disp-formula>where <inline-formula id="ieqn-6"><mml:math id="mml-ieqn-6"><mml:msub><mml:mi>W</mml:mi><mml:mrow><mml:mi>v</mml:mi></mml:mrow></mml:msub><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:mi>d</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mn>2048</mml:mn></mml:mrow></mml:msup></mml:math></inline-formula> is a weight parameter.</p>
</sec>
<sec id="s3_2">
<label>3.2</label>
<title>Multimodal Interaction</title>
<p>While image descriptions can provide supplementary semantic information, they are not always advantageous when attempting to classify sentiment. An excess of semantic information can increase background noise and reduce classification accuracy. To address this problem, we propose three main modules: (1) a Target-based Attention Module to capture the target-text relevance; (2) an Image-based Attention Module to capture the image-text relevance; and (3) a Target-Image Matching Module to align the target-text with the image-text to obtain fine-grained semantic information.</p>
<p><bold>Target-based Attention Module</bold>. Since the input target is extracted from a sentence, as previously mentioned, we argue that it is unsuitable for direct use in sentiment analysis tasks due to a lack of contextually relevant semantic information. Consequently, we apply the cross-modal Transformer layer [<xref ref-type="bibr" rid="ref-28">28</xref>] to modality interaction between the input target and the concatenated text, where the representations of the input target <inline-formula id="ieqn-7"><mml:math id="mml-ieqn-7"><mml:msub><mml:mi>H</mml:mi><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> serve as queries, and the representations of the contextualized text <inline-formula id="ieqn-8"><mml:math id="mml-ieqn-8"><mml:msub><mml:mi>H</mml:mi><mml:mrow><mml:mi>S</mml:mi><mml:mi>D</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> serve as keys and values:</p>
<p><disp-formula id="eqn-7"><label>(7)</label><mml:math id="mml-eqn-7" display="block"><mml:mrow><mml:msub><mml:msup><mml:mi>H</mml:mi><mml:mo>&#x2032;</mml:mo></mml:msup><mml:mi>T</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mtext>Cross</mml:mtext><mml:mo>&#x2212;</mml:mo><mml:mtext>ATT</mml:mtext><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:msub><mml:mi>H</mml:mi><mml:mi>T</mml:mi></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>H</mml:mi><mml:mrow><mml:mi>S</mml:mi><mml:mi>D</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>H</mml:mi><mml:mrow><mml:mi>S</mml:mi><mml:mi>D</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mo>,</mml:mo></mml:mrow></mml:math></disp-formula>where <inline-formula id="ieqn-9"><mml:math id="mml-ieqn-9"><mml:mrow><mml:msubsup><mml:mi>H</mml:mi><mml:mi>T</mml:mi><mml:mrow><mml:msup><mml:mi></mml:mi><mml:mo>&#x2032;</mml:mo></mml:msup></mml:mrow></mml:msubsup></mml:mrow><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:mi>d</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mi>t</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula> is the generated target-based attention representation.</p>
<p><bold>Image-based Attention Module.</bold> To obtain the semantic information jointly presented by the image and the concatenated text, we use another cross-modal Transformer layer to modality the interaction between the image and the concatenated text, where the representations of the image <inline-formula id="ieqn-10"><mml:math id="mml-ieqn-10"><mml:msub><mml:mi>H</mml:mi><mml:mrow><mml:mi>V</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> serve as queries and the representations of the contextualized text <inline-formula id="ieqn-11"><mml:math id="mml-ieqn-11"><mml:msub><mml:mi>H</mml:mi><mml:mrow><mml:mi>S</mml:mi><mml:mi>D</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> also serve as keys and values:</p>
<p><disp-formula id="eqn-8"><label>(8)</label><mml:math id="mml-eqn-8" display="block"><mml:msubsup><mml:mi>H</mml:mi><mml:mrow><mml:mi>V</mml:mi></mml:mrow><mml:mrow><mml:msup><mml:mi></mml:mi><mml:mo>&#x2032;</mml:mo></mml:msup></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:mrow><mml:mtext>Cross</mml:mtext></mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mrow><mml:mtext>ATT</mml:mtext></mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>H</mml:mi><mml:mrow><mml:mi>V</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>H</mml:mi><mml:mrow><mml:mi>S</mml:mi><mml:mi>D</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>H</mml:mi><mml:mrow><mml:mi>S</mml:mi><mml:mi>D</mml:mi></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow><mml:mo>,</mml:mo></mml:math></disp-formula>where <inline-formula id="ieqn-12"><mml:math id="mml-ieqn-12"><mml:msubsup><mml:mi>H</mml:mi><mml:mrow><mml:mi>V</mml:mi></mml:mrow><mml:mrow><mml:msup><mml:mi></mml:mi><mml:mo>&#x2032;</mml:mo></mml:msup></mml:mrow></mml:msubsup><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:mi>d</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mn>49</mml:mn></mml:mrow></mml:msup></mml:math></inline-formula> is the generated image-based attention representation.</p>
<p><bold>Target-Image Matching Module.</bold> Based on the Target-based Attention Module and the Image-based Attention Module, which both work to extract key information and reduce the impact of irrelevant information, the Target-Image Matching Module aims to identify target-based attention representation aligned with image-based attention representation. Specifically, we use target-based attention representation <inline-formula id="ieqn-13"><mml:math id="mml-ieqn-13"><mml:msubsup><mml:mi>H</mml:mi><mml:mrow><mml:mi>T</mml:mi></mml:mrow><mml:mrow><mml:msup><mml:mi></mml:mi><mml:mo>&#x2032;</mml:mo></mml:msup></mml:mrow></mml:msubsup></mml:math></inline-formula> as queries and image-based attention representation as <inline-formula id="ieqn-14"><mml:math id="mml-ieqn-14"><mml:msubsup><mml:mi>H</mml:mi><mml:mrow><mml:mi>V</mml:mi></mml:mrow><mml:mrow><mml:msup><mml:mi></mml:mi><mml:mo>&#x2032;</mml:mo></mml:msup></mml:mrow></mml:msubsup></mml:math></inline-formula> keys and values:</p>
<p><disp-formula id="eqn-9"><label>(9)</label><mml:math id="mml-eqn-9" display="block"><mml:msubsup><mml:mi>H</mml:mi><mml:mrow><mml:mi>T</mml:mi></mml:mrow><mml:mrow><mml:msup><mml:mi></mml:mi><mml:mo>&#x2032;</mml:mo></mml:msup><mml:mrow><mml:msup><mml:mi></mml:mi><mml:mo>&#x2032;</mml:mo></mml:msup></mml:mrow></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:mrow><mml:mtext>Cross</mml:mtext></mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mrow><mml:mtext>ATT</mml:mtext></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:msubsup><mml:mi>H</mml:mi><mml:mrow><mml:mi>T</mml:mi></mml:mrow><mml:mrow><mml:msup><mml:mi></mml:mi><mml:mo>&#x2032;</mml:mo></mml:msup></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:msubsup><mml:mi>H</mml:mi><mml:mrow><mml:mi>V</mml:mi></mml:mrow><mml:mrow><mml:msup><mml:mi></mml:mi><mml:mo>&#x2032;</mml:mo></mml:msup></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:msubsup><mml:mi>H</mml:mi><mml:mrow><mml:mi>V</mml:mi></mml:mrow><mml:mrow><mml:msup><mml:mi></mml:mi><mml:mo>&#x2032;</mml:mo></mml:msup></mml:mrow></mml:msubsup><mml:mo stretchy="false">)</mml:mo><mml:mo>,</mml:mo></mml:math></disp-formula>where <inline-formula id="ieqn-15"><mml:math id="mml-ieqn-15"><mml:msubsup><mml:mi>H</mml:mi><mml:mrow><mml:mi>T</mml:mi></mml:mrow><mml:mrow><mml:msup><mml:mi></mml:mi><mml:mo>&#x2032;</mml:mo></mml:msup><mml:mrow><mml:msup><mml:mi></mml:mi><mml:mo>&#x2032;</mml:mo></mml:msup></mml:mrow></mml:mrow></mml:msubsup><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mi>d</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula> is the matched representation.</p>
<p>Although the representation fused in the above way reduces data redundancy, the contribution of image modality to the fusion is not considered. Incorrect correlations between the different modalities could result in the combination of unrelated information, thereby reducing the accuracy of the final classification results. Consequently, our model first measures the relationship between the text and the image by computing the similarity between the text and the image description. Specifically, given that BERT captures a wealth of semantic information [<xref ref-type="bibr" rid="ref-29">29</xref>], we directly create two sentence vectors <inline-formula id="ieqn-16"><mml:math id="mml-ieqn-16"><mml:msubsup><mml:mi>H</mml:mi><mml:mrow><mml:mi>S</mml:mi></mml:mrow><mml:mrow><mml:msup><mml:mi></mml:mi><mml:mo>&#x2032;</mml:mo></mml:msup></mml:mrow></mml:msubsup></mml:math></inline-formula> and <inline-formula id="ieqn-17"><mml:math id="mml-ieqn-17"><mml:msubsup><mml:mi>H</mml:mi><mml:mrow><mml:mi>D</mml:mi></mml:mrow><mml:mrow><mml:msup><mml:mi></mml:mi><mml:mo>&#x2032;</mml:mo></mml:msup></mml:mrow></mml:msubsup></mml:math></inline-formula>, by averaging all of the word vectors in the final hidden representation layer of <inline-formula id="ieqn-18"><mml:math id="mml-ieqn-18"><mml:msub><mml:mi>H</mml:mi><mml:mrow><mml:mi>S</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> and <inline-formula id="ieqn-19"><mml:math id="mml-ieqn-19"><mml:msub><mml:mi>H</mml:mi><mml:mrow><mml:mi>D</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>, and then we compute the cosine similarity of the two sentence vectors:</p>
<p><disp-formula id="eqn-10"><label>(10)</label><mml:math id="mml-eqn-10" display="block"><mml:mrow><mml:mi>s</mml:mi><mml:mi>i</mml:mi><mml:mi>m</mml:mi><mml:mi>i</mml:mi><mml:mi>l</mml:mi><mml:mi>a</mml:mi><mml:mi>r</mml:mi><mml:mi>i</mml:mi><mml:mi>t</mml:mi><mml:mi>y</mml:mi></mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:msubsup><mml:mi>H</mml:mi><mml:mrow><mml:mi>S</mml:mi></mml:mrow><mml:mrow><mml:msup><mml:mi></mml:mi><mml:mo>&#x2032;</mml:mo></mml:msup></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:msubsup><mml:mi>H</mml:mi><mml:mrow><mml:mi>D</mml:mi></mml:mrow><mml:mrow><mml:msup><mml:mi></mml:mi><mml:mo>&#x2032;</mml:mo></mml:msup></mml:mrow></mml:msubsup><mml:mo>)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:msubsup><mml:mi>H</mml:mi><mml:mrow><mml:mi>S</mml:mi></mml:mrow><mml:mrow><mml:msup><mml:mi></mml:mi><mml:mo>&#x2032;</mml:mo></mml:msup></mml:mrow></mml:msubsup><mml:mo>&#x22C5;</mml:mo><mml:msubsup><mml:mi>H</mml:mi><mml:mrow><mml:mi>D</mml:mi></mml:mrow><mml:mrow><mml:msup><mml:mi></mml:mi><mml:mo>&#x2032;</mml:mo></mml:msup></mml:mrow></mml:msubsup></mml:mrow><mml:mrow><mml:mrow><mml:mo symmetric="true">&#x2016;</mml:mo><mml:msubsup><mml:mi>H</mml:mi><mml:mrow><mml:mi>S</mml:mi></mml:mrow><mml:mrow><mml:msup><mml:mi></mml:mi><mml:mo>&#x2032;</mml:mo></mml:msup></mml:mrow></mml:msubsup><mml:mo symmetric="true">&#x2016;</mml:mo></mml:mrow><mml:mrow><mml:mo symmetric="true">&#x2016;</mml:mo><mml:msubsup><mml:mi>H</mml:mi><mml:mrow><mml:mi>D</mml:mi></mml:mrow><mml:mrow><mml:msup><mml:mi></mml:mi><mml:mo>&#x2032;</mml:mo></mml:msup></mml:mrow></mml:msubsup><mml:mo symmetric="true">&#x2016;</mml:mo></mml:mrow></mml:mrow></mml:mfrac><mml:mo>.</mml:mo></mml:math></disp-formula></p>
<p>Then, we dynamically regulate the contribution of the image modality to the fusion process according to the degree of similarity. Thus, we construct a visual filter matrix <inline-formula id="ieqn-20"><mml:math id="mml-ieqn-20"><mml:mi>G</mml:mi><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:mi>d</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mn>49</mml:mn></mml:mrow></mml:msup></mml:math></inline-formula> based on the similarity score in <xref ref-type="disp-formula" rid="eqn-10">Eq. (10)</xref>, which indicates the relevant score between the text and the image. Then the filtered image representations can be obtained from the visual filter matrix:</p>
<p><disp-formula id="eqn-11"><label>(11)</label><mml:math id="mml-eqn-11" display="block"><mml:msubsup><mml:mi>H</mml:mi><mml:mrow><mml:mi>V</mml:mi></mml:mrow><mml:mrow><mml:msup><mml:mi></mml:mi><mml:mo>&#x2032;</mml:mo></mml:msup></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:msubsup><mml:mi>H</mml:mi><mml:mrow><mml:mi>V</mml:mi></mml:mrow><mml:mrow><mml:msup><mml:mi></mml:mi><mml:mo>&#x2032;</mml:mo></mml:msup></mml:mrow></mml:msubsup><mml:mo>&#x2299;</mml:mo><mml:mi>G</mml:mi><mml:mo>,</mml:mo></mml:math></disp-formula>where <inline-formula id="ieqn-21"><mml:math id="mml-ieqn-21"><mml:mo>&#x2299;</mml:mo></mml:math></inline-formula> represents the element-by-element multiplication. Consequently, <inline-formula id="ieqn-22"><mml:math id="mml-ieqn-22"><mml:msubsup><mml:mi>H</mml:mi><mml:mrow><mml:mi>V</mml:mi></mml:mrow><mml:mrow><mml:msup><mml:mi></mml:mi><mml:mo>&#x2032;</mml:mo></mml:msup></mml:mrow></mml:msubsup></mml:math></inline-formula> in <xref ref-type="disp-formula" rid="eqn-9">Eq. (9)</xref> requires revision of <xref ref-type="disp-formula" rid="eqn-11">Eq. (11)</xref>.</p>
</sec>
<sec id="s3_3">
<label>3.3</label>
<title>Multimodal Sentiment Classification</title>
<p>With the representations <inline-formula id="ieqn-23"><mml:math id="mml-ieqn-23"><mml:msubsup><mml:mi>H</mml:mi><mml:mrow><mml:mi>T</mml:mi></mml:mrow><mml:mrow><mml:msup><mml:mi></mml:mi><mml:mo>&#x2032;</mml:mo></mml:msup><mml:mrow><mml:msup><mml:mi></mml:mi><mml:mo>&#x2032;</mml:mo></mml:msup></mml:mrow></mml:mrow></mml:msubsup></mml:math></inline-formula> generated from the Target-Image Matching Module, a late feature fusion is performed to concatenate them with sentence representation <inline-formula id="ieqn-24"><mml:math id="mml-ieqn-24"><mml:msub><mml:mi>H</mml:mi><mml:mrow><mml:mi>S</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>, and feed them to a self-attention layer for multimodal fusion:</p>
<p><disp-formula id="eqn-12"><label>(12)</label><mml:math id="mml-eqn-12" display="block"><mml:mi>H</mml:mi><mml:mo>=</mml:mo><mml:mrow><mml:mtext>Self</mml:mtext></mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mrow><mml:mtext>ATT</mml:mtext></mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mi>c</mml:mi><mml:mi>o</mml:mi><mml:mi>n</mml:mi><mml:mi>c</mml:mi><mml:mi>a</mml:mi><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:msubsup><mml:mi>H</mml:mi><mml:mrow><mml:mi>T</mml:mi></mml:mrow><mml:mrow><mml:msup><mml:mi></mml:mi><mml:mo>&#x2032;</mml:mo></mml:msup><mml:mrow><mml:msup><mml:mi></mml:mi><mml:mo>&#x2032;</mml:mo></mml:msup></mml:mrow></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:msub><mml:mi>H</mml:mi><mml:mrow><mml:mi>S</mml:mi></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mo>.</mml:mo></mml:math></disp-formula></p>
<p>Finally, the softmax layer receives the representation of the first token <inline-formula id="ieqn-25"><mml:math id="mml-ieqn-25"><mml:mrow><mml:msup><mml:mi>H</mml:mi><mml:mrow><mml:mrow><mml:mo>[</mml:mo> <mml:mn>0</mml:mn> <mml:mo>]</mml:mo></mml:mrow></mml:mrow></mml:msup></mml:mrow></mml:math></inline-formula> to produce the prediction result <italic>y</italic> after layer normalization (<italic>LN</italic>).</p>
<p><disp-formula id="eqn-13"><label>(13)</label><mml:math id="mml-eqn-13" display="block"><mml:mrow><mml:mi>y</mml:mi><mml:mo>=</mml:mo><mml:mi>S</mml:mi><mml:mi>o</mml:mi><mml:mi>f</mml:mi><mml:mi>t</mml:mi><mml:mi>m</mml:mi><mml:mi>a</mml:mi><mml:mi>x</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mi>L</mml:mi><mml:mi>N</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:msup><mml:mi>W</mml:mi><mml:mi>T</mml:mi></mml:msup><mml:msup><mml:mi>H</mml:mi><mml:mrow><mml:mrow><mml:mo>[</mml:mo> <mml:mn>0</mml:mn> <mml:mo>]</mml:mo></mml:mrow></mml:mrow></mml:msup><mml:mo>+</mml:mo><mml:mi>b</mml:mi></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mo>.</mml:mo></mml:mrow></mml:math></disp-formula></p>
<p>As the classification loss for the purposes of model training, the cross-entropy loss is used by the majority of multimodal sentiment analysis methods:</p>
<p><disp-formula id="eqn-14"><label>(14)</label><mml:math id="mml-eqn-14" display="block"><mml:msup><mml:mrow><mml:mi>&#x02112;</mml:mi></mml:mrow><mml:mrow><mml:mi>I</mml:mi><mml:mi>T</mml:mi><mml:mi>M</mml:mi><mml:mi>S</mml:mi><mml:mi>C</mml:mi></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:mi>N</mml:mi></mml:mfrac><mml:msubsup><mml:mo movablelimits="false">&#x2211;</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>N</mml:mi></mml:mrow></mml:msubsup><mml:mi>l</mml:mi><mml:mi>o</mml:mi><mml:mi>g</mml:mi><mml:mi>P</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:msup><mml:mi>y</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msup><mml:mo>)</mml:mo></mml:mrow><mml:mo>,</mml:mo></mml:math></disp-formula>where <italic>N</italic> is the number of samples for the classification task, <inline-formula id="ieqn-26"><mml:math id="mml-ieqn-26"><mml:msup><mml:mi>y</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula> is the representation of the probability distribution of the final target classification that was obtained by our model.</p>
</sec>
</sec>
<sec id="s4">
<label>4</label>
<title>Experiment</title>
<sec id="s4_1">
<label>4.1</label>
<title>Datasets</title>
<p>We evaluate our ITMSC model using the Twitter-2015 and Twitter-2017 public multimodal sentiment databases. Both of these databases are freely available to the public. The two Twitter databases that Yu et al. [<xref ref-type="bibr" rid="ref-7">7</xref>] presented include multimodal tweets posted to Twitter. Each multimodal tweet contains a sentence, an accompanying image, the target in the sentence, and the sentiment polarity of each target. Yu et al. categorized each target as either negative, neutral, or positive. We followed the same partitioning of the dataset as several recent publications [<xref ref-type="bibr" rid="ref-7">7</xref>] and [<xref ref-type="bibr" rid="ref-8">8</xref>] to ensure an equitable evaluation. The characteristics of the datasets are summarized in <xref ref-type="table" rid="table-2">Table 2</xref>.</p>
<table-wrap id="table-2">
<label>Table 2</label>
<caption>
<title>The statistics of the multimodal sentiment databases for Twitter-2015 and Twitter-2017</title>
</caption>
<table frame="hsides">
<colgroup>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
</colgroup>
<thead>
<tr>
<th rowspan="2">Label</th>
<th align="center" colspan="3">Twitter-2015</th>
<th align="center" colspan="3">Twitter-2017</th>
</tr>
<tr>
<th>Train</th>
<th>Validation</th>
<th>Test</th>
<th>Train</th>
<th>Validation</th>
<th>Test</th>
</tr>
</thead>
<tbody>
<tr>
<td>Positive</td>
<td>928</td>
<td>303</td>
<td>317</td>
<td>1508</td>
<td>515</td>
<td>493</td>
</tr>
<tr>
<td>Neutral</td>
<td>1883</td>
<td>670</td>
<td>607</td>
<td>1638</td>
<td>517</td>
<td>573</td>
</tr>
<tr>
<td>Negative</td>
<td>368</td>
<td>149</td>
<td>113</td>
<td>416</td>
<td>144</td>
<td>168</td>
</tr>
<tr>
<td>Total</td>
<td>3179</td>
<td>1122</td>
<td>1037</td>
<td>3562</td>
<td>1176</td>
<td>1234</td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
<sec id="s4_2">
<label>4.2</label>
<title>Experimental Settings</title>
<p>We obtained the text representation by fine-tuning a BERT-base model, and the image representation was obtained using a pre-trained ResNet152. With the help of Google Colaboratory, we were able to conduct our experiment. We would like to express our gratitude for the opportunity to use this resource, as it made our research and experimentation significantly easier and more efficient. As for the hardware, we used a powerful Tesla Graphics Processing Unit (GPU) with 16 GB of Random Access Memory (RAM). Pytorch was used to realize the framework of the model. Pytorch is an ideal deep learning framework for quickly and accurately building and deploying sophisticated machine learning models. Refer to <xref ref-type="table" rid="table-3">Table 3</xref> for details on the hyper-parameters used in our model, such as maximum text length, description length, target length, training batch size, and learning rate.</p>
<table-wrap id="table-3">
<label>Table 3</label>
<caption>
<title>Settings of important parameters</title>
</caption>
<table frame="hsides">
<colgroup>
<col align="left"/>
<col align="left"/>
<col align="left"/>
</colgroup>
<thead>
<tr>
<th>Hyperparameters</th>
<th>Twitter-2015</th>
<th>Twitter-2017</th>
</tr>
</thead>
<tbody>
<tr>
<td>Maximum text length</td>
<td>64</td>
<td>64</td>
</tr>
<tr>
<td>Maximum description length</td>
<td>64</td>
<td>64</td>
</tr>
<tr>
<td>Maximum target length</td>
<td>16</td>
<td>16</td>
</tr>
<tr>
<td>Attention head</td>
<td>12</td>
<td>12</td>
</tr>
<tr>
<td>Hidden dimension</td>
<td>768</td>
<td>768</td>
</tr>
<tr>
<td>Learning rate</td>
<td>2e-5</td>
<td>4e-5</td>
</tr>
<tr>
<td>Training batch size</td>
<td>32</td>
<td>16</td>
</tr>
<tr>
<td>Training epoch</td>
<td>8</td>
<td>8</td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
<sec id="s4_3">
<label>4.3</label>
<title>Experimental Results and Analysis</title>
<p>To further validate the efficacy of the proposed ITMSC model for targeted multimodal sentiment classification, we compare our approach to various existing competitive methods. The results of our analysis demonstrate that the proposed model has a strong performance in terms of accuracy and Macro-F1 score. This further validates the efficacy of our proposed model and provides further evidence that the accuracy of multimodal sentiment classification can be improved using semantic descriptions of images.</p>
<p>We choose three kinds of baselines. The first category is the visual-based ResNet-Aspect model. The second category is some traditional text-based models, including the BERT, the long short-term memory with aspect embedding (AE-LSTM) [<xref ref-type="bibr" rid="ref-30">30</xref>], the deep memory network (MemNet) [<xref ref-type="bibr" rid="ref-31">31</xref>], and the recurrent attention network on memory (RAM) [<xref ref-type="bibr" rid="ref-32">32</xref>]. The third category consists of multimodal models such as the multi-interactive memory network (MIMN) [<xref ref-type="bibr" rid="ref-5">5</xref>], the entity-sensitive attention and fusion network (ESAFN) [<xref ref-type="bibr" rid="ref-6">6</xref>], the target-oriented multimodal bidirectional encoder representation from Transformers (TomBERT) [<xref ref-type="bibr" rid="ref-7">7</xref>], and the exploiting BERT for multimodal target sentiment classification through input space translation (EF-CapTrBERT) [<xref ref-type="bibr" rid="ref-8">8</xref>], the sentiment-aware multimodal pre-training framework (SMP) [<xref ref-type="bibr" rid="ref-22">22</xref>], the coarse-to-fine grained Image-Target Matching network (ITM) [<xref ref-type="bibr" rid="ref-23">23</xref>], and the vision-and-language BERT (ViLBERT) [<xref ref-type="bibr" rid="ref-33">33</xref>].</p>
<p><xref ref-type="table" rid="table-4">Table 4</xref> presents a comparison of accuracy and Macro-F1 score for the proposed ITMSC method and other benchmark models on both Twitter-2015 and Twitter-2017 datasets. The baseline models were evaluated using different approaches: TomBERT&#x2019;s performance was predicted using the model generated in [<xref ref-type="bibr" rid="ref-7">7</xref>], while the results of EF-CaTrBERT and ITM were generated by running their provided code. In contrast, the remaining baseline model results were obtained directly from the original papers.</p>
<table-wrap id="table-4">
<label>Table 4</label>
<caption>
<title>Comparison of different methods on two Twitter datasets. Results marked with &#x002A; represent the average performance over five runs with a seed set ranging from 42 to 46. The marker &#x00B1;denotes the standard deviation of the results, and the marker &#x2020; indicates the significant test p-value, which is less than 0.05</title>
</caption>
<table frame="hsides">
<colgroup>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
</colgroup>
<thead>
<tr>
<th>Model</th>
<th>Methods</th>
<th align="center" colspan="2">Twitter-2015</th>
<th align="center" colspan="2">Twitter-2017</th>
</tr>
<tr>
<th/>
<th/>
<th>Accuracy</th>
<th>Macro-F1</th>
<th>Accuracy</th>
<th>Macro-F1</th>
</tr>
</thead>
<tbody>
<tr>
<td>Image only</td>
<td>ResNet</td>
<td>59.88</td>
<td>46.48</td>
<td>58.59</td>
<td>53.98</td>
</tr>
<tr>
<td/>
<td>AE-LSTM</td>
<td>70.30</td>
<td>63.43</td>
<td>61.67</td>
<td>57.97</td>
</tr>
<tr>
<td>Text only</td>
<td>MemNet</td>
<td>70.11</td>
<td>61.76</td>
<td>64.18</td>
<td>60.80</td>
</tr>
<tr>
<td/>
<td>RAM</td>
<td>70.68</td>
<td>63.05</td>
<td>64.42</td>
<td>61.01</td>
</tr>
<tr>
<td/>
<td>BERT</td>
<td>74.15</td>
<td>68.86</td>
<td>68.15</td>
<td>65.23</td>
</tr>
<tr>
<td/>
<td>MIMN (2019)</td>
<td>71.84</td>
<td>65.69</td>
<td>65.88</td>
<td>62.99</td>
</tr>
<tr>
<td/>
<td>ViLBERT (2019)</td>
<td>73.76</td>
<td>69.85</td>
<td>67.42</td>
<td>64.87</td>
</tr>
<tr>
<td/>
<td>TomBERT<sup>&#x002A;</sup> (2019)<sup>&#x002A;</sup></td>
<td>76.82 <inline-formula id="ieqn-27"><mml:math id="mml-ieqn-27"><mml:mo>&#x00B1;</mml:mo></mml:math></inline-formula> 0.08</td>
<td>71.04 <inline-formula id="ieqn-28"><mml:math id="mml-ieqn-28"><mml:mo>&#x00B1;</mml:mo></mml:math></inline-formula> 0.12</td>
<td>70.02 <inline-formula id="ieqn-29"><mml:math id="mml-ieqn-29"><mml:mo>&#x00B1;</mml:mo></mml:math></inline-formula> 0.17</td>
<td>67.67 <inline-formula id="ieqn-30"><mml:math id="mml-ieqn-30"><mml:mo>&#x00B1;</mml:mo></mml:math></inline-formula> 0.12</td>
</tr>
<tr>
<td>Image&#x002B;Text</td>
<td>ESAFN (2020)</td>
<td>73.38</td>
<td>67.37</td>
<td>67.83</td>
<td>64.22</td>
</tr>
<tr>
<td/>
<td>EF-CaTrBERT (2021)<sup>&#x002A;</sup></td>
<td>76.32 <inline-formula id="ieqn-31"><mml:math id="mml-ieqn-31"><mml:mo>&#x00B1;</mml:mo></mml:math></inline-formula> 0.85</td>
<td>71.54 <inline-formula id="ieqn-32"><mml:math id="mml-ieqn-32"><mml:mo>&#x00B1;</mml:mo></mml:math></inline-formula> 0.73</td>
<td>67.96 <inline-formula id="ieqn-33"><mml:math id="mml-ieqn-33"><mml:mo>&#x00B1;</mml:mo></mml:math></inline-formula> 1.16</td>
<td>65.61 <inline-formula id="ieqn-34"><mml:math id="mml-ieqn-34"><mml:mo>&#x00B1;</mml:mo></mml:math></inline-formula> 0.94</td>
</tr>
<tr>
<td/>
<td>SMP (2022)</td>
<td>77.53</td>
<td>72.24</td>
<td>71.15</td>
<td>69.47</td>
</tr>
<tr>
<td/>
<td>ITM (2022)<sup>&#x002A;</sup></td>
<td>77.38 &#x00B1; 0.56</td>
<td>72.43 &#x00B1; 1.15</td>
<td><bold>71.79 &#x00B1; 0.32</bold></td>
<td><bold>70.38 &#x00B1; 0.32</bold></td>
</tr>
<tr>
<td/>
<td>ITMSC (Ours)</td>
<td><bold>78.59 </bold><inline-formula id="ieqn-35"><mml:math id="mml-ieqn-35"><mml:mo>&#x00B1;</mml:mo></mml:math></inline-formula> <bold>0.11</bold><sup>&#x2020;</sup></td>
<td><bold>74.28 </bold><inline-formula id="ieqn-36"><mml:math id="mml-ieqn-36"><mml:mo>&#x00B1;</mml:mo></mml:math></inline-formula> <bold>0.09</bold><sup>&#x2020;</sup></td>
<td>70.28 <inline-formula id="ieqn-37"><mml:math id="mml-ieqn-37"><mml:mo>&#x00B1;</mml:mo></mml:math></inline-formula> 0.06</td>
<td>68.40 <inline-formula id="ieqn-38"><mml:math id="mml-ieqn-38"><mml:mo>&#x00B1;</mml:mo></mml:math></inline-formula> 0.07</td>
</tr>
</tbody>
</table>
</table-wrap>
<p><xref ref-type="table" rid="table-4">Table 4</xref> presents a quantitative analysis of our method and its performance compared to other state-of-the-art approaches. Our method demonstrated superior performance on Twitter-2015, outperforming the other methods. Similarly, on Twitter-2017, our method exhibited strong performance, validating the effectiveness of our proposed improvement strategies for the TMSC task. Based on our careful examination and comparison of the experimental results, we have drawn the following key findings:</p>

<p>Firstly, the method that relies solely on images for sentiment analysis has been shown to perform the worst. This is mostly because images lack the contextual information required for more accurate analysis. Emotions cannot be expressed as explicitly and directly through images as words. As a result, relying solely on the visual modality for sentiment analysis yields poor results. Therefore, it is desirable to utilize more modality information, such as the corresponding textual information, to obtain a more accurate and reliable sentiment analysis.</p>
<p>Secondly, these text-based analysis methods outperform image-based methods. This is due to the fact that emotional cues in textual content are typically more powerful and informative than those in visual data. Additionally, textual data is generally more accessible and easier to process, making it easier to detect and interpret emotional cues. Furthermore, it is evident that BERT consistently outperforms all baselines. We attribute this success to the BERT models&#x2019; ability to learn from massive datasets, which in turn improves their performance at extracting task-relevant features.</p>
<p>Thirdly, the majority of multimodal approaches outperform their corresponding unimodal baseline approaches by a significant margin. This demonstrates that relying solely on text or images is typically insufficient for sentiment classification. In fact, richer information can be extracted using different modalities, which mutually assist in capturing the implied semantic features through mutual support and fusion between different modalities of data. This can help capture the implied semantic features that are otherwise difficult to access. Ultimately, this can lead to better decision-making and more accurate outcomes.</p>
<p>Fourthly, it&#x2019;s easy to see that MIMN and ESAFN get the worst results when it comes to multimodal methods. This is attributed to the absence of pre-trained models that can extract relevant features from both text and images. Additionally, the pre-trained version of ViLBERT did not perform as well, which may be due to its failure to explicitly model the interaction between text and images at the aspect level. Unlike TomBERT, SMP, and ITM, which rely on extracted image and text representations for sentiment classification, EF-CaTrBERT takes a different approach by combining both text and image descriptions. In contrast, our ITMSC model not only combines text and image but also incorporates image descriptions, resulting in superior performance compared to the several most advanced models currently available. This demonstrates the validity of our hypothesis that image descriptions contain rich semantic information and help comprehend the content of images, thereby enabling the model to align a given target with relevant image regions while reducing noise in irrelevant regions.</p>
</sec>
<sec id="s4_4">
<label>4.4</label>
<title>In-depth Analysis</title>
<p><bold>Ablation Study</bold>. We performed ablation analysis experiments to assess the individual contributions of the various modules to the performance of our overall ITMSC model. The results of the ablation analysis were used to identify the best combination of modules for the ITMSC model, which helped optimize the performance of the system. Therefore, we removed the Target-based Attention Module, Image-based Attention Module, and Target-Image Matching Module on the basis of the ITMSC model, respectively.</p>
<p>The results presented in <xref ref-type="table" rid="table-5">Table 5</xref> provide a comprehensive overview of our findings:</p>
<p><list list-type="simple">
<list-item><label>(1)</label><p>The ITMSC model with all modules performs best. The removal of a single module from a system could have a detrimental effect on its accuracy and the F1 score. This is due to the fact that the model relies on all its components in order to achieve the best possible performance. Without the single module, the model would be missing a critical component, and thus its performance would suffer.</p></list-item>
<list-item><label>(2)</label><p>Removing the Target-based Attention Module results to worse performance, demonstrating the importance of extracting target information in the TMSC task. When the Target-based Attention Module is removed, the model loses the ability to extract relevant contextual information from the target. This in turn affects the alignment in the Target-Image Matching Module and leads to a decrease in sentiment accuracy.</p></list-item>
<list-item><label>(3)</label><p>The removal of the Image-based Attention Module also leads to a decrease in performance, although the drop in accuracy is relatively small. This further emphasizes that images do not express emotions as directly as the text does. Nevertheless, the Image-based Attention Module is able to capture cross-modal interactions to improve our model&#x2019;s understanding of sentiment across text and visual modalities.</p></list-item>
<list-item><label>(4)</label><p>Without the Target-Image Matching Module, performance suffers drastically. By adjusting the image contributions in the fusion representation and performing a fine-grained alignment between the target and the image, it is possible to assist the model in discovering crucial sentiment prediction information.</p></list-item>
</list></p>
<table-wrap id="table-5">
<label>Table 5</label>
<caption>
<title>Ablation study results on two Twitter datasets</title>
</caption>
<table frame="hsides">
<colgroup>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
</colgroup>
<thead>
<tr>
<th>Methods</th>
<th align="center" colspan="2">Twitter-2015</th>
<th align="center" colspan="2">Twitter-2017</th>
</tr>
<tr>
<th/>
<th>Accuracy</th>
<th>Macro-F1</th>
<th>Accuracy</th>
<th>Macro-F1</th>
</tr>
</thead>
<tbody>
<tr>
<td>w/o Target-based attention module</td>
<td>74.41</td>
<td>68.57</td>
<td>68.34</td>
<td>65.45</td>
</tr>
<tr>
<td>w/o Image-based attention module</td>
<td>77.25</td>
<td>72.96</td>
<td>69.32</td>
<td>66.89</td>
</tr>
<tr>
<td>w/o Target-image matching module</td>
<td>75.88</td>
<td>68.69</td>
<td>70.13</td>
<td>67.91</td>
</tr>
<tr>
<td>ITMSC</td>
<td>78.59</td>
<td>74.28</td>
<td>70.28</td>
<td>68.40</td>
</tr>
</tbody>
</table>
</table-wrap>
<p><bold>Case Study</bold>. We evaluated the performance of our ITMSC model in targeted multimodal sentiment analysis and compared it with two other models, TomBERT and ITM, which also used images and text for sentiment analysis. Specifically, we conducted experiments on four cases, as presented in <xref ref-type="table" rid="table-6">Table 6</xref>. In case (a), as the image and text are unrelated, there is no apparent alignment between the target <italic>LeBron James</italic> and the image. After analyzing the unrelated image, TomBERT and ITM provided an incorrect prediction. However, our ITMSC model correctly predicted the outcome by obtaining semantic information from the image description. In case (b), the target <italic>Stagecoach</italic> had been given a label for <italic>negative</italic> sentiment. However, TomBERT made an incorrect prediction, probably because it only noticed the facial expression of the person in the image, while our model and ITM model correctly predicted by paying attention to additional image-related information like lights, microphones, and smoke through the image description. In case (c), all three models correctly predicted the target <italic>SteveScalise</italic> with the <italic>negative</italic> sentiment based on the sentiment words and the image representation. Our model was observed to concentrate its attention on the vehicles in the image, which were semantically related to the sentiment expressed in the text. In case (d), where TomBERT made inaccurate predictions because of the inclusion of irrelevant information in the image. Conversely, in the ITM model, the contribution of the visual modality was likely restrained during the feature fusion process due to the discriminative mechanism that suppressed the influence of irrelevant image features. Our proposed model, however, leveraged semantic information provided by image descriptions to effectively filter out visual noise and concentrate on the relevant features of the hand posture and mouth region, leading to accurate predictions. It is worth noting that despite the incorrect image description labeling the cigarette as a toothbrush, our model still managed to focus on the salient region of the image, demonstrating the efficacy of the Target-Image Matching Module.</p>
<table-wrap id="table-6">
<label>Table 6</label>
<caption>
<title>A case study on some multimodal sentiment samples. The correct and incorrect predictions are denoted by &#x221A; and &#x00D7;, respectively</title>
</caption>
<table frame="hsides">
<colgroup>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
</colgroup>
<thead>
<tr>
<th>Visual<break/> modality</th>
<th><inline-graphic xlink:href="CMC_38220-inline-2.tif"/></th>
<th><inline-graphic xlink:href="CMC_38220-inline-3.tif"/></th>
<th><inline-graphic xlink:href="CMC_38220-inline-4.tif"/></th>
<th><inline-graphic xlink:href="CMC_38220-inline-5.tif"/></th>
</tr>
</thead>
<tbody>
<tr>
<td>Target<break/> attention</td>
<td><inline-graphic xlink:href="CMC_38220-inline-6.tif"/></td>
<td><inline-graphic xlink:href="CMC_38220-inline-7.tif"/></td>
<td><inline-graphic xlink:href="CMC_38220-inline-8.tif"/></td>
<td><inline-graphic xlink:href="CMC_38220-inline-9.tif"/></td>
</tr>
<tr>
<td>Textual modality</td>
<td><bold>LeBron James</bold> to Produce NBA Documentary.</td>
<td># SamHunt Performs at <bold>Stagecoach</bold> # MusicFestival 2016.</td>
<td><bold>SteveScalise</bold> remains in critical condition after shooting at baseball practice.</td>
<td>Petition to have <bold>Jessica Lange</bold> come back for American Horror Story season 6.</td>
</tr>
<tr>
<td>Image description</td>
<td>A group of people standing outside of a building.</td>
<td>A man is holding up a microphone to take a picture.</td>
<td>A freeway with a lot of trucks and cars.</td>
<td>A woman is holding a toothbrush in her mouth.</td>
</tr>
<tr>
<td>TomBERT</td>
<td>Positive <bold>&#x00D7;</bold></td>
<td>Positive <bold>&#x00D7;</bold></td>
<td>Negative <bold>&#x221A;</bold></td>
<td>Positive <bold>&#x00D7;</bold></td>
</tr>
<tr>
<td>ITM</td>
<td>Positive <bold>&#x00D7;</bold></td>
<td>Neutral <bold>&#x221A;</bold></td>
<td>Negative <bold>&#x221A;</bold></td>
<td>Positive <bold>&#x00D7;</bold></td>
</tr>
<tr>
<td>ITMSC</td>
<td>Neutral <bold>&#x221A;</bold></td>
<td>Neutral <bold>&#x221A;</bold></td>
<td>Negative <bold>&#x221A;</bold></td>
<td>Neutral <bold>&#x221A;</bold></td>
</tr>
<tr>
<td></td>
<td>(a)</td>
<td>(b)</td>
<td>(c)</td>
<td>(d)</td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
</sec>
<sec id="s5">
<label>5</label>
<title>Conclusion</title>
<p>In this paper, we began by investigating the shortcomings of strategies previously proposed for targeted multimodal sentiment classification. Then we proposed the ITMSC model to improve the targeted multimodal sentiment classification based on the semantic description of the image to address these limitations. The ITMSC model consists of a target-based attention module to capture target-text relevance, an image-based attention module to capture image-text relevance, and a target-image matching module based on the former two modules to properly align the target with the image so that fine-grained semantic information can be extracted. The effectiveness and superiority of our model are demonstrated by the experimental results and in-depth analysis of two datasets. Our results also demonstrate that the semantic description of the image can provide supplementary information for multimodal sentiment classification, leading to more accurate predictions.</p>
<p>Despite the promising performance, our proposed approach still has several limitations. First, our research has shown that images play an essential role in multimodal sentiment analysis, and the description of the images can provide valuable semantic information to support the analysis of the sentiment expressed. However, some image descriptions do not precisely correspond to the image&#x2019;s content, which can introduce semantic interference and detrimentally influence the accuracy of sentiment analysis. Second, the first limitation affects the results of calculating the semantic similarity between the text and the image description. As a result, the accuracy of sentiment analysis is reduced, thus negatively affecting the overall performance of sentiment analysis.</p>
<p>In future work, we plan to construct a model that employs the advantages of Vision-Language Pre-Trained Models to analyze sentiment accurately and provide more accurate results than existing models. Furthermore, we need to develop an algorithm that can consistently and accurately describe the content of an image. Such an algorithm should consider the visual elements of the image and any associated contextual information to generate a description that accurately reflects the image content. This will ultimately improve the model&#x2019;s performance, resulting in a more reliable and accurate output.</p>
</sec>
</body>
<back>
<sec><title>Funding Statement</title>
<p>The authors received no specific funding for this study.</p>
</sec>
<sec sec-type="COI-statement"><title>Conflicts of Interest</title>
<p>The authors declare that they have no conflicts of interest to report regarding the present study.</p>
</sec>
<ref-list content-type="authoryear">
<title>References</title>
<ref id="ref-1"><label>[1]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>L.</given-names> <surname>Zhang</surname></string-name>, <string-name><given-names>S.</given-names> <surname>Wang</surname></string-name> and <string-name><given-names>B.</given-names> <surname>Liu</surname></string-name></person-group>, &#x201C;<article-title>Deep learning for sentiment analysis: A survey</article-title>,&#x201D; <source>WIREs Data Mining and Knowledge Discovery</source>, vol. <volume>8</volume>, no. <issue>4</issue>, pp. <fpage>e1253</fpage>, <year>2018</year>. <pub-id pub-id-type="doi">10.1002/widm.1253</pub-id></mixed-citation></ref>
<ref id="ref-2"><label>[2]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>M.</given-names> <surname>Imran</surname></string-name>, <string-name><given-names>F.</given-names> <surname>Ofli</surname></string-name>, <string-name><given-names>D.</given-names> <surname>Caragea</surname></string-name> and <string-name><given-names>A.</given-names> <surname>Torralba</surname></string-name></person-group>, &#x201C;<article-title>Using AI and social media multimodal content for disaster response and management: Opportunities, challenges, and future directions</article-title>,&#x201D; <source>Information Processing &#x0026; Management</source>, vol. <volume>57</volume>, no. <issue>5</issue>, pp. <fpage>102261</fpage>, <year>2020</year>. <pub-id pub-id-type="doi">10.1016/j.ipm.2020.102261</pub-id></mixed-citation></ref>
<ref id="ref-3"><label>[3]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>A.</given-names> <surname>Gandhi</surname></string-name>, <string-name><given-names>K.</given-names> <surname>Adhvaryu</surname></string-name>, <string-name><given-names>S.</given-names> <surname>Poria</surname></string-name>, <string-name><given-names>E.</given-names> <surname>Cambria</surname></string-name> and <string-name><given-names>A.</given-names> <surname>Hussain</surname></string-name></person-group>, &#x201C;<article-title>Multimodal sentiment analysis: A systematic review of history, datasets, multimodal fusion methods, applications, challenges and future directions</article-title>,&#x201D; <source>Information Fusion</source>, vol. <volume>91</volume>, no. <issue>3</issue>, pp. <fpage>424</fpage>&#x2013;<lpage>444</lpage>, <year>2023</year>. <pub-id pub-id-type="doi">10.1016/j.inffus.2022.09.025</pub-id></mixed-citation></ref>
<ref id="ref-4"><label>[4]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>B.</given-names> <surname>Liu</surname></string-name></person-group>, &#x201C;<article-title>Sentiment analysis and opinion mining</article-title>,&#x201D; <source>Synthesis Lectures on Human Language Technologies</source>, vol. <volume>5</volume>, no. <issue>1</issue>, pp. <fpage>1</fpage>&#x2013;<lpage>167</lpage>, <year>2012</year>.</mixed-citation></ref>
<ref id="ref-5"><label>[5]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>N.</given-names> <surname>Xu</surname></string-name>, <string-name><given-names>W. J.</given-names> <surname>Mao</surname></string-name> and <string-name><given-names>G. D.</given-names> <surname>Chen</surname></string-name></person-group>, &#x201C;<article-title>Multi-interactive memory network for aspect based multimodal sentiment analysis</article-title>,&#x201D; in<source> Proc. of the AAAI Conf. on Artificial Intelligence</source>, vol. <volume>33</volume>, no. <issue>1</issue>, <publisher-loc>Honolulu, Hawaii, USA</publisher-loc>, pp. <fpage>371</fpage>&#x2013;<lpage>378</lpage>, <year>2019</year>.</mixed-citation></ref>
<ref id="ref-6"><label>[6]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>J. F.</given-names> <surname>Yu</surname></string-name>, <string-name><given-names>J.</given-names> <surname>Jiang</surname></string-name> and <string-name><given-names>R.</given-names> <surname>Xia</surname></string-name></person-group>, &#x201C;<article-title>Entity-sensitive attention and fusion network for entity-level multimodal sentiment classification</article-title>,&#x201D; <source>IEEE/ACM Transactions on Audio, Speech, and Language Processing</source>, vol. <volume>28</volume>, pp. <fpage>429</fpage>&#x2013;<lpage>439</lpage>, <year>2020</year>. <pub-id pub-id-type="doi">10.1109/TASLP.2019.2957872</pub-id></mixed-citation></ref>
<ref id="ref-7"><label>[7]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>J. F.</given-names> <surname>Yu</surname></string-name> and <string-name><given-names>J.</given-names> <surname>Jiang</surname></string-name></person-group>, &#x201C;<article-title>Adapting BERT for target-oriented multimodal sentiment classification</article-title>,&#x201D; in <conf-name>Electronic Proc. of IJCAI 2019</conf-name>, <publisher-loc>Macao, China</publisher-loc>, pp. <fpage>5408</fpage>&#x2013;<lpage>5414</lpage>, <year>2019</year>. </mixed-citation></ref>
<ref id="ref-8"><label>[8]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>Z.</given-names> <surname>Khan</surname></string-name> and <string-name><given-names>Y.</given-names> <surname>Fu</surname></string-name></person-group>, &#x201C;<article-title>Exploiting BERT for multimodal target sentiment classification through input space translation</article-title>,&#x201D; in <conf-name>Proc. of the 29th ACM Int. Conf. on Multimedia</conf-name>, <conf-loc>Virtual Event, China</conf-loc>, pp. <fpage>3034</fpage>&#x2013;<lpage>3042</lpage>, <year>2021</year>. </mixed-citation></ref>
<ref id="ref-9"><label>[9]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>B.</given-names> <surname>Pang</surname></string-name> and <string-name><given-names>L.</given-names> <surname>Lee</surname></string-name></person-group>, &#x201C;<article-title>Opinion mining and sentiment analysis</article-title>,&#x201D; <source>Foundations and Trends&#x00AE; in Information Retrieval</source>, vol. <volume>2</volume>, no. <issue>1&#x2013;2</issue>, pp. <fpage>1</fpage>&#x2013;<lpage>135</lpage>, <year>2008</year>.</mixed-citation></ref>
<ref id="ref-10"><label>[10]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>A.</given-names> <surname>Agarwal</surname></string-name>, <string-name><given-names>B.</given-names> <surname>Xie</surname></string-name>, <string-name><given-names>I.</given-names> <surname>Vovsha</surname></string-name>, <string-name><given-names>O.</given-names> <surname>Rambow</surname></string-name> and <string-name><given-names>R. J.</given-names> <surname>Passonneau</surname></string-name></person-group>, &#x201C;<article-title>Sentiment analysis of twitter data</article-title>,&#x201D; in <conf-name>Proc. of the Workshop on Language in Social Media (LSM 2011)</conf-name>, <publisher-loc>Portland, Oregon</publisher-loc>, pp. <fpage>30</fpage>&#x2013;<lpage>38</lpage>, <year>2011</year>. </mixed-citation></ref>
<ref id="ref-11"><label>[11]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>X.</given-names> <surname>Fang</surname></string-name> and <string-name><given-names>J.</given-names> <surname>Zhan</surname></string-name></person-group>, &#x201C;<article-title>Sentiment analysis using product review data</article-title>,&#x201D; <source>Journal of Big Data</source>, vol. <volume>2</volume>, no. <issue>1</issue>, pp. <fpage>1</fpage>&#x2013;<lpage>14</lpage>, <year>2015</year>.</mixed-citation></ref>
<ref id="ref-12"><label>[12]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>G. X.</given-names> <surname>Xu</surname></string-name>, <string-name><given-names>Y. T.</given-names> <surname>Meng</surname></string-name>, <string-name><given-names>X. Y.</given-names> <surname>Qiu</surname></string-name>, <string-name><given-names>Z. H.</given-names> <surname>Yu</surname></string-name> and <string-name><given-names>X.</given-names> <surname>Wu</surname></string-name></person-group>, &#x201C;<article-title>Sentiment analysis of comment texts based on BiLSTM</article-title>,&#x201D; <source>IEEE Access</source>, vol. <volume>7</volume>, pp. <fpage>51522</fpage>&#x2013;<lpage>51532</lpage>, <year>2019</year>.</mixed-citation></ref>
<ref id="ref-13"><label>[13]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>M.</given-names> <surname>Singh</surname></string-name>, <string-name><given-names>A. K.</given-names> <surname>Jakhar</surname></string-name> and <string-name><given-names>S.</given-names> <surname>Pandey</surname></string-name></person-group>, &#x201C;<article-title>Sentiment analysis on the impact of coronavirus in social life using the BERT model</article-title>,&#x201D; <source>Social Network Analysis and Mining</source>, vol. <volume>11</volume>, no. <issue>1</issue>, pp. <fpage>1</fpage>&#x2013;<lpage>11</lpage>, <year>2021</year>.</mixed-citation></ref>
<ref id="ref-14"><label>[14]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>T.</given-names> <surname>Zhu</surname></string-name>, <string-name><given-names>L. D.</given-names> <surname>Li</surname></string-name>, <string-name><given-names>J. F.</given-names> <surname>Yang</surname></string-name>, <string-name><given-names>S. C.</given-names> <surname>Zhao</surname></string-name>, <string-name><given-names>H. T.</given-names> <surname>Liu</surname></string-name> <etal>et al.</etal></person-group><italic>,</italic> &#x201C;<article-title>Multimodal sentiment analysis with image-text interaction network</article-title>,&#x201D; <source>IEEE Transactions on Multimedia</source>, pp. <fpage>1</fpage>, <year>2022</year>. <pub-id pub-id-type="doi">10.1109/TMM.2022.3160060</pub-id></mixed-citation></ref>
<ref id="ref-15"><label>[15]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>M. F.</given-names> <surname>Liu</surname></string-name>, <string-name><given-names>L. M.</given-names> <surname>Zhang</surname></string-name>, <string-name><given-names>Y.</given-names> <surname>Liu</surname></string-name>, <string-name><given-names>H. J.</given-names> <surname>Hu</surname></string-name> and <string-name><given-names>W.</given-names> <surname>Fang</surname></string-name></person-group>, &#x201C;<article-title>Recognizing semantic correlation in image-text weibo via feature space mapping</article-title>,&#x201D; <source>Computer Vision and Image Understanding</source>, vol. <volume>163</volume>, no. <issue>5</issue>, pp. <fpage>58</fpage>&#x2013;<lpage>66</lpage>, <year>2017</year>. <pub-id pub-id-type="doi">10.1016/j.cviu.2017.04.012</pub-id></mixed-citation></ref>
<ref id="ref-16"><label>[16]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>N.</given-names> <surname>Xu</surname></string-name>, <string-name><given-names>W. J.</given-names> <surname>Mao</surname></string-name> and <string-name><given-names>G. D.</given-names> <surname>Chen</surname></string-name></person-group>, &#x201C;<article-title>A co-memory network for multimodal sentiment analysis</article-title>,&#x201D; in <conf-name>The 41st Int. ACM SIGIR Conf. on Research &#x0026; Development in Information Retrieval</conf-name>, <publisher-loc>New York, NY, United States</publisher-loc>, pp. <fpage>929</fpage>&#x2013;<lpage>932</lpage>, <year>2018</year>. </mixed-citation></ref>
<ref id="ref-17"><label>[17]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>Z. Y.</given-names> <surname>Zhao</surname></string-name>, <string-name><given-names>H. Y.</given-names> <surname>Zhu</surname></string-name>, <string-name><given-names>Z. H.</given-names> <surname>Xue</surname></string-name>, <string-name><given-names>Z.</given-names> <surname>Liu</surname></string-name>, <string-name><given-names>J.</given-names> <surname>Tian</surname></string-name> <etal>et al.</etal></person-group><italic>,</italic> &#x201C;<article-title>An image-text consistency driven multimodal sentiment analysis approach for social media</article-title>,&#x201D; <source>Information Processing &#x0026; Management</source>, vol. <volume>56</volume>, no. <issue>6</issue>, pp. <fpage>102097</fpage>, <year>2019</year>.</mixed-citation></ref>
<ref id="ref-18"><label>[18]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>N.</given-names> <surname>Xu</surname></string-name>, <string-name><given-names>Z. X.</given-names> <surname>Zeng</surname></string-name> and <string-name><given-names>W. J.</given-names> <surname>Mao</surname></string-name></person-group>, &#x201C;<article-title>Reasoning with multimodal sarcastic tweets via modeling cross-modality contrast and semantic association</article-title>,&#x201D; in <conf-name>Proc. of the 58th Annual Meeting of the Association for Computational Linguistics</conf-name>, <publisher-loc>Online</publisher-loc>, pp. <fpage>3777</fpage>&#x2013;<lpage>3786</lpage>, <year>2020</year>. </mixed-citation></ref>
<ref id="ref-19"><label>[19]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>T.</given-names> <surname>Chen</surname></string-name>, <string-name><given-names>D. Y.</given-names> <surname>Lu</surname></string-name>, <string-name><given-names>M. Y.</given-names> <surname>Kan</surname></string-name> and <string-name><given-names>P.</given-names> <surname>Cui</surname></string-name></person-group>, &#x201C;<article-title>Understanding and classifying image tweets</article-title>,&#x201D; in <conf-name>Proc. of the 21st ACM Int. Conf. on Multimedia</conf-name>, <publisher-loc>Barcelona, Spain</publisher-loc>, pp. <fpage>781</fpage>&#x2013;<lpage>784</lpage>, <year>2013</year>. </mixed-citation></ref>
<ref id="ref-20"><label>[20]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>T.</given-names> <surname>Chen</surname></string-name> and <string-name><given-names>H.</given-names> <surname>SalahEldeen</surname></string-name></person-group>, &#x201C;<article-title>Velda: Relating an image tweet&#x2019;s text and images</article-title>,&#x201D; in <conf-name>Twenty-Ninth AAAI Conf. on Artificial Intelligence</conf-name>, <publisher-loc>Austin, USA</publisher-loc>, vol. <volume>29</volume>, pp. <fpage>1</fpage>, <year>2015</year>. </mixed-citation></ref>
<ref id="ref-21"><label>[21]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>A.</given-names> <surname>Vempala</surname></string-name> and <string-name><given-names>D.</given-names> <surname>Preo&#x0163;iuc-Pietro</surname></string-name></person-group>, &#x201C;<article-title>Categorizing and inferring the relationship between the text and image of Twitter posts</article-title>,&#x201D; in <conf-name>Proc. of the 57th Annual Meeting of the Association for Computational Linguistics</conf-name>, <publisher-loc>Florence, Italy</publisher-loc>, pp. <fpage>2830</fpage>&#x2013;<lpage>2840</lpage>, <year>2019</year>. </mixed-citation></ref>
<ref id="ref-22"><label>[22]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>J. J.</given-names> <surname>Ye</surname></string-name>, <string-name><given-names>J.</given-names> <surname>Zhou</surname></string-name>, <string-name><given-names>J. F.</given-names> <surname>Tian</surname></string-name>, <string-name><given-names>R.</given-names> <surname>Wang</surname></string-name>, <string-name><given-names>J. Y.</given-names> <surname>Zhou</surname></string-name> <etal>et al.</etal></person-group><italic>,</italic> &#x201C;<article-title>Sentiment-aware multimodal pre-training for multimodal sentiment analysis</article-title>,&#x201D; <source>Knowledge-Based Systems</source>, vol. <volume>258</volume>, pp. <fpage>110021</fpage>, <year>2022</year>. <pub-id pub-id-type="doi">10.1016/j.knosys.2022.110021</pub-id></mixed-citation></ref>
<ref id="ref-23"><label>[23]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>J. F.</given-names> <surname>Yu</surname></string-name>, <string-name><given-names>J. M.</given-names> <surname>Wang</surname></string-name>, <string-name><given-names>R.</given-names> <surname>Xia</surname></string-name> and <string-name><given-names>J. J.</given-names> <surname>Li</surname></string-name></person-group>, &#x201C;<article-title>Targeted multimodal sentiment classification based on coarse-to-fine grained image-target matching</article-title>,&#x201D; in <conf-name>Proc. of the Thirty-First Int. Joint Conf. on Artificial Intelligence, IJCAI 2022</conf-name>, <publisher-loc>Vienna, Austria</publisher-loc>, pp. <fpage>4482</fpage>&#x2013;<lpage>4488</lpage>, <year>2022</year>. </mixed-citation></ref>
<ref id="ref-24"><label>[24]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><given-names>J.</given-names> <surname>Devlin</surname></string-name>, <string-name><given-names>M. W.</given-names> <surname>Chang</surname></string-name>, <string-name><given-names>K.</given-names> <surname>Lee</surname></string-name> and <string-name><given-names>K.</given-names> <surname>Toutanova</surname></string-name></person-group>, &#x201C;<article-title>Bert: Pre-training of deep bidirectional transformers for language understanding</article-title>,&#x201D; <comment>arXiv preprint arXiv:1810.04805</comment>, <year>2018</year>.</mixed-citation></ref>
<ref id="ref-25"><label>[25]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>N.</given-names> <surname>Liu</surname></string-name> and <string-name><given-names>J. H.</given-names> <surname>Zhao</surname></string-name></person-group>, &#x201C;<article-title>A BERT-based aspect-level sentiment analysis algorithm for cross-domain text</article-title>,&#x201D; <source>Computational Intelligence and Neuroscience</source>, vol. <volume>2022</volume>, pp. <fpage>1</fpage>&#x2013;<lpage>11</lpage>, <year>2022</year>. <pub-id pub-id-type="doi">10.1155/2022/8726621</pub-id>; <pub-id pub-id-type="pmid">35795761</pub-id></mixed-citation></ref>
<ref id="ref-26"><label>[26]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>K. M.</given-names> <surname>He</surname></string-name>, <string-name><given-names>X. Y.</given-names> <surname>Zhang</surname></string-name>, <string-name><given-names>S. Q.</given-names> <surname>Ren</surname></string-name> and <string-name><given-names>J.</given-names> <surname>Sun</surname></string-name></person-group>, &#x201C;<article-title>Deep residual learning for image recognition</article-title>,&#x201D; in <conf-name>Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition</conf-name>, <publisher-loc>Las Vegas, USA</publisher-loc>, pp. <fpage>770</fpage>&#x2013;<lpage>778</lpage>, <year>2016</year>. </mixed-citation></ref>
<ref id="ref-27"><label>[27]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>O.</given-names> <surname>Russakovsky</surname></string-name>, <string-name><given-names>J.</given-names> <surname>Deng</surname></string-name>, <string-name><given-names>H.</given-names> <surname>Su</surname></string-name>, <string-name><given-names>J.</given-names> <surname>Krause</surname></string-name>, <string-name><given-names>S.</given-names> <surname>Satheesh</surname></string-name> <etal>et al.</etal></person-group><italic>,</italic> &#x201C;<article-title>Imagenet large scale visual recognition challenge</article-title>,&#x201D; <source>International Journal of Computer Vision</source>, vol. <volume>115</volume>, no. <issue>3</issue>, pp. <fpage>211</fpage>&#x2013;<lpage>252</lpage>, <year>2015</year>.</mixed-citation></ref>
<ref id="ref-28"><label>[28]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>Y. H. H.</given-names> <surname>Tsai</surname></string-name>, <string-name><given-names>S. J.</given-names> <surname>Bai</surname></string-name>, <string-name><given-names>P. P.</given-names> <surname>Liang</surname></string-name>, <string-name><given-names>J. Z.</given-names> <surname>Kolter</surname></string-name>, <string-name><given-names>L. P.</given-names> <surname>Morency</surname></string-name> <etal>et al.</etal></person-group><italic>,</italic> &#x201C;<article-title>Multimodal transformer for unaligned multimodal language sequences</article-title>,&#x201D; in <conf-name>Proc. of the Conf. Association for Computational Linguistics. Meeting</conf-name>, <publisher-loc>Florence, Italy</publisher-loc>, pp. <fpage>6558</fpage>&#x2013;<lpage>6569</lpage>, <year>2019</year>. </mixed-citation></ref>
<ref id="ref-29"><label>[29]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>G.</given-names> <surname>Jawahar</surname></string-name>, <string-name><given-names>B.</given-names> <surname>Sagot</surname></string-name> and <string-name><given-names>D.</given-names> <surname>Seddah</surname></string-name></person-group>, &#x201C;<article-title>What does BERT learn about the structure of language?</article-title>,&#x201D; in <conf-name>ACL 2019-57th Annual Meeting of the Association for Computational Linguistics</conf-name>, <publisher-loc>Florence, Italy</publisher-loc>, pp. <fpage>3651</fpage>&#x2013;<lpage>3657</lpage>, <year>2019</year>. </mixed-citation></ref>
<ref id="ref-30"><label>[30]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>Y. Q.</given-names> <surname>Wang</surname></string-name>, <string-name><given-names>M. L.</given-names> <surname>Huang</surname></string-name>, <string-name><given-names>X. Y.</given-names> <surname>Zhu</surname></string-name> and <string-name><given-names>L.</given-names> <surname>Zhao</surname></string-name></person-group>, &#x201C;<article-title>Attention-based LSTM for aspect-level sentiment classification</article-title>,&#x201D; in <conf-name>Proc. of the 2016 Conf. on Empirical Methods in Natural Language Processing</conf-name>, <publisher-loc>Austin, USA</publisher-loc>, pp. <fpage>606</fpage>&#x2013;<lpage>615</lpage>, <year>2016</year>. </mixed-citation></ref>
<ref id="ref-31"><label>[31]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><given-names>D.</given-names> <surname>Tang</surname></string-name>, <string-name><given-names>B.</given-names> <surname>Qin</surname></string-name> and <string-name><given-names>T.</given-names> <surname>Liu</surname></string-name></person-group>, &#x201C;<article-title>Aspect level sentiment classification with deep memory network</article-title>,&#x201D; <comment>arXiv preprint arXiv:1605.08900</comment>, <year>2016</year>.</mixed-citation></ref>
<ref id="ref-32"><label>[32]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>P.</given-names> <surname>Chen</surname></string-name>, <string-name><given-names>Z. Q.</given-names> <surname>Sun</surname></string-name>, <string-name><given-names>L. D.</given-names> <surname>Bing</surname></string-name> and <string-name><given-names>W.</given-names> <surname>Yang</surname></string-name></person-group>, &#x201C;<article-title>Recurrent attention network on memory for aspect sentiment analysis</article-title>,&#x201D; in <conf-name>Proc. of the 2017 Conf. on Empirical Methods in Natural Language Processing</conf-name>, <publisher-loc>Copenhagen, Denmark</publisher-loc>, pp. <fpage>452</fpage>&#x2013;<lpage>461</lpage>, <year>2017</year>. </mixed-citation></ref>
<ref id="ref-33"><label>[33]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>J. S.</given-names> <surname>Lu</surname></string-name>, <string-name><given-names>D.</given-names> <surname>Batra</surname></string-name>, <string-name><given-names>D.</given-names> <surname>Parikh</surname></string-name> and <string-name><given-names>S.</given-names> <surname>Lee</surname></string-name></person-group>, &#x201C;<article-title>Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks</article-title>,&#x201D; <source>Advances in Neural Information Processing Systems</source>, vol. <volume>32</volume>, pp. <fpage>13</fpage>&#x2013;<lpage>23</lpage>, <year>2019</year>.</mixed-citation></ref>
</ref-list>
</back>
</article>