<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.1 20151215//EN" "http://jats.nlm.nih.gov/publishing/1.1/JATS-journalpublishing1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:mml="http://www.w3.org/1998/Math/MathML" xml:lang="en" article-type="research-article" dtd-version="1.1">
<front>
<journal-meta>
<journal-id journal-id-type="pmc">CMC</journal-id>
<journal-id journal-id-type="nlm-ta">CMC</journal-id>
<journal-id journal-id-type="publisher-id">CMC</journal-id>
<journal-title-group>
<journal-title>Computers, Materials &#x0026; Continua</journal-title>
</journal-title-group>
<issn pub-type="epub">1546-2226</issn>
<issn pub-type="ppub">1546-2218</issn>
<publisher>
<publisher-name>Tech Science Press</publisher-name>
<publisher-loc>USA</publisher-loc>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="publisher-id">55943</article-id>
<article-id pub-id-type="doi">10.32604/cmc.2024.055943</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Article</subject>
</subj-group>
</article-categories>
<title-group>
<article-title>Text-Image Feature Fine-Grained Learning for Joint Multimodal Aspect-Based Sentiment Analysis</article-title>
<alt-title alt-title-type="left-running-head">Text-Image Feature Fine-Grained Learning for Joint Multimodal Aspect-Based Sentiment Analysis</alt-title>
<alt-title alt-title-type="right-running-head">Text-Image Feature Fine-Grained Learning for Joint Multimodal Aspect-Based Sentiment Analysis</alt-title>
</title-group>
<contrib-group>
<contrib id="author-1" contrib-type="author">
<name name-style="western"><surname>Zhang</surname><given-names>Tianzhi</given-names></name><xref ref-type="aff" rid="aff-1">1</xref></contrib>
<contrib id="author-2" contrib-type="author" corresp="yes">
<name name-style="western"><surname>Zhou</surname><given-names>Gang</given-names></name><xref ref-type="aff" rid="aff-1">1</xref><email>gzhougzhou@126.com</email></contrib>
<contrib id="author-3" contrib-type="author">
<name name-style="western"><surname>Zhang</surname><given-names>Shuang</given-names></name><xref ref-type="aff" rid="aff-2">2</xref></contrib>
<contrib id="author-4" contrib-type="author">
<name name-style="western"><surname>Li</surname><given-names>Shunhang</given-names></name><xref ref-type="aff" rid="aff-1">1</xref></contrib>
<contrib id="author-5" contrib-type="author">
<name name-style="western"><surname>Sun</surname><given-names>Yepeng</given-names></name><xref ref-type="aff" rid="aff-1">1</xref></contrib>
<contrib id="author-6" contrib-type="author">
<name name-style="western"><surname>Pi</surname><given-names>Qiankun</given-names></name><xref ref-type="aff" rid="aff-1">1</xref></contrib>
<contrib id="author-7" contrib-type="author">
<name name-style="western"><surname>Liu</surname><given-names>Shuo</given-names></name><xref ref-type="aff" rid="aff-3">3</xref></contrib>
<aff id="aff-1"><label>1</label><institution>School of Data and Target Engineering, Information Engineering University</institution>, <addr-line>Zhengzhou, 450001</addr-line>, <country>China</country></aff>
<aff id="aff-2"><label>2</label><institution>Information Engineering Department, Liaoning Provincial College of Communications</institution>, <addr-line>Shenyang, 110122</addr-line>, <country>China</country></aff>
<aff id="aff-3"><label>3</label><institution>School of Computer and Artificial Intelligence, Zhengzhou University</institution>, <addr-line>Zhengzhou, 450000</addr-line>, <country>China</country></aff>
</contrib-group>
<author-notes>
<corresp id="cor1"><label>&#x002A;</label>Corresponding Author: Gang Zhou. Email: <email>gzhougzhou@126.com</email></corresp>
</author-notes>
<pub-date date-type="collection" publication-format="electronic">
<year>2025</year>
</pub-date>
<pub-date date-type="pub" publication-format="electronic">
<day>03</day><month>1</month><year>2025</year>
</pub-date>
<volume>82</volume>
<issue>1</issue>
<fpage>279</fpage>
<lpage>305</lpage>
<history>
<date date-type="received">
<day>10</day>
<month>7</month>
<year>2024</year>
</date>
<date date-type="accepted">
<day>05</day>
<month>10</month>
<year>2024</year>
</date>
</history>
<permissions>
<copyright-statement>&#x00A9; 2025 The Authors.</copyright-statement>
<copyright-year>2025</copyright-year>
<copyright-holder>Published by Tech Science Press.</copyright-holder>
<license xlink:href="https://creativecommons.org/licenses/by/4.0/">
<license-p>This work is licensed under a <ext-link ext-link-type="uri" xlink:type="simple" xlink:href="https://creativecommons.org/licenses/by/4.0/">Creative Commons Attribution 4.0 International License</ext-link>, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
</license>
</permissions>
<self-uri content-type="pdf" xlink:href="TSP_CMC_55943.pdf"></self-uri>
<abstract>
<p>Joint Multimodal Aspect-based Sentiment Analysis (JMASA) is a significant task in the research of multimodal fine-grained sentiment analysis, which combines two subtasks: Multimodal Aspect Term Extraction (MATE) and Multimodal Aspect-oriented Sentiment Classification (MASC). Currently, most existing models for JMASA only perform text and image feature encoding from a basic level, but often neglect the in-depth analysis of unimodal intrinsic features, which may lead to the low accuracy of aspect term extraction and the poor ability of sentiment prediction due to the insufficient learning of intra-modal features. Given this problem, we propose a Text-Image Feature Fine-grained Learning (TIFFL) model for JMASA. First, we construct an enhanced adjacency matrix of word dependencies and adopt graph convolutional network to learn the syntactic structure features for text, which addresses the context interference problem of identifying different aspect terms. Then, the adjective-noun pairs extracted from image are introduced to enable the semantic representation of visual features more intuitive, which addresses the ambiguous semantic extraction problem during image feature learning. Thereby, the model performance of aspect term extraction and sentiment polarity prediction can be further optimized and enhanced. Experiments on two Twitter benchmark datasets demonstrate that TIFFL achieves competitive results for JMASA, MATE and MASC, thus validating the effectiveness of our proposed methods.</p>
</abstract>
<kwd-group kwd-group-type="author">
<kwd>Multimodal sentiment analysis</kwd>
<kwd>aspect-based sentiment analysis</kwd>
<kwd>feature fine-grained learning</kwd>
<kwd>graph convolutional network</kwd>
<kwd>adjective-noun pairs</kwd>
</kwd-group>
<funding-group>
<award-group id="awg1">
<funding-source>Science and Technology Project of Henan Province</funding-source>
<award-id>222102210081</award-id>
</award-group>
</funding-group>
</article-meta>
</front>
<body>
<sec id="s1">
<label>1</label>
<title>Introduction</title>
<p>With the wide application of intelligent mobile terminals around the world, more and more people are inclined to publish information that includes multiple modalities, such as text and image, to express their opinions and sentiments in response to various events and topics [<xref ref-type="bibr" rid="ref-1">1</xref>]. This situation and trend make sentiment analysis become one of the most popular research tasks at present. However, capturing and fusing the above different modal information has created new challenges for sentiment analysis, thus giving birth to the emerging research area of Multimodal Sentiment Analysis (MSA) [<xref ref-type="bibr" rid="ref-2">2</xref>,<xref ref-type="bibr" rid="ref-3">3</xref>]. MSA has been widely concerned by academics, businesses, governmental organizations, and public services in recent years due to its improved sentiment analysis accuracy and enhanced sentiment understanding comprehensiveness. While Multimodal Aspect-Based Sentiment Analysis (MABSA) aims to analyze the sentiment of aspect terms in each sample, which select the specific noun phrases in text as aspect terms, and each text includes an indefinite number of aspect terms.</p>
<p>Nowadays, researchers have further proposed Joint Multimodal Aspect-based Sentiment Analysis (JMASA) based on the MABSA task, which can be divided into two subtasks: Multimodal Aspect Term Extraction (MATE) and Multimodal Aspect-oriented Sentiment Classification (MASC). <xref ref-type="table" rid="table-1">Table 1</xref> shows two representative examples of JMASA: <xref ref-type="table" rid="table-1">Table 1</xref>(a) extracts the aspect terms &#x201C;Ashford Town Ladies&#x201D; and &#x201C;Newquay&#x201D; by combining text and image semantics, and predicts their positive and neutral sentiments, respectively, through the context in text and the scene in image. <xref ref-type="table" rid="table-1">Table 1</xref>(b) also extracts the aspect terms &#x201C;Stephen Curry&#x201D; and &#x201C;NBA&#x201D; through text and image semantics, and infers their positive and neutral sentiments, respectively, by combining the words in text, the facial expression as well as the NBA scene in image.</p>
<table-wrap id="table-1">
<label>Table 1</label>
<caption>
<title>Two representative examples of JMASA</title>
</caption>
<table frame="hsides">
<colgroup>
<col align="left"/>
<col align="left"/>
<col align="left"/>
</colgroup>
<tbody>
<tr>
<td>Image</td>
<td><inline-graphic xlink:href="CMC_55943-inline-1.tif"/></td>
<td><inline-graphic xlink:href="CMC_55943-inline-2.tif"/></td>
</tr>
<tr>
<td>Text</td>
<td>(a) Congratulations to Ashford town ladies winners of the 2016 Newquay tournament # footballtour # newquaysixes</td>
<td>(b) Stephen Curry just played the best overtime in # NBA history - SB Nation</td>
</tr>
<tr>
<td>Output</td>
<td>(Ashford town ladies, Positive)</td>
<td>(Stephen Curry, Positive)</td>
</tr>
<tr>
<td/>
<td>(Newquay, Neutral)</td>
<td>(NBA, Neutral)</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>As a significant multimodal fine-grained sentiment analysis task, JMASA has received extensive academic attention, and numerous research methods have been proposed in these years. For example, Ju et al. [<xref ref-type="bibr" rid="ref-4">4</xref>] controlled the rational utilization of visual information by employing a text-image relation detection method, Ling et al. [<xref ref-type="bibr" rid="ref-5">5</xref>] simplified all pretraining and downstream tasks by designing a unified multimodal encoder-decoder architecture, Yang et al. [<xref ref-type="bibr" rid="ref-6">6</xref>] enhanced model performance on the target tasks by setting auxiliary supervision for text and image, respectively, and Wang et al. [<xref ref-type="bibr" rid="ref-7">7</xref>] bridged the semantic gap between text and image representations by optimizing the scalar weight of balancing their features. However, most current models only perform the basic text and image feature encoding and often neglect the further analysis of unimodal intrinsic features, which may lead to the low accuracy of aspect term extraction and the poor ability of sentiment prediction due to the insufficient learning of unimodal features. Furthermore, JMASA treats text as the dominant modality, and image as an additional modality that assists with its semantic expression can provide important clues for text to some extent, but there may also be extra noise introduced by the image information unrelated to text semantics, so the feature fusion strategy between modalities is also a key factor that affects the overall model performance.</p>
<p>Given the above problems, we propose a JMASA model Text-Image Feature Fine-Grained Learning (TIFFL). Firstly, we adopt pretrained text and image encoders for the text-image multimodal sample to obtain unimodal feature representations. Secondly, a gating mechanism is constructed to prevent the visual features unrelated to text semantics from interfering with our model. Then, we introduce Graph Convolutional Network (GCN) [<xref ref-type="bibr" rid="ref-8">8</xref>] and Adjective-Noun Pairs (ANPs) [<xref ref-type="bibr" rid="ref-9">9</xref>] to better learn and represent the intrinsic features of text and image, respectively. Finally, an effective inter-modal fusion strategy is designed to generate the final representations of text and image features to further achieve aspect term extraction and sentiment polarity prediction. Our contributions to TIFFL are as follows:
<list list-type="bullet">
<list-item>
<p>To promote the effective fusion of text and image information, a multimodal feature correlation discrimination module is proposed, which constructs a gating mechanism for the dynamic input of visual features by calculating the correlation degree of text and image semantics, while prevents the image information unrelated to text semantics from introducing extra noise.</p></list-item>
<list-item>
<p>To further enhance the learning and representation of text and image intrinsic features, we adopt GCN to obtain the syntactic structure features of text, which addresses the context interference problem of identifying different aspect terms by raising the attention to noun phrases and calculating the sentiment scores between dependent words, while introduce image ANPs to enable visual semantic representation more intuitive, thus addressing the ambiguous problem of image semantic extraction.</p></list-item>
<list-item>
<p>Experimental results on two Twitter benchmark datasets show that our model outperforms most unimodal and multimodal associated studies, with competitive results on the JMASA task as well as the two subtasks MATE and MASC.</p></list-item>
</list></p>
</sec>
<sec id="s2">
<label>2</label>
<title>Related Work</title>
<p>Early sentiment analysis studies were mostly conducted on unimodal forms such as text and image [<xref ref-type="bibr" rid="ref-10">10</xref>&#x2013;<xref ref-type="bibr" rid="ref-12">12</xref>]. Over the years, MSA has emerged as a crucial research area in sentiment analysis, while JMASA has been further advanced and refined based on the MABSA task.</p>
<sec id="s2_1">
<label>2.1</label>
<title>Graph Neural Network (GNN)</title>
<p>Graph Neural Network (GNN) has previously achieved excellent results in many Natural Language Processing (NLP) tasks including aspect-based sentiment analysis. Zhang et al. [<xref ref-type="bibr" rid="ref-13">13</xref>] proposed a syntactic dependency tree based GCN to obtain the contextual syntactic information and word dependencies of aspect term. Huang et al. [<xref ref-type="bibr" rid="ref-14">14</xref>] proposed a target-dependent Graph Attention network (GAT) to learn the sentiment information of aspect term by exploring contextual word dependencies. Sun et al. [<xref ref-type="bibr" rid="ref-15">15</xref>] stacked a GCN layer on Long Short-Term Memory (LSTM) network [<xref ref-type="bibr" rid="ref-16">16</xref>], which employs Bidirectional LSTM (BiLSTM) network to learn the contextual features of text and further perform convolutions over a dependency tree to extract the richer representations. Tang et al. [<xref ref-type="bibr" rid="ref-17">17</xref>] jointly considered the flat and graph-based representations in an iterative interaction manner by a dependency graph enhanced dual-transformer network. Wang et al. [<xref ref-type="bibr" rid="ref-18">18</xref>] proposed an aspect-oriented tree network that focuses on aspect terms by reshaping and pruning ordinary dependency trees. However, the above methods ignore the sentiment information between context words and aspect terms, which can directly demonstrate the sentiment expression for a specific aspect term of text.</p>
</sec>
<sec id="s2_2">
<label>2.2</label>
<title>Multimodal Sentiment Analysis (MSA)</title>
<p>MSA has received considerable academic attention over the last few years [<xref ref-type="bibr" rid="ref-19">19</xref>,<xref ref-type="bibr" rid="ref-20">20</xref>], which implements model construction by combining text with non-text information and is typically divided into two subtasks: conversation MSA and social media MSA. In conversation MSA, current studies mainly model information interactions in different modalities by adopting various deep learning methods such as LSTM, Gated Recurrent Unit (GRU) [<xref ref-type="bibr" rid="ref-21">21</xref>], Convolutional Neural Network (CNN) [<xref ref-type="bibr" rid="ref-22">22</xref>] and Transformer [<xref ref-type="bibr" rid="ref-23">23</xref>], which have been demonstrated superior performance in multiple sentiment related tasks such as sentiment analysis [<xref ref-type="bibr" rid="ref-24">24</xref>&#x2013;<xref ref-type="bibr" rid="ref-26">26</xref>], emotion analysis [<xref ref-type="bibr" rid="ref-27">27</xref>,<xref ref-type="bibr" rid="ref-28">28</xref>] and sarcasm detection [<xref ref-type="bibr" rid="ref-29">29</xref>,<xref ref-type="bibr" rid="ref-30">30</xref>]. In social media MSA, major studies include social media image sentiment analysis [<xref ref-type="bibr" rid="ref-8">8</xref>,<xref ref-type="bibr" rid="ref-31">31</xref>,<xref ref-type="bibr" rid="ref-32">32</xref>] and text-image integrated sentiment analysis [<xref ref-type="bibr" rid="ref-33">33</xref>&#x2013;<xref ref-type="bibr" rid="ref-35">35</xref>]. While these studies are applicable to coarse-grained global sentiment analysis, they cannot provide direct utilization for the fine-grained tasks.</p>
</sec>
<sec id="s2_3">
<label>2.3</label>
<title>Multimodal Aspect-Based Sentiment Analysis (MABSA)</title>
<p>To effectively exploit different modal information for aspect-based sentiment analysis, researchers have proposed numerous MABSA models in the past few years by utilizing different methods in various text and image tasks. Xu et al. [<xref ref-type="bibr" rid="ref-36">36</xref>] first discussed the multimodal fine-grained sentiment analysis task and proposed a text-image interaction model MIMN based on multi-interactive BiLSTM network, which can also be extended to the MABSA task, while also built an e-commerce comment dataset ZOL for the experiments. Yu et al. [<xref ref-type="bibr" rid="ref-37">37</xref>] proposed an entity-sensitive attention and fusion network model ESAFN that captures the aspect-text and aspect-image relations, then also constructed two Twitter benchmark datasets Twitter-15 and Twitter-17. Yu et al. [<xref ref-type="bibr" rid="ref-38">38</xref>] proposed an architecture improved model TomBERT based on the pretrained language model BERT [<xref ref-type="bibr" rid="ref-39">39</xref>] and achieved significant performance enhancement, which has been further cited and refined by multiple subsequent studies. Khan et al. [<xref ref-type="bibr" rid="ref-40">40</xref>] proposed a cross-modal translation model CapBERT by converting the image semantics into captions and captured the sentiment polarity only through text information. Zhao et al. [<xref ref-type="bibr" rid="ref-41">41</xref>] proposed an ANPs-based knowledge enhancement framework KEF and incorporated it into various models to improve their visual attention and sentiment prediction capabilities. Although these methods are applicable to the sentiment analysis of given aspect terms, the aspect terms are generally not directly given in practice, so multimodal aspect term extraction has become a prerequisite for the corresponding sentiment analysis.</p>
</sec>
<sec id="s2_4">
<label>2.4</label>
<title>Joint Multimodal Aspect-Based Sentiment Analysis (JMASA)</title>
<p>JMASA is a significant research task proposed during these years, which is generated by integrating MATE into MABSA. For the MATE task, Wu et al. [<xref ref-type="bibr" rid="ref-42">42</xref>] proposed a region-aware alignment network model RAN that extracts aspect term by aligning text-image entity regions. Yu et al. [<xref ref-type="bibr" rid="ref-43">43</xref>] proposed a Multimodal Named Entity Recognition (MNER) model UMT based on entity span detection, which dynamically captures the text-image information association through cross-modal feature interaction. Wu et al. [<xref ref-type="bibr" rid="ref-44">44</xref>] proposed a MNER model OSCGA based on text-image entity alignment, which achieves entity prediction by designing a neural network that combines image object and text character information. Jia et al. [<xref ref-type="bibr" rid="ref-45">45</xref>] proposed a MNER model MNER-QG based on an end-to-end machine reading comprehension framework, which provides prior knowledge of entity types and visual regions to enhance the text and image representations. For the JMASA task, Ju et al. [<xref ref-type="bibr" rid="ref-4">4</xref>] first proposed an auxiliary cross-modal relation detection model JML, which controls the rational utilization of visual information by designing a text-image relation detection method. Ling et al. [<xref ref-type="bibr" rid="ref-5">5</xref>] proposed a task-specific vision-language pretraining model VLP-MABSA, which simplifies all pretraining and downstream tasks by introducing a unified multimodal encoder-decoder architecture. Yang et al. [<xref ref-type="bibr" rid="ref-6">6</xref>] proposed a multi-task learning cross-modal Transformer model CMMT, which enhances the model performance by constructing auxiliary supervision modules for text and image, respectively. Wang et al. [<xref ref-type="bibr" rid="ref-7">7</xref>] proposed a self-adaptive attention fusion model SAAF, which bridges the semantic gap between text and image representations by adjusting the scalar weight of balancing their features. Despite the above studies have been demonstrated to be effective on JMASA, they typically only employ pretrained language and vision encoders to achieve text and image feature encoding from a basic level, often neglecting the further analysis of unimodal intrinsic features. Therefore, we propose a novel model to address the low accuracy of aspect term extraction and poor ability of sentiment prediction problems due to the insufficient unimodal feature learning by performing intra-modal feature fine-grained analysis.</p>
</sec>
</sec>
<sec id="s3">
<label>3</label>
<title>Methodology</title>
<p>In this chapter, we define the JMASA task and provide the overview of our proposed Text-Image Feature Fine-Grained Learning (TIFFL) model, then introduce the specific workflow of each component in TIFFL.</p>
<p><bold>Task Definition:</bold> Motivated by most previous studies on joint aspect-based sentiment analysis [<xref ref-type="bibr" rid="ref-46">46</xref>&#x2013;<xref ref-type="bibr" rid="ref-48">48</xref>], we describe the JMASA task as a text sequence labeling, which adopts the <inline-formula id="ieqn-1"><mml:math id="mml-ieqn-1"><mml:mi>B</mml:mi><mml:mi>I</mml:mi><mml:mi>O</mml:mi><mml:mn>2</mml:mn></mml:math></inline-formula> tagging schema [<xref ref-type="bibr" rid="ref-49">49</xref>] as aspect term extractor with seven classifications for each token. Specifically, given a set of input multimodal samples <inline-formula id="ieqn-2"><mml:math id="mml-ieqn-2"><mml:mi>D</mml:mi><mml:mo>=</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>d</mml:mi></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula>, each sample <inline-formula id="ieqn-3"><mml:math id="mml-ieqn-3"><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>&#x2208;</mml:mo><mml:mi>D</mml:mi></mml:math></inline-formula> contains an <inline-formula id="ieqn-4"><mml:math id="mml-ieqn-4"><mml:mi>m</mml:mi></mml:math></inline-formula>-word text <inline-formula id="ieqn-5"><mml:math id="mml-ieqn-5"><mml:mi>S</mml:mi><mml:mo>=</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>w</mml:mi><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>w</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mi>w</mml:mi><mml:mrow><mml:mi>m</mml:mi></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula> and a corresponding image <inline-formula id="ieqn-6"><mml:math id="mml-ieqn-6"><mml:mi>I</mml:mi></mml:math></inline-formula>. Our target task is to obtain the text label sequence <inline-formula id="ieqn-7"><mml:math id="mml-ieqn-7"><mml:mi>y</mml:mi><mml:mo>=</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>y</mml:mi><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>y</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mi>y</mml:mi><mml:mrow><mml:mi>m</mml:mi></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula> for each sample, where <inline-formula id="ieqn-8"><mml:math id="mml-ieqn-8"><mml:msub><mml:mi>y</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>&#x2208;</mml:mo></mml:math></inline-formula> {O, B-POS, I-POS, B-NEU, I-NEU, B-NEG, I-NEG}, <inline-formula id="ieqn-9"><mml:math id="mml-ieqn-9"><mml:mrow><mml:mtext>O</mml:mtext></mml:mrow></mml:math></inline-formula> denotes the non-aspect term token label, <inline-formula id="ieqn-10"><mml:math id="mml-ieqn-10"><mml:mrow><mml:mtext>B</mml:mtext></mml:mrow></mml:math></inline-formula> denotes the beginning token label of aspect term, <inline-formula id="ieqn-11"><mml:math id="mml-ieqn-11"><mml:mrow><mml:mtext>I</mml:mtext></mml:mrow></mml:math></inline-formula> denotes the remaining token label, and <inline-formula id="ieqn-12"><mml:math id="mml-ieqn-12"><mml:mrow><mml:mtext>POS</mml:mtext></mml:mrow></mml:math></inline-formula>, <inline-formula id="ieqn-13"><mml:math id="mml-ieqn-13"><mml:mrow><mml:mtext>NEU</mml:mtext></mml:mrow></mml:math></inline-formula> and <inline-formula id="ieqn-14"><mml:math id="mml-ieqn-14"><mml:mrow><mml:mtext>NEG</mml:mtext></mml:mrow></mml:math></inline-formula> denote the positive, neutral and negative sentiment labels, respectively.</p>
<sec id="s3_1">
<label>3.1</label>
<title>Overview</title>
<p><xref ref-type="fig" rid="fig-1">Fig. 1</xref> shows the overall architecture of TIFFL, which is divided into four components: (1) Unimodal feature encoding. (2) Multimodal feature correlation discrimination. (3) Intra-modal feature fine-grained analysis. (4) Inter-modal feature fusion and output.</p>
<fig id="fig-1">
<label>Figure 1</label>
<caption>
<title>The overall architecture of TIFFL model</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_55943-fig-1.tif"/>
</fig>
</sec>
<sec id="s3_2">
<label>3.2</label>
<title>Unimodal Feature Encoding</title>
<p>Unimodal feature encoding is the basis of all subsequent tasks in multimodal work, and the extracted feature vectors enable further cross-modal interaction and fusion between them. In this section, we employ pretrained language and vision models to obtain the input text and image feature representations, respectively.</p>
<sec id="s3_2_1">
<label>3.2.1</label>
<title>Text Feature Encoding</title>
<p>For the input text feature extraction, we designate the text encoder as the pretrained language model RoBERTa [<xref ref-type="bibr" rid="ref-50">50</xref>], which has produced more favorable results in multiple NLP tasks as an extension and enhancement of BERT. Specifically, we insert two specific tokens &#x201C;&#x003C;s&#x003E;&#x201D; and &#x201C;&#x003C;/s&#x003E;&#x201D; at the beginning and end of the input text <inline-formula id="ieqn-15"><mml:math id="mml-ieqn-15"><mml:mi>S</mml:mi></mml:math></inline-formula> as <inline-formula id="ieqn-16"><mml:math id="mml-ieqn-16"><mml:msup><mml:mi>S</mml:mi><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula>, then feed <inline-formula id="ieqn-17"><mml:math id="mml-ieqn-17"><mml:msup><mml:mi>S</mml:mi><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula> into RoBERTa to extract the text token representation that incorporates context information:
<disp-formula id="eqn-1"><label>(1)</label><mml:math id="mml-eqn-1" display="block"><mml:msub><mml:mi>H</mml:mi><mml:mrow><mml:mi>S</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mi>R</mml:mi><mml:mi>o</mml:mi><mml:mi>B</mml:mi><mml:mi>E</mml:mi><mml:mi>R</mml:mi><mml:mi>T</mml:mi><mml:mi>a</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:msup><mml:mi>S</mml:mi><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msup><mml:mo>)</mml:mo></mml:mrow></mml:math></disp-formula>where <inline-formula id="ieqn-18"><mml:math id="mml-ieqn-18"><mml:msub><mml:mi>H</mml:mi><mml:mrow><mml:mi>S</mml:mi></mml:mrow></mml:msub><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>d</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mi>m</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula>, <inline-formula id="ieqn-19"><mml:math id="mml-ieqn-19"><mml:mi>d</mml:mi></mml:math></inline-formula> is the hidden dimension of text representation, and <inline-formula id="ieqn-20"><mml:math id="mml-ieqn-20"><mml:mi>m</mml:mi></mml:math></inline-formula> is the length of <inline-formula id="ieqn-21"><mml:math id="mml-ieqn-21"><mml:msup><mml:mi>S</mml:mi><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula>.</p>
</sec>
<sec id="s3_2_2">
<label>3.2.2</label>
<title>Image Feature Encoding</title>
<p>For the input image feature extraction, we designate the image encoder as the pretrained vision model Residual Network (ResNet) [<xref ref-type="bibr" rid="ref-51">51</xref>], which avoids gradient disappearance with the increasing number of layers by employing residual connections. Compared with the VGG [<xref ref-type="bibr" rid="ref-52">52</xref>] network that has been widely adopted in earlier associated studies, ResNet enables a deeper extraction of image semantic information. Specifically, we resize the input image <inline-formula id="ieqn-22"><mml:math id="mml-ieqn-22"><mml:mi>I</mml:mi></mml:math></inline-formula> to 224 &#x00D7; 224 pixels as <inline-formula id="ieqn-23"><mml:math id="mml-ieqn-23"><mml:msup><mml:mi>I</mml:mi><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula>, then obtain the image vision representation from the last convolution layer of the pretrained 152-layer ResNet:
<disp-formula id="eqn-2"><label>(2)</label><mml:math id="mml-eqn-2" display="block"><mml:msub><mml:mi>H</mml:mi><mml:mrow><mml:mi>I</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mi>R</mml:mi><mml:mi>e</mml:mi><mml:mi>s</mml:mi><mml:mi>N</mml:mi><mml:mi>e</mml:mi><mml:mi>t</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:msup><mml:mi>I</mml:mi><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msup><mml:mo>)</mml:mo></mml:mrow></mml:math></disp-formula>where <inline-formula id="ieqn-24"><mml:math id="mml-ieqn-24"><mml:msub><mml:mi>H</mml:mi><mml:mrow><mml:mi>I</mml:mi></mml:mrow></mml:msub><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mn>2048</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mn>49</mml:mn></mml:mrow></mml:msup></mml:math></inline-formula>, 49 is the count of 7 &#x00D7; 7 equal-size vision blocks, and 2048 is the dimension of a vision block. Considering the subsequent cross-modal interactions, we map text and image representations into the same semantic space and perform a linear transformation on <inline-formula id="ieqn-25"><mml:math id="mml-ieqn-25"><mml:msub><mml:mi>H</mml:mi><mml:mrow><mml:mi>I</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> to generate the final image representation:
<disp-formula id="eqn-3"><label>(3)</label><mml:math id="mml-eqn-3" display="block"><mml:msub><mml:mi>H</mml:mi><mml:mrow><mml:mi>V</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mi>W</mml:mi><mml:mrow><mml:mi>V</mml:mi></mml:mrow></mml:msub><mml:msub><mml:mi>H</mml:mi><mml:mrow><mml:mi>I</mml:mi></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:msub><mml:mi>b</mml:mi><mml:mrow><mml:mi>V</mml:mi></mml:mrow></mml:msub></mml:math></disp-formula>where <inline-formula id="ieqn-26"><mml:math id="mml-ieqn-26"><mml:msub><mml:mi>W</mml:mi><mml:mrow><mml:mi>V</mml:mi></mml:mrow></mml:msub><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>d</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mn>2048</mml:mn></mml:mrow></mml:msup></mml:math></inline-formula> and <inline-formula id="ieqn-27"><mml:math id="mml-ieqn-27"><mml:msub><mml:mi>b</mml:mi><mml:mrow><mml:mi>V</mml:mi></mml:mrow></mml:msub><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>d</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula> are the learnable linear transformation parameters.</p>
</sec>
</sec>
<sec id="s3_3">
<label>3.3</label>
<title>Multimodal Feature Correlation Discrimination</title>
<p>Although image can provide the model with information other than text, our aim is to extract aspect terms and predict sentiment polarities with the image assistance, and the image information unrelated to text semantics not only fails to assist text in accomplishing the target task but may also introduce extra noise. In this component, we design a Multimodal Feature Correlation Discrimination (MFCD) module to better achieve the text and image information fusion by constructing an Image Gating Mechanism (IGM) for visual features. The internal architecture of MFCD is shown in <xref ref-type="fig" rid="fig-2">Fig. 2</xref>, which consists of two layers: (1) Cross-modal feature interaction. (2) Image gating mechanism construction. Compared with mostly existing methods that directly integrate inter-modal information, our MFCD promotes the effective fusion of text and image information by calculating the correlation degree of their semantics to perform filtering.</p>
<fig id="fig-2">
<label>Figure 2</label>
<caption>
<title>The workflow of MFCD module</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_55943-fig-2.tif"/>
</fig>
<sec id="s3_3_1">
<label>3.3.1</label>
<title>Cross-Modal Feature Interaction</title>
<p>With the purpose of learning text feature representation in image, we employ the Multi-head Cross-modal Attention (MCATT) [<xref ref-type="bibr" rid="ref-53">53</xref>] mechanism that has multiple attention heads focusing on different features to capture multimodal complex associations, and treat the image representation <inline-formula id="ieqn-28"><mml:math id="mml-ieqn-28"><mml:msub><mml:mi>H</mml:mi><mml:mrow><mml:mi>V</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> as Query (Q), the text representation <inline-formula id="ieqn-29"><mml:math id="mml-ieqn-29"><mml:msub><mml:mi>H</mml:mi><mml:mrow><mml:mi>S</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> as Key (K) and Value (V), then obtain the image-aware text representation <inline-formula id="ieqn-30"><mml:math id="mml-ieqn-30"><mml:msub><mml:mi>H</mml:mi><mml:mrow><mml:mi>V</mml:mi><mml:mo stretchy="false">&#x2192;</mml:mo><mml:mi>S</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> by two Layer Normalization (LN) [<xref ref-type="bibr" rid="ref-54">54</xref>] and one Feed-Forward Network (FFN) [<xref ref-type="bibr" rid="ref-17">17</xref>]:
<disp-formula id="eqn-4"><label>(4)</label><mml:math id="mml-eqn-4" display="block"><mml:msub><mml:mi>Z</mml:mi><mml:mrow><mml:mi>V</mml:mi><mml:mo stretchy="false">&#x2192;</mml:mo><mml:mi>S</mml:mi></mml:mrow></mml:msub><mml:mspace width="thinmathspace" /><mml:mo>=</mml:mo><mml:mspace width="thinmathspace" /><mml:mi>L</mml:mi><mml:mi>N</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>H</mml:mi><mml:mrow><mml:mi>V</mml:mi></mml:mrow></mml:msub><mml:mspace width="thinmathspace" /><mml:mo>+</mml:mo><mml:mspace width="thinmathspace" /><mml:mi>M</mml:mi><mml:mi>C</mml:mi><mml:mi>A</mml:mi><mml:mi>T</mml:mi><mml:mi>T</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>H</mml:mi><mml:mrow><mml:mi>V</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mspace width="thinmathspace" /><mml:msub><mml:mi>H</mml:mi><mml:mrow><mml:mi>S</mml:mi></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:math></disp-formula>
<disp-formula id="eqn-5"><label>(5)</label><mml:math id="mml-eqn-5" display="block"><mml:msub><mml:mi>H</mml:mi><mml:mrow><mml:mi>V</mml:mi><mml:mo stretchy="false">&#x2192;</mml:mo><mml:mi>S</mml:mi></mml:mrow></mml:msub><mml:mspace width="thinmathspace" /><mml:mo>=</mml:mo><mml:mspace width="thinmathspace" /><mml:mi>L</mml:mi><mml:mi>N</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>Z</mml:mi><mml:mrow><mml:mi>V</mml:mi><mml:mo stretchy="false">&#x2192;</mml:mo><mml:mi>S</mml:mi></mml:mrow></mml:msub><mml:mspace width="thinmathspace" /><mml:mo>+</mml:mo><mml:mspace width="thinmathspace" /><mml:mi>F</mml:mi><mml:mi>F</mml:mi><mml:mi>N</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>Z</mml:mi><mml:mrow><mml:mi>V</mml:mi><mml:mo stretchy="false">&#x2192;</mml:mo><mml:mi>S</mml:mi></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:math></disp-formula>where <inline-formula id="ieqn-31"><mml:math id="mml-ieqn-31"><mml:msub><mml:mi>H</mml:mi><mml:mrow><mml:mi>V</mml:mi><mml:mo stretchy="false">&#x2192;</mml:mo><mml:mi>S</mml:mi></mml:mrow></mml:msub><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>d</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mn>49</mml:mn></mml:mrow></mml:msup></mml:math></inline-formula>. However, <inline-formula id="ieqn-32"><mml:math id="mml-ieqn-32"><mml:msub><mml:mi>H</mml:mi><mml:mrow><mml:mi>V</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> is treated as Q in this MCATT and the individual vector of <inline-formula id="ieqn-33"><mml:math id="mml-ieqn-33"><mml:msub><mml:mi>H</mml:mi><mml:mrow><mml:mi>V</mml:mi><mml:mo stretchy="false">&#x2192;</mml:mo><mml:mi>S</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> is represented in form of a vision block. Considering the subsequent construction of gating mechanism, it is necessary to convert each vector into a token representation. Therefore, we employ another MCATT by treating <inline-formula id="ieqn-34"><mml:math id="mml-ieqn-34"><mml:msub><mml:mi>H</mml:mi><mml:mrow><mml:mi>S</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> as Q, <inline-formula id="ieqn-35"><mml:math id="mml-ieqn-35"><mml:msub><mml:mi>H</mml:mi><mml:mrow><mml:mi>V</mml:mi><mml:mo stretchy="false">&#x2192;</mml:mo><mml:mi>S</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> as K and V to generate the final image-aware text representation <inline-formula id="ieqn-36"><mml:math id="mml-ieqn-36"><mml:msubsup><mml:mi>H</mml:mi><mml:mrow><mml:mi>V</mml:mi><mml:mo stretchy="false">&#x2192;</mml:mo><mml:mi>S</mml:mi></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula>, where <inline-formula id="ieqn-37"><mml:math id="mml-ieqn-37"><mml:msubsup><mml:mi>H</mml:mi><mml:mrow><mml:mi>V</mml:mi><mml:mo stretchy="false">&#x2192;</mml:mo><mml:mi>S</mml:mi></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msubsup><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>d</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mi>m</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula>.</p>
<p>For learning the image feature representation of each token in text, we employ the same cross-modal interaction method as the MCATT described above, and treat <inline-formula id="ieqn-38"><mml:math id="mml-ieqn-38"><mml:msub><mml:mi>H</mml:mi><mml:mrow><mml:mi>S</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> as Q, <inline-formula id="ieqn-39"><mml:math id="mml-ieqn-39"><mml:msub><mml:mi>H</mml:mi><mml:mrow><mml:mi>V</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> as K and V to generate the text-aware image representation <inline-formula id="ieqn-40"><mml:math id="mml-ieqn-40"><mml:msub><mml:mi>H</mml:mi><mml:mrow><mml:mi>S</mml:mi><mml:mo stretchy="false">&#x2192;</mml:mo><mml:mi>V</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>, where <inline-formula id="ieqn-41"><mml:math id="mml-ieqn-41"><mml:msub><mml:mi>H</mml:mi><mml:mrow><mml:mi>S</mml:mi><mml:mo stretchy="false">&#x2192;</mml:mo><mml:mi>V</mml:mi></mml:mrow></mml:msub><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>d</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mi>m</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula>.</p>
</sec>
<sec id="s3_3_2">
<label>3.3.2</label>
<title>Image Gating Mechanism Construction</title>
<p>In a previous MNER study, Yu et al. [<xref ref-type="bibr" rid="ref-43">43</xref>] controlled the visual feature contribution to each token in text by constructing an image gate and achieved effective results in a series of experiments. Motivated by this work, we also decide to introduce an image gating mechanism, which serves to dynamically control the contribution of image information by assigning correlation weights to its features in <inline-formula id="ieqn-42"><mml:math id="mml-ieqn-42"><mml:mrow><mml:mo>[</mml:mo><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mn>1</mml:mn><mml:mo>]</mml:mo></mml:mrow></mml:math></inline-formula> with corresponding text. Specifically, we concatenate the above <inline-formula id="ieqn-43"><mml:math id="mml-ieqn-43"><mml:msubsup><mml:mi>H</mml:mi><mml:mrow><mml:mi>V</mml:mi><mml:mo stretchy="false">&#x2192;</mml:mo><mml:mi>S</mml:mi></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula> and <inline-formula id="ieqn-44"><mml:math id="mml-ieqn-44"><mml:msub><mml:mi>H</mml:mi><mml:mrow><mml:mi>S</mml:mi><mml:mo stretchy="false">&#x2192;</mml:mo><mml:mi>V</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>, then construct the gating mechanism by linear transformation and nonlinear activation:
<disp-formula id="eqn-6"><label>(6)</label><mml:math id="mml-eqn-6" display="block"><mml:mi>g</mml:mi><mml:mo>=</mml:mo><mml:mi>&#x03C3;</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>W</mml:mi><mml:mrow><mml:mi>g</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo>[</mml:mo><mml:msubsup><mml:mi>H</mml:mi><mml:mrow><mml:mi>V</mml:mi><mml:mo stretchy="false">&#x2192;</mml:mo><mml:mi>S</mml:mi></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msubsup><mml:mo>;</mml:mo><mml:mspace width="thinmathspace" /><mml:msub><mml:mi>H</mml:mi><mml:mrow><mml:mi>S</mml:mi><mml:mo stretchy="false">&#x2192;</mml:mo><mml:mi>V</mml:mi></mml:mrow></mml:msub><mml:mo>]</mml:mo></mml:mrow><mml:mspace width="thinmathspace" /><mml:mo>+</mml:mo><mml:mspace width="thinmathspace" /><mml:msub><mml:mi>b</mml:mi><mml:mrow><mml:mi>g</mml:mi></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow></mml:math></disp-formula>where <inline-formula id="ieqn-45"><mml:math id="mml-ieqn-45"><mml:msub><mml:mi>W</mml:mi><mml:mrow><mml:mi>g</mml:mi></mml:mrow></mml:msub><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>d</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mn>2</mml:mn><mml:mi>d</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula> and <inline-formula id="ieqn-46"><mml:math id="mml-ieqn-46"><mml:msub><mml:mi>b</mml:mi><mml:mrow><mml:mi>g</mml:mi></mml:mrow></mml:msub><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>d</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula> are the learnable linear transformation parameters, and <inline-formula id="ieqn-47"><mml:math id="mml-ieqn-47"><mml:mi>&#x03C3;</mml:mi></mml:math></inline-formula> is the element-wise nonlinear activation that controls the gating mechanism output in <inline-formula id="ieqn-48"><mml:math id="mml-ieqn-48"><mml:mrow><mml:mo>[</mml:mo><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mn>1</mml:mn><mml:mo>]</mml:mo></mml:mrow></mml:math></inline-formula>. The generated <inline-formula id="ieqn-49"><mml:math id="mml-ieqn-49"><mml:mi>g</mml:mi></mml:math></inline-formula> is a weight vector where all element values are between 0 and 1, with element values close to 1 in the regions of high text-image correlation and close to 0 in the regions of low correlation, thus its subsequent multiplication with image associated representations can filter out the image information unrelated to text semantics.</p>
</sec>
</sec>
<sec id="s3_4">
<label>3.4</label>
<title>Intra-Modal Feature Fine-Grained Analysis</title>
<p>For the previous studies on JMASA, most methods typically implement text and image feature encoding from a basic level by employing pretrained language and vision encoders, which only guarantee the coarse-grained learning and representation of individual modal features but are not sufficient for a deeper understanding of the text internal structure and image semantic information, thus may lead to some impact on the model performance. To address this problem, we perform a more in-depth analysis and research on the unimodal intrinsic features, including: (1) Constructing a Text Auxiliary Information (TAI) based on sentiment-enhanced GCN to learn the syntactic structure features of text; (2) Constructing an Image Auxiliary Information (IAI) by adopting ANPs to assist the image semantic representation from the text level.</p>
<sec id="s3_4_1">
<label>3.4.1</label>
<title>Text Auxiliary Information Based on Sentiment-Enhanced Graph Convolutional Network</title>
<p><xref ref-type="fig" rid="fig-3">Fig. 3</xref> illustrates the generation process of TAI based on sentiment-enhanced GCN. For the text in a sample, it may contain one or more aspect terms, but different aspect terms involve different valid context information. For example, given an input text &#x201C;Hosted the @MLBPDP event today with Mother Nature on our side! Dayton Moore was in the house. #TBones #FunWellDone&#x201D;, the valid context information of the aspect term &#x201C;Dayton Moore&#x201D; is &#x201C;was in the house&#x201D; rather than &#x201C;on our side&#x201D; or other context information before it, and only coarse-grained learning of text may introduce unrelated context information for aspect terms. In view of this problem, we adopt GCN to learn the syntactic structure features of text to filter unrelated context information that may interfere with aspect term extraction and sentiment polarity prediction. Compared with other existing GNN methods, GCN has higher efficiency in processing graph data by extracting the spatial features of graph structure through convolution to learn the complex relationships between nodes.</p>
<fig id="fig-3">
<label>Figure 3</label>
<caption>
<title>The generation process of TAI based on sentiment-enhanced GCN</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_55943-fig-3.tif"/>
</fig>
<p>Considering that aspect terms are typically noun phrases in text, we utilize the NLP tool Spacy (<ext-link ext-link-type="uri" xlink:href="https://spacy.io">https://spacy.io</ext-link>, accessed on 20 August 2024) to extract noun phrases from text as the candidates for aspect terms, next continue to utilize Spacy to construct the syntactic dependency tree of text and obtain the corresponding adjacency matrix <inline-formula id="ieqn-50"><mml:math id="mml-ieqn-50"><mml:mi>D</mml:mi><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>m</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mi>m</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula> based on the dependency relationship between the words of each node:
<disp-formula id="eqn-7"><label>(7)</label><mml:math id="mml-eqn-7" display="block"><mml:msub><mml:mi>D</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mspace width="thinmathspace" /><mml:mo>=</mml:mo><mml:mspace width="thinmathspace" /><mml:mrow><mml:mo>{</mml:mo><mml:mtable columnalign="left" rowspacing="4pt" columnspacing="1em"><mml:mtr><mml:mtd><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mspace width="thinmathspace" /><mml:msub><mml:mi>w</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mspace width="thinmathspace" /><mml:mi>a</mml:mi><mml:mi>n</mml:mi><mml:mi>d</mml:mi><mml:mspace width="thinmathspace" /><mml:msub><mml:mi>w</mml:mi><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mspace width="thinmathspace" /><mml:mi>a</mml:mi><mml:mi>r</mml:mi><mml:mi>e</mml:mi><mml:mrow><mml:mspace width="thinmathspace" /></mml:mrow><mml:mrow><mml:mi>a</mml:mi><mml:mi>d</mml:mi><mml:mi>j</mml:mi><mml:mi>a</mml:mi><mml:mi>c</mml:mi><mml:mi>e</mml:mi><mml:mi>n</mml:mi><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mspace width="thinmathspace" /></mml:mrow><mml:mrow><mml:mi>n</mml:mi><mml:mi>o</mml:mi><mml:mi>d</mml:mi><mml:mi>e</mml:mi><mml:mi>s</mml:mi></mml:mrow></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mspace width="thinmathspace" /><mml:msub><mml:mi>w</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mspace width="thinmathspace" /><mml:mi>a</mml:mi><mml:mi>n</mml:mi><mml:mi>d</mml:mi><mml:mspace width="thinmathspace" /><mml:msub><mml:mi>w</mml:mi><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mspace width="thinmathspace" /><mml:mi>a</mml:mi><mml:mi>r</mml:mi><mml:mi>e</mml:mi><mml:mspace width="thinmathspace" /><mml:mi>n</mml:mi><mml:mi>o</mml:mi><mml:mi>n</mml:mi><mml:mspace width="thinmathspace" /><mml:mo>&#x2212;</mml:mo><mml:mspace width="thinmathspace" /><mml:mrow><mml:mspace width="thinmathspace" /></mml:mrow><mml:mrow><mml:mi>a</mml:mi><mml:mi>d</mml:mi><mml:mi>j</mml:mi><mml:mi>a</mml:mi><mml:mi>c</mml:mi><mml:mi>e</mml:mi><mml:mi>n</mml:mi><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mspace width="thinmathspace" /></mml:mrow><mml:mrow><mml:mi>n</mml:mi><mml:mi>o</mml:mi><mml:mi>d</mml:mi><mml:mi>e</mml:mi><mml:mi>s</mml:mi></mml:mrow></mml:mtd></mml:mtr></mml:mtable><mml:mo fence="true" stretchy="true" symmetric="true"></mml:mo></mml:mrow></mml:math></disp-formula></p>
<p>where <inline-formula id="ieqn-51"><mml:math id="mml-ieqn-51"><mml:msub><mml:mrow><mml:mi mathvariant="italic">w</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mi mathvariant="italic">i</mml:mi></mml:mrow></mml:mrow></mml:msub></mml:math></inline-formula> and <inline-formula id="ieqn-52"><mml:math id="mml-ieqn-52"><mml:msub><mml:mi>w</mml:mi><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> are the <inline-formula id="ieqn-53"><mml:math id="mml-ieqn-53"><mml:mi>i</mml:mi></mml:math></inline-formula>th and <inline-formula id="ieqn-54"><mml:math id="mml-ieqn-54"><mml:mi>j</mml:mi></mml:math></inline-formula>th words, then introduce a sentiment dictionary named SenticNet to generate the sentiment score <inline-formula id="ieqn-55"><mml:math id="mml-ieqn-55"><mml:mi>S</mml:mi></mml:math></inline-formula> between each adjacent node that enhances the adjacency matrix representation:
<disp-formula id="eqn-8"><label>(8)</label><mml:math id="mml-eqn-8" display="block"><mml:msub><mml:mi>S</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mi>S</mml:mi><mml:mi>e</mml:mi><mml:mi>n</mml:mi><mml:mi>t</mml:mi><mml:mi>i</mml:mi><mml:mi>c</mml:mi><mml:mi>N</mml:mi><mml:mi>e</mml:mi><mml:mi>t</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>w</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow><mml:mspace width="thinmathspace" /><mml:mo>+</mml:mo><mml:mspace width="thinmathspace" /><mml:mi>S</mml:mi><mml:mi>e</mml:mi><mml:mi>n</mml:mi><mml:mi>t</mml:mi><mml:mi>i</mml:mi><mml:mi>c</mml:mi><mml:mi>N</mml:mi><mml:mi>e</mml:mi><mml:mi>t</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>w</mml:mi><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow></mml:math></disp-formula></p>
<p>where <inline-formula id="ieqn-56"><mml:math id="mml-ieqn-56"><mml:mrow><mml:mi mathvariant="italic">S</mml:mi><mml:mi mathvariant="italic">e</mml:mi><mml:mi mathvariant="italic">n</mml:mi><mml:mi mathvariant="italic">t</mml:mi><mml:mi mathvariant="italic">i</mml:mi><mml:mi mathvariant="italic">c</mml:mi><mml:mi mathvariant="italic">N</mml:mi><mml:mi mathvariant="italic">e</mml:mi><mml:mi mathvariant="italic">t</mml:mi></mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mrow><mml:mi mathvariant="italic">w</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mi mathvariant="italic">i</mml:mi></mml:mrow></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow><mml:mo>&#x2208;</mml:mo><mml:mrow><mml:mo>[</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mn>1</mml:mn><mml:mo>]</mml:mo></mml:mrow></mml:math></inline-formula> is the sentiment score of the word <inline-formula id="ieqn-57"><mml:math id="mml-ieqn-57"><mml:msub><mml:mi>w</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> in SenticNet, &#x2013;1 means the sentiment is negative, 1 is positive, and <inline-formula id="ieqn-58"><mml:math id="mml-ieqn-58"><mml:mi>S</mml:mi><mml:mi>e</mml:mi><mml:mi>n</mml:mi><mml:mi>t</mml:mi><mml:mi>i</mml:mi><mml:mi>c</mml:mi><mml:mi>N</mml:mi><mml:mi>e</mml:mi><mml:mi>t</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>w</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mn>0</mml:mn></mml:math></inline-formula> indicates that the sentiment polarity of <inline-formula id="ieqn-59"><mml:math id="mml-ieqn-59"><mml:msub><mml:mi>w</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> is neutral or the word does not exist in SenticNet. Furthermore, we expect the noun phrases that may be aspect terms to receive more attention in the adjacency matrix, so the enhancement matrix <inline-formula id="ieqn-60"><mml:math id="mml-ieqn-60"><mml:mi>T</mml:mi></mml:math></inline-formula> continues to be constructed for each noun phrase:
<disp-formula id="eqn-9"><label>(9)</label><mml:math id="mml-eqn-9" display="block"><mml:msubsup><mml:mi>T</mml:mi><mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:mrow><mml:mo>{</mml:mo><mml:mtable columnalign="left" rowspacing="4pt" columnspacing="1em"><mml:mtr><mml:mtd><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mspace width="thinmathspace" /><mml:msub><mml:mi>w</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mspace width="thinmathspace" /><mml:mi>o</mml:mi><mml:mi>r</mml:mi><mml:mspace width="thinmathspace" /><mml:msub><mml:mi>w</mml:mi><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mspace width="thinmathspace" /><mml:mi>i</mml:mi><mml:mi>s</mml:mi><mml:mspace width="thinmathspace" /><mml:mi>n</mml:mi><mml:mi>o</mml:mi><mml:mi>u</mml:mi><mml:mi>n</mml:mi><mml:mrow><mml:mspace width="thinmathspace" /></mml:mrow><mml:mrow><mml:mi>p</mml:mi><mml:mi>h</mml:mi><mml:mi>r</mml:mi><mml:mi>a</mml:mi><mml:mi>s</mml:mi><mml:mi>e</mml:mi></mml:mrow></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mspace width="thinmathspace" /><mml:msub><mml:mi>w</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mspace width="thinmathspace" /><mml:mi>o</mml:mi><mml:mi>r</mml:mi><mml:mspace width="thinmathspace" /><mml:msub><mml:mi>w</mml:mi><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mspace width="thinmathspace" /><mml:mi>i</mml:mi><mml:mi>s</mml:mi><mml:mspace width="thinmathspace" /><mml:mi>n</mml:mi><mml:mi>o</mml:mi><mml:mi>n</mml:mi><mml:mspace width="thinmathspace" /><mml:mo>&#x2212;</mml:mo><mml:mspace width="thinmathspace" /><mml:mi>n</mml:mi><mml:mi>o</mml:mi><mml:mi>u</mml:mi><mml:mi>n</mml:mi><mml:mspace width="thinmathspace" /><mml:mrow><mml:mspace width="thinmathspace" /></mml:mrow><mml:mrow><mml:mi>p</mml:mi><mml:mi>h</mml:mi><mml:mi>r</mml:mi><mml:mi>a</mml:mi><mml:mi>s</mml:mi><mml:mi>e</mml:mi></mml:mrow></mml:mtd></mml:mtr></mml:mtable><mml:mo fence="true" stretchy="true" symmetric="true"></mml:mo></mml:mrow></mml:math></disp-formula></p>
<p>where <inline-formula id="ieqn-61"><mml:math id="mml-ieqn-61"><mml:mrow><mml:mi mathvariant="italic">k</mml:mi></mml:mrow><mml:mo>&#x2208;</mml:mo><mml:mi>n</mml:mi></mml:math></inline-formula> is the <inline-formula id="ieqn-62"><mml:math id="mml-ieqn-62"><mml:mi>k</mml:mi></mml:math></inline-formula>th noun phrase, <inline-formula id="ieqn-63"><mml:math id="mml-ieqn-63"><mml:mi>n</mml:mi></mml:math></inline-formula> is the count of noun phrases in text, then construct the sentiment-enhanced text adjacency matrix <inline-formula id="ieqn-64"><mml:math id="mml-ieqn-64"><mml:mi>A</mml:mi></mml:math></inline-formula>:
<disp-formula id="eqn-10"><label>(10)</label><mml:math id="mml-eqn-10" display="block"><mml:msub><mml:mi>A</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:mi>n</mml:mi></mml:mfrac><mml:munderover><mml:mo movablelimits="false">&#x2211;</mml:mo><mml:mrow><mml:mi>k</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:munderover><mml:mrow><mml:mo>[</mml:mo><mml:msub><mml:mi>D</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mspace width="thinmathspace" /><mml:mo>&#x00D7;</mml:mo><mml:mspace width="thinmathspace" /><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>S</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:msubsup><mml:mi>T</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msubsup><mml:mspace width="thinmathspace" /><mml:mo>+</mml:mo><mml:mspace width="thinmathspace" /><mml:mn>1</mml:mn><mml:mo>)</mml:mo></mml:mrow><mml:mo>]</mml:mo></mml:mrow></mml:math></disp-formula></p>
<p>After obtaining this adjacency matrix, we introduce GCN to learn the syntactic structure features and sentiment dependencies of the above noun phrases based on syntactic dependency tree:
<disp-formula id="eqn-11"><label>(11)</label><mml:math id="mml-eqn-11" display="block"><mml:msubsup><mml:mi>h</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>l</mml:mi></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:mi>R</mml:mi><mml:mi>e</mml:mi><mml:mi>L</mml:mi><mml:mi>U</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mfrac><mml:mrow><mml:munderover><mml:mo movablelimits="false">&#x2211;</mml:mo><mml:mrow><mml:mi>j</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>m</mml:mi></mml:mrow></mml:munderover><mml:msub><mml:mi>A</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:msub><mml:mi>W</mml:mi><mml:mrow><mml:mi>l</mml:mi></mml:mrow></mml:msub><mml:msubsup><mml:mi>h</mml:mi><mml:mrow><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mi>l</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msubsup></mml:mrow><mml:mrow><mml:msub><mml:mi>d</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:mfrac><mml:mo>+</mml:mo><mml:msub><mml:mi>b</mml:mi><mml:mrow><mml:mi>l</mml:mi></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow></mml:math></disp-formula></p>
<p>where <inline-formula id="ieqn-65"><mml:math id="mml-ieqn-65"><mml:mrow><mml:mi mathvariant="italic">i</mml:mi></mml:mrow></mml:math></inline-formula> is the current node, <inline-formula id="ieqn-66"><mml:math id="mml-ieqn-66"><mml:mi>j</mml:mi></mml:math></inline-formula> is the adjacent node of <inline-formula id="ieqn-67"><mml:math id="mml-ieqn-67"><mml:mi>i</mml:mi></mml:math></inline-formula>, <inline-formula id="ieqn-68"><mml:math id="mml-ieqn-68"><mml:msubsup><mml:mi>h</mml:mi><mml:mrow><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mi>l</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msubsup><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>d</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula> is the representation of <inline-formula id="ieqn-69"><mml:math id="mml-ieqn-69"><mml:mi>j</mml:mi></mml:math></inline-formula> in layer <inline-formula id="ieqn-70"><mml:math id="mml-ieqn-70"><mml:mi>l</mml:mi></mml:math></inline-formula> that is generated by the previous GCN layer, <inline-formula id="ieqn-71"><mml:math id="mml-ieqn-71"><mml:msubsup><mml:mi>h</mml:mi><mml:mrow><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mn>0</mml:mn></mml:mrow></mml:msubsup><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>d</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula> is the initial node representation in GCN and also the representation of the <inline-formula id="ieqn-72"><mml:math id="mml-ieqn-72"><mml:mi>j</mml:mi></mml:math></inline-formula> th token in <inline-formula id="ieqn-73"><mml:math id="mml-ieqn-73"><mml:msub><mml:mi>H</mml:mi><mml:mrow><mml:mi>S</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>, <inline-formula id="ieqn-74"><mml:math id="mml-ieqn-74"><mml:msub><mml:mi>d</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:munderover><mml:mo movablelimits="false">&#x2211;</mml:mo><mml:mrow><mml:mi>j</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>m</mml:mi></mml:mrow></mml:munderover><mml:msub><mml:mi>A</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> is the degree of <inline-formula id="ieqn-75"><mml:math id="mml-ieqn-75"><mml:mi>i</mml:mi></mml:math></inline-formula>, <inline-formula id="ieqn-76"><mml:math id="mml-ieqn-76"><mml:msub><mml:mi>W</mml:mi><mml:mrow><mml:mi>l</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> and <inline-formula id="ieqn-77"><mml:math id="mml-ieqn-77"><mml:msub><mml:mi>b</mml:mi><mml:mrow><mml:mi>l</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> are the learnable linear transformation parameters that map the current node features to adjacent nodes. Finally, we treat the output of the last GCN layer as sentiment-enhanced text representation and as TAI:
<disp-formula id="eqn-12"><label>(12)</label><mml:math id="mml-eqn-12" display="block"><mml:msub><mml:mi>H</mml:mi><mml:mrow><mml:mi>G</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mrow><mml:mo>{</mml:mo><mml:msubsup><mml:mi>h</mml:mi><mml:mrow><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>l</mml:mi><mml:mi>a</mml:mi><mml:mi>s</mml:mi><mml:mi>t</mml:mi></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:msubsup><mml:mi>h</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow><mml:mrow><mml:mi>l</mml:mi><mml:mi>a</mml:mi><mml:mi>s</mml:mi><mml:mi>t</mml:mi></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:msubsup><mml:mi>h</mml:mi><mml:mrow><mml:mi>m</mml:mi></mml:mrow><mml:mrow><mml:mi>l</mml:mi><mml:mi>a</mml:mi><mml:mi>s</mml:mi><mml:mi>t</mml:mi></mml:mrow></mml:msubsup><mml:mo>}</mml:mo></mml:mrow></mml:math></disp-formula>where <inline-formula id="ieqn-78"><mml:math id="mml-ieqn-78"><mml:msub><mml:mi>H</mml:mi><mml:mrow><mml:mi>G</mml:mi></mml:mrow></mml:msub><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>d</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mi>m</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula>.</p>
</sec>
<sec id="s3_4_2">
<label>3.4.2</label>
<title>Image Auxiliary Information Based on Adjective-Noun Pairs</title>
<p>To enhance the sentiment information expression in visual features, we adopt ANPs extracted from image to enhance the image semantic representation from another level. Unlike the image representation described above, ANPs can extract nouns such as people or objects appearing in image and adjectives modifying these nouns, which enable image semantics to be understood from the text level. Specifically, we employ an existing visual concept detector library DeepSentiBank [<xref ref-type="bibr" rid="ref-55">55</xref>] that can detect 2089 ANPs and their corresponding confidence scores for each image, then choose <inline-formula id="ieqn-79"><mml:math id="mml-ieqn-79"><mml:mi>K</mml:mi></mml:math></inline-formula> ANPs with high confidence scores (Top-<inline-formula id="ieqn-80"><mml:math id="mml-ieqn-80"><mml:mi>K</mml:mi></mml:math></inline-formula> ANPs) for concatenation and feed them into RoBERTa to obtain the ANPs representation:
<disp-formula id="eqn-13"><label>(13)</label><mml:math id="mml-eqn-13" display="block"><mml:msub><mml:mi>H</mml:mi><mml:mrow><mml:mi>A</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mi>R</mml:mi><mml:mi>o</mml:mi><mml:mi>B</mml:mi><mml:mi>E</mml:mi><mml:mi>R</mml:mi><mml:mi>T</mml:mi><mml:mi>a</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mi>A</mml:mi><mml:mi>N</mml:mi><mml:mi>P</mml:mi><mml:msub><mml:mi>s</mml:mi><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mspace width="thinmathspace" /><mml:mo>;</mml:mo><mml:mspace width="thinmathspace" /><mml:mi>A</mml:mi><mml:mi>N</mml:mi><mml:mi>P</mml:mi><mml:msub><mml:mi>s</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub><mml:mspace width="thinmathspace" /><mml:mo>;</mml:mo><mml:mspace width="thinmathspace" /><mml:mo>&#x22EF;</mml:mo><mml:mspace width="thinmathspace" /><mml:mo>;</mml:mo><mml:mspace width="thinmathspace" /><mml:mi>A</mml:mi><mml:mi>N</mml:mi><mml:mi>P</mml:mi><mml:msub><mml:mi>s</mml:mi><mml:mrow><mml:mi>K</mml:mi></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow></mml:math></disp-formula></p>
<p>However, ANPs may contain the image region content unrelated to text semantics, and directly utilizing these ANPs to assist image representation would introduce extra noise to a large extent. Given this problem, we employ MCATT to perform an interaction between <inline-formula id="ieqn-81"><mml:math id="mml-ieqn-81"><mml:msub><mml:mi>H</mml:mi><mml:mrow><mml:mi>S</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> and <inline-formula id="ieqn-82"><mml:math id="mml-ieqn-82"><mml:msub><mml:mi>H</mml:mi><mml:mrow><mml:mi>A</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> to generate the text-aware ANPs representation <inline-formula id="ieqn-83"><mml:math id="mml-ieqn-83"><mml:msub><mml:mi>H</mml:mi><mml:mrow><mml:mi>S</mml:mi><mml:mo stretchy="false">&#x2192;</mml:mo><mml:mi>A</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> as IAI, thus filtering the image content unrelated to text semantics as possible:
<disp-formula id="eqn-14"><label>(14)</label><mml:math id="mml-eqn-14" display="block"><mml:msub><mml:mi>Z</mml:mi><mml:mrow><mml:mi>S</mml:mi><mml:mo stretchy="false">&#x2192;</mml:mo><mml:mi>A</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mi>L</mml:mi><mml:mi>N</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>H</mml:mi><mml:mrow><mml:mi>S</mml:mi></mml:mrow></mml:msub><mml:mspace width="thinmathspace" /><mml:mo>+</mml:mo><mml:mspace width="thinmathspace" /><mml:mi>M</mml:mi><mml:mi>C</mml:mi><mml:mi>A</mml:mi><mml:mi>T</mml:mi><mml:mi>T</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>H</mml:mi><mml:mrow><mml:mi>S</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mspace width="thinmathspace" /><mml:msub><mml:mi>H</mml:mi><mml:mrow><mml:mi>A</mml:mi></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:math></disp-formula>
<disp-formula id="eqn-15"><label>(15)</label><mml:math id="mml-eqn-15" display="block"><mml:msub><mml:mi>H</mml:mi><mml:mrow><mml:mi>S</mml:mi><mml:mo stretchy="false">&#x2192;</mml:mo><mml:mi>A</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mi>L</mml:mi><mml:mi>N</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>Z</mml:mi><mml:mrow><mml:mi>S</mml:mi><mml:mo stretchy="false">&#x2192;</mml:mo><mml:mi>A</mml:mi></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:mi>F</mml:mi><mml:mi>F</mml:mi><mml:mi>N</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>Z</mml:mi><mml:mrow><mml:mi>S</mml:mi><mml:mo stretchy="false">&#x2192;</mml:mo><mml:mi>A</mml:mi></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:math></disp-formula>where <inline-formula id="ieqn-84"><mml:math id="mml-ieqn-84"><mml:msub><mml:mi>H</mml:mi><mml:mrow><mml:mi>S</mml:mi><mml:mo stretchy="false">&#x2192;</mml:mo><mml:mi>A</mml:mi></mml:mrow></mml:msub><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>d</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mi>m</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula>.</p>
</sec>
</sec>
<sec id="s3_5">
<label>3.5</label>
<title>Inter-Modal Feature Fusion and Output</title>
<p>In this component, we employ the image gating mechanism constructed in <xref ref-type="sec" rid="s3_3">Section 3.3</xref> to perform feature fusion on the text representation <inline-formula id="ieqn-85"><mml:math id="mml-ieqn-85"><mml:msub><mml:mi>H</mml:mi><mml:mrow><mml:mi>S</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>, the text-aware image representation <inline-formula id="ieqn-86"><mml:math id="mml-ieqn-86"><mml:msub><mml:mi>H</mml:mi><mml:mrow><mml:mi>S</mml:mi><mml:mo stretchy="false">&#x2192;</mml:mo><mml:mi>V</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>, the sentiment-enhanced text representation <inline-formula id="ieqn-87"><mml:math id="mml-ieqn-87"><mml:msub><mml:mi>H</mml:mi><mml:mrow><mml:mi>G</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> and the text-aware ANPs representation <inline-formula id="ieqn-88"><mml:math id="mml-ieqn-88"><mml:msub><mml:mi>H</mml:mi><mml:mrow><mml:mi>S</mml:mi><mml:mo stretchy="false">&#x2192;</mml:mo><mml:mi>A</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> through an effective strategy to generate the inter-modal fusion representation of text and image features.</p>
<p>Since image is introduced as an additional modality to assist with text semantic expression in this study, we multiply the generated value of gating mechanism with the corresponding elements of image associated representations, thus dynamically controlling the input image information with the text word-level intensity, and filtering the image information unrelated to text semantics to prevent extra interference with subsequent work. Furthermore, <inline-formula id="ieqn-89"><mml:math id="mml-ieqn-89"><mml:msub><mml:mi>H</mml:mi><mml:mrow><mml:mi>G</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> and <inline-formula id="ieqn-90"><mml:math id="mml-ieqn-90"><mml:msub><mml:mi>H</mml:mi><mml:mrow><mml:mi>S</mml:mi><mml:mo stretchy="false">&#x2192;</mml:mo><mml:mi>A</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> as auxiliary text and image enhancement information are not in the same magnitude as <inline-formula id="ieqn-91"><mml:math id="mml-ieqn-91"><mml:msub><mml:mi>H</mml:mi><mml:mrow><mml:mi>S</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> and <inline-formula id="ieqn-92"><mml:math id="mml-ieqn-92"><mml:msub><mml:mi>H</mml:mi><mml:mrow><mml:mi>S</mml:mi><mml:mo stretchy="false">&#x2192;</mml:mo><mml:mi>V</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>, so we set weights named <inline-formula id="ieqn-93"><mml:math id="mml-ieqn-93"><mml:mi>&#x03B1;</mml:mi></mml:math></inline-formula> and <inline-formula id="ieqn-94"><mml:math id="mml-ieqn-94"><mml:mi>&#x03B2;</mml:mi></mml:math></inline-formula> for <inline-formula id="ieqn-95"><mml:math id="mml-ieqn-95"><mml:msub><mml:mi>H</mml:mi><mml:mrow><mml:mi>G</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> and <inline-formula id="ieqn-96"><mml:math id="mml-ieqn-96"><mml:msub><mml:mi>H</mml:mi><mml:mrow><mml:mi>S</mml:mi><mml:mo stretchy="false">&#x2192;</mml:mo><mml:mi>A</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>, respectively, to control the TAI and IAI contribution to the inter-modal fusion representation. Finally, we fuse all corresponding text and image representations through the gating mechanism <inline-formula id="ieqn-97"><mml:math id="mml-ieqn-97"><mml:mi>g</mml:mi></mml:math></inline-formula> to obtain the inter-modal fusion representation of text and image features:
<disp-formula id="eqn-16"><label>(16)</label><mml:math id="mml-eqn-16" display="block"><mml:mi>H</mml:mi><mml:mo>=</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:mn>1</mml:mn><mml:mspace width="thinmathspace" /><mml:mo>&#x2212;</mml:mo><mml:mspace width="thinmathspace" /><mml:mi>g</mml:mi><mml:mo>)</mml:mo></mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>H</mml:mi><mml:mrow><mml:mi>S</mml:mi></mml:mrow></mml:msub><mml:mo>;</mml:mo><mml:mspace width="thinmathspace" /><mml:mi>&#x03B1;</mml:mi><mml:msub><mml:mi>H</mml:mi><mml:mrow><mml:mi>G</mml:mi></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow><mml:mspace width="thinmathspace" /><mml:mo>+</mml:mo><mml:mspace width="thinmathspace" /><mml:mi>g</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>H</mml:mi><mml:mrow><mml:mi>S</mml:mi><mml:mo stretchy="false">&#x2192;</mml:mo><mml:mi>V</mml:mi></mml:mrow></mml:msub><mml:mo>;</mml:mo><mml:mspace width="thinmathspace" /><mml:mi>&#x03B2;</mml:mi><mml:msub><mml:mi>H</mml:mi><mml:mrow><mml:mi>S</mml:mi><mml:mo stretchy="false">&#x2192;</mml:mo><mml:mi>A</mml:mi></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow></mml:math></disp-formula></p>
<p>Conditional Random Field (CRF) [<xref ref-type="bibr" rid="ref-56">56</xref>] is a discriminative undirected graph model that can effectively model the constraint relationships between sequence labels, so we adopt CRF to accomplish the aspect term extraction and sentiment polarity prediction tasks in our study. Specifically, we feed the above inter-modal fusion representation <inline-formula id="ieqn-98"><mml:math id="mml-ieqn-98"><mml:mi>H</mml:mi></mml:math></inline-formula> into CRF to achieve text label sequence prediction:
<disp-formula id="eqn-17"><label>(17)</label><mml:math id="mml-eqn-17" display="block"><mml:mi>P</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mi>y</mml:mi><mml:mo>)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mi>exp</mml:mi><mml:mo>&#x2061;</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mtext>score</mml:mtext></mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mi>H</mml:mi><mml:mo>,</mml:mo><mml:mi>y</mml:mi><mml:mo>)</mml:mo></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:munder><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:msup><mml:mi>y</mml:mi><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msup><mml:mo>&#x2208;</mml:mo><mml:mi>Y</mml:mi></mml:mrow></mml:munder><mml:mi>exp</mml:mi><mml:mo>&#x2061;</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mtext>score</mml:mtext></mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mi>H</mml:mi><mml:mo>,</mml:mo><mml:msup><mml:mi>y</mml:mi><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msup><mml:mo>)</mml:mo></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:mfrac></mml:math></disp-formula>
<disp-formula id="eqn-18"><label>(18)</label><mml:math id="mml-eqn-18" display="block"><mml:mrow><mml:mtext>score</mml:mtext></mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mi>H</mml:mi><mml:mo>,</mml:mo><mml:mi>y</mml:mi><mml:mo>)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:munderover><mml:mo movablelimits="false">&#x2211;</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>0</mml:mn></mml:mrow><mml:mrow><mml:mi>m</mml:mi></mml:mrow></mml:munderover><mml:msub><mml:mi>T</mml:mi><mml:mrow><mml:mrow><mml:msub><mml:mi>y</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo>,</mml:mo><mml:mrow><mml:msub><mml:mi>y</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:mrow></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:munderover><mml:mo movablelimits="false">&#x2211;</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>m</mml:mi></mml:mrow></mml:munderover><mml:msub><mml:mi>E</mml:mi><mml:mrow><mml:mrow><mml:msub><mml:mi>h</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo>,</mml:mo><mml:mrow><mml:msub><mml:mi>y</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:mrow></mml:msub></mml:math></disp-formula>where <inline-formula id="ieqn-99"><mml:math id="mml-ieqn-99"><mml:msub><mml:mi>T</mml:mi><mml:mrow><mml:mrow><mml:msub><mml:mi>y</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo>,</mml:mo><mml:mrow><mml:msub><mml:mi>y</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:mrow></mml:mrow></mml:msub></mml:math></inline-formula> is the transfer score of the <inline-formula id="ieqn-100"><mml:math id="mml-ieqn-100"><mml:msub><mml:mi>y</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> label and the <inline-formula id="ieqn-101"><mml:math id="mml-ieqn-101"><mml:msub><mml:mi>y</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula> label (i.e., the probability that <inline-formula id="ieqn-102"><mml:math id="mml-ieqn-102"><mml:msub><mml:mi>y</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> and <inline-formula id="ieqn-103"><mml:math id="mml-ieqn-103"><mml:msub><mml:mi>y</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula> appear together), and <inline-formula id="ieqn-104"><mml:math id="mml-ieqn-104"><mml:msub><mml:mi>E</mml:mi><mml:mrow><mml:mrow><mml:msub><mml:mi>h</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo>,</mml:mo><mml:mrow><mml:msub><mml:mi>y</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:mrow></mml:msub></mml:math></inline-formula> is the emission score of <inline-formula id="ieqn-105"><mml:math id="mml-ieqn-105"><mml:msub><mml:mi>y</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> (i.e., the probability that the output label is <inline-formula id="ieqn-106"><mml:math id="mml-ieqn-106"><mml:msub><mml:mi>y</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> when the input is <inline-formula id="ieqn-107"><mml:math id="mml-ieqn-107"><mml:msub><mml:mi>h</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>).</p>
<p>For optimizing all model parameters, we employ the cross-entropy loss constructed between the predicted text label sequence and the real text label sequence as the training loss function on JMASA:
<disp-formula id="eqn-19"><label>(19)</label><mml:math id="mml-eqn-19" display="block"><mml:mrow><mml:mi>&#x02112;</mml:mi></mml:mrow><mml:mo>=</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mi>D</mml:mi><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow></mml:mrow></mml:mfrac><mml:munderover><mml:mo movablelimits="false">&#x2211;</mml:mo><mml:mrow><mml:mi>j</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mi>D</mml:mi><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow></mml:mrow></mml:munderover><mml:mi>l</mml:mi><mml:mi>o</mml:mi><mml:mi>g</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>P</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msup><mml:mi>y</mml:mi><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msup><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:msup><mml:mi>H</mml:mi><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msup><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">)</mml:mo></mml:math></disp-formula></p>
</sec>
</sec>
<sec id="s4">
<label>4</label>
<title>Experiment</title>
<p>In this chapter, we perform a series of experiments with two Twitter benchmark datasets to demonstrate the effectiveness of our Text-Image Feature Fine-Grained Learning (TIFFL) model, then compare its performance with some representative methods in recent years.</p>
<sec id="s4_1">
<label>4.1</label>
<title>Experimental Setup</title>
<p><bold>Datasets:</bold> Considering the increasing application of social software such as Twitter and Facebook in people&#x2019;s daily life, analyzing social media data has become a major research trend in academics. Twitter-15 and Twitter-17 are two benchmark datasets for JMASA built by Yu et al. [<xref ref-type="bibr" rid="ref-37">37</xref>], which are sampled from tweets containing text and image posted on the Twitter social media platform in 2014&#x2013;2015 and 2016&#x2013;2017 with the <inline-formula id="ieqn-108"><mml:math id="mml-ieqn-108"><mml:mi>B</mml:mi><mml:mi>I</mml:mi><mml:mi>O</mml:mi><mml:mn>2</mml:mn></mml:math></inline-formula> tagging schema described in Task Definition for labeling. There are 3502 and 2910 total texts as well as 8288 and 4819 total images for Twitter-15 and Twitter-17, respectively. The detailed statistics for the Twitter datasets are shown in <xref ref-type="table" rid="table-2">Table 2</xref> (where Pos, Neu and Neg are the counts of aspect terms as positive, neutral and negative, Total aspects is the count of aspect terms, and Sentence is the count of texts).</p>
<table-wrap id="table-2">
<label>Table 2</label>
<caption>
<title>The basic statistics for two Twitter benchmark datasets</title>
</caption>
<table frame="hsides">
<colgroup>
<col align="left" />
<col align="left" />
<col align="left" />
<col align="left" />
<col align="left" />
<col align="left" />
<col align="left" />
</colgroup>
<thead>
<tr>
<th rowspan="2"></th>
<th colspan="3" align="center">Twitter-15</th>
<th colspan="3" align="center">Twitter-17</th>
</tr>
<tr>
<th>Train</th>
<th>Dev</th>
<th>Test</th>
<th>Train</th>
<th>Dev</th>
<th>Test</th>
</tr>
</thead>
<tbody>
<tr>
<td>Pos</td>
<td>928</td>
<td>303</td>
<td>317</td>
<td>1508</td>
<td>515</td>
<td>493</td>
</tr>
<tr>
<td>Neu</td>
<td>1883</td>
<td>670</td>
<td>607</td>
<td>1638</td>
<td>517</td>
<td>573</td>
</tr>
<tr>
<td>Neg</td>
<td>368</td>
<td>149</td>
<td>113</td>
<td>416</td>
<td>144</td>
<td>168</td>
</tr>
<tr>
<td>Total aspects</td>
<td>3179</td>
<td>1122</td>
<td>1037</td>
<td>3562</td>
<td>1176</td>
<td>1234</td>
</tr>
<tr>
<td>Sentence</td>
<td>2101</td>
<td>727</td>
<td>674</td>
<td>1746</td>
<td>577</td>
<td>587</td>
</tr>
</tbody>
</table>
</table-wrap>
<p><bold>Implementation Details:</bold> For TIFFL, pretrained RoBERTa-base [<xref ref-type="bibr" rid="ref-50">50</xref>] and ResNet-152 [<xref ref-type="bibr" rid="ref-51">51</xref>] are employed as text and image encoders. In the process of parameter optimization, we adopt the AdamW learner with a weight attenuation of 0.01. Specifically, we set the batch size to 32 during training phase as well as 16 during development and testing phases, the training epoch to 25, the weight values <inline-formula id="ieqn-109"><mml:math id="mml-ieqn-109"><mml:mi>&#x03B1;</mml:mi></mml:math></inline-formula> and <inline-formula id="ieqn-110"><mml:math id="mml-ieqn-110"><mml:mi>&#x03B2;</mml:mi></mml:math></inline-formula> to 0.6 and 0.5 on Twitter-15 as well as 0.7 and 0.4 on Twitter-17, the <inline-formula id="ieqn-111"><mml:math id="mml-ieqn-111"><mml:mi>K</mml:mi></mml:math></inline-formula> value to 5, the number of GCN layers to 2, and the learning rate to 3e-5. The final experimental results are chosen as the average scores of three independent trainings for all models. Our experiments are implemented based on PyTorch and run on an NVIDIA Tesla V100 GPU.</p>
</sec>
<sec id="s4_2">
<label>4.2</label>
<title>Compared Baselines</title>
<p>Considering that JMASA consists of two subtasks, MATE and MASC, we compare TIFFL with various existing methods on the JMASA, MATE and MASC tasks to achieve the performance evaluation of our model. <xref ref-type="table" rid="table-3">Tables 3</xref>&#x2013;<xref ref-type="table" rid="table-5">5</xref> show the unimodal and multimodal compared baselines selected for the three tasks.</p>
<table-wrap id="table-3">
<label>Table 3</label>
<caption>
<title>Compared baselines for the JMASA task</title>
</caption>
<table frame="hsides">
<colgroup>
<col align="left"/>
<col align="left"/>
</colgroup>
<thead>
<tr>
<th>Method</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>SPAN [<xref ref-type="bibr" rid="ref-57">57</xref>]</td>
<td>SPAN is a method for extracting aspect terms and predicting sentiment polarities through LSTM-based multi-span decoding algorithm for the text modality only.</td>
</tr>
<tr>
<td>D-GCN [<xref ref-type="bibr" rid="ref-58">58</xref>]</td>
<td>D-GCN is a method for extracting aspect terms and predicting sentiment polarities through dependencies between words for the text modality only.</td>
</tr>
<tr>
<td>RoBERTa [<xref ref-type="bibr" rid="ref-50">50</xref>]</td>
<td>RoBERTa is a pretrained language model for BERT enhancement by employing better training strategies and larger corpus for the text modality only.</td>
</tr>
<tr>
<td>UMT [<xref ref-type="bibr" rid="ref-43">43</xref>] &#x002B; TomBERT [<xref ref-type="bibr" rid="ref-38">38</xref>]</td>
<td>UMT is a MNER model based on span detection, and TomBERT is a MABSA model based on the BERT architecture, UMT &#x002B; TomBERT combines the two models to accomplish the JMASA task for the text and image modalities.</td>
</tr>
<tr>
<td>OSCGA [<xref ref-type="bibr" rid="ref-44">44</xref>] &#x002B; TomBERT</td>
<td>OSCGA is a MNER model based on entity alignment, OSCGA &#x002B; TomBERT combines OSCGA with TomBERT to accomplish the JMASA task for the text and image modalities.</td>
</tr>
<tr>
<td>UMT-collapse</td>
<td>UMT-collapse applies UMT to the JMASA task for the text and image modalities.</td>
</tr>
<tr>
<td>OSCGA-collapse</td>
<td>OSCGA-collapse applies OSCGA to the JMASA task for the text and image modalities.</td>
</tr>
<tr>
<td>UMT-RoBERTa</td>
<td>UMT-RoBERTa replaces BERT in UMT-collapse with RoBERTa for the text and image modalities.</td>
</tr>
<tr>
<td>JML [<xref ref-type="bibr" rid="ref-4">4</xref>]</td>
<td>JML is a JMASA model based on auxiliary cross-modal relation detection for the text and image modalities.</td>
</tr>
<tr>
<td>VLP-MABSA [<xref ref-type="bibr" rid="ref-5">5</xref>]</td>
<td>VLP-MABSA is a JMASA model based on a unified multimodal encoder-decoder architecture for the text and image modalities.</td>
</tr>
<tr>
<td>CMMT [<xref ref-type="bibr" rid="ref-6">6</xref>]</td>
<td>CMMT is a JMASA model based on text-guided cross-modal interaction for the text and image modalities.</td>
</tr>
<tr>
<td>SAAF [<xref ref-type="bibr" rid="ref-7">7</xref>]</td>
<td>SAAF is a JMASA model based on a text-image selective fusion mechanism for the text and image modalities.</td>
</tr>
</tbody>
</table>
</table-wrap><table-wrap id="table-4">
<label>Table 4</label>
<caption>
<title>Compared baselines for the MATE task</title>
</caption>
<table frame="hsides">
<colgroup>
<col align="left"/>
<col align="left"/>
</colgroup>
<thead>
<tr>
<th>Method</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>RAN [<xref ref-type="bibr" rid="ref-42">42</xref>]</td>
<td>RAN is a MNER model based on region-aware alignment for the text and image modalities.</td>
</tr>
<tr>
<td>UMT</td>
<td>UMT is a MNER model based on span detection for the text and image modalities.</td>
</tr>
<tr>
<td>OSCGA</td>
<td>OSCGA is a MNER model based on entity alignment for the text and image modalities.</td>
</tr>
<tr>
<td>MNER-QG [<xref ref-type="bibr" rid="ref-45">45</xref>]</td>
<td>MNER-QG is a MNER model based on an end-to-end machine reading comprehension framework for the text and image modalities.</td>
</tr>
</tbody>
</table>
</table-wrap><table-wrap id="table-5">
<label>Table 5</label>
<caption>
<title>Compared baselines for the MASC task</title>
</caption>
<table frame="hsides">
<colgroup>
<col align="left"/>
<col align="left"/>
</colgroup>
<thead>
<tr>
<th>Method</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>MIMN [<xref ref-type="bibr" rid="ref-36">36</xref>]</td>
<td>MIMN is a MABSA model based on multi-interactive Bi-LSTM for the text and image modalities.</td>
</tr>
<tr>
<td>ESAFN [<xref ref-type="bibr" rid="ref-37">37</xref>]</td>
<td>ESAFN is a MABSA model based on an entity-aware attention fusion network for the text and image modalities.</td>
</tr>
<tr>
<td>TomBERT</td>
<td>TomBERT is a MABSA model based on the BERT architecture for the text and image modalities.</td>
</tr>
<tr>
<td>CapBERT [<xref ref-type="bibr" rid="ref-40">40</xref>]</td>
<td>CapBERT is a MABSA model for converting image semantics into caption and encoding it in combination with input text for the text and image modalities.</td>
</tr>
<tr>
<td>KEF-TomBERT [<xref ref-type="bibr" rid="ref-41">41</xref>]</td>
<td>KEF-TomBERT is a MABSA model for applying a proposed knowledge enhancement framework KEF to TomBERT for the text and image modalities.</td>
</tr>
<tr>
<td>TomRoBERTa</td>
<td>TomRoBERTa replaces BERT in TomBERT with RoBERTa for the text and image modalities.</td>
</tr>
<tr>
<td>CapRoBERTa</td>
<td>CapRoBERTa replaces BERT in CapBERT with RoBERTa for the text and image modalities.</td>
</tr>
<tr>
<td>KEF-TomRoBERTa</td>
<td>KEF-TomRoBERTa replaces BERT in KEF-TomBERT with RoBERTa for the text and image modalities.</td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
<sec id="s4_3">
<label>4.3</label>
<title>Experimental Results and Analysis</title>
<p>In this section, we perform experiments with TIFFL and corresponding compared baselines on the JMASA, MATE and MASC tasks, then analyze the generated results.</p>
<sec id="s4_3_1">
<label>4.3.1</label>
<title>Experiments for the JMASA Task</title>
<p><xref ref-type="table" rid="table-6">Table 6</xref> shows the experimental results of TIFFL and compared baselines for JMASA on the Twitter-15 and Twitter-17 datasets, we choose Precision (P), Recall (R) and Macro-F1 (F1) as evaluation metrics and mark the best score for each metric in bold. Moreover, the results with * are produced with our implementation. Compared with the better performing methods CMMT and SAAF, TIFFL achieves competitive results on the Twitter datasets through multimodal feature correlation discrimination and intra-modal feature fine-grained analysis, essentially maintaining comparable model performance on Twitter-15 while improving precision, recall and Macro-F1 by about 0.9%, 0.6% and 0.8% over CMMT on Twitter-17, and about 0.3%, 1.0% and 0.7% over SAAF, respectively.</p>
<table-wrap id="table-6">
<label>Table 6</label>
<caption>
<title>Experimental results of TIFFL and compared baselines for the JMASA task</title>
</caption>
<table frame="hsides">
<colgroup>
<col align="left"/>
<col align="left"/>
<col align="left" />
<col align="left" />
<col align="left" />
<col align="left" />
<col align="left" />
<col align="left" />
</colgroup>
<thead>
<tr>
<th>Modality</th>
<th>Method</th>
<th colspan="3" align="center">Twitter-15</th>
<th colspan="3" align="center">Twitter-17</th>
</tr>
<tr>
<th/>
<th/>
<th>P</th>
<th>R</th>
<th>F1</th>
<th>P</th>
<th>R</th>
<th>F1</th>
</tr>
</thead>
<tbody>
<tr>
<td rowspan="3">Text</td>
<td>SPAN</td>
<td>53.7</td>
<td>53.9</td>
<td>53.8</td>
<td>59.6</td>
<td>61.7</td>
<td>60.6</td>
</tr>
<tr>
<td>D-GCN</td>
<td>58.3</td>
<td>58.8</td>
<td>59.4</td>
<td>64.1</td>
<td>64.2</td>
<td>64.1</td>
</tr>
<tr>
<td>RoBERTa</td>
<td>61.8</td>
<td>65.3</td>
<td>63.5</td>
<td>65.5</td>
<td>66.9</td>
<td>66.2</td>
</tr>
<tr>
<td rowspan="10">Text &#x002B; Image</td>
<td>UMT &#x002B; TomBERT</td>
<td>58.4</td>
<td>61.3</td>
<td>59.8</td>
<td>62.3</td>
<td>62.4</td>
<td>62.4</td>
</tr>
<tr>
<td>OSCGA &#x002B; TomBERT</td>
<td>61.7</td>
<td>63.4</td>
<td>62.5</td>
<td>63.4</td>
<td>64.0</td>
<td>63.7</td>
</tr>
<tr>
<td>UMT-collapse</td>
<td>60.4</td>
<td>61.6</td>
<td>61.0</td>
<td>60.0</td>
<td>61.7</td>
<td>60.8</td>
</tr>
<tr>
<td>OSCGA-collapse</td>
<td>63.1</td>
<td>63.7</td>
<td>63.2</td>
<td>63.5</td>
<td>63.5</td>
<td>63.5</td>
</tr>
<tr>
<td>UMT-RoBERTa</td>
<td>61.6</td>
<td>66.4</td>
<td>63.9</td>
<td>65.3</td>
<td>68.2</td>
<td>66.7</td>
</tr>
<tr>
<td>JML</td>
<td>65.0</td>
<td>63.2</td>
<td>64.1</td>
<td>66.5</td>
<td>65.5</td>
<td>66.0</td>
</tr>
<tr>
<td>VLP-MABSA*</td>
<td>64.8</td>
<td>68.3</td>
<td>66.3</td>
<td>66.4</td>
<td>69.0</td>
<td>67.9</td>
</tr>
<tr>
<td>CMMT</td>
<td>64.6</td>
<td><bold>68.7</bold></td>
<td><bold>66.5</bold></td>
<td>67.6</td>
<td>69.4</td>
<td>68.5</td>
</tr>
<tr>
<td>SAAF</td>
<td><bold>65.6</bold></td>
<td>67.3</td>
<td>66.4</td>
<td>68.2</td>
<td>69.0</td>
<td>68.6</td>
</tr>
<tr>
<td>TIFFL (Ours)*</td>
<td>65.0</td>
<td>68.3</td>
<td><bold>66.5</bold></td>
<td><bold>68.5</bold></td>
<td><bold>70.0</bold></td>
<td><bold>69.3</bold></td>
</tr>
</tbody>
</table>
</table-wrap>
<p>According to the experimental results, we can conclude as follows: (1) Since RoBERTa is pretrained based on BERT by optimizing training strategies and adopting larger corpus, its performance is much better than SPAN and D-GCN; (2) UMT &#x002B; TomBERT and OSCGA &#x002B; TomBERT are both pipeline methods that combine MATE and MASC, their performance is worse than UMT-collapse and OSCGA-collapse, which may be attributed to the propagation of error information between the two tasks; (3) UMT-RoBERTa performs better than UMT-collapse, which also proves that RoBERTa is more powerful than BERT; (4) JML, VLP-MABSA, CMMT and SAAF perform better than other compared baselines, which indicates that methods specifically designed for JMASA can better accomplish this target task; (5) Although JML employs auxiliary cross-modal relation detection to control the rational utilization of visual information, its lack of in-depth research on unimodal features leads to inferior model performance compared to TIFFL; (6) VLP-MABSA simplifies model complexity with a unified multimodal encoder-decoder architecture, but its direct integration of textual and image features may introduce extra interference information, which is a major factor that its model performance is not comparable to TIFFL.</p>
<p>However, TIFFL does not perform as well on Twitter-15 as on Twitter-17 compared to CMMT and SAAF, we speculate the possible reasons are as follows:</p>
<p>CMMT adopts all 2089 ANPs extracted from image as visual auxiliary supervision, which can guarantee the accurate recognition of image semantics from the quantitative level. Since our chosen Top-<inline-formula id="ieqn-112"><mml:math id="mml-ieqn-112"><mml:mi>K</mml:mi></mml:math></inline-formula> ANPs that might contain some error information such as misrecognized words, focusing only on these ANPs may introduce extra noise to some extent. While CMMT does not perform a correlation discrimination between text and image features, and Twitter-15 might have more high text-image correlation samples than Twitter-17, thus the model performance of TIFFL on Twitter-15 is less favorable. This argument is validated to be reliable in the ablation study in <xref ref-type="sec" rid="s4_4">Section 4.4</xref> and the parameter setting component in <xref ref-type="sec" rid="s4_5">Section 4.5</xref>. Moreover, CMMT introduces another label sequence as text auxiliary supervision, thus achieving performance enhancement by constructing auxiliary supervision modules for both text and image modalities, but excessive space resource occupation and manual labeling cost are also major problems. TIFFL effectively avoids the problems by choosing a more cost-effective method, which is an additional advantage of our model.</p>
<p>SAAF adopts Beta distribution to adjust the scalar weight of balancing text and image features, which enhances the resilience of image representation incorporating text features to bridge the inter-modal semantic gap, thus resulting in superior model performance. While in the process of gate vector computation, SAAF only employs text-aware image representation and ignores that there might be some information unrelated to image semantics in text, which may also introduce extra noise to the gate vector construction and cause certain defects in image filtering. Similarly, the ablation study in <xref ref-type="sec" rid="s4_4">Section 4.4</xref> validates that the proportion of samples with low text-image correlation in Twitter-17 is higher than Twitter-15. Therefore, the gate vector of SAAF might not play a significant role on Twitter-15, and TIFFL employs both image-aware text and text-aware image representations to construct a more effective gating mechanism that works better on Twitter-17. Meanwhile, we can learn that TIFFL achieves better results for the MATE task on Twitter-17 from <xref ref-type="sec" rid="s4_3_2">4.3.2</xref>, which is a key factor in the performance enhancement on this dataset.</p>
</sec>
<sec id="s4_3_2">
<label>4.3.2</label>
<title>Experiments for the MATE Task</title>
<p><xref ref-type="table" rid="table-7">Table 7</xref> shows the experimental results of TIFFL and compared baselines for the MATE task on Twitter-15 and Twitter-17, we also choose P, R and F1 as evaluation metrics. Compared with other methods in the table, TIFFL achieves optimal experimental results on the Twitter datasets, improving Macro-F1 by about 0.9% and 1.9% over CMMT on Twitter-15 and Twitter-17, and about 1.1% and 2.0% over SAAF, respectively, which also validate that our proposed multimodal feature correlation discrimination and intra-modal feature fine-grained analysis methods can efficiently accomplish the aspect term extraction to further enhance the overall model performance on JMASA.</p>
<table-wrap id="table-7">
<label>Table 7</label>
<caption>
<title>Experimental results of TIFFL and compared baselines for the MATE task</title>
</caption>
<table frame="hsides">
<colgroup>
<col align="left" />
<col align="left" />
<col align="left" />
<col align="left" />
<col align="left" />
<col align="left" />
<col align="left" />
</colgroup>
<thead>
<tr>
<th rowspan="2">Method</th>
<th colspan="3" align="center">Twitter-15</th>
<th colspan="3" align="center">Twitter-17</th>
</tr>
<tr>
<th>P</th>
<th>R</th>
<th>F1</th>
<th>P</th>
<th>R</th>
<th>F1</th>
</tr>
</thead>
<tbody>
<tr>
<td>RoBERTa</td>
<td>84.0</td>
<td>87.1</td>
<td>85.5</td>
<td>92.1</td>
<td>93.4</td>
<td>92.7</td>
</tr>
<tr>
<td>RAN</td>
<td>80.5</td>
<td>81.5</td>
<td>81.0</td>
<td>90.7</td>
<td>90.0</td>
<td>90.3</td>
</tr>
<tr>
<td>UMT</td>
<td>77.8</td>
<td>81.7</td>
<td>79.7</td>
<td>86.7</td>
<td>86.8</td>
<td>86.7</td>
</tr>
<tr>
<td>OSCGA</td>
<td>81.7</td>
<td>82.1</td>
<td>81.9</td>
<td>90.2</td>
<td>90.7</td>
<td>90.4</td>
</tr>
<tr>
<td>MNER-QG</td>
<td>82.7</td>
<td>81.2</td>
<td>81.7</td>
<td>88.3</td>
<td>86.8</td>
<td>87.3</td>
</tr>
<tr>
<td>JML</td>
<td>83.6</td>
<td>81.2</td>
<td>82.4</td>
<td>92.0</td>
<td>90.7</td>
<td>91.4</td>
</tr>
<tr>
<td>VLP-MABSA*</td>
<td>83.1</td>
<td>88.2</td>
<td>85.5</td>
<td>90.2</td>
<td>92.5</td>
<td>91.3</td>
</tr>
<tr>
<td>CMMT</td>
<td>83.9</td>
<td>88.1</td>
<td>85.9</td>
<td>92.2</td>
<td>93.9</td>
<td>93.1</td>
</tr>
<tr>
<td>SAAF*</td>
<td>83.9</td>
<td>88.0</td>
<td>85.7</td>
<td>92.5</td>
<td>93.4</td>
<td>93.0</td>
</tr>
<tr>
<td>TIFFL (Ours)*</td>
<td><bold>84.7</bold></td>
<td><bold>89.0</bold></td>
<td><bold>86.8</bold></td>
<td><bold>94.5</bold></td>
<td><bold>95.5</bold></td>
<td><bold>95.0</bold></td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
<sec id="s4_3_3">
<label>4.3.3</label>
<title>Experiments for the MASC Task</title>
<p><xref ref-type="table" rid="table-8">Table 8</xref> shows the experimental results of TIFFL and compared baselines for the MASC task on Twitter-15 and Twitter-17, we choose Accuracy (Acc) and F1 as evaluation metrics. The experimental results of TIFFL on the Twitter datasets are not particularly favorable compared to other methods in the table, we speculate the possible reasons are the same as the inferences described in <xref ref-type="sec" rid="s4_3_1">Section 4.3.1</xref> that our Top-<inline-formula id="ieqn-113"><mml:math id="mml-ieqn-113"><mml:mi>K</mml:mi></mml:math></inline-formula> ANPs may introduce extra noise due to error information, which is demonstrated to have a significant impact on sentiment prediction in this subsection. KEF as an enhancement framework for MASC also employs ANPs to enhance the image semantic representation, but it filters unrelated ANPs interference information by calculating the similarity between a specific aspect term and the noun in ANP that achieves better Accuracy on Twitter-15. However, TIFFL performs well on the MATE task, which compensates for its deficiencies on MASC, thus also enhancing the overall model performance on JMASA.</p>
<table-wrap id="table-8">
<label>Table 8</label>
<caption>
<title>Experimental results of TIFFL and compared baselines for the MASC task</title>
</caption>
<table frame="hsides">
<colgroup>
<col align="left" />
<col align="left" />
<col align="left" />
<col align="left" />
<col align="left" />
</colgroup>
<thead>
<tr>
<th rowspan="2">Method</th>
<th colspan="2" align="center">Twitter-15</th>
<th colspan="2" align="center">Twitter-17</th>
</tr>
<tr>
<th>Acc</th>
<th>F1</th>
<th>Acc</th>
<th>F1</th>
</tr>
</thead>
<tbody>
<tr>
<td>RoBERTa</td>
<td>76.3</td>
<td>71.4</td>
<td>69.8</td>
<td>68.0</td>
</tr>
<tr>
<td>MIMN</td>
<td>71.8</td>
<td>65.7</td>
<td>65.9</td>
<td>63.0</td>
</tr>
<tr>
<td>ESAFN</td>
<td>73.4</td>
<td>67.4</td>
<td>67.8</td>
<td>64.2</td>
</tr>
<tr>
<td>TomBERT</td>
<td>77.2</td>
<td>71.8</td>
<td>70.3</td>
<td>68.0</td>
</tr>
<tr>
<td>CapBERT</td>
<td>78.0</td>
<td>73.3</td>
<td>69.8</td>
<td>68.4</td>
</tr>
<tr>
<td>KEF-TomBERT</td>
<td>78.7</td>
<td>73.8</td>
<td>72.1</td>
<td>70.0</td>
</tr>
<tr>
<td>TomRoBERTa</td>
<td>77.6</td>
<td>73.2</td>
<td>71.3</td>
<td>70.1</td>
</tr>
<tr>
<td>CapRoBERTa</td>
<td>77.8</td>
<td>73.4</td>
<td>71.1</td>
<td>68.6</td>
</tr>
<tr>
<td>KEF-TomRoBERTa*</td>
<td><bold>78.8</bold></td>
<td>74.0</td>
<td>72.2</td>
<td>70.2</td>
</tr>
<tr>
<td>JML</td>
<td>78.7</td>
<td>&#x2013;</td>
<td>72.7</td>
<td>&#x2013;</td>
</tr>
<tr>
<td>VLP-MABSA*</td>
<td>78.5</td>
<td>73.8</td>
<td>73.2</td>
<td>71.4</td>
</tr>
<tr>
<td>CMMT</td>
<td>77.9</td>
<td>&#x2013;</td>
<td><bold>73.8</bold></td>
<td>&#x2013;</td>
</tr>
<tr>
<td>SAAF*</td>
<td>78.6</td>
<td>73.7</td>
<td>73.1</td>
<td><bold>71.6</bold></td>
</tr>
<tr>
<td>TIFFL (Ours)*</td>
<td>78.4</td>
<td><bold>74.5</bold></td>
<td>73.0</td>
<td><bold>71.6</bold></td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
</sec>
<sec id="s4_4">
<label>4.4</label>
<title>Ablation Study</title>
<p>To further investigate the effectiveness of our proposed methods, we perform ablation analysis for three important units in TIFFL on Twitter-15 and Twitter-17: (1) Image Gating Mechanism (IGM). (2) Text Auxiliary Information (TAI). (3) Image Auxiliary Information (IAI). We first remove each unit individually and then remove all three units simultaneously to comprehensively demonstrate the contribution of each model unit. The ablation study results are shown in <xref ref-type="table" rid="table-9">Table 9</xref>.</p>
<table-wrap id="table-9">
<label>Table 9</label>
<caption>
<title>Ablation study of TIFFL</title>
</caption>
<table frame="hsides">
<colgroup>
<col align="left" />
<col align="left" />
<col align="left" />
<col align="left" />
<col align="left" />
<col align="left" />
<col align="left" />
</colgroup>
<thead>
<tr>
<th rowspan="2">Method</th>
<th colspan="3" align="center">Twitter-15</th>
<th colspan="3" align="center">Twitter-17</th>
</tr>
<tr>
<th>P</th>
<th>R</th>
<th>F1</th>
<th>P</th>
<th>R</th>
<th>F1</th>
</tr>
</thead>
<tbody>
<tr>
<td>TIFFL</td>
<td><bold>65.0</bold></td>
<td><bold>68.3</bold></td>
<td><bold>66.5</bold></td>
<td><bold>68.5</bold></td>
<td><bold>70.0</bold></td>
<td><bold>69.3</bold></td>
</tr>
<tr>
<td>TIFFL w/o IGM</td>
<td>62.9</td>
<td>66.5</td>
<td>64.7</td>
<td>66.4</td>
<td>67.6</td>
<td>67.0</td>
</tr>
<tr>
<td>TIFFL w/o TAI</td>
<td>62.4</td>
<td>65.9</td>
<td>64.3</td>
<td>66.8</td>
<td>67.6</td>
<td>67.2</td>
</tr>
<tr>
<td>TIFFL w/o IAI</td>
<td>63.2</td>
<td>66.9</td>
<td>65.0</td>
<td>67.2</td>
<td>68.5</td>
<td>68.0</td>
</tr>
<tr>
<td>TIFFL w/o IGM &#x0026; TAI &#x0026; IAI</td>
<td>61.9</td>
<td>65.2</td>
<td>63.5</td>
<td>65.1</td>
<td>67.5</td>
<td>66.3</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>First, we can inform that removing IGM decreases Macro-F1 by about 1.8% and 2.3% on the Twitter datasets, which indicates that the inter-modal fusion strategy in our model can effectively filter the image information unrelated to text semantics; Next, removing TAI decreases Macro-F1 by about 2.2% and 2.1% on the Twitter datasets, which indicates that our model introduces sentiment-enhanced GCN can deeply learn the syntactic structure features of text; Then, removing IAI decreases Macro-F1 by about 1.5% and 1.3% on the Twitter datasets, which indicates that our adopted ANPs can better assist the image semantic expression from the text level; Finally, removing all above units decreases Macro-F1 by about 3.0% on the Twitter datasets, which also indicates that our methods can contribute to the model performance on JMASA.</p>
<p>However, removing IAI shows a certain reduction in the decrease of Macro-F1 compared to IGM and TAI, and removing IGM decreases Macro-F1 less on Twitter-15 than Twitter-17, which also validates the argument in <xref ref-type="sec" rid="s4_3_1">4.3.1</xref> that Top-<inline-formula id="ieqn-114"><mml:math id="mml-ieqn-114"><mml:mi>K</mml:mi></mml:math></inline-formula> ANPs might be interfered by error information and limit model performance, as well as Twitter-17 contains more samples with low text-image correlation than Twitter-15.</p>
</sec>
<sec id="s4_5">
<label>4.5</label>
<title>Parameter Analysis</title>
<p>In this section, we detail the evaluation process of optimal hyper-parameters. The above experiments are all performed in TIFFL after hyper-parameter tuning.</p>
<sec id="s4_5_1">
<label>4.5.1</label>
<title><inline-formula id="ieqn-115"><mml:math id="mml-ieqn-115"><mml:mi>&#x03B1;</mml:mi></mml:math></inline-formula> <italic>Value</italic></title>
<p>For testing the effect of the TAI weight <inline-formula id="ieqn-116"><mml:math id="mml-ieqn-116"><mml:mi>&#x03B1;</mml:mi></mml:math></inline-formula> on model performance during inter-modal feature fusion, we set <inline-formula id="ieqn-117"><mml:math id="mml-ieqn-117"><mml:mi>&#x03B1;</mml:mi></mml:math></inline-formula> as a decimal number with an interval of 0.1 in <inline-formula id="ieqn-118"><mml:math id="mml-ieqn-118"><mml:mrow><mml:mo>[</mml:mo><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mn>1</mml:mn><mml:mo>]</mml:mo></mml:mrow></mml:math></inline-formula>. <xref ref-type="fig" rid="fig-4">Fig. 4a</xref>,<xref ref-type="fig" rid="fig-4">b</xref> shows the model performance for <inline-formula id="ieqn-119"><mml:math id="mml-ieqn-119"><mml:mi>&#x03B1;</mml:mi></mml:math></inline-formula> values on Twitter-15 and Twitter-17, respectively.</p>
<fig id="fig-4">
<label>Figure 4</label>
<caption>
<title>Effect of <inline-formula id="ieqn-120"><mml:math id="mml-ieqn-120"><mml:mi>&#x03B1;</mml:mi></mml:math></inline-formula> value on model performance</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_55943-fig-4.tif"/>
</fig>
<p>According to the test results, we can inform that our model performs worse without introducing TAI, which indicates that sentiment-enhanced GCN contributes to enhance the model performance to some extent. As <inline-formula id="ieqn-121"><mml:math id="mml-ieqn-121"><mml:mi>&#x03B1;</mml:mi></mml:math></inline-formula> increases, the performance shows an upward trend in fluctuation, with the best results when <inline-formula id="ieqn-122"><mml:math id="mml-ieqn-122"><mml:mi>&#x03B1;</mml:mi></mml:math></inline-formula> is 0.6 and 0.7 on Twitter-15 and Twitter-17, respectively. However, the performance starts to decrease as <inline-formula id="ieqn-123"><mml:math id="mml-ieqn-123"><mml:mi>&#x03B1;</mml:mi></mml:math></inline-formula> further increases, the possible reason is speculated as follows: In the process of constructing TAI, the extracted noun phrases can only be treated as undetermined aspect terms for reference because image information is not combined. Therefore, the model assigns a larger proportion to TAI as <inline-formula id="ieqn-124"><mml:math id="mml-ieqn-124"><mml:mi>&#x03B1;</mml:mi></mml:math></inline-formula> increases, so that the noun phrases of non-aspect terms might also receive too much attention and introduce extra noise.</p>
</sec>
<sec id="s4_5_2">
<label>4.5.2</label>
<title><inline-formula id="ieqn-125"><mml:math id="mml-ieqn-125"><mml:mi>&#x03B2;</mml:mi></mml:math></inline-formula> <italic>Value</italic></title>
<p>For analyzing the value of the IAI weight <inline-formula id="ieqn-126"><mml:math id="mml-ieqn-126"><mml:mi>&#x03B2;</mml:mi></mml:math></inline-formula> during inter-modal feature fusion, we also set <inline-formula id="ieqn-127"><mml:math id="mml-ieqn-127"><mml:mi>&#x03B2;</mml:mi></mml:math></inline-formula> as a decimal number with an interval of 0.1 in <inline-formula id="ieqn-128"><mml:math id="mml-ieqn-128"><mml:mrow><mml:mo>[</mml:mo><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mn>1</mml:mn><mml:mo>]</mml:mo></mml:mrow></mml:math></inline-formula>. <xref ref-type="fig" rid="fig-5">Fig. 5a</xref>,<xref ref-type="fig" rid="fig-5">b</xref> shows the model performance for <inline-formula id="ieqn-129"><mml:math id="mml-ieqn-129"><mml:mi>&#x03B2;</mml:mi></mml:math></inline-formula> values on Twitter-15 and Twitter-17, respectively.</p>
<fig id="fig-5">
<label>Figure 5</label>
<caption>
<title>Effect of <inline-formula id="ieqn-130"><mml:math id="mml-ieqn-130"><mml:mi>&#x03B2;</mml:mi></mml:math></inline-formula> value on model performance</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_55943-fig-5.tif"/>
</fig>
<p>According to the test results, we can inform as follows: Our model performance shows an upward trend in fluctuation as <inline-formula id="ieqn-131"><mml:math id="mml-ieqn-131"><mml:mi>&#x03B2;</mml:mi></mml:math></inline-formula> increases, with the best results when <inline-formula id="ieqn-132"><mml:math id="mml-ieqn-132"><mml:mi>&#x03B2;</mml:mi></mml:math></inline-formula> is 0.5 and 0.4 on Twitter-15 and Twitter-17, respectively. However, the performance starts to decrease as <inline-formula id="ieqn-133"><mml:math id="mml-ieqn-133"><mml:mi>&#x03B2;</mml:mi></mml:math></inline-formula> further increases, the possible reason is speculated that IAI might contain some ANPs error information unrelated to image semantics. When <inline-formula id="ieqn-134"><mml:math id="mml-ieqn-134"><mml:mi>&#x03B2;</mml:mi></mml:math></inline-formula> exceeds a certain range, the model pays more attention to IAI and weakens the dominance of original image features, so that the error information is also amplified resulting in a continuous degradation of the performance.</p>
</sec>
<sec id="s4_5_3">
<label>4.5.3</label>
<title><inline-formula id="ieqn-135"><mml:math id="mml-ieqn-135"><mml:mi>K</mml:mi></mml:math></inline-formula> <italic>Value</italic></title>
<p>For exploring the optimal value of the ANPs number <inline-formula id="ieqn-136"><mml:math id="mml-ieqn-136"><mml:mi>K</mml:mi></mml:math></inline-formula> in IAI, we set <inline-formula id="ieqn-137"><mml:math id="mml-ieqn-137"><mml:mi>K</mml:mi></mml:math></inline-formula> to each integer in the range of <inline-formula id="ieqn-138"><mml:math id="mml-ieqn-138"><mml:mrow><mml:mo>[</mml:mo><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mn>1</mml:mn><mml:mo>]</mml:mo></mml:mrow></mml:math></inline-formula>. <xref ref-type="fig" rid="fig-6">Fig. 6a</xref>,<xref ref-type="fig" rid="fig-6">b</xref> shows the model performance for <inline-formula id="ieqn-139"><mml:math id="mml-ieqn-139"><mml:mi>K</mml:mi></mml:math></inline-formula> values on Twitter-15 and Twitter-17, respectively.</p>
<fig id="fig-6">
<label>Figure 6</label>
<caption>
<title>Effect of <inline-formula id="ieqn-140"><mml:math id="mml-ieqn-140"><mml:mi>K</mml:mi></mml:math></inline-formula> value on model performance</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_55943-fig-6.tif"/>
</fig>
<p>According to the test results, we can inform that our model performs worse when <inline-formula id="ieqn-141"><mml:math id="mml-ieqn-141"><mml:mi>K</mml:mi></mml:math></inline-formula> is equal to 0, which indicates that adopting ANPs as IAI can further enhance the model performance. As <inline-formula id="ieqn-142"><mml:math id="mml-ieqn-142"><mml:mi>K</mml:mi></mml:math></inline-formula> increases, the performance shows an upward trend in fluctuation, with the best results when <inline-formula id="ieqn-143"><mml:math id="mml-ieqn-143"><mml:mi>K</mml:mi></mml:math></inline-formula> is equal to 5 on the Twitter datasets. However, the performance starts to decrease when <inline-formula id="ieqn-144"><mml:math id="mml-ieqn-144"><mml:mi>K</mml:mi></mml:math></inline-formula> exceeds 5, the possible reason is speculated as follows: Each text in the Twitter datasets involves up to five aspect terms, unlike CMMT that adopts all ANPs for the vision representation supervision, we treat ANPs as the candidates for aspect terms in image. When <inline-formula id="ieqn-145"><mml:math id="mml-ieqn-145"><mml:mi>K</mml:mi></mml:math></inline-formula> is greater than the count of aspect terms, IAI might generate extra noise to the model by introducing too much misrecognized word information.</p>
<p>Moreover, by comparing the test results of the above three parts, we can learn that TAI tuning outperforms IAI on model performance, which also validates the argument in <xref ref-type="sec" rid="s4_3_1">Section 4.3.1</xref>.</p>
</sec>
</sec>
<sec id="s4_6">
<label>4.6</label>
<title>Case Study</title>
<p>In this section, we choose four representative samples to compare TIFFL with RoBERTa, CMMT and SAAF on the MATE and JMASA tasks to better demonstrate the superiority of our model. Meanwhile, we sequentially remove the three important units of TIFFL described in <xref ref-type="sec" rid="s4_4">Section 4.4</xref> to prove the effectiveness of our designed methods through these samples. The case information and prediction results are shown in <xref ref-type="table" rid="table-10">Tables 10</xref> and <xref ref-type="table" rid="table-11">11</xref>.</p>
<table-wrap id="table-10">
<label>Table 10</label>
<caption>
<title>Case study for the MATE task</title>
</caption>
<table frame="hsides">
<colgroup>
<col align="left"/>
<col align="left"/>
<col align="left"/>
</colgroup>
<tbody>
<tr>
<td>Image</td>
<td><inline-graphic xlink:href="CMC_55943-inline-3.tif"/></td>
<td><inline-graphic xlink:href="CMC_55943-inline-4.tif"/></td>
</tr>
<tr>
<td>Text</td>
<td>(a) Joe Maddon talks to # Cubs pitcher Jason Motte (30) as Jon Lester stretches on first day of spring training in Mesa</td>
<td>(b) RT @ WizardGirlsNBA: Excited to have Cheerleaders Gdynia here the way from Poland for # PolishHeritageNight # WizKings</td>
</tr>
<tr>
<td>Noun phrase</td>
<td>Joe Maddon, # Cubs, pitcher, Jason Motte, Jon Lester, day, spring training, Mesa</td>
<td>WizardGirlsNBA, Cheerleaders Gdynia, way, Poland</td>
</tr>
<tr>
<td>Top-<inline-formula id="ieqn-146"><mml:math id="mml-ieqn-146"><mml:mi>K</mml:mi></mml:math></inline-formula> ANPs</td>
<td>holy cross, outdoor sports, mad dog, holy angels, classic race</td>
<td>holy child, excited crowd, proud student, talented kids, successful team</td>
</tr>
<tr>
<td>Label</td>
<td>(Joe Maddon, Neutral)</td>
<td>(Cheerleaders Gdynia, Positive)</td>
</tr>
<tr>
<td/>
<td>(Jason Motte, Neutral)</td>
<td/>
</tr>
<tr>
<td/>
<td>(Jon Lester, Neutral)</td>
<td/>
</tr>
<tr>
<td/>
<td>(Mesa, Neutral)</td>
<td/>
</tr>
<tr>
<td>RoBERTa</td>
<td>(Joe Maddon, Neutral) &#x2713;</td>
<td>(Cheerleaders Gdynia, Positive) &#x2713;</td>
</tr>
<tr>
<td/>
<td>(# Cubs, Neutral) &#x2717;</td>
<td>(Poland, Neutral) &#x2717;</td>
</tr>
<tr>
<td/>
<td>(Jason Motte, Neutral) &#x2713;</td>
<td/>
</tr>
<tr>
<td/>
<td>(Jon Lester, Neutral) &#x2713;</td>
<td/>
</tr>
<tr>
<td/>
<td>(Mesa, Neutral) &#x2713;</td>
<td/>
</tr>
<tr>
<td>CMMT</td>
<td>(Joe Maddon, Neutral) &#x2713;</td>
<td>(Cheerleaders Gdynia, Positive) &#x2713;</td>
</tr>
<tr>
<td/>
<td>(# Cubs, Neutral) &#x2717;</td>
<td>(Poland, Neutral) &#x2717;</td>
</tr>
<tr>
<td/>
<td>(Jason Motte, Neutral) &#x2713;</td>
<td/>
</tr>
<tr>
<td/>
<td>(Jon Lester, Neutral) &#x2713;</td>
<td/>
</tr>
<tr>
<td/>
<td>(Mesa, Neutral) &#x2713;</td>
<td/>
</tr>
<tr>
<td>SAAF</td>
<td>(Joe Maddon, Neutral) &#x2713;</td>
<td>(Cheerleaders Gdynia, Positive) &#x2713;</td>
</tr>
<tr>
<td/>
<td>(# Cubs, Neutral) &#x2717;</td>
<td>(Poland, Neutral) &#x2717;</td>
</tr>
<tr>
<td/>
<td>(Jason Motte, Neutral) &#x2713;</td>
<td/>
</tr>
<tr>
<td/>
<td>(Jon Lester, Neutral) &#x2713;</td>
<td/>
</tr>
<tr>
<td/>
<td>(Mesa, Neutral) &#x2713;</td>
<td/>
</tr>
<tr>
<td>TIFFL (Ours)</td>
<td>(Joe Maddon, Neutral) &#x2713;</td>
<td>(Cheerleaders Gdynia, Positive) &#x2713;</td>
</tr>
<tr>
<td/>
<td>(Jason Motte, Neutral) &#x2713;</td>
<td/>
</tr>
<tr>
<td/>
<td>(Jon Lester, Neutral) &#x2713;</td>
<td/>
</tr>
<tr>
<td/>
<td>(Mesa, Neutral) &#x2713;</td>
<td/>
</tr>
<tr>
<td>TIFFL w/o IGM</td>
<td>(Joe Maddon, Neutral) &#x2713;</td>
<td>(Cheerleaders Gdynia, Positive) &#x2713;</td>
</tr>
<tr>
<td/>
<td>(# Cubs, Neutral) &#x2717;</td>
<td/>
</tr>
<tr>
<td/>
<td>(Jason Motte, Neutral) &#x2713;</td>
<td/>
</tr>
<tr>
<td/>
<td>(Jon Lester, Neutral) &#x2713;</td>
<td/>
</tr>
<tr>
<td/>
<td>(Mesa, Neutral) &#x2713;</td>
<td/>
</tr>
<tr>
<td>TIFFL w/o TAI</td>
<td>(Joe Maddon, Neutral) &#x2713;</td>
<td>(Cheerleaders Gdynia, Positive) &#x2713;</td>
</tr>
<tr>
<td/>
<td>(Jason Motte, Neutral) &#x2713;</td>
<td/>
</tr>
<tr>
<td/>
<td>(Jon Lester, Neutral) &#x2713;</td>
<td/>
</tr>
<tr>
<td/>
<td>(Mesa, Neutral) &#x2713;</td>
<td/>
</tr>
<tr>
<td>TIFFL w/o IAI</td>
<td>(Joe Maddon, Neutral) &#x2713;</td>
<td>(Cheerleaders Gdynia, Positive) &#x2713;</td>
</tr>
<tr>
<td/>
<td>(Jason Motte, Neutral) &#x2713;</td>
<td>(Poland, Neutral) &#x2717;</td>
</tr>
<tr>
<td/>
<td>(Jon Lester, Neutral) &#x2713;</td>
<td/>
</tr>
<tr>
<td/>
<td>(Mesa, Neutral) &#x2713;</td>
<td/>
</tr>
</tbody>
</table>
</table-wrap><table-wrap id="table-11">
<label>Table 11</label>
<caption>
<title>Case study for the JMASA task</title>
</caption>
<table frame="hsides">
<colgroup>
<col align="left"/>
<col align="left"/>
<col align="left"/>
</colgroup>
<tbody>
<tr>
<td>Image</td>
<td><inline-graphic xlink:href="CMC_55943-inline-5.tif"/></td>
<td><inline-graphic xlink:href="CMC_55943-inline-6.tif"/></td>
</tr>
<tr>
<td>Text</td>
<td>(a) Jean Marmoreo - ready to run ! # stwm</td>
<td>(b) I&#x2019;ve just witnessed Wes Morgan lift a premier league title. Football really is bonkers</td>
</tr>
<tr>
<td>Noun phrase</td>
<td>Jean marmoreo, # stwm</td>
<td>Wes Morgan, premier league, Football, bonkers</td>
</tr>
<tr>
<td>Top-<inline-formula id="ieqn-147"><mml:math id="mml-ieqn-147"><mml:mi>K</mml:mi></mml:math></inline-formula> ANPs</td>
<td>young driver, young fan, happy christmas, bad girls, fat pig</td>
<td>sexy halloween, ill child, excited crowd, fancy dress, great party</td>
</tr>
<tr>
<td>Label</td>
<td>(Jean Marmoreo, Positive)</td>
<td>(Wes Morgan, Positive)</td>
</tr>
<tr>
<td/>
<td/>
<td>(premier league, Neutral)</td>
</tr>
<tr>
<td>RoBERTa</td>
<td>(Jean Marmoreo, Neutral) &#x2717;</td>
<td>(Wes Morgan, Negative) &#x2717;</td>
</tr>
<tr>
<td/>
<td/>
<td>(premier league, Neutral) &#x2713;</td>
</tr>
<tr>
<td>CMMT</td>
<td>(Jean Marmoreo, Positive) &#x2713;</td>
<td>(Wes Morgan, Negative) &#x2717;</td>
</tr>
<tr>
<td/>
<td/>
<td>(premier league, Neutral) &#x2713;</td>
</tr>
<tr>
<td>SAAF</td>
<td>(Jean Marmoreo, Positive) &#x2713;</td>
<td>(Wes Morgan, Negative) &#x2717;</td>
</tr>
<tr>
<td/>
<td/>
<td>(premier league, Neutral) &#x2713;</td>
</tr>
<tr>
<td>TIFFL (Ours)</td>
<td>(Jean Marmoreo, Positive) &#x2713;</td>
<td>(Wes Morgan, Positive) &#x2713;</td>
</tr>
<tr>
<td/>
<td/>
<td>(premier league, Neutral) &#x2713;</td>
</tr>
<tr>
<td>TIFFL w/o IGM</td>
<td>(Jean Marmoreo, Positive) &#x2713;</td>
<td>(Wes Morgan, Positive) &#x2713;</td>
</tr>
<tr>
<td/>
<td/>
<td>(premier league, Neutral) &#x2713;</td>
</tr>
<tr>
<td>TIFFL w/o TAI</td>
<td>(Jean Marmoreo, Positive) &#x2713;</td>
<td>(Wes Morgan, Negative) &#x2717;</td>
</tr>
<tr>
<td/>
<td/>
<td>(premier league, Neutral) &#x2713;</td>
</tr>
<tr>
<td>TIFFL w/o IAI</td>
<td>(Jean Marmoreo, Positive) &#x2713;</td>
<td>(Wes Morgan, Positive) &#x2713;</td>
</tr>
<tr>
<td/>
<td/>
<td>(premier league, Neutral) &#x2713;</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>First, <xref ref-type="table" rid="table-10">Table 10</xref> shows two samples where TIFFL is dominant on the MATE task. In the sample of <xref ref-type="table" rid="table-10">Table 10</xref>(a), RoBERTa, CMMT and SAAF extract one more incorrect aspect term &#x201C;# Cubs&#x201D; on top of the correct aspect term extraction, while TIFFL achieves the accurate aspect term extraction and sentiment polarity prediction, we speculate the possible reasons are as follows: The inter-modal fusion strategy in TIFFL can effectively combine text and image features to accomplish MATE. Although &#x201C;# Cubs&#x201D; is a description of the aspect term &#x201C;Jason Motte&#x201D;, it is not reflected in the image. However, CMMT dynamically controls the intervention of image information by predicting word confidence, which treats text information as the dominant role on the target task and underestimates the effect of image information, and SAAF does not consider the image-unrelated information from text during the construction of gate vector. Therefore, both CMMT that preferentially extracts the aspect terms in text and RoBERTa that only extracts the aspect terms in text make incorrect predictions, while SAAF also makes incorrect predictions by introducing extra noise due to the imperfect gate vector construction. Furthermore, the aspect term &#x201C;# Cubs&#x201D; is also extracted when TIFFL removes IGM, we hypothesize the possible reason is that the image information unrelated to text semantics cannot be efficiently filtered without utilizing the gating mechanism, which introduces extra noise and degrades the model performance of aspect term extraction.</p>

<p>In the sample of <xref ref-type="table" rid="table-10">Table 10</xref>(b), RoBERTa, CMMT and SAAF extract an incorrect aspect term &#x201C;Poland&#x201D; on top of the correct aspect term, while &#x201C;Poland&#x201D; is also extracted when TIFFL removes IAI. The possible reasons are speculated as follows: TIFFL constructs IAI to clearly understand that there is no reflection of &#x201C;Poland&#x201D; in the image from the nouns of ANPs, and SAAF does not conduct further analysis of image features. RoBERTa is only for text modal and cannot intervene with image information. Moreover, though CMMT adopts ANPs as visual auxiliary supervision as well, its utilization of all 2089 ANPs may contain unrelated interference information leading to the extraction of incorrect aspect term. Therefore, TIFFL achieves accurate aspect term extraction and sentiment polarity prediction by effectively combining text and image features.</p>

<p>Then, <xref ref-type="table" rid="table-11">Table 11</xref> shows two samples where TIFFL has an advantage on the JMASA task. In the sample of <xref ref-type="table" rid="table-11">Table 11</xref>(a), RoBERTa, CMMT, SAAF and TIFFL all extract the correct aspect term, but only RoBERTa makes incorrect sentiment prediction for the aspect term &#x201C;Jean Marmoreo&#x201D;, the possible reasons are speculated as follows: RoBERTa can only analyze the text modality and cannot identify the facial expression feature in image, and SAAF can combine image features to make correct sentiment prediction. Although Top-<inline-formula id="ieqn-148"><mml:math id="mml-ieqn-148"><mml:mi>K</mml:mi></mml:math></inline-formula> ANPs contain more misrecognized word information, CMMT and TIFFL can also predict the correct sentiment by combining the smiling facial expressions in image.</p>

<p>In the sample of <xref ref-type="table" rid="table-11">Table 11</xref>(b), the four models also extract the correct aspect terms, but RoBERTa, CMMT and SAAF give wrong sentiment predictions for the aspect term &#x201C;Wes Morgan&#x201D;, while the sentiment of &#x201C;Wes Morgan&#x201D; is also incorrectly predicted when TIFFL removes TAI, the possible reasons are speculated as follows: The noun phrases extracted by TIFFL contain the actual aspect terms &#x201C;Wes Morgan&#x201D; and &#x201C;premier league&#x201D;, which help our model to perform the MATE task better, and TIFFL avoids the effect of the word &#x201C;bonkers&#x201D; on the previous sentence by introducing TAI to learn the syntactic structure features of text, so removing TAI is demonstrated to have an impact on model performance. However, RoBERTa, CMMT and SAAF cannot learn the intrinsic features of text from a deeper level during text feature encoding, so &#x201C;bonkers&#x201D; may interfere with the aspect term &#x201C;Wes Morgan&#x201D; to some extent resulting in its prediction as negative.</p>

</sec>
</sec>
<sec id="s5">
<label>5</label>
<title>Conclusion</title>
<p>In this paper, we propose a text-image feature fine-grained learning model TIFFL for Joint Multimodal Aspect-based Sentiment Analysis (JMASA). For text feature learning, the model constructs an enhanced adjacency matrix of word dependencies and learns the syntactic structure features of text by employing Graph Convolutional Network (GCN), thus solving the context interference problem of identifying different aspect terms. For image feature learning, the model introduces image Adjective-Noun Pairs (ANPs) to represent visual feature semantics more intuitively, thus solving the problem of ambiguous image semantic extraction. Thereby, the model can further enhance the performance of aspect term extraction and sentiment polarity prediction. Experiments on two Twitter benchmark datasets demonstrate that TIFFL outperforms most advanced studies on Twitter-15 and all compared baselines on Twitter-17, thus validating the superiority of our adopted methods.</p>
<p>Since TIFFL performs less well on Twitter-15 than Twitter-17, we subsequently plan to implement a model optimization and conduct a specific investigation of the Twitter-15 dataset to figure out the cause of unsatisfactory model performance. Moreover, we determine the values of weight hyper-parameters through experimental tests, but the method relies too much on manual operation. Therefore, we plan to implement automatic learning of the above hyper-parameters to generate the proportions of text and image auxiliary information in the inter-modal fusion representation for the next research task, thus the model can assign more reasonable fusion weights according to the specific details of individual modalities in sample to achieve further performance enhancement.</p>
</sec>
</body>
<back>
<ack>
<p>This work was supported by the Science and Technology Project of Henan Province.</p>
</ack>
<sec><title>Funding Statement</title>
<p>This work was supported by the Science and Technology Project of Henan Province (No. 222102210081).</p>
</sec>
<sec><title>Author Contributions</title>
<p>Tianzhi Zhang wrote the main manuscript text, Gang Zhou and Shuang Zhang participated in the experiment, Shunhang Li analyzed the data, Yepeng Sun processed data, Qiankun Pi set up the experimental environment and Shuo Liu prepared tables and figures. All authors reviewed the results and approved the final version of the manuscript.</p>
</sec>
<sec sec-type="data-availability"><title>Availability of Data and Materials</title>
<p>The data that support the findings of this study are available from the corresponding author, Gang Zhou, upon reasonable request.</p>
</sec>
<sec><title>Ethics Approval</title>
<p>Not applicable.</p>
</sec>
<sec sec-type="COI-statement"><title>Conflicts of Interest</title>
<p>The authors declare no conflicts of interest to report regarding the present study.</p>
</sec>
<ref-list content-type="authoryear">
<title>References</title>
<ref id="ref-1"><label>[1]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>W.</given-names> <surname>Fan</surname></string-name> and <string-name><given-names>Z.</given-names> <surname>Shi</surname></string-name></person-group>, &#x201C;<article-title>Cross-modal consistency with aesthetic similarity for multimodal false information detection</article-title>,&#x201D; <source>Comput. Mater. Contin.</source>, vol. <volume>79</volume>, no. <issue>2</issue>, pp. <fpage>2723</fpage>&#x2013;<lpage>2741</lpage>, <year>2024</year>. doi: <pub-id pub-id-type="doi">10.32604/cmc.2024.050344</pub-id>.</mixed-citation></ref>
<ref id="ref-2"><label>[2]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>W.</given-names> <surname>Liu</surname></string-name>, <string-name><given-names>S.</given-names> <surname>Cao</surname></string-name>, and <string-name><given-names>S.</given-names> <surname>Zhang</surname></string-name></person-group>, &#x201C;<article-title>Multimodal consistency-specificity fusion based on information bottleneck for sentiment analysis</article-title>,&#x201D; <source>J. King Saud Univ.-Comput. Inf. Sci.</source>, vol. <volume>36</volume>, no. <issue>2</issue>, <year>2024</year>, Art. no. 101943. doi: <pub-id pub-id-type="doi">10.1016/j.jksuci.2024.101943</pub-id>.</mixed-citation></ref>
<ref id="ref-3"><label>[3]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>L.</given-names> <surname>Deng</surname></string-name>, <string-name><given-names>B.</given-names> <surname>Liu</surname></string-name>, and <string-name><given-names>Z.</given-names> <surname>Li</surname></string-name></person-group>, &#x201C;<article-title>Multimodal sentiment analysis based on a cross-modal multihead attention mechanism</article-title>,&#x201D; <source>Comput. Mater. Contin.</source>, vol. <volume>78</volume>, no. <issue>1</issue>, pp. <fpage>1157</fpage>&#x2013;<lpage>1170</lpage>, <year>2024</year>. doi: <pub-id pub-id-type="doi">10.32604/cmc.2023.042150</pub-id>.</mixed-citation></ref>
<ref id="ref-4"><label>[4]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>X.</given-names> <surname>Ju</surname></string-name> <etal>et al.</etal></person-group>, &#x201C;<article-title>Joint multi-modal aspect-sentiment analysis with auxiliary cross-modal relation detection</article-title>,&#x201D; in <conf-name>Proc. 2021 Conf. Empir. Methods Nat. Lang. Process.</conf-name>, <publisher-loc>Punta Cana</publisher-loc>, <publisher-name>Dominican Republic</publisher-name>, <year>2021</year>, pp. <fpage>4395</fpage>&#x2013;<lpage>4405</lpage>.</mixed-citation></ref>
<ref id="ref-5"><label>[5]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>Y.</given-names> <surname>Ling</surname></string-name>, <string-name><given-names>J.</given-names> <surname>Yu</surname></string-name>, and <string-name><given-names>R.</given-names> <surname>Xia</surname></string-name></person-group>, &#x201C;<article-title>Vision language pre-training for multimodal aspect-based sentiment analysis</article-title>,&#x201D; in <conf-name>Proc. 60th Annu. Meet. Assoc. Comput. Linguist.</conf-name>, <publisher-loc>Dublin, Ireland</publisher-loc>, <year>2022</year>, pp. <fpage>2149</fpage>&#x2013;<lpage>2159</lpage>.</mixed-citation></ref>
<ref id="ref-6"><label>[6]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>L.</given-names> <surname>Yang</surname></string-name>, <string-name><given-names>J. C.</given-names> <surname>Na</surname></string-name>, and <string-name><given-names>J.</given-names> <surname>Yu</surname></string-name></person-group>, &#x201C;<article-title>Cross-modal multitask transformer for end-to-end multimodal aspect-based sentiment analysis</article-title>,&#x201D; <source>Inf. Process. Manag.</source>, vol. <volume>59</volume>, no. <issue>5</issue>, <year>2022</year>, Art. no. 103038. doi: <pub-id pub-id-type="doi">10.1016/j.ipm.2022.103038</pub-id>.</mixed-citation></ref>
<ref id="ref-7"><label>[7]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>Z.</given-names> <surname>Wang</surname></string-name> and <string-name><given-names>J.</given-names> <surname>Guo</surname></string-name></person-group>, &#x201C;<article-title>Self-adaptive attention fusion for multimodal aspect-based sentiment analysis</article-title>,&#x201D; <source>Math. Biosci. Eng.</source>, vol. <volume>21</volume>, no. <issue>1</issue>, pp. <fpage>1305</fpage>&#x2013;<lpage>1320</lpage>, <year>2024</year>. doi: <pub-id pub-id-type="doi">10.3934/mbe.2024056</pub-id>; <pub-id pub-id-type="pmid">38303466</pub-id></mixed-citation></ref>
<ref id="ref-8"><label>[8]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><given-names>T. N.</given-names> <surname>Kipf</surname></string-name> and <string-name><given-names>M.</given-names> <surname>Welling</surname></string-name></person-group>, &#x201C;<article-title>Semi-supervised classification with graph convolutional networks</article-title>,&#x201D; <year>2016</year>, <italic>arXiv:1609.02907</italic>.</mixed-citation></ref>
<ref id="ref-9"><label>[9]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>D.</given-names> <surname>Borth</surname></string-name>, <string-name><given-names>R.</given-names> <surname>Ji</surname></string-name>, <string-name><given-names>T.</given-names> <surname>Chen</surname></string-name>, <string-name><given-names>T.</given-names> <surname>Breuel</surname></string-name>, and <string-name><given-names>S. F.</given-names> <surname>Chang</surname></string-name></person-group>, &#x201C;<article-title>Large-scale visual sentiment ontology and detectors using adjective noun pairs</article-title>,&#x201D; in <conf-name>Proc. 21st ACM Int. Conf. Multimed.</conf-name>, <publisher-loc>Barcelona, Spain</publisher-loc>, <year>2013</year>, pp. <fpage>223</fpage>&#x2013;<lpage>232</lpage>.</mixed-citation></ref>
<ref id="ref-10"><label>[10]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><given-names>Y.</given-names> <surname>Chen</surname></string-name></person-group>, &#x201C;<article-title>Convolutional neural network for sentence classification</article-title>,&#x201D; <comment>M.S. thesis</comment>, <publisher-name>Univ. of Waterloo</publisher-name>, <publisher-loc>Waterloo, ON, Canada</publisher-loc>, <year>2015</year>. </mixed-citation></ref>
<ref id="ref-11"><label>[11]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><given-names>B.</given-names> <surname>Shin</surname></string-name>, <string-name><given-names>T.</given-names> <surname>Lee</surname></string-name>, and <string-name><given-names>J. D.</given-names> <surname>Choi</surname></string-name></person-group>, &#x201C;<article-title>Lexicon integrated CNN models with attention for sentiment analysis</article-title>,&#x201D; <year>2016</year>, <italic>arXiv:1610.06272</italic>.</mixed-citation></ref>
<ref id="ref-12"><label>[12]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>Q.</given-names> <surname>You</surname></string-name>, <string-name><given-names>H.</given-names> <surname>Jin</surname></string-name>, and <string-name><given-names>J.</given-names> <surname>Luo</surname></string-name></person-group>, &#x201C;<article-title>Visual sentiment analysis by attending on local image regions</article-title>,&#x201D; in <source>Proc. AAAI Conf. Artif. Intell.</source>, <publisher-loc>San Francisco</publisher-loc>, <publisher-loc>CA, USA</publisher-loc>, <year>2017</year>, vol. <volume>31</volume>. doi: <pub-id pub-id-type="doi">10.1609/aaai.v31i1.10501</pub-id>.</mixed-citation></ref>
<ref id="ref-13"><label>[13]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>C.</given-names> <surname>Zhang</surname></string-name>, <string-name><given-names>Q.</given-names> <surname>Li</surname></string-name>, and <string-name><given-names>D.</given-names> <surname>Song</surname></string-name></person-group>, &#x201C;<article-title>Aspect-based sentiment classification with aspect-specific graph convolutional networks</article-title>,&#x201D; in <conf-name>Proc. 2019 Conf. Empir. Methods Nat. Lang. Process. 9th Int. Joint Conf. Nat. Lang. Process. (EMNLP-IJCNLP)</conf-name>, <publisher-loc>Hong Kong, China</publisher-loc>, <year>2019</year>, pp. <fpage>4567</fpage>&#x2013;<lpage>4577</lpage>.</mixed-citation></ref>
<ref id="ref-14"><label>[14]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>B.</given-names> <surname>Huang</surname></string-name> and <string-name><given-names>K.</given-names> <surname>Carley</surname></string-name></person-group>, &#x201C;<article-title>Syntax-aware aspect level sentiment classification with graph attention networks</article-title>,&#x201D; in <conf-name>Proc. 2019 Conf. Empir. Methods Nat. Lang. Process. 9th Int. Joint Conf. Nat. Lang. Process. (EMNLP-IJCNLP)</conf-name>, <publisher-loc>Hong Kong, China</publisher-loc>, <year>2019</year>, pp. <fpage>5469</fpage>&#x2013;<lpage>5477</lpage>.</mixed-citation></ref>
<ref id="ref-15"><label>[15]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>K.</given-names> <surname>Sun</surname></string-name>, <string-name><given-names>R.</given-names> <surname>Zhang</surname></string-name>, <string-name><given-names>S.</given-names> <surname>Mensah</surname></string-name>, <string-name><given-names>Y.</given-names> <surname>Mao</surname></string-name>, and <string-name><given-names>X.</given-names> <surname>Liu</surname></string-name></person-group>, &#x201C;<article-title>Aspect-level sentiment analysis via convolution over dependency tree</article-title>,&#x201D; in <conf-name>Proc. 2019 Conf. Empir. Methods Nat. Lang. Process. 9th Int. Joint Conf. Nat. Lang. Process. (EMNLP-IJCNLP)</conf-name>, <publisher-loc> Hong Kong, China</publisher-loc>, <year>2019</year>, pp. <fpage>5679</fpage>&#x2013;<lpage>5688</lpage>.</mixed-citation></ref>
<ref id="ref-16"><label>[16]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>S.</given-names> <surname>Hochreiter</surname></string-name> and <string-name><given-names>J.</given-names> <surname>Schmidhuber</surname></string-name></person-group>, &#x201C;<article-title>Long short-term memory</article-title>,&#x201D; <source>Neural Comput.</source>, vol. <volume>9</volume>, no. <issue>8</issue>, pp. <fpage>1735</fpage>&#x2013;<lpage>1780</lpage>, <year>1997</year>. doi: <pub-id pub-id-type="doi">10.1162/neco.1997.9.8.1735</pub-id>; <pub-id pub-id-type="pmid">9377276</pub-id></mixed-citation></ref>
<ref id="ref-17"><label>[17]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>H.</given-names> <surname>Tang</surname></string-name>, <string-name><given-names>D.</given-names> <surname>Ji</surname></string-name>, <string-name><given-names>C.</given-names> <surname>Li</surname></string-name>, and <string-name><given-names>Q.</given-names> <surname>Zhou</surname></string-name></person-group>, &#x201C;<article-title>Dependency graph enhanced dual-transformer structure for aspect-based sentiment classification</article-title>,&#x201D; in <conf-name>Proc. 58th Annu. Meet. Assoc. Comput. Linguist.</conf-name>, <year>2020</year>, pp. <fpage>3229</fpage>&#x2013;<lpage>3238</lpage>.</mixed-citation></ref>
<ref id="ref-18"><label>[18]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>K.</given-names> <surname>Wang</surname></string-name>, <string-name><given-names>W.</given-names> <surname>Shen</surname></string-name>, <string-name><given-names>Y.</given-names> <surname>Yang</surname></string-name>, <string-name><given-names>X.</given-names> <surname>Quan</surname></string-name>, and <string-name><given-names>R.</given-names> <surname>Wang</surname></string-name></person-group>, &#x201C;<article-title>Relational graph attention network for aspect-based sentiment analysis</article-title>,&#x201D; in <conf-name>Proc. 58th Annu. Meet. Assoc. Comput. Linguist.</conf-name>, <year>2020</year>, pp. <fpage>6578</fpage>&#x2013;<lpage>6588</lpage>.</mixed-citation></ref>
<ref id="ref-19"><label>[19]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><given-names>A.</given-names> <surname>Zadeh</surname></string-name>, <string-name><given-names>M.</given-names> <surname>Chen</surname></string-name>, <string-name><given-names>S.</given-names> <surname>Poria</surname></string-name>, <string-name><given-names>E.</given-names> <surname>Cambria</surname></string-name>, and <string-name><given-names>L. P.</given-names> <surname>Morency</surname></string-name></person-group>, &#x201C;<article-title>Tensor fusion network for multimodal sentiment analysis</article-title>,&#x201D; <year>2017</year>, <italic>arXiv:1707.07250</italic>.</mixed-citation></ref>
<ref id="ref-20"><label>[20]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>S.</given-names> <surname>Poria</surname></string-name>, <string-name><given-names>D.</given-names> <surname>Hazarika</surname></string-name>, <string-name><given-names>N.</given-names> <surname>Majumder</surname></string-name>, and <string-name><given-names>R.</given-names> <surname>Mihalcea</surname></string-name></person-group>, &#x201C;<article-title>Beneath the tip of the iceberg: Current challenges and new directions in sentiment analysis research</article-title>,&#x201D; <source>IEEE Trans. Affect. Comput.</source>, vol. <volume>14</volume>, no. <issue>1</issue>, pp. <fpage>108</fpage>&#x2013;<lpage>132</lpage>, <year>2020</year>. doi: <pub-id pub-id-type="doi">10.1109/TAFFC.2020.3038167</pub-id>.</mixed-citation></ref>
<ref id="ref-21"><label>[21]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><given-names>J.</given-names> <surname>Chung</surname></string-name>, <string-name><given-names>C.</given-names> <surname>Gulcehre</surname></string-name>, <string-name><given-names>K.</given-names> <surname>Cho</surname></string-name>, and <string-name><given-names>Y.</given-names> <surname>Bengio</surname></string-name></person-group>, &#x201C;<article-title>Empirical evaluation of gated recurrent neural networks on sequence modeling</article-title>,&#x201D; <year>2014</year>, <italic>arXiv:1412.3555</italic>.</mixed-citation></ref>
<ref id="ref-22"><label>[22]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>D.</given-names> <surname>Bertero</surname></string-name>, <string-name><given-names>F. B.</given-names> <surname>Siddique</surname></string-name>, <string-name><given-names>C. S.</given-names> <surname>Wu</surname></string-name>, <string-name><given-names>Y.</given-names> <surname>Wan</surname></string-name>, <string-name><given-names>R. H. Y.</given-names> <surname>Chan</surname></string-name> and <string-name><given-names>P.</given-names> <surname>Fung</surname></string-name></person-group>, &#x201C;<article-title>Real-time speech emotion and sentiment recognition for interactive dialogue systems</article-title>,&#x201D; in <conf-name>Proc. 2016 Conf. Empir. Methods Nat. Lang. Process.</conf-name>, <publisher-loc>Austin, TX, USA</publisher-loc>, <year>2016</year>, pp. <fpage>1042</fpage>&#x2013;<lpage>1047</lpage>.</mixed-citation></ref>
<ref id="ref-23"><label>[23]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>A.</given-names> <surname>Vaswani</surname></string-name> <etal>et al.</etal></person-group>, &#x201C;<article-title>Attention is all you need</article-title>,&#x201D; in <source>Proc. Adv. Neural Inf. Process. Syst.</source>, <publisher-loc>Long Beach, CA, USA</publisher-loc>, <year>2017</year>, pp. <fpage>6000</fpage>&#x2013;<lpage>6010</lpage></mixed-citation></ref>
<ref id="ref-24"><label>[24]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>S.</given-names> <surname>Poria</surname></string-name>, <string-name><given-names>E.</given-names> <surname>Cambria</surname></string-name>, and <string-name><given-names>A.</given-names> <surname>Gelbukh</surname></string-name></person-group>, &#x201C;<article-title>Deep convolutional neural network textual features and multiple kernel learning for utterance-level multimodal sentiment analysis</article-title>,&#x201D; in <conf-name>Proc. 2015 Conf. Empir. Methods Nat. Lang. Process.</conf-name>, <publisher-loc>Lisbon, Portugal</publisher-loc>, <year>2015</year>, pp. <fpage>2539</fpage>&#x2013;<lpage>2544</lpage>.</mixed-citation></ref>
<ref id="ref-25"><label>[25]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>S.</given-names> <surname>Poria</surname></string-name>, <string-name><given-names>E.</given-names> <surname>Cambria</surname></string-name>, <string-name><given-names>D.</given-names> <surname>Hazarika</surname></string-name>, <string-name><given-names>N.</given-names> <surname>Majumder</surname></string-name>, <string-name><given-names>A.</given-names> <surname>Zadeh</surname></string-name> and <string-name><given-names>L. P.</given-names> <surname>Morency</surname></string-name></person-group>, &#x201C;<article-title>Context-dependent sentiment analysis in user-generated videos</article-title>,&#x201D; in <conf-name>Proc. 55th Annu. Meet. Assoc. Comput. Linguist.</conf-name>, <publisher-loc>Vancouver, BC, Canada</publisher-loc>, <year>2017</year>, pp. <fpage>873</fpage>&#x2013;<lpage>883</lpage>.</mixed-citation></ref>
<ref id="ref-26"><label>[26]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><given-names>P. P.</given-names> <surname>Liang</surname></string-name>, <string-name><given-names>Z.</given-names> <surname>Liu</surname></string-name>, <string-name><given-names>A.</given-names> <surname>Zadeh</surname></string-name>, and <string-name><given-names>L. P.</given-names> <surname>Morency</surname></string-name></person-group>, &#x201C;<article-title>Multimodal language analysis with recurrent multistage fusion</article-title>,&#x201D; <year>2018</year>, <italic>arXiv:1808.03920</italic>.</mixed-citation></ref>
<ref id="ref-27"><label>[27]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>C.</given-names> <surname>Busso</surname></string-name> <etal>et al.</etal></person-group>, &#x201C;<article-title>Analysis of emotion recognition using facial expressions, speech and multimodal information</article-title>,&#x201D; in <conf-name>Proc. 6th Int. Conf. Multimodal Interf.</conf-name>, <publisher-loc>State College, PA, USA</publisher-loc>, <year>2004</year>, pp. <fpage>205</fpage>&#x2013;<lpage>211</lpage>.</mixed-citation></ref>
<ref id="ref-28"><label>[28]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>C. C.</given-names> <surname>Lee</surname></string-name>, <string-name><given-names>E.</given-names> <surname>Mower</surname></string-name>, <string-name><given-names>C.</given-names> <surname>Busso</surname></string-name>, <string-name><given-names>S.</given-names> <surname>Lee</surname></string-name>, and <string-name><given-names>S.</given-names> <surname>Narayanan</surname></string-name></person-group>, &#x201C;<article-title>Emotion recognition using a hierarchical binary decision tree approach</article-title>,&#x201D; <source>Speech Commun.</source>, vol. <volume>53</volume>, no. <issue>9&#x2013;10</issue>, pp. <fpage>1162</fpage>&#x2013;<lpage>1171</lpage>, <year>2011</year>. doi: <pub-id pub-id-type="doi">10.1016/j.specom.2011.06.004</pub-id>.</mixed-citation></ref>
<ref id="ref-29"><label>[29]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><given-names>S.</given-names> <surname>Castro</surname></string-name>, <string-name><given-names>D.</given-names> <surname>Hazarika</surname></string-name>, <string-name><given-names>V.</given-names> <surname>P&#x00E9;rez-Rosas</surname></string-name>, <string-name><given-names>R.</given-names> <surname>Zimmermann</surname></string-name>, <string-name><given-names>R.</given-names> <surname>Mihalcea</surname></string-name> and <string-name><given-names>S.</given-names> <surname>Poria</surname></string-name></person-group>, &#x201C;<article-title>Towards multimodal sarcasm detection (an _obviously_perfect paper)</article-title>,&#x201D; <year>2019</year>, <italic>arXiv:1906.01815</italic>.</mixed-citation></ref>
<ref id="ref-30"><label>[30]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>Y.</given-names> <surname>Cai</surname></string-name>, <string-name><given-names>H.</given-names> <surname>Cai</surname></string-name>, and <string-name><given-names>X.</given-names> <surname>Wan</surname></string-name></person-group>, &#x201C;<article-title>Multi-modal sarcasm detection in twitter with hierarchical fusion model</article-title>,&#x201D; in <conf-name>Proc. 57th Annu. Meet. Assoc. Comput. Linguist.</conf-name>, <publisher-loc>Florence, Italy</publisher-loc>, <year>2019</year>, pp. <fpage>2506</fpage>&#x2013;<lpage>2515</lpage>.</mixed-citation></ref>
<ref id="ref-31"><label>[31]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>T.</given-names> <surname>Chen</surname></string-name>, <string-name><given-names>F. X.</given-names> <surname>Yu</surname></string-name>, <string-name><given-names>J.</given-names> <surname>Chen</surname></string-name>, <string-name><given-names>Y.</given-names> <surname>Cui</surname></string-name>, <string-name><given-names>Y. Y.</given-names> <surname>Chen</surname></string-name> and <string-name><given-names>S. F.</given-names> <surname>Chang</surname></string-name></person-group>, &#x201C;<article-title>Object-based visual sentiment concept analysis and application</article-title>,&#x201D; in <conf-name>Proc. 22nd ACM Int. Conf. Multimedia</conf-name>, <publisher-loc>New York, NY, USA</publisher-loc>, <year>2014</year>, pp. <fpage>367</fpage>&#x2013;<lpage>376</lpage>.</mixed-citation></ref>
<ref id="ref-32"><label>[32]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>J.</given-names> <surname>Yang</surname></string-name>, <string-name><given-names>D.</given-names> <surname>She</surname></string-name>, <string-name><given-names>M.</given-names> <surname>Sun</surname></string-name>, <string-name><given-names>M. M.</given-names> <surname>Cheng</surname></string-name>, <string-name><given-names>P. L.</given-names> <surname>Rosin</surname></string-name> and <string-name><given-names>L.</given-names> <surname>Wang</surname></string-name></person-group>, &#x201C;<article-title>Visual sentiment prediction based on automatic discovery of affective regions</article-title>,&#x201D; <source>IEEE Trans. Multimed.</source>, vol. <volume>20</volume>, no. <issue>9</issue>, pp. <fpage>2513</fpage>&#x2013;<lpage>2525</lpage>, <year>2018</year>. doi: <pub-id pub-id-type="doi">10.1109/TMM.2018.2803520</pub-id>.</mixed-citation></ref>
<ref id="ref-33"><label>[33]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>Q.</given-names> <surname>You</surname></string-name>, <string-name><given-names>L.</given-names> <surname>Cao</surname></string-name>, <string-name><given-names>H.</given-names> <surname>Jin</surname></string-name>, and <string-name><given-names>J.</given-names> <surname>Luo</surname></string-name></person-group>, &#x201C;<article-title>Robust visual-textual sentiment analysis: When attention meets tree-structured recursive neural networks</article-title>,&#x201D; in <conf-name>Proc. 24th ACM Int. Conf. Multimed.</conf-name>, <publisher-loc>New York, NY, USA</publisher-loc>, <year>2016</year>, pp. <fpage>1008</fpage>&#x2013;<lpage>1017</lpage>.</mixed-citation></ref>
<ref id="ref-34"><label>[34]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>N.</given-names> <surname>Xu</surname></string-name>, <string-name><given-names>W.</given-names> <surname>Mao</surname></string-name>, and <string-name><given-names>G.</given-names> <surname>Chen</surname></string-name></person-group>, &#x201C;<article-title>A co-memory network for multimodal sentiment analysis</article-title>,&#x201D; in <conf-name>41st Int. ACM SIGIR Conf. Res. Dev. Inf. Retriev.</conf-name>, <publisher-loc>New York, NY, USA</publisher-loc>, <year>2018</year>, pp. <fpage>929</fpage>&#x2013;<lpage>932</lpage>.</mixed-citation></ref>
<ref id="ref-35"><label>[35]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>A.</given-names> <surname>Kumar</surname></string-name> and <string-name><given-names>G.</given-names> <surname>Garg</surname></string-name></person-group>, &#x201C;<article-title>Sentiment analysis of multimodal twitter data</article-title>,&#x201D; <source>Multimed. Tools Appl.</source>, vol. <volume>78</volume>, no. <issue>17</issue>, pp. <fpage>24103</fpage>&#x2013;<lpage>24119</lpage>, <year>2019</year>. doi: <pub-id pub-id-type="doi">10.1007/s11042-019-7390-1</pub-id>.</mixed-citation></ref>
<ref id="ref-36"><label>[36]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>N.</given-names> <surname>Xu</surname></string-name>, <string-name><given-names>W.</given-names> <surname>Mao</surname></string-name>, and <string-name><given-names>G.</given-names> <surname>Chen</surname></string-name></person-group>, &#x201C;<article-title>Multi-interactive memory network for aspect based multimodal sentiment analysis</article-title>,&#x201D; in <conf-name>Proc. AAAI Conf. Artif. Intell.</conf-name>, <publisher-loc>Honolulu, HI, USA</publisher-loc>, <year>2019</year>, vol. <volume>33</volume>, pp. <fpage>371</fpage>&#x2013;<lpage>378</lpage>.</mixed-citation></ref>
<ref id="ref-37"><label>[37]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>J.</given-names> <surname>Yu</surname></string-name>, <string-name><given-names>J.</given-names> <surname>Jiang</surname></string-name>, and <string-name><given-names>R.</given-names> <surname>Xia</surname></string-name></person-group>, &#x201C;<article-title>Entity-sensitive attention and fusion network for entity-level multimodal sentiment classification</article-title>,&#x201D; <source>IEEE/ACM Trans. Audio, Speech, Lang. Process.</source>, vol. <volume>28</volume>, pp. <fpage>429</fpage>&#x2013;<lpage>439</lpage>, <year>2019</year>. doi: <pub-id pub-id-type="doi">10.1109/TASLP.2019.2957872</pub-id>.</mixed-citation></ref>
<ref id="ref-38"><label>[38]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>J.</given-names> <surname>Yu</surname></string-name> and <string-name><given-names>J.</given-names> <surname>Jiang</surname></string-name></person-group>, &#x201C;<article-title>Adapting BERT for target-oriented multimodal sentiment classification</article-title>,&#x201D; in <conf-name>Proc. Twenty-Eighth Int. Joint Conf. Artif. Intell.</conf-name>, <publisher-loc>Macao, China</publisher-loc>, <year>2019</year>, pp. <fpage>5408</fpage>&#x2013;<lpage>5414</lpage>.</mixed-citation></ref>
<ref id="ref-39"><label>[39]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><given-names>J.</given-names> <surname>Devlin</surname></string-name>, <string-name><given-names>M. W.</given-names> <surname>Chang</surname></string-name>, <string-name><given-names>K.</given-names> <surname>Lee</surname></string-name>, and <string-name><given-names>K.</given-names> <surname>Toutanova</surname></string-name></person-group>, &#x201C;<article-title>BERT: Pre-training of deep bidirectional transformers for language understanding</article-title>,&#x201D; <year>2018</year>, <italic>arXiv:1810.04805</italic>.</mixed-citation></ref>
<ref id="ref-40"><label>[40]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>Z.</given-names> <surname>Khan</surname></string-name> and <string-name><given-names>Y.</given-names> <surname>Fu</surname></string-name></person-group>, &#x201C;<article-title>Exploiting BERT for multimodal target sentiment classification through input space translation</article-title>,&#x201D; in <conf-name>Proc. 29th ACM Int. Conf. Multimed.</conf-name>, <publisher-loc>New York, NY, USA</publisher-loc>, <year>2021</year>, pp. <fpage>3034</fpage>&#x2013;<lpage>3042</lpage>.</mixed-citation></ref>
<ref id="ref-41"><label>[41]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>F.</given-names> <surname>Zhao</surname></string-name>, <string-name><given-names>Z.</given-names> <surname>Wu</surname></string-name>, <string-name><given-names>S.</given-names> <surname>Long</surname></string-name>, <string-name><given-names>X.</given-names> <surname>Dai</surname></string-name>, <string-name><given-names>S.</given-names> <surname>Huang</surname></string-name> and <string-name><given-names>J.</given-names> <surname>Chen</surname></string-name></person-group>, &#x201C;<article-title>Learning from adjective-noun pairs: A knowledge-enhanced framework for target-oriented multimodal sentiment classification</article-title>,&#x201D; in <conf-name>Proc. 29th Int. Conf. Comput. Linguist.</conf-name>, <publisher-name>Gyeongju, Republic of Korea</publisher-name>, <year>2022</year>, pp. <fpage>6784</fpage>&#x2013;<lpage>6794</lpage>.</mixed-citation></ref>
<ref id="ref-42"><label>[42]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>H.</given-names> <surname>Wu</surname></string-name>, <string-name><given-names>S.</given-names> <surname>Cheng</surname></string-name>, <string-name><given-names>J.</given-names> <surname>Wang</surname></string-name>, <string-name><given-names>S.</given-names> <surname>Li</surname></string-name>, and <string-name><given-names>L.</given-names> <surname>Chi</surname></string-name></person-group>, &#x201C;<article-title>Multimodal aspect extraction with region-aware alignment network</article-title>,&#x201D; in <conf-name>Proc. Nat. Lang. Process. Chin. Comput.</conf-name>, <publisher-loc>Zhengzhou, China</publisher-loc>, <year>2020</year>, pp. <fpage>145</fpage>&#x2013;<lpage>156</lpage>.</mixed-citation></ref>
<ref id="ref-43"><label>[43]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>J.</given-names> <surname>Yu</surname></string-name>, <string-name><given-names>J.</given-names> <surname>Jiang</surname></string-name>, <string-name><given-names>L.</given-names> <surname>Yang</surname></string-name>, and <string-name><given-names>R.</given-names> <surname>Xia</surname></string-name></person-group>, &#x201C;<article-title>Improving multimodal named entity recognition via entity span detection with unified multimodal transformer</article-title>,&#x201D; in <conf-name>Proc. 58th Annu. Meet. Assoc. Comput. Linguist.</conf-name>, <publisher-loc>Seattle, WA, USA</publisher-loc>, <year>2020</year>, pp. <fpage>3342</fpage>&#x2013;<lpage>3352</lpage>.</mixed-citation></ref>
<ref id="ref-44"><label>[44]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>Z.</given-names> <surname>Wu</surname></string-name>, <string-name><given-names>C.</given-names> <surname>Zheng</surname></string-name>, <string-name><given-names>Y.</given-names> <surname>Cai</surname></string-name>, <string-name><given-names>J.</given-names> <surname>Chen</surname></string-name>, <string-name><given-names>H. F.</given-names> <surname>Leung</surname></string-name> and <string-name><given-names>Q.</given-names> <surname>Li</surname></string-name></person-group>, &#x201C;<article-title>Multimodal representation with embedded visual guiding objects for named entity recognition in social media posts</article-title>,&#x201D; in <conf-name>Proc. 28th ACM Int. Conf. Multimed.</conf-name>, <publisher-loc>New York, NY, USA</publisher-loc>, <year>2020</year>, pp. <fpage>1038</fpage>&#x2013;<lpage>1046</lpage>.</mixed-citation></ref>
<ref id="ref-45"><label>[45]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>M.</given-names> <surname>Jia</surname></string-name> <etal>et al.</etal></person-group>, &#x201C;<article-title>MNER-QG: An end-to-end MRC framework for multimodal named entity recognition with query grounding</article-title>,&#x201D; in <source>Proc. AAAI Conf. Artif. Intell.</source>, <publisher-loc>Washington, DC, USA</publisher-loc>, <year>2023</year>, vol. <volume>37</volume>, no. <issue>7</issue>, pp. <fpage>8032</fpage>&#x2013;<lpage>8040</lpage>. doi: <pub-id pub-id-type="doi">10.1609/aaai.v37i7.25971</pub-id>.</mixed-citation></ref>
<ref id="ref-46"><label>[46]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>M.</given-names> <surname>Zhang</surname></string-name>, <string-name><given-names>Y.</given-names> <surname>Zhang</surname></string-name>, and <string-name><given-names>D. T.</given-names> <surname>Vo</surname></string-name></person-group>, &#x201C;<article-title>Neural networks for open domain targeted sentiment</article-title>,&#x201D; in <conf-name>Proc. 2015 Conf. Empir. Methods Nat. Lang. Process.</conf-name>, <publisher-loc>Lisbon, Portugal</publisher-loc>, <year>2015</year>, pp. <fpage>612</fpage>&#x2013;<lpage>621</lpage>.</mixed-citation></ref>
<ref id="ref-47"><label>[47]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>X.</given-names> <surname>Li</surname></string-name>, <string-name><given-names>L.</given-names> <surname>Bing</surname></string-name>, <string-name><given-names>P.</given-names> <surname>Li</surname></string-name>, and <string-name><given-names>W.</given-names> <surname>Lam</surname></string-name></person-group>, &#x201C;<article-title>A unified model for opinion target extraction and target sentiment prediction</article-title>,&#x201D; in <conf-name>Proc. AAAI Conf. Artif. Intell.</conf-name>, <publisher-loc>Honolulu, HI, USA</publisher-loc>, <year>2019</year>, vol. <volume>33</volume>, pp. <fpage>6714</fpage>&#x2013;<lpage>6721</lpage>.</mixed-citation></ref>
<ref id="ref-48"><label>[48]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>Z.</given-names> <surname>Chen</surname></string-name> and <string-name><given-names>T.</given-names> <surname>Qian</surname></string-name></person-group>, &#x201C;<article-title>Relation-aware collaborative learning for unified aspect-based sentiment analysis</article-title>,&#x201D; in <conf-name>Proc. Conf. Assoc. Comput. Linguist.</conf-name>, <publisher-loc>Seattle, WA, USA</publisher-loc>, <year>2020</year>, pp. <fpage>3685</fpage>&#x2013;<lpage>3694</lpage>.</mixed-citation></ref>
<ref id="ref-49"><label>[49]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><given-names>E. F.</given-names> <surname>Sang</surname></string-name> and <string-name><given-names>J.</given-names> <surname>Veenstra</surname></string-name></person-group>, &#x201C;<article-title>Representing text chunks</article-title>,&#x201D; <year>1999</year>, <italic>arXiv:cs/9907006</italic>.</mixed-citation></ref>
<ref id="ref-50"><label>[50]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><given-names>Y.</given-names> <surname>Liu</surname></string-name> <etal>et al.</etal></person-group>, &#x201C;<article-title>Roberta: A robustly optimized BERT pretraining approach</article-title>,&#x201D; <year>2019</year>, <italic>arXiv:1907.11692</italic>.</mixed-citation></ref>
<ref id="ref-51"><label>[51]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>K.</given-names> <surname>He</surname></string-name>, <string-name><given-names>X.</given-names> <surname>Zhang</surname></string-name>, <string-name><given-names>S.</given-names> <surname>Ren</surname></string-name>, and <string-name><given-names>J.</given-names> <surname>Sun</surname></string-name></person-group>, &#x201C;<article-title>Deep residual learning for image recognition</article-title>,&#x201D; in <conf-name>Proc. IEEE Conf. Comput. Vis. Pattern Recognit.</conf-name>, <publisher-loc>Las Vegas, NV, USA</publisher-loc>, <year>2016</year>, pp. <fpage>770</fpage>&#x2013;<lpage>778</lpage>.</mixed-citation></ref>
<ref id="ref-52"><label>[52]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><given-names>K.</given-names> <surname>Simonyan</surname></string-name> and <string-name><given-names>A.</given-names> <surname>Zisserman</surname></string-name></person-group>, &#x201C;<article-title>Very deep convolutional networks for large-scale image recognition</article-title>,&#x201D; <year>2014</year>, <italic>arXiv:1409.1556</italic>.</mixed-citation></ref>
<ref id="ref-53"><label>[53]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>Y. H. H.</given-names> <surname>Tsai</surname></string-name>, <string-name><given-names>S.</given-names> <surname>Bai</surname></string-name>, <string-name><given-names>P. P.</given-names> <surname>Liang</surname></string-name>, <string-name><given-names>J. Z.</given-names> <surname>Kolter</surname></string-name>, <string-name><given-names>L. P.</given-names> <surname>Morency</surname></string-name> and <string-name><given-names>R.</given-names> <surname>Salakhutdinov</surname></string-name></person-group>, &#x201C;<article-title>Multimodal transformer for unaligned multimodal language sequences</article-title>,&#x201D; in <conf-name>Proc. Conf. Assoc. Comput. Linguist.</conf-name>, <publisher-loc>Florence, Italy</publisher-loc>, <year>2019</year>, pp. <fpage>6558</fpage>&#x2013;<lpage>6569</lpage>.</mixed-citation></ref>
<ref id="ref-54"><label>[54]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><given-names>J. L.</given-names> <surname>Ba</surname></string-name>, <string-name><given-names>J. R.</given-names> <surname>Kiros</surname></string-name>, and <string-name><given-names>G. E.</given-names> <surname>Hinton</surname></string-name></person-group>, &#x201C;<article-title>Layer normalization</article-title>,&#x201D; <year>2016</year>, <italic>arXiv:1607.06450</italic>.</mixed-citation></ref>
<ref id="ref-55"><label>[55]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><given-names>T.</given-names> <surname>Chen</surname></string-name>, <string-name><given-names>D.</given-names> <surname>Borth</surname></string-name>, <string-name><given-names>T.</given-names> <surname>Darrell</surname></string-name>, and <string-name><given-names>S. F.</given-names> <surname>Chang</surname></string-name></person-group>, &#x201C;<article-title>DeepSentiBank: Visual sentiment concept classification with deep convolutional neural networks</article-title>,&#x201D; <year>2014</year>, <italic>arXiv:1410.8586</italic>.</mixed-citation></ref>
<ref id="ref-56"><label>[56]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>J.</given-names> <surname>Lafferty</surname></string-name>, <string-name><given-names>A.</given-names> <surname>McCallum</surname></string-name>, and <string-name><given-names>F. C.</given-names> <surname>Pereira</surname></string-name></person-group>, &#x201C;<article-title>Conditional random fields: Probabilistic models for segmenting and labeling sequence data</article-title>,&#x201D; in <source>Proc. ICML</source>, <publisher-loc>Williamstown, MA, USA</publisher-loc>, <year>2001</year>, vol. <volume>1</volume>, no. <issue>2</issue>, p. <fpage>3</fpage>.</mixed-citation></ref>
<ref id="ref-57"><label>[57]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><given-names>M.</given-names> <surname>Hu</surname></string-name>, <string-name><given-names>Y.</given-names> <surname>Peng</surname></string-name>, <string-name><given-names>Z.</given-names> <surname>Huang</surname></string-name>, <string-name><given-names>D.</given-names> <surname>Li</surname></string-name>, and <string-name><given-names>Y.</given-names> <surname>Lv</surname></string-name></person-group>, &#x201C;<article-title>Open-domain targeted sentiment analysis via span-based extraction and classification</article-title>,&#x201D; <year>2019</year>, <italic>arXiv:1906.03820</italic>.</mixed-citation></ref>
<ref id="ref-58"><label>[58]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>G.</given-names> <surname>Chen</surname></string-name>, <string-name><given-names>Y.</given-names> <surname>Tian</surname></string-name>, and <string-name><given-names>Y.</given-names> <surname>Song</surname></string-name></person-group>, &#x201C;<article-title>Joint aspect extraction and sentiment analysis with directional graph convolutional networks</article-title>,&#x201D; in <conf-name>Proc. 28th Int. Conf. Comput. Linguist.</conf-name>, <publisher-loc>Barcelona, Spain</publisher-loc>, <year>2020</year>, pp. <fpage>272</fpage>&#x2013;<lpage>279</lpage>.</mixed-citation></ref>
</ref-list>
</back></article>