<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.1 20151215//EN" "http://jats.nlm.nih.gov/publishing/1.1/JATS-journalpublishing1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:mml="http://www.w3.org/1998/Math/MathML" xml:lang="en" article-type="research-article" dtd-version="1.1">
<front>
<journal-meta>
<journal-id journal-id-type="pmc">CMC</journal-id>
<journal-id journal-id-type="nlm-ta">CMC</journal-id>
<journal-id journal-id-type="publisher-id">CMC</journal-id>
<journal-title-group>
<journal-title>Computers, Materials &#x0026; Continua</journal-title>
</journal-title-group>
<issn pub-type="epub">1546-2226</issn>
<issn pub-type="ppub">1546-2218</issn>
<publisher>
<publisher-name>Tech Science Press</publisher-name>
<publisher-loc>USA</publisher-loc>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="publisher-id">66532</article-id>
<article-id pub-id-type="doi">10.32604/cmc.2025.066532</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Article</subject>
</subj-group>
</article-categories>
<title-group>
<article-title>VRCL: A Discrimination Detection Method for Multilingual and Multimodal Information</article-title>
<alt-title alt-title-type="left-running-head">VRCL: A Discrimination Detection Method for Multilingual and Multimodal Information</alt-title>
<alt-title alt-title-type="right-running-head">VRCL: A Discrimination Detection Method for Multilingual and Multimodal Information</alt-title>
</title-group>
<contrib-group>
<contrib id="author-1" contrib-type="author">
<name name-style="western"><surname>Zhang</surname><given-names>Kejun</given-names></name><xref ref-type="aff" rid="aff-1">1</xref></contrib>
<contrib id="author-2" contrib-type="author" corresp="yes">
<name name-style="western"><surname>Li</surname><given-names>Meijiao</given-names></name><xref ref-type="aff" rid="aff-1">1</xref><email>20232923@mail.besti.edu.cn</email></contrib>
<contrib id="author-3" contrib-type="author">
<name name-style="western"><surname>Cheng</surname><given-names>Jiahao</given-names></name><xref ref-type="aff" rid="aff-1">1</xref></contrib>
<contrib id="author-4" contrib-type="author">
<name name-style="western"><surname>Wang</surname><given-names>Jun</given-names></name><xref ref-type="aff" rid="aff-1">1</xref></contrib>
<contrib id="author-5" contrib-type="author">
<name name-style="western"><surname>Yang</surname><given-names>Ying</given-names></name><xref ref-type="aff" rid="aff-2">2</xref></contrib>
<aff id="aff-1"><label>1</label><institution>Department of Cyberspace Security, Beijing Electronic Science and Technology Institute</institution>, <addr-line>Beijing, 100070</addr-line>, <country>China</country></aff>
<aff id="aff-2"><label>2</label><institution>Department of Information and Cybersecurity, The State Information Center</institution>, <addr-line>Beijing, 100045</addr-line>, <country>China</country></aff>
</contrib-group>
<author-notes>
<corresp id="cor1"><label>&#x002A;</label>Corresponding Author: Meijiao Li. Email: <email>20232923@mail.besti.edu.cn</email></corresp>
</author-notes>
<pub-date date-type="collection" publication-format="electronic">
<year>2025</year>
</pub-date>
<pub-date date-type="pub" publication-format="electronic">
<day>29</day><month>08</month><year>2025</year>
</pub-date>
<volume>85</volume>
<issue>1</issue>
<fpage>1019</fpage>
<lpage>1035</lpage>
<history>
<date date-type="received">
<day>10</day>
<month>4</month>
<year>2025</year>
</date>
<date date-type="accepted">
<day>23</day>
<month>6</month>
<year>2025</year>
</date>
</history>
<permissions>
<copyright-statement>&#x00A9; 2025 The Authors.</copyright-statement>
<copyright-year>2025</copyright-year>
<copyright-holder>Published by Tech Science Press.</copyright-holder>
<license xlink:href="https://creativecommons.org/licenses/by/4.0/">
<license-p>This work is licensed under a <ext-link ext-link-type="uri" xlink:type="simple" xlink:href="https://creativecommons.org/licenses/by/4.0/">Creative Commons Attribution 4.0 International License</ext-link>, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
</license>
</permissions>
<self-uri content-type="pdf" xlink:href="TSP_CMC_66532.pdf"></self-uri>
<abstract>
<p>With the rapid growth of the Internet and social media, information is widely disseminated in multimodal forms, such as text and images, where discriminatory content can manifest in various ways. Discrimination detection techniques for multilingual and multimodal data can identify potential discriminatory behavior and help foster a more equitable and inclusive cyberspace. However, existing methods often struggle in complex contexts and multilingual environments. To address these challenges, this paper proposes an innovative detection method, using image and multilingual text encoders to separately extract features from different modalities. It continuously updates a historical feature memory bank, aggregates the Top-K most similar samples, and utilizes a Gated Recurrent Unit (GRU) to integrate current and historical features, generating enhanced feature representations with stronger semantic expressiveness to improve the model&#x2019;s ability to capture discriminatory signals. Experimental results demonstrate that the proposed method exhibits superior discriminative power and detection accuracy in multilingual and multimodal contexts, offering a reliable and effective solution for identifying discriminatory content.</p>
</abstract>
<kwd-group kwd-group-type="author">
<kwd>Multimodal</kwd>
<kwd>multilingual</kwd>
<kwd>discriminatory content</kwd>
<kwd>hate memes</kwd>
</kwd-group>
<funding-group>
<award-group id="awg1">
<funding-source>Open Foundation of Key Laboratory of Cyberspace Security, Ministry of Education</funding-source>
<award-id>KLCS20240210</award-id>
</award-group>
</funding-group>
</article-meta>
</front>
<body>
<sec id="s1">
<label>1</label>
<title>Introduction</title>
<p>Discriminatory content refers to negative attacks or expressions of prejudice targeting an individual or group based on identity characteristics such as race, gender, religion, and language. With the rise of social media and the global internet, discriminatory content has evolved beyond textual forms, exhibiting complex multimodal and multilingual features, including the integration of images and text. This presents a significant challenge to the harmony and social equality of online communities.</p>
<p>Traditional detection methods primarily rely on keyword matching and user metadata analysis [<xref ref-type="bibr" rid="ref-1">1</xref>]. The former is susceptible to vocabulary variations and requires frequent updates, while the latter may introduce biases. In addition, rule-based detection methods capture potential discriminatory expressions by establishing syntactic or semantic rules [<xref ref-type="bibr" rid="ref-2">2</xref>]. However, this approach still has limitations in identifying implicit discrimination. Sentiment analysis, as a traditional method, evaluates the presence of negative or hostile emotions in text by analyzing emotional tendencies but struggles with sarcasm or indirect discriminatory content [<xref ref-type="bibr" rid="ref-3">3</xref>]. Meanwhile, some studies have explored emotion recognition in images; for example, Bhavana et al. [<xref ref-type="bibr" rid="ref-4">4</xref>] proposed a convolutional neural network (CNN)-based method that leverages CNNs&#x2019; hierarchical feature learning to extract emotional features from raw images, demonstrating CNNs&#x2019; strong ability to understand image content. Social network analysis identifies the spread of discriminatory content by monitoring user interactions and information propagation patterns, which is especially effective in capturing group-based discriminatory behavior. Behavioral analysis methods enhance detection accuracy by predicting potential discriminatory actions through the tracking of user interactions and historical behavior [<xref ref-type="bibr" rid="ref-5">5</xref>]. As technology progresses, machine learning-based approaches, such as Naive Bayes [<xref ref-type="bibr" rid="ref-6">6</xref>] and Support Vector Machines (SVM) [<xref ref-type="bibr" rid="ref-7">7</xref>], have shown promise in detecting discriminatory content.</p>
<p>Recently, the use of deep learning techniques has further propelled advancements in the field. For instance, Rodr&#x00ED;guez-S&#x00E1;nchez [<xref ref-type="bibr" rid="ref-8">8</xref>] applied Bidirectional Long Short-Term Memory (Bi-LSTM) and Multilingual BERT (mBERT) models to enhance classification performance on multilingual datasets. Despite these advancements, existing methods remain limited, primarily addressing text-image interactions within single-language environments and neglecting combined multimodal and multilingual scenarios. As shown in <xref ref-type="fig" rid="fig-1">Fig. 1</xref>, the same image content exhibits significant differences in detection results in five different countries. This variation indicates that identical multimodal content can be perceived as offensive and discriminatory in certain cultural contexts while being deemed acceptable in others. Such cross-cultural differences underscore the complexities and challenges inherent in detecting discriminatory content. Therefore, automated detection of complex multimodal and multilingual discriminatory content has become the focus of current research.</p>
<fig id="fig-1">
<label>Figure 1</label>
<caption>
<title>Detection results for five image contents vary significantly across five countries: the United States, Germany, Mexico, India, and China</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_66532-fig-1.tif"/>
</fig>
<p>To address these challenges, this paper proposes a multimodal, multilingual method for discriminatory content detection, aimed at capturing cross-modal relationships between images and text while accounting for language-specific discrimination characteristics in multilingual contexts. The key contributions are as follows:
<list list-type="bullet">
<list-item>
<p>Integration of Cross-lingual Language RoBERTa (XLM-R) [<xref ref-type="bibr" rid="ref-9">9</xref>] and Vision Transformer (ViT) Models [<xref ref-type="bibr" rid="ref-10">10</xref>]: The XLM-R model is used for multilingual text feature extraction, while the ViT model extracts image features. These features are then precisely aligned and fused using a cross-attention mechanism, effectively capturing the deep semantic relationships between image and text.</p></list-item>
<list-item>
<p>Dynamic Memory-based Discriminatory Signal Detection: This method dynamically updates a historical feature memory bank, aggregates the top-K similar samples, and generates a fused historical feature representation. A Gated Recurrent Unit (GRU) [<xref ref-type="bibr" rid="ref-11">11</xref>] module integrates both historical and current features, generating enhanced features with stronger semantic expressiveness.</p></list-item>
<list-item>
<p>Comprehensive Validation: Experiments conducted on multimodal multilingual datasets demonstrate that the proposed method surpasses existing models in precision, recall, and F1 score, highlighting its effectiveness in discriminatory content detection tasks.</p></list-item>
</list></p>
<p>The rest of this paper is organized into six sections. <xref ref-type="sec" rid="s2">Section 2</xref> briefly reviews related work, including multimodal discrimination detection and multilingual discrimination detection. <xref ref-type="sec" rid="s3">Section 3</xref> introduces the VRCL model, including the Multilingual and Multimodal Feature Extraction Module, the Multimodal Feature Alignment and Fusion Module, the Label-Guided Contrastive Learning Module, and the Dynamic Memory-based Discrimination Detection Module. <xref ref-type="sec" rid="s4">Section 4</xref> presents the experimental datasets and baseline models. <xref ref-type="sec" rid="s5">Section 5</xref> analyzes the experimental results. <xref ref-type="sec" rid="s6">Section 6</xref> provides visualization of the experimental results. Finally, <xref ref-type="sec" rid="s7">Section 7</xref> concludes the paper.</p>
</sec>
<sec id="s2">
<label>2</label>
<title>Related Work</title>
<sec id="s2_1">
<label>2.1</label>
<title>Multimodal Discrimination Detection</title>
<p>Multimodal discrimination detection integrates image and text information to comprehensively identify discriminatory content. It addresses the limitations of single-modality approaches, which often fail to capture the complexity of discriminatory information. For instance, Kiela [<xref ref-type="bibr" rid="ref-12">12</xref>] proposed an early fusion method combining image and text modalities to detect discriminatory internet memes. Ma [<xref ref-type="bibr" rid="ref-13">13</xref>] utilized a self-supervised label generation module alongside Bidirectional Encoder Representations from Transformers (BERT) and Residual Network (ResNet) models to enhance feature learning without requiring additional annotations. Similarly, Chen and Pan [<xref ref-type="bibr" rid="ref-14">14</xref>] implemented the OSCAR&#x002B; model with Optical Character Recognition (OCR) technology to improve detection performance.</p>
<p>Nie et al. [<xref ref-type="bibr" rid="ref-15">15</xref>] proposed MAGIC, a multimodal dialogue system that interprets user intent within multimodal contexts and dynamically determines the response type and modality. This context-aware and modality-adaptive approach provides useful inspiration for improving multimodal discrimination detection in complex semantic environments.</p>
<p>Furthermore, models such as HateCLIPper [<xref ref-type="bibr" rid="ref-16">16</xref>] and InterCLIP-MEP [<xref ref-type="bibr" rid="ref-17">17</xref>] explored various modality interactions in Contrastive Language-Image Pre-training (CLIP)&#x2019;s visual and linguistic representations to address the challenges of hate meme detection. Recent advances in multimodal pre-trained models, including Vision-and-Language BERT (ViLBERT) [<xref ref-type="bibr" rid="ref-18">18</xref>], VisualBERT [<xref ref-type="bibr" rid="ref-19">19</xref>] and Universal Image-Text Representation (UNITER) [<xref ref-type="bibr" rid="ref-20">20</xref>], have employed Transformer architectures to enhance cross-modal interactions, significantly boosting task performance. Nevertheless, challenges persist, such as modality heterogeneity, noise in training data, and the diverse manifestations of discriminatory content. Moreover, the lack of fine-grained annotations in existing datasets hinders models&#x2019; ability to process metaphorical and contextually complex discrimination.</p>
</sec>
<sec id="s2_2">
<label>2.2</label>
<title>Multilingual Discrimination Detection</title>
<p>Multilingual discrimination detection focuses on identifying discriminatory content across diverse languages and cultural contexts. Multilingual pre-trained models, such as mBERT, XLM-R, and multilingual Text-to-Text Transfer Transformer (mT5) [<xref ref-type="bibr" rid="ref-21">21</xref>], offer an effective solution for cross-lingual detection by leveraging large-scale corpora to capture both commonalities and differences between languages. However, these models still exhibit limited generalization capabilities, particularly in low-resource languages and implicit discrimination detection, due to variations in linguistic expression and cultural context. Montariol [<xref ref-type="bibr" rid="ref-22">22</xref>] enhanced the models&#x2019; ability to transfer across languages by introducing auxiliary tasks like sentiment analysis and named entity recognition. However, semantic differences remain a significant challenge. R&#x00F6;ttger [<xref ref-type="bibr" rid="ref-23">23</xref>] introduced the MULTILINGUAL HATCHECK framework, which covers 10 languages and highlights the complexity of cross-cultural applications through 36,582 comparison cases, offering a more comprehensive perspective on multilingual discrimination detection.</p>
</sec>
</sec>
<sec id="s3">
<label>3</label>
<title>Methodology</title>
<p>This section provides a comprehensive overview of the proposed discrimination detection method. The structural framework of the proposed method is illustrated in <xref ref-type="fig" rid="fig-2">Fig. 2</xref>.</p>
<fig id="fig-2">
<label>Figure 2</label>
<caption>
<title>Structural diagram for multilingual and multimodal discrimination detection</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_66532-fig-2.tif"/>
</fig>
<sec id="s3_1">
<label>3.1</label>
<title>Multilingual and Multimodal Feature Extraction Module</title>
<p>To enhance the cross-lingual generalization of CLIP, we replace its original text encoder with XLM-R, a multilingual pre-trained model supporting over 100 languages. Built on RoBERTa, XLM-R effectively captures multilingual semantics and performs well in low-resource settings, making it suitable for this task. Given a batch of image-text pairs <inline-formula id="ieqn-1"><mml:math id="mml-ieqn-1"><mml:mo fence="false" stretchy="false">{</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>I</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>T</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:msubsup><mml:mo fence="false" stretchy="false">}</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mi>N</mml:mi></mml:msubsup></mml:math></inline-formula>, where <inline-formula id="ieqn-2"><mml:math id="mml-ieqn-2"><mml:msub><mml:mi>I</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:math></inline-formula> and <inline-formula id="ieqn-3"><mml:math id="mml-ieqn-3"><mml:msub><mml:mi>T</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:math></inline-formula> denote the image and text features of sample <inline-formula id="ieqn-4"><mml:math id="mml-ieqn-4"><mml:mi>i</mml:mi></mml:math></inline-formula>, each text <inline-formula id="ieqn-5"><mml:math id="mml-ieqn-5"><mml:msub><mml:mi>T</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:math></inline-formula> is encoded by XLM-R into a multilingual high-dimensional representation <inline-formula id="ieqn-6"><mml:math id="mml-ieqn-6"><mml:msub><mml:mi>t</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:math></inline-formula>, as defined in <xref ref-type="disp-formula" rid="eqn-1">Eq. (1)</xref>.
<disp-formula id="eqn-1"><label>(1)</label><mml:math id="mml-eqn-1" display="block"><mml:msub><mml:mi>t</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mrow><mml:mtext>XLM-R</mml:mtext></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>T</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mrow><mml:mtext>FFN</mml:mtext></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mtext>SelfAttention</mml:mtext></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>E</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>T</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">)</mml:mo></mml:math></disp-formula>where <inline-formula id="ieqn-7"><mml:math id="mml-ieqn-7"><mml:mi>E</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>T</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> denotes the word embedding matrix of the input text. SelfAttention (<inline-formula id="ieqn-8"><mml:math id="mml-ieqn-8"><mml:mo>&#x22C5;</mml:mo></mml:math></inline-formula>) is the multi-head self-attention mechanism for modeling contextual dependencies, and FFN (<inline-formula id="ieqn-9"><mml:math id="mml-ieqn-9"><mml:mo>&#x22C5;</mml:mo></mml:math></inline-formula>) is the feedforward neural network. The output <inline-formula id="ieqn-10"><mml:math id="mml-ieqn-10"><mml:msub><mml:mi>t</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:msub><mml:mi>d</mml:mi><mml:mi>t</mml:mi></mml:msub></mml:mrow></mml:msup></mml:math></inline-formula> is the high-dimensional text feature.</p>
<p>For visual encoding, we adopt the original Vision Transformer (ViT) backbone from CLIP. ViT splits images into fixed-size patches and applies Transformer layers to capture global semantics, outperforming CNN-based models such as ResNet and ConvNeXT in vision and multimodal tasks. The resulting global image feature is denoted as <inline-formula id="ieqn-11"><mml:math id="mml-ieqn-11"><mml:msub><mml:mi>v</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:math></inline-formula>, as shown in <xref ref-type="disp-formula" rid="eqn-2">Eq. (2)</xref>.
<disp-formula id="eqn-2"><label>(2)</label><mml:math id="mml-eqn-2" display="block"><mml:msub><mml:mi>v</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mrow><mml:mtext>ViT</mml:mtext></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>I</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mrow><mml:mtext>FFN</mml:mtext></mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mtext>SelfAttention</mml:mtext></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mtext>PatchEmbedding</mml:mtext></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>I</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">)</mml:mo><mml:mo>)</mml:mo></mml:mrow></mml:math></disp-formula></p>
<p>The above process separately extracts image and multilingual text features, laying the foundation for the subsequent feature alignment and fusion of text and image modalities.</p>
</sec>
<sec id="s3_2">
<label>3.2</label>
<title>Multimodal Feature Alignment and Fusion Module</title>
<p>In order to solve the difference in dimension between image and text features, a linear transformation operation is used to map the two modal features into a unified feature space, as shown in <xref ref-type="fig" rid="fig-3">Fig. 3</xref>. The specific form of the transformation is given in <xref ref-type="disp-formula" rid="eqn-3">Eq. (3)</xref>.
<disp-formula id="eqn-3"><label>(3)</label><mml:math id="mml-eqn-3" display="block"><mml:msup><mml:mi>V</mml:mi><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:msub><mml:mi>W</mml:mi><mml:mi>v</mml:mi></mml:msub><mml:msub><mml:mi>v</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo>+</mml:mo><mml:msub><mml:mi>b</mml:mi><mml:mi>v</mml:mi></mml:msub><mml:mo>;</mml:mo><mml:mspace width="1em" /><mml:msup><mml:mi>T</mml:mi><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:msub><mml:mi>W</mml:mi><mml:mi>t</mml:mi></mml:msub><mml:msub><mml:mi>t</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo>+</mml:mo><mml:msub><mml:mi>b</mml:mi><mml:mi>t</mml:mi></mml:msub></mml:math></disp-formula></p>
<fig id="fig-3">
<label>Figure 3</label>
<caption>
<title>The process of aligning and fusing image and text features. The image and text features are mapped by projection linear transformation, and unified into the same feature space to obtain <italic>V</italic><sup><italic>&#x2032;</italic></sup> and <italic>T</italic><sup><italic>&#x2032;</italic></sup>. The cross-attention mechanism calculates the attention weights of the image guiding the text (<inline-formula id="ieqn-18"><mml:math id="mml-ieqn-18"><mml:msub><mml:mi>A</mml:mi><mml:mi>v</mml:mi></mml:msub></mml:math></inline-formula>) and the text guiding the image (<inline-formula id="ieqn-19"><mml:math id="mml-ieqn-19"><mml:msub><mml:mi>A</mml:mi><mml:mi>t</mml:mi></mml:msub></mml:math></inline-formula>), generating the fused features <inline-formula id="ieqn-20"><mml:math id="mml-ieqn-20"><mml:msub><mml:mi>F</mml:mi><mml:mi>v</mml:mi></mml:msub></mml:math></inline-formula> and <inline-formula id="ieqn-21"><mml:math id="mml-ieqn-21"><mml:msub><mml:mi>F</mml:mi><mml:mi>t</mml:mi></mml:msub></mml:math></inline-formula>. A weighted sum further integrates these image and text features, resulting in the final multimodal representation <italic>F</italic></title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_66532-fig-3.tif"/>
</fig>
<p>where <inline-formula id="ieqn-12"><mml:math id="mml-ieqn-12"><mml:msub><mml:mi>W</mml:mi><mml:mi>v</mml:mi></mml:msub><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:mi>d</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:msub><mml:mi>d</mml:mi><mml:mi>v</mml:mi></mml:msub></mml:mrow></mml:msup></mml:math></inline-formula> and <inline-formula id="ieqn-13"><mml:math id="mml-ieqn-13"><mml:msub><mml:mi>W</mml:mi><mml:mi>t</mml:mi></mml:msub><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mrow><mml:mi>d</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:msub><mml:mi>d</mml:mi><mml:mi>t</mml:mi></mml:msub></mml:mrow></mml:msup></mml:math></inline-formula> are learnable linear transformation matrices, and <inline-formula id="ieqn-14"><mml:math id="mml-ieqn-14"><mml:msub><mml:mi>b</mml:mi><mml:mi>v</mml:mi></mml:msub></mml:math></inline-formula> and <inline-formula id="ieqn-15"><mml:math id="mml-ieqn-15"><mml:msub><mml:mi>b</mml:mi><mml:mi>t</mml:mi></mml:msub></mml:math></inline-formula> are bias terms. <inline-formula id="ieqn-16"><mml:math id="mml-ieqn-16"><mml:msup><mml:mi>V</mml:mi><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msup><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mi>d</mml:mi></mml:msup></mml:math></inline-formula> and <inline-formula id="ieqn-17"><mml:math id="mml-ieqn-17"><mml:msup><mml:mi>T</mml:mi><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msup><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant="double-struck">R</mml:mi></mml:mrow><mml:mi>d</mml:mi></mml:msup></mml:math></inline-formula> represent the image and text features after dimensional alignment.</p>
<p>After completing the feature dimension alignment according to <xref ref-type="disp-formula" rid="eqn-4">Eq. (4)</xref>, a cross-attention mechanism is utilized to achieve effective interaction between image and text features. Specifically, image feature <italic>V</italic><sup><italic>&#x2032;</italic></sup> is used as Query and text feature <italic>T</italic><sup><italic>&#x2032;</italic></sup> is used as Key and Value. By calculating the attention weight <inline-formula id="ieqn-22"><mml:math id="mml-ieqn-22"><mml:msub><mml:mi>A</mml:mi><mml:mi>v</mml:mi></mml:msub></mml:math></inline-formula> of an image feature to a text feature, the text feature can be made to focus on the part related to the semantics of the image; similarly, the attention weight <inline-formula id="ieqn-23"><mml:math id="mml-ieqn-23"><mml:msub><mml:mi>A</mml:mi><mml:mi>t</mml:mi></mml:msub></mml:math></inline-formula> of a text feature to an image feature can guide the image feature to focus on the part related to the semantics of the text. The specific calculation process is as follows: Computation of Attention Weights for Text Guided by Images:
<disp-formula id="eqn-4"><label>(4)</label><mml:math id="mml-eqn-4" display="block"><mml:msub><mml:mi>A</mml:mi><mml:mi>v</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mrow><mml:mtext>softmax</mml:mtext></mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mfrac><mml:mrow><mml:msup><mml:mi>V</mml:mi><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msup><mml:mo stretchy="false">(</mml:mo><mml:msup><mml:mi>T</mml:mi><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msup><mml:msup><mml:mo stretchy="false">)</mml:mo><mml:mrow><mml:mi mathvariant="normal">&#x22A4;</mml:mi></mml:mrow></mml:msup></mml:mrow><mml:msqrt><mml:mi>d</mml:mi></mml:msqrt></mml:mfrac><mml:mo>)</mml:mo></mml:mrow></mml:math></disp-formula></p>
<p>The attention weights for images guided by text are computed as shown in <xref ref-type="disp-formula" rid="eqn-5">Eq. (5)</xref>:
<disp-formula id="eqn-5"><label>(5)</label><mml:math id="mml-eqn-5" display="block"><mml:msub><mml:mi>A</mml:mi><mml:mi>t</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mrow><mml:mtext>softmax</mml:mtext></mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mfrac><mml:mrow><mml:msup><mml:mi>T</mml:mi><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msup><mml:mo stretchy="false">(</mml:mo><mml:msup><mml:mi>V</mml:mi><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msup><mml:msup><mml:mo stretchy="false">)</mml:mo><mml:mrow><mml:mi mathvariant="normal">&#x22A4;</mml:mi></mml:mrow></mml:msup></mml:mrow><mml:msqrt><mml:mi>d</mml:mi></mml:msqrt></mml:mfrac><mml:mo>)</mml:mo></mml:mrow></mml:math></disp-formula></p>
<p>Calculation of fused image and text features is given in <xref ref-type="disp-formula" rid="eqn-6">Eq. (6)</xref>:
<disp-formula id="eqn-6"><label>(6)</label><mml:math id="mml-eqn-6" display="block"><mml:msub><mml:mi>F</mml:mi><mml:mi>v</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mi>A</mml:mi><mml:mi>v</mml:mi></mml:msub><mml:msup><mml:mi>T</mml:mi><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msup><mml:mo>;</mml:mo><mml:msub><mml:mi>F</mml:mi><mml:mi>t</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mi>A</mml:mi><mml:mi>t</mml:mi></mml:msub><mml:msup><mml:mi>V</mml:mi><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msup></mml:math></disp-formula></p>
<p>In order to further refine the fused representation, a weighted summation integrates the image and text features into the final multimodal representation <italic>F</italic>. Learnable weighting coefficients <inline-formula id="ieqn-24"><mml:math id="mml-ieqn-24"><mml:mi>&#x03B1;</mml:mi></mml:math></inline-formula> and <inline-formula id="ieqn-25"><mml:math id="mml-ieqn-25"><mml:mi>&#x03B2;</mml:mi></mml:math></inline-formula> (<inline-formula id="ieqn-26"><mml:math id="mml-ieqn-26"><mml:mi>&#x03B1;</mml:mi><mml:mo>+</mml:mo><mml:mi>&#x03B2;</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:math></inline-formula>) dynamically adjust the contributions of each modality during the fusion process, ensuring a balanced representation of image and text information. The final fusion formula is given in <xref ref-type="disp-formula" rid="eqn-7">Eq. (7)</xref>:
<disp-formula id="eqn-7"><label>(7)</label><mml:math id="mml-eqn-7" display="block"><mml:mi>F</mml:mi><mml:mo>=</mml:mo><mml:mi>&#x03B1;</mml:mi><mml:msub><mml:mi>F</mml:mi><mml:mi>v</mml:mi></mml:msub><mml:mo>+</mml:mo><mml:mi>&#x03B2;</mml:mi><mml:msub><mml:mi>F</mml:mi><mml:mi>t</mml:mi></mml:msub></mml:math></disp-formula></p>
<p>This feature alignment and fusion strategy effectively mitigates the challenges posed by dimensional differences and insufficient cross-modal interactions in multilingual multimodal discrimination detection tasks.</p>
</sec>
<sec id="s3_3">
<label>3.3</label>
<title>Label-Guided Contrastive Learning Module</title>
<p>Given a batch of image-text pairs <inline-formula id="ieqn-27"><mml:math id="mml-ieqn-27"><mml:mo fence="false" stretchy="false">{</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>I</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>T</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:msubsup><mml:mo fence="false" stretchy="false">}</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mi>N</mml:mi></mml:msubsup></mml:math></inline-formula>, where <inline-formula id="ieqn-28"><mml:math id="mml-ieqn-28"><mml:msub><mml:mi>I</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:math></inline-formula> represents the image features of the sample <inline-formula id="ieqn-29"><mml:math id="mml-ieqn-29"><mml:mi>i</mml:mi></mml:math></inline-formula> and <inline-formula id="ieqn-30"><mml:math id="mml-ieqn-30"><mml:msub><mml:mi>T</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:math></inline-formula> represents the text features of the sample <inline-formula id="ieqn-31"><mml:math id="mml-ieqn-31"><mml:mi>i</mml:mi></mml:math></inline-formula>. The label of each sample indicates whether it is discriminatory (i.e., discriminatory or non-discriminatory). Unlike traditional contrastive learning methods, in this study, the input features are no longer the original image and text features, but the fusion representation obtained by fusing the image and text features, and the label information is used as the classification basis to guide the contrast learning.</p>
<sec id="s3_3_1">
<label>3.3.1</label>
<title>Definition of Positive and Negative Sample Pairs</title>
<p>Positive and negative sample pairs in contrastive learning are defined based on their labels:
<list list-type="bullet">
<list-item>
<p><italic><bold>Positive sample pairs:</bold></italic> Two samples <inline-formula id="ieqn-32"><mml:math id="mml-ieqn-32"><mml:msub><mml:mi>F</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:math></inline-formula> and <inline-formula id="ieqn-33"><mml:math id="mml-ieqn-33"><mml:msub><mml:mi>F</mml:mi><mml:mi>j</mml:mi></mml:msub></mml:math></inline-formula> are considered positive if they share the same label (i.e., both are discriminatory or both are non-discriminatory).</p></list-item>
<list-item>
<p><italic><bold>Negative sample pairs:</bold></italic> Two samples <inline-formula id="ieqn-34"><mml:math id="mml-ieqn-34"><mml:msub><mml:mi>F</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:math></inline-formula> and <inline-formula id="ieqn-35"><mml:math id="mml-ieqn-35"><mml:msub><mml:mi>F</mml:mi><mml:mi>j</mml:mi></mml:msub></mml:math></inline-formula> are considered negative if their labels differ (i.e., one is discriminatory and the other is non-discriminatory).</p></list-item>
</list></p>
</sec>
<sec id="s3_3_2">
<label>3.3.2</label>
<title>Contrastive Learning Loss Function</title>
<p>In the contrastive learning framework, the model enhances the expression capability of multimodal features by maximizing the similarity of positive sample pairs and minimizing the similarity of negative sample pairs, significantly improving the model&#x2019;s performance in distinguishing between discriminatory and non-discriminatory samples. This method utilizes the following contrastive learning loss function, as defined in <xref ref-type="disp-formula" rid="eqn-8">Eq. (8)</xref>:
<disp-formula id="eqn-8"><label>(8)</label><mml:math id="mml-eqn-8" display="block"><mml:msub><mml:mi>L</mml:mi><mml:mrow><mml:mrow><mml:mtext>sup</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mstyle mathsize="0.7em"><mml:mrow><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mi>N</mml:mi></mml:munderover><mml:mrow><mml:mo>[</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mi>P</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>i</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow></mml:mrow></mml:mfrac><mml:munder><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>p</mml:mi><mml:mo>&#x2208;</mml:mo><mml:mi>P</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>i</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:munder><mml:mrow><mml:mo>(</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mi>log</mml:mi><mml:mo>&#x2061;</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:mfrac><mml:mrow><mml:mi>exp</mml:mi><mml:mo>&#x2061;</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mtext>sim</mml:mtext></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>F</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>F</mml:mi><mml:mi>P</mml:mi></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mrow><mml:mo>/</mml:mo></mml:mrow><mml:mi>&#x03C4;</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>k</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mi>N</mml:mi></mml:munderover><mml:mi>exp</mml:mi><mml:mo>&#x2061;</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mtext>sim</mml:mtext></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>F</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>F</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mrow><mml:mo>/</mml:mo></mml:mrow><mml:mi>&#x03C4;</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:mfrac><mml:mo>)</mml:mo></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mo>]</mml:mo></mml:mrow></mml:mrow><mml:mo>.</mml:mo></mml:mstyle></mml:math></disp-formula>where <inline-formula id="ieqn-36"><mml:math id="mml-ieqn-36"><mml:msub><mml:mi>F</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:math></inline-formula> denotes the fused feature representation of sample <inline-formula id="ieqn-37"><mml:math id="mml-ieqn-37"><mml:mi>i</mml:mi></mml:math></inline-formula>. <inline-formula id="ieqn-38"><mml:math id="mml-ieqn-38"><mml:msub><mml:mi>F</mml:mi><mml:mi>P</mml:mi></mml:msub></mml:math></inline-formula> represents the set of samples that share the same label as <inline-formula id="ieqn-39"><mml:math id="mml-ieqn-39"><mml:msub><mml:mi>F</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:math></inline-formula>, i.e., the positive samples. The function <inline-formula id="ieqn-40"><mml:math id="mml-ieqn-40"><mml:mtext>sim</mml:mtext><mml:mo stretchy="false">(</mml:mo><mml:mo>&#x22C5;</mml:mo><mml:mo>,</mml:mo><mml:mo>&#x22C5;</mml:mo><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> computes the cosine similarity between two feature representations. The temperature parameter <inline-formula id="ieqn-41"><mml:math id="mml-ieqn-41"><mml:mi>&#x03C4;</mml:mi></mml:math></inline-formula> is used to control the smoothness of the similarity distribution. <inline-formula id="ieqn-42"><mml:math id="mml-ieqn-42"><mml:mi>P</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>i</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> refers to the set of all positive samples corresponding to sample <inline-formula id="ieqn-43"><mml:math id="mml-ieqn-43"><mml:mi>i</mml:mi></mml:math></inline-formula>.</p>
</sec>
</sec>
<sec id="s3_4">
<label>3.4</label>
<title>Dynamic Memory-Based Discrimination Detection Module</title>
<p>The complex discrimination signal detection process based on the dynamic memory mechanism is shown in <xref ref-type="fig" rid="fig-4">Fig. 4</xref>. This module is based on the fused multimodal features obtained from the multimodal feature alignment and fusion module. By constructing a memory bank that stores the feature representations of historical samples along with their label information, it enables similarity retrieval and similarity-weighted aggregation between the current sample and historical samples. Specifically, the memory bank <italic>M</italic> storing historical samples is continuously updated. Cosine similarity measures the similarity between the current sample <inline-formula id="ieqn-44"><mml:math id="mml-ieqn-44"><mml:msub><mml:mrow><mml:mover><mml:mi>h</mml:mi><mml:mo stretchy="false">&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mi>f</mml:mi></mml:msub></mml:math></inline-formula> and historical samples <inline-formula id="ieqn-45"><mml:math id="mml-ieqn-45"><mml:msubsup><mml:mrow><mml:mover><mml:mi>h</mml:mi><mml:mo stretchy="false">&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mi>f</mml:mi><mml:mi>i</mml:mi></mml:msubsup></mml:math></inline-formula> within <italic>M</italic>. The Top-K most relevant samples are retrieved and aggregated using similarity-weighted values to generate enhanced feature representations <inline-formula id="ieqn-46"><mml:math id="mml-ieqn-46"><mml:msub><mml:mrow><mml:mover><mml:mi>h</mml:mi><mml:mo stretchy="false">&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mi>f</mml:mi></mml:msub></mml:math></inline-formula>. The concatenated current and enhanced features are then fed into the gated recurrent unit (GRU) module, which dynamically integrates historical and current features through update and reset gates, producing the final representation. These integrated features are subsequently classified by a Softmax classifier to predict category labels. Finally, the prediction results and feature representations are used to update <italic>M</italic>, ensuring continuous learning and adaptation.</p>
<fig id="fig-4">
<label>Figure 4</label>
<caption>
<title>Illustration of the discrimination signal detection process using a dynamic memory mechanism</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_66532-fig-4.tif"/>
</fig>
<sec id="s3_4_1">
<label>3.4.1</label>
<title>Establishment of Dynamic Memory Bank</title>
<p>During the model training process, a Memory Bank <italic>M</italic> is constructed to store key information, including feature representations of historical samples, true labels, and predicted entropy values, as defined in <xref ref-type="disp-formula" rid="eqn-9">Eq. (9)</xref>:
<disp-formula id="eqn-9"><label>(9)</label><mml:math id="mml-eqn-9" display="block"><mml:mi>M</mml:mi><mml:mo>=</mml:mo><mml:mo fence="false" stretchy="false">{</mml:mo><mml:mo fence="false" stretchy="false">&#x27E8;</mml:mo><mml:msubsup><mml:mrow><mml:mover><mml:mi>h</mml:mi><mml:mo stretchy="false">&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mi>f</mml:mi><mml:mi>i</mml:mi></mml:msubsup><mml:mo>,</mml:mo><mml:msup><mml:mi>y</mml:mi><mml:mi>i</mml:mi></mml:msup><mml:mo>,</mml:mo><mml:mi>H</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mover><mml:mi>y</mml:mi><mml:mo stretchy="false">&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mo stretchy="false">)</mml:mo><mml:mo fence="false" stretchy="false">&#x27E9;</mml:mo><mml:mo>&#x2223;</mml:mo><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mn>2</mml:mn><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:mi>N</mml:mi><mml:mo fence="false" stretchy="false">}</mml:mo></mml:math></disp-formula>where <inline-formula id="ieqn-47"><mml:math id="mml-ieqn-47"><mml:msubsup><mml:mrow><mml:mover><mml:mi>h</mml:mi><mml:mo stretchy="false">&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mi>f</mml:mi><mml:mi>i</mml:mi></mml:msubsup></mml:math></inline-formula> denotes the feature representation of historical sample <inline-formula id="ieqn-48"><mml:math id="mml-ieqn-48"><mml:mi>i</mml:mi></mml:math></inline-formula>, <inline-formula id="ieqn-49"><mml:math id="mml-ieqn-49"><mml:msup><mml:mi>y</mml:mi><mml:mi>i</mml:mi></mml:msup></mml:math></inline-formula> is the true label of sample <inline-formula id="ieqn-50"><mml:math id="mml-ieqn-50"><mml:mi>i</mml:mi></mml:math></inline-formula>, and <inline-formula id="ieqn-51"><mml:math id="mml-ieqn-51"><mml:mi>H</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mover><mml:mi>y</mml:mi><mml:mo stretchy="false">&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> represents the predicted entropy value corresponding to the model&#x2019;s uncertainty for that sample.</p>
<p>The predicted entropy value indicates the degree of confusion in the model&#x2019;s output probability distribution for each category, is calculated as shown in <xref ref-type="disp-formula" rid="eqn-10">Eq. (10)</xref>:
<disp-formula id="eqn-10"><label>(10)</label><mml:math id="mml-eqn-10" display="block"><mml:mi>H</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mover><mml:mi>y</mml:mi><mml:mo stretchy="false">&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mo>&#x2212;</mml:mo><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mi>C</mml:mi></mml:munderover><mml:msub><mml:mi>P</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mi>log</mml:mi><mml:mo>&#x2061;</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>P</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy="false">)</mml:mo></mml:math></disp-formula>where <italic>C</italic> denotes the number of categories, and <inline-formula id="ieqn-52"><mml:math id="mml-ieqn-52"><mml:msub><mml:mi>P</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:math></inline-formula> represents the probability assigned by the model to a particular category.</p>
<p>Predicted entropy reflects the model&#x2019;s confidence in its output&#x2014;higher entropy signifies greater uncertainty, indicating the potential for ambiguous or unclear predictions. Therefore, when the memory bank reaches its capacity, it is updated based on the predicted entropy values, with preference given to retaining samples with lower entropy. This process ensures that the memory bank contains samples with higher confidence and stability, offering more reliable references for subsequent feature enhancement and semantic analysis.</p>
</sec>
<sec id="s3_4_2">
<label>3.4.2</label>
<title>Similarity Calculation and Retrieval</title>
<p>During the prediction phase, the current sample <inline-formula id="ieqn-53"><mml:math id="mml-ieqn-53"><mml:msub><mml:mi>h</mml:mi><mml:mi>f</mml:mi></mml:msub></mml:math></inline-formula> is compared with historical samples <inline-formula id="ieqn-54"><mml:math id="mml-ieqn-54"><mml:msubsup><mml:mi>h</mml:mi><mml:mi>f</mml:mi><mml:mi>i</mml:mi></mml:msubsup></mml:math></inline-formula> in the memory bank to identify the most relevant samples for feature enhancement. Cosine similarity is employed to measure the similarity between samples, as defined in <xref ref-type="disp-formula" rid="eqn-11">Eq. (11)</xref>:
<disp-formula id="eqn-11"><label>(11)</label><mml:math id="mml-eqn-11" display="block"><mml:mi>S</mml:mi><mml:mi>i</mml:mi><mml:mi>m</mml:mi><mml:mi>i</mml:mi><mml:mi>l</mml:mi><mml:mi>a</mml:mi><mml:mi>r</mml:mi><mml:mi>i</mml:mi><mml:mi>t</mml:mi><mml:mi>y</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mrow><mml:mover><mml:mi>h</mml:mi><mml:mo stretchy="false">&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mi>f</mml:mi></mml:msub><mml:mo>,</mml:mo><mml:msubsup><mml:mrow><mml:mover><mml:mi>h</mml:mi><mml:mo stretchy="false">&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mi>f</mml:mi><mml:mi>i</mml:mi></mml:msubsup><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:msub><mml:mrow><mml:mover><mml:mi>h</mml:mi><mml:mo stretchy="false">&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mi>f</mml:mi></mml:msub><mml:msubsup><mml:mrow><mml:mover><mml:mi>h</mml:mi><mml:mo stretchy="false">&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mi>f</mml:mi><mml:mi>i</mml:mi></mml:msubsup></mml:mrow><mml:mrow><mml:mo fence="false" stretchy="false">&#x2016;</mml:mo><mml:msub><mml:mrow><mml:mover><mml:mi>h</mml:mi><mml:mo stretchy="false">&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mi>f</mml:mi></mml:msub><mml:mo fence="false" stretchy="false">&#x2016;</mml:mo><mml:mo fence="false" stretchy="false">&#x2016;</mml:mo><mml:msubsup><mml:mrow><mml:mover><mml:mi>h</mml:mi><mml:mo stretchy="false">&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mi>f</mml:mi><mml:mi>i</mml:mi></mml:msubsup><mml:mo fence="false" stretchy="false">&#x2016;</mml:mo></mml:mrow></mml:mfrac></mml:math></disp-formula></p>
<p>Based on the similarity scores, the top-<italic>K</italic> most relevant historical samples are selected to form the retrieval neighborhood, as defined in <xref ref-type="disp-formula" rid="eqn-12">Eq. (12)</xref>:
<disp-formula id="eqn-12"><label>(12)</label><mml:math id="mml-eqn-12" display="block"><mml:mi>N</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mrow><mml:mover><mml:mi>h</mml:mi><mml:mo stretchy="false">&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mi>f</mml:mi></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mo fence="false" stretchy="false">{</mml:mo><mml:mo fence="false" stretchy="false">&#x27E8;</mml:mo><mml:msubsup><mml:mrow><mml:mover><mml:mi>h</mml:mi><mml:mo stretchy="false">&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mi>f</mml:mi><mml:mi>i</mml:mi></mml:msubsup><mml:mo>,</mml:mo><mml:msup><mml:mi>y</mml:mi><mml:mi>i</mml:mi></mml:msup><mml:mo fence="false" stretchy="false">&#x27E9;</mml:mo><mml:mo>&#x2223;</mml:mo><mml:mi>i</mml:mi><mml:mo>&#x2208;</mml:mo><mml:mrow><mml:mtext>Top-K</mml:mtext></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>S</mml:mi><mml:mi>i</mml:mi><mml:mi>m</mml:mi><mml:mi>i</mml:mi><mml:mi>l</mml:mi><mml:mi>a</mml:mi><mml:mi>r</mml:mi><mml:mi>i</mml:mi><mml:mi>t</mml:mi><mml:mi>y</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mrow><mml:mover><mml:mi>h</mml:mi><mml:mo stretchy="false">&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mi>f</mml:mi></mml:msub><mml:mo>,</mml:mo><mml:msubsup><mml:mrow><mml:mover><mml:mi>h</mml:mi><mml:mo stretchy="false">&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mi>f</mml:mi><mml:mi>i</mml:mi></mml:msubsup><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">)</mml:mo><mml:mo fence="false" stretchy="false">}</mml:mo></mml:math></disp-formula></p>
</sec>
<sec id="s3_4_3">
<label>3.4.3</label>
<title>Feature Aggregation</title>
<p>To ensure that more similar samples have higher contribution weights, weights are calculated based on similarity, and feature fusion is performed by weighted average to obtain the fusion feature <inline-formula id="ieqn-55"><mml:math id="mml-ieqn-55"><mml:msub><mml:mrow><mml:mover><mml:mi>h</mml:mi><mml:mo stretchy="false">&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mi>f</mml:mi></mml:msub></mml:math></inline-formula>, as defined in <xref ref-type="disp-formula" rid="eqn-13">Eq. (13)</xref>:
<disp-formula id="eqn-13"><label>(13)</label><mml:math id="mml-eqn-13" display="block"><mml:msub><mml:mrow><mml:mover><mml:mi>h</mml:mi><mml:mo stretchy="false">&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mi>f</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:munder><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>&#x2208;</mml:mo><mml:mi>N</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mrow><mml:mover><mml:mi>h</mml:mi><mml:mo stretchy="false">&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mi>f</mml:mi></mml:msub><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:munder><mml:msub><mml:mi>w</mml:mi><mml:mi>j</mml:mi></mml:msub><mml:mo>&#x22C5;</mml:mo><mml:msubsup><mml:mrow><mml:mover><mml:mi>h</mml:mi><mml:mo stretchy="false">&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mi>f</mml:mi><mml:mi>i</mml:mi></mml:msubsup></mml:math></disp-formula>where the weight is computed based on the similarity between samples, as defined in <xref ref-type="disp-formula" rid="eqn-14">Eq. (14)</xref>:
<disp-formula id="eqn-14"><label>(14)</label><mml:math id="mml-eqn-14" display="block"><mml:msub><mml:mi>w</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mi>exp</mml:mi><mml:mo>&#x2061;</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mtext>Similarity</mml:mtext></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mrow><mml:mover><mml:mi>h</mml:mi><mml:mo stretchy="false">&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mi>f</mml:mi></mml:msub><mml:mo>,</mml:mo><mml:msubsup><mml:mrow><mml:mover><mml:mi>h</mml:mi><mml:mo stretchy="false">&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mi>f</mml:mi><mml:mi>i</mml:mi></mml:msubsup><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mrow><mml:munder><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>k</mml:mi><mml:mo>&#x2208;</mml:mo><mml:mi>N</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mrow><mml:mover><mml:mi>h</mml:mi><mml:mo stretchy="false">&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mi>f</mml:mi></mml:msub><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:munder><mml:mi>exp</mml:mi><mml:mo>&#x2061;</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mtext>Similarity</mml:mtext></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mrow><mml:mover><mml:mi>h</mml:mi><mml:mo stretchy="false">&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mi>f</mml:mi></mml:msub><mml:mo>,</mml:mo><mml:msubsup><mml:mrow><mml:mover><mml:mi>h</mml:mi><mml:mo stretchy="false">&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mi>f</mml:mi><mml:mi>k</mml:mi></mml:msubsup><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mfrac></mml:math></disp-formula>where <inline-formula id="ieqn-56"><mml:math id="mml-ieqn-56"><mml:msub><mml:mi>w</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:math></inline-formula> is a similarity-based weight that ensures that samples with higher similarity contribute more. This feature aggregation method can fully utilize the information from historical samples to supplement the semantic information that may be missing in the current samples, and is especially suitable for capturing cross-sample features with strong semantic relevance.</p>
</sec>
<sec id="s3_4_4">
<label>3.4.4</label>
<title>Dynamic Fusion Based on GRU</title>
<p>In discrimination detection tasks, discriminatory signals often exhibit complex semantic dependencies across samples. For instance, certain discriminatory remarks require contextualization with historical information to be accurately interpreted. Current and historical samples contain different levels of feature information, and the dynamic fusion mechanism of GRUs helps the model minimize misclassification and underclassification, especially when dealing with ambiguous samples such as sarcasm, puns, or cultural bias. Moreover, as expressions of discriminatory language and behavior evolve over time and in different social contexts, GRU enables the model to continuously learn the relationship between historical and current samples, allowing it to adapt to new forms of discriminatory expression and significantly enhance the model&#x2019;s adaptability and generalization ability.</p>
<p>The concatenation of the two feature sets serves as the input to the GRU, as defined in <xref ref-type="disp-formula" rid="eqn-15">Eq. (15)</xref>:
<disp-formula id="eqn-15"><label>(15)</label><mml:math id="mml-eqn-15" display="block"><mml:msub><mml:mi>h</mml:mi><mml:mrow><mml:mi>c</mml:mi><mml:mi>o</mml:mi><mml:mi>n</mml:mi><mml:mi>c</mml:mi><mml:mi>a</mml:mi><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mrow><mml:mo>[</mml:mo><mml:msub><mml:mrow><mml:mover><mml:mi>h</mml:mi><mml:mo stretchy="false">&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mi>f</mml:mi></mml:msub><mml:mo>;</mml:mo><mml:msub><mml:mrow><mml:mover><mml:mi>h</mml:mi><mml:mo stretchy="false">&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mi>f</mml:mi></mml:msub><mml:mo>]</mml:mo></mml:mrow></mml:math></disp-formula></p>
<p>The GRU update process is defined as shown in <xref ref-type="disp-formula" rid="eqn-16">Eqs. (16)</xref> to <xref ref-type="disp-formula" rid="eqn-20">(19)</xref>:
<disp-formula id="eqn-16"><label>(16)</label><mml:math id="mml-eqn-16" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:msub><mml:mi>z</mml:mi><mml:mi>t</mml:mi></mml:msub></mml:mtd><mml:mtd><mml:mi></mml:mi><mml:mo>=</mml:mo><mml:mi>&#x03C3;</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>W</mml:mi><mml:mi>z</mml:mi></mml:msub><mml:msub><mml:mi>h</mml:mi><mml:mrow><mml:mi>c</mml:mi><mml:mi>o</mml:mi><mml:mi>n</mml:mi><mml:mi>c</mml:mi><mml:mi>a</mml:mi><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:msub><mml:mi>U</mml:mi><mml:mi>z</mml:mi></mml:msub><mml:msub><mml:mi>h</mml:mi><mml:mrow><mml:mi>t</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:msub><mml:mi>b</mml:mi><mml:mi>z</mml:mi></mml:msub><mml:mo stretchy="false">)</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<disp-formula id="eqn-17"><label>(17)</label><mml:math id="mml-eqn-17" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:msub><mml:mi>r</mml:mi><mml:mi>t</mml:mi></mml:msub></mml:mtd><mml:mtd><mml:mi></mml:mi><mml:mo>=</mml:mo><mml:mi>&#x03C3;</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>W</mml:mi><mml:mi>r</mml:mi></mml:msub><mml:msub><mml:mi>h</mml:mi><mml:mrow><mml:mi>c</mml:mi><mml:mi>o</mml:mi><mml:mi>n</mml:mi><mml:mi>c</mml:mi><mml:mi>a</mml:mi><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:msub><mml:mi>U</mml:mi><mml:mi>r</mml:mi></mml:msub><mml:msub><mml:mi>h</mml:mi><mml:mrow><mml:mi>t</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:msub><mml:mi>b</mml:mi><mml:mi>r</mml:mi></mml:msub><mml:mo stretchy="false">)</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<disp-formula id="eqn-18"><label>(18)</label><mml:math id="mml-eqn-18" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mover><mml:mi>h</mml:mi><mml:mo stretchy="false">&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mi>t</mml:mi></mml:msub></mml:mtd><mml:mtd><mml:mi></mml:mi><mml:mo>=</mml:mo><mml:mi>tanh</mml:mi><mml:mo>&#x2061;</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>W</mml:mi><mml:mi>h</mml:mi></mml:msub><mml:msub><mml:mi>h</mml:mi><mml:mrow><mml:mi>c</mml:mi><mml:mi>o</mml:mi><mml:mi>n</mml:mi><mml:mi>c</mml:mi><mml:mi>a</mml:mi><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:msub><mml:mi>U</mml:mi><mml:mi>h</mml:mi></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>r</mml:mi><mml:mi>t</mml:mi></mml:msub><mml:mo>&#x2299;</mml:mo><mml:msub><mml:mi>h</mml:mi><mml:mrow><mml:mi>t</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mo>+</mml:mo><mml:msub><mml:mi>b</mml:mi><mml:mi>h</mml:mi></mml:msub><mml:mo stretchy="false">)</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<disp-formula id="eqn-19"><label>(19)</label><mml:math id="mml-eqn-19" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:msub><mml:mi>h</mml:mi><mml:mi>t</mml:mi></mml:msub></mml:mtd><mml:mtd><mml:mi></mml:mi><mml:mo>=</mml:mo><mml:msub><mml:mi>Z</mml:mi><mml:mi>t</mml:mi></mml:msub><mml:mo>&#x2299;</mml:mo><mml:msub><mml:mi>h</mml:mi><mml:mrow><mml:mi>t</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mn>1</mml:mn><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mi>Z</mml:mi><mml:mi>t</mml:mi></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mo>&#x2299;</mml:mo><mml:msub><mml:mrow><mml:mover><mml:mi>h</mml:mi><mml:mo stretchy="false">&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mi>t</mml:mi></mml:msub></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>where <inline-formula id="ieqn-57"><mml:math id="mml-ieqn-57"><mml:msub><mml:mi>z</mml:mi><mml:mi>t</mml:mi></mml:msub></mml:math></inline-formula> and <inline-formula id="ieqn-58"><mml:math id="mml-ieqn-58"><mml:msub><mml:mi>r</mml:mi><mml:mi>t</mml:mi></mml:msub></mml:math></inline-formula> represent the update gate and reset gate, respectively. The operator <inline-formula id="ieqn-59"><mml:math id="mml-ieqn-59"><mml:mo>&#x2299;</mml:mo></mml:math></inline-formula> denotes element-wise multiplication. The final fusion feature representation <inline-formula id="ieqn-60"><mml:math id="mml-ieqn-60"><mml:msub><mml:mi>h</mml:mi><mml:mi>t</mml:mi></mml:msub></mml:math></inline-formula> dynamically combines the current features and historical features.</p>
<p>Classification Prediction: The dynamically fused feature <inline-formula id="ieqn-61"><mml:math id="mml-ieqn-61"><mml:msub><mml:mi>h</mml:mi><mml:mi>t</mml:mi></mml:msub></mml:math></inline-formula> is input into a classifier to generate the final prediction result, as defined in <xref ref-type="disp-formula" rid="eqn-20">Eq. (20)</xref>:
<disp-formula id="eqn-20"><label>(20)</label><mml:math id="mml-eqn-20" display="block"><mml:mrow><mml:mover><mml:mi>y</mml:mi><mml:mo stretchy="false">&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mo>=</mml:mo><mml:mrow><mml:mtext>softmax</mml:mtext></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>W</mml:mi><mml:mi>y</mml:mi></mml:msub><mml:msub><mml:mi>h</mml:mi><mml:mi>t</mml:mi></mml:msub><mml:mo stretchy="false">)</mml:mo></mml:math></disp-formula>where <inline-formula id="ieqn-62"><mml:math id="mml-ieqn-62"><mml:msub><mml:mi>W</mml:mi><mml:mi>y</mml:mi></mml:msub></mml:math></inline-formula> represents the weights of the classification head. The softmax function maps the model outputs to a probability distribution, producing a vector containing the probabilities of each class. In this task, we use a binary classification setting, where the softmax outputs two elements corresponding to the probabilities of the&#x201C;non-discrimination&#x201D; and &#x201C;discrimination&#x201D; classes, respectively. The class with the higher probability is taken as the model&#x2019;s prediction.</p>
<p>Specifically, if
<disp-formula id="eqn-21"><label>(21)</label><mml:math id="mml-eqn-21" display="block"><mml:mrow><mml:mover><mml:mi>y</mml:mi><mml:mo stretchy="false">&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mo>=</mml:mo><mml:mrow><mml:mo>[</mml:mo><mml:msub><mml:mi>p</mml:mi><mml:mrow><mml:mrow><mml:mtext>non-discrimination</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mtext>&#xA0;</mml:mtext><mml:msub><mml:mi>p</mml:mi><mml:mrow><mml:mrow><mml:mtext>discrimination</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:mo>]</mml:mo></mml:mrow></mml:math></disp-formula></p>
<p>As shown in <xref ref-type="disp-formula" rid="eqn-21">Eq. (21)</xref>, the closer <inline-formula id="ieqn-63"><mml:math id="mml-ieqn-63"><mml:msub><mml:mi>p</mml:mi><mml:mrow><mml:mtext>discrimination</mml:mtext></mml:mrow></mml:msub></mml:math></inline-formula> is to 1, the more likely the model considers the sample to contain discriminatory content; conversely, if it is close to 0, the sample is more likely non-discriminatory.</p>
<p>During the training phase, the model is optimized using both cross-entropy loss and contrastive loss. The cross-entropy loss minimizes the discrepancy between the predicted and true labels, thereby improving classification accuracy. The model&#x2019;s classification loss function is defined in <xref ref-type="disp-formula" rid="eqn-22">Eq. (22)</xref>:
<disp-formula id="eqn-22"><label>(22)</label><mml:math id="mml-eqn-22" display="block"><mml:msub><mml:mi>L</mml:mi><mml:mrow><mml:mi>c</mml:mi><mml:mi>e</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:mi>N</mml:mi></mml:mfrac><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>N</mml:mi></mml:mrow></mml:munderover><mml:mrow><mml:mo>[</mml:mo><mml:msub><mml:mi>y</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mi>log</mml:mi><mml:mo>&#x2061;</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mrow><mml:mover><mml:mi>y</mml:mi><mml:mo stretchy="false">&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mo>+</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mn>1</mml:mn><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mi>y</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mi>log</mml:mi><mml:mo>&#x2061;</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mn>1</mml:mn><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mrow><mml:mover><mml:mi>y</mml:mi><mml:mo stretchy="false">&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mo>]</mml:mo></mml:mrow></mml:math></disp-formula></p>
<p>Contrastive loss optimizes the feature space differentiation by reducing the distance between positive sample features and increasing the distance between negative sample features, thereby enhancing the clustering effect within the feature space. Let the positive sample be denoted as <inline-formula id="ieqn-64"><mml:math id="mml-ieqn-64"><mml:msubsup><mml:mi>h</mml:mi><mml:mi>f</mml:mi><mml:mo>+</mml:mo></mml:msubsup></mml:math></inline-formula> and the negative sample as <inline-formula id="ieqn-65"><mml:math id="mml-ieqn-65"><mml:msubsup><mml:mi>h</mml:mi><mml:mi>f</mml:mi><mml:mo>&#x2212;</mml:mo></mml:msubsup></mml:math></inline-formula>. The contrastive loss function as defined in <xref ref-type="disp-formula" rid="eqn-23">Eq. (23)</xref>:
<disp-formula id="eqn-23"><label>(23)</label><mml:math id="mml-eqn-23" display="block"><mml:msub><mml:mi>L</mml:mi><mml:mrow><mml:mi>c</mml:mi><mml:mi>l</mml:mi><mml:mi>s</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mi>log</mml:mi><mml:mo>&#x2061;</mml:mo><mml:mfrac><mml:mrow><mml:mi>exp</mml:mi><mml:mo>&#x2061;</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mtext>Similarity</mml:mtext></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mrow><mml:mover><mml:mi>h</mml:mi><mml:mo stretchy="false">&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mi>f</mml:mi></mml:msub><mml:mo>,</mml:mo><mml:msubsup><mml:mi>h</mml:mi><mml:mi>f</mml:mi><mml:mo>+</mml:mo></mml:msubsup><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mrow><mml:mi>exp</mml:mi><mml:mo>&#x2061;</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mtext>Similarity</mml:mtext></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mrow><mml:mover><mml:mi>h</mml:mi><mml:mo stretchy="false">&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mi>f</mml:mi></mml:msub><mml:mo>,</mml:mo><mml:msubsup><mml:mi>h</mml:mi><mml:mi>f</mml:mi><mml:mo>+</mml:mo></mml:msubsup><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">)</mml:mo><mml:mo>+</mml:mo><mml:mi>exp</mml:mi><mml:mo>&#x2061;</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mtext>Similarity</mml:mtext></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mrow><mml:mover><mml:mi>h</mml:mi><mml:mo stretchy="false">&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mi>f</mml:mi></mml:msub><mml:mo>,</mml:mo><mml:msubsup><mml:mi>h</mml:mi><mml:mi>f</mml:mi><mml:mo>&#x2212;</mml:mo></mml:msubsup><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mfrac></mml:math></disp-formula></p>
<p>The loss function updates the parameters of the GRU module and the classifier through backpropagation, guiding the model to learn a more discriminative dynamic representation. The total loss for this method is as follows, as shown in <xref ref-type="disp-formula" rid="eqn-24">Eq. (24)</xref>:
<disp-formula id="eqn-24"><label>(24)</label><mml:math id="mml-eqn-24" display="block"><mml:msub><mml:mi>L</mml:mi><mml:mrow><mml:mi>c</mml:mi><mml:mi>o</mml:mi><mml:mi>n</mml:mi><mml:mi>t</mml:mi><mml:mi>r</mml:mi><mml:mi>a</mml:mi><mml:mi>s</mml:mi><mml:mi>t</mml:mi><mml:mi>i</mml:mi><mml:mi>v</mml:mi><mml:mi>e</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mi>L</mml:mi><mml:mrow><mml:mi>c</mml:mi><mml:mi>e</mml:mi></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:msub><mml:mi>L</mml:mi><mml:mrow><mml:mi>c</mml:mi><mml:mi>l</mml:mi><mml:mi>s</mml:mi></mml:mrow></mml:msub></mml:math></disp-formula></p>
</sec>
<sec id="s3_4_5">
<label>3.4.5</label>
<title>Memory Bank Update</title>
<p>The features and predicted labels of the current samples are saved into the memory bank, as shown in <xref ref-type="disp-formula" rid="eqn-25">Eq. (25)</xref>:
<disp-formula id="eqn-25"><label>(25)</label><mml:math id="mml-eqn-25" display="block"><mml:mi>M</mml:mi><mml:mo>=</mml:mo><mml:mi>M</mml:mi><mml:mo>&#x222A;</mml:mo><mml:mrow><mml:mo>{</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mrow><mml:mover><mml:mi>h</mml:mi><mml:mo stretchy="false">&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mi>f</mml:mi></mml:msub><mml:mo>,</mml:mo><mml:mi>y</mml:mi><mml:mo>,</mml:mo><mml:mi>H</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mover><mml:mi>y</mml:mi><mml:mo stretchy="false">&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">)</mml:mo><mml:mo>}</mml:mo></mml:mrow></mml:math></disp-formula></p>
</sec>
</sec>
</sec>
<sec id="s4">
<label>4</label>
<title>Experiments</title>
<p>This section presents a comparative analysis of the performance of the baseline models and the proposed method. Experiments are conducted on two multilingual multimodal datasets, as shown in <xref ref-type="table" rid="table-1">Table 1</xref>. A variety of models are selected as baseline comparisons, including unimodal vision models, text models, and multimodal pre-trained models.</p>
<table-wrap id="table-1">
<label>Table 1</label>
<caption>
<title>Dataset splitting methods and sample distribution of Multi<sup>3</sup>Hate and BHM</title>
</caption>
<table>
<colgroup>
<col/>
<col/>
<col/>
<col/>
<col/>
<col/>
<col/>
<col/>
</colgroup>
<thead>
<tr>
<th>Dataset</th>
<th>Class</th>
<th colspan="3">Total</th>
<th>Train</th>
<th>Valid</th>
<th>Test</th>
</tr>
</thead>
<tbody>
<tr>
<td rowspan="2">BHM</td>
<td>Hate</td>
<td colspan="3">2624</td>
<td>2117</td>
<td>241</td>
<td>266</td>
</tr>
<tr>
<td>Not-Hate</td>
<td colspan="3">4485</td>
<td>3641</td>
<td>399</td>
<td>445</td>
</tr>
<tr>
<td rowspan="10">Multi<sup>3</sup>Hate</td>
<td rowspan="5">Hate</td>
<td rowspan="5">864</td>
<td>US</td>
<td>153</td>
<td>122</td>
<td>31</td>
<td rowspan="5">/</td>
</tr>
<tr>
<td>DE</td>
<td>177</td>
<td>142</td>
<td>35</td>
</tr>
<tr>
<td>MX</td>
<td>165</td>
<td>132</td>
<td>33</td>
</tr>
<tr>
<td>IN</td>
<td>180</td>
<td>144</td>
<td>36</td>
</tr>
<tr>
<td>CN</td>
<td>189</td>
<td>151</td>
<td>38</td>
</tr>
<tr>
<td rowspan="5">Not-Hate</td>
<td rowspan="5">636</td>
<td>US</td>
<td>147</td>
<td>118</td>
<td>29</td>
<td rowspan="5">/</td>
</tr>
<tr>
<td>DE</td>
<td>123</td>
<td>98</td>
<td>25</td>
</tr>
<tr>
<td>MX</td>
<td>135</td>
<td>108</td>
<td>27</td>
</tr>
<tr>
<td>IN</td>
<td>120</td>
<td>96</td>
<td>24</td>
</tr>
<tr>
<td>CN</td>
<td>111</td>
<td>89</td>
<td>22</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<fn id="table-1fn1" fn-type="other">
<p>Note: For Multi<sup>3</sup>Hate, the numbers shown represent the sample counts in the training and validation sets for each fold.</p>
</fn>
</table-wrap-foot>
</table-wrap>
<sec id="s4_1">
<label>4.1</label>
<title>Datasets</title>
<p><list list-type="bullet">
<list-item>
<p><bold>Multi<sup>3</sup>Hate dataset [<xref ref-type="bibr" rid="ref-24">24</xref>]: </bold>A multimodal, multilingual, and multicultural dataset designed for hate speech detection. It contains 300 parallel meme samples across five languages&#x2014;English, German, Spanish, Hindi, and Chinese&#x2014;resulting in a total of 1500 examples. Each sample is annotated by at least five native speakers from diverse cultural backgrounds and classified as either hate speech or non-hate speech. To partition the dataset, a 5-fold cross-validation approach is used to maximize the utility of limited samples. Specifically, the 300 samples in each language are randomly divided into five subsets, with four subsets serving as the training set and one as the validation set in each iteration. This process is repeated over five rounds of training and evaluation. The final model performance is evaluated by averaging the results from all validation rounds, thereby reducing bias introduced by dataset partitioning.</p></list-item>
<list-item>
<p><bold>BHM Dataset [<xref ref-type="bibr" rid="ref-25">25</xref>]:</bold> The BHM dataset is a multimodal collection designed for Bengali hate meme detection, consisting of 7109 samples gathered from public platforms such as Facebook, Instagram, Pinterest, and blogs. The dataset is divided into training, validation, and test sets in an 80%, 10%, and 10% split, respectively, to facilitate model training and evaluation. It contains content in Bengali and code-mixed text combining Bengali and English.</p></list-item>
</list></p>
</sec>
<sec id="s4_2">
<label>4.2</label>
<title>Baselines</title>
<sec id="s4_2_1">
<label>4.2.1</label>
<title>Unimodal Models</title>
<p><list list-type="bullet">
<list-item>
<p>For the text-only models, we employed classic models, including TextCNN [<xref ref-type="bibr" rid="ref-26">26</xref>], sequence-dependent models such as Bi-LSTM [<xref ref-type="bibr" rid="ref-27">27</xref>] and mBERT [<xref ref-type="bibr" rid="ref-28">28</xref>].</p></list-item>
<list-item>
<p>For the visual-only models, we selected ResNet [<xref ref-type="bibr" rid="ref-29">29</xref>], ViT and ConvNeXT [<xref ref-type="bibr" rid="ref-30">30</xref>] for comparison. These models represent image feature extraction techniques based on convolutional and Transformer architectures, respectively.</p></list-item>
</list></p>
</sec>
<sec id="s4_2_2">
<label>4.2.2</label>
<title>Multimodal Models</title>
<p><list list-type="bullet">
<list-item>
<p>CLIP: CLIP is a multimodal model trained using a contrastive learning approach to process large-scale image-text paired data effectively. It has been extensively utilized in multimodal classification tasks.</p></list-item>
<list-item>
<p>ALBEF: ALBEF (Align Before Fuse) [<xref ref-type="bibr" rid="ref-31">31</xref>] is another multimodal model, using momentum distillation and contrastive learning method for the pre-training on noisy image-text data.</p></list-item>
<list-item>
<p>G<sup>2</sup>SAM: G<sup>2</sup>SAM [<xref ref-type="bibr" rid="ref-32">32</xref>] is a multimodal model that leverages a gated fusion mechanism and contrastive learning to jointly align and integrate visual and textual representations, enabling effective gender bias detection in multimodal content.</p></list-item>
</list></p>
</sec>
</sec>
</sec>
<sec id="s5">
<label>5</label>
<title>Results</title>
<p><xref ref-type="table" rid="table-2">Table 2</xref> displays the experimental results for multimodal and multilingual discrimination detection tasks. Performance was evaluated on the BHM dataset and the Multi<sup>3</sup>Hate dataset across five languages: English (US), German (DE), Spanish (MX), Hindi (IN), and Chinese (CN). The experiments compared the performance of various models in visual, textual, and multimodal tasks, using Precision (P), Recall (R), and F1 score (F1) as the primary evaluation metrics.</p>
<table-wrap id="table-2">
<label>Table 2</label>
<caption>
<title>Main results</title>
</caption>
<table>
<colgroup>
<col/>
<col/>
<col/>
<col/>
<col/>
<col/>
<col/>
<col/>
<col/>
<col/>
<col/>
<col/>
<col/>
<col/>
<col/>
<col/>
<col/>
<col/>
<col/>
<col/>
</colgroup>
<thead>
<tr>
<th rowspan="3">Approach</th>
<th rowspan="3">Models</th>
<th colspan="3">BHM</th>
<th colspan="15">Multi<sup>3</sup>Hate</th>
</tr>
<tr>
<th rowspan="2">P</th>
<th rowspan="2">R</th>
<th rowspan="2">F1</th>
<th colspan="3">US</th>
<th colspan="3">DE</th>
<th colspan="3">MX</th>
<th colspan="3">IN</th>
<th colspan="3">CN</th>
</tr>
<tr>
<th>P</th>
<th>R</th>
<th>F1</th>
<th>P</th>
<th>R</th>
<th>F1</th>
<th>P</th>
<th>R</th>
<th>F1</th>
<th>P</th>
<th>R</th>
<th>F1</th>
<th>P</th>
<th>R</th>
<th>F1</th>
</tr>
</thead>
<tbody>
<tr>
<td rowspan="3">Visual only</td>
<td>ViT</td>
<td>0.671</td>
<td>0.677</td>
<td>0.674</td>
<td>0.601</td>
<td>0.613</td>
<td>0.607</td>
<td>0.582</td>
<td>0.568</td>
<td>0.575</td>
<td>0.596</td>
<td>0.576</td>
<td>0.586</td>
<td>0.591</td>
<td>0.569</td>
<td>0.579</td>
<td>0.534</td>
<td>0.548</td>
<td>0.541</td>
</tr>
<tr>
<td>ResNet</td>
<td>0.612</td>
<td>0.544</td>
<td>0.576</td>
<td>0.577</td>
<td>0.591</td>
<td>0.584</td>
<td>0.563</td>
<td>0.544</td>
<td>0.553</td>
<td>0.558</td>
<td>0.57</td>
<td>0.564</td>
<td>0.551</td>
<td>0.533</td>
<td>0.542</td>
<td>0.528</td>
<td>0.537</td>
<td>0.532</td>
</tr>
<tr>
<td>ConvNeXT</td>
<td>0.692</td>
<td>0.699</td>
<td>0.695</td>
<td>0.686</td>
<td>0.679</td>
<td>0.682</td>
<td>0.597</td>
<td>0.621</td>
<td>0.609</td>
<td>0.627</td>
<td>0.611</td>
<td>0.619</td>
<td>0.604</td>
<td>0.628</td>
<td>0.616</td>
<td>0.618</td>
<td>0.598</td>
<td>0.608</td>
</tr>
<tr>
<td rowspan="3">Text only</td>
<td>TextCNN</td>
<td>0.601</td>
<td>0.621</td>
<td>0.611</td>
<td>0.638</td>
<td>0.617</td>
<td>0.627</td>
<td>0.576</td>
<td>0.589</td>
<td>0.582</td>
<td>0.568</td>
<td>0.581</td>
<td>0.574</td>
<td>0.573</td>
<td>0.589</td>
<td>0.581</td>
<td>0.578</td>
<td>0.598</td>
<td>0.588</td>
</tr>
<tr>
<td>Bi-LSTM</td>
<td>0.622</td>
<td>0.611</td>
<td>0.643</td>
<td>0.631</td>
<td>0.617</td>
<td>0.604</td>
<td>0.581</td>
<td>0.578</td>
<td>0.587</td>
<td>0.566</td>
<td>0.589</td>
<td>0.577</td>
<td>0.561</td>
<td>0.589</td>
<td>0.581</td>
<td>0.578</td>
<td>0.598</td>
<td>0.588</td>
</tr>
<tr>
<td>mBERT</td>
<td>0.648</td>
<td>0.668</td>
<td>0.658</td>
<td>0.675</td>
<td>0.668</td>
<td>0.671</td>
<td>0.669</td>
<td>0.676</td>
<td>0.673</td>
<td>0.669</td>
<td><bold>0.676</bold></td>
<td>0.673</td>
<td><bold>0.67</bold></td>
<td>0.677</td>
<td><bold>0.674</bold></td>
<td>0.669</td>
<td><bold>0.676</bold></td>
<td>0.672</td>
</tr>
<tr>
<td rowspan="4">Multimodal</td>
<td>clip</td>
<td>0.596</td>
<td>0.607</td>
<td>0.601</td>
<td>0.652</td>
<td>0.638</td>
<td>0.645</td>
<td>0.631</td>
<td>0.617</td>
<td>0.624</td>
<td>0.622</td>
<td>0.632</td>
<td>0.627</td>
<td>0.636</td>
<td>0.647</td>
<td>0.642</td>
<td>0.612</td>
<td>0.636</td>
<td>0.624</td>
</tr>
<tr>
<td>ALBEF</td>
<td>0.671</td>
<td>0.682</td>
<td>0.676</td>
<td>0.694</td>
<td>0.688</td>
<td>0.691</td>
<td>0.677</td>
<td>0.654</td>
<td>0.665</td>
<td>0.641</td>
<td>0.673</td>
<td>0.657</td>
<td>0.648</td>
<td>0.669</td>
<td>0.658</td>
<td>0.663</td>
<td>0.641</td>
<td>0.652</td>
</tr>
<tr>
<td>G<sup>2</sup>SAM</td>
<td>0.679</td>
<td>0.681</td>
<td>0.667</td>
<td>0.688</td>
<td>0.681</td>
<td>0.676</td>
<td>0.673</td>
<td>0.687</td>
<td>0.655</td>
<td>0.649</td>
<td>0.663</td>
<td>0.673</td>
<td>0.652</td>
<td><bold>0</bold>.682</td>
<td>0.651</td>
<td>0.665</td>
<td>0.647</td>
<td>0.662</td>
</tr>
<tr>
<td>VRCL (Our)</td>
<td><bold>0.726</bold></td>
<td><bold>0.702</bold></td>
<td><bold>0.714</bold></td>
<td><bold>0.721</bold></td>
<td><bold>0.713</bold></td>
<td><bold>0.72</bold></td>
<td><bold>0.722</bold></td>
<td><bold>0.717</bold></td>
<td><bold>0.719</bold></td>
<td><bold>0.684</bold></td>
<td>0.671</td>
<td><bold>0.677</bold></td>
<td>0.661</td>
<td>0.645</td>
<td>0.659</td>
<td><bold>0.689</bold></td>
<td>0.667</td>
<td><bold>0.678</bold></td>
</tr>
</tbody>
</table>
</table-wrap>
<sec id="s5_1">
<label>5.1</label>
<title>Analysis of Visual Model Performance</title>
<p>In the visual model, ConvNeXT outperforms both ViT and ResNet across all languages. On the BHM dataset, ConvNeXT achieves an F1 score of 0.695, surpassing ViT&#x2019;s 0.674 and ResNet&#x2019;s 0.576. VRCL (our) further exceeds ConvNeXT, improving the F1 score to 0.714. Additionally, in the Multi<sup>3</sup>Hate dataset across five languages, the VRCL demonstrates an average F1 score improvement of 0.0618 over ConvNeXT, underscoring its superior performance and multilingual adaptability in visual modeling. A deeper analysis of these baseline models reveals several critical limitations. For instance, while ViT achieves a moderate average F1 score (0.674), its performance drops significantly on Chinese (F1 &#x003D; 0.541), indicating challenges in handling linguistic diversity. ResNet performs even worse, with an average F1 score of 0.564, suggesting limited capacity to extract discriminative features from visual data. Although ConvNeXT demonstrates relatively better generalization (average F1 &#x003D; 0.616), it still encounters difficulties with certain languages, such as Chinese (F1 &#x003D; 0.608). These observations further highlight the effectiveness and robustness of the proposed VRCL model in managing multilingual visual inputs.</p>
</sec>
<sec id="s5_2">
<label>5.2</label>
<title>Analysis of Text Model Performance</title>
<p>In evaluating text model performance, the multilingual pre-trained model mBERT consistently outperforms traditional models such as TextCNN and Bi-LSTM. For example, on the BHM dataset, mBERT achieves an F1 score of 0.658, compared to 0.611 and 0.643 for TextCNN and Bi-LSTM, respectively. Moreover, mBERT demonstrates strong performance across all languages in the Multi<sup>3</sup>Hate dataset, particularly in Hindi (IN), where it achieves an F1 score of 0.674. A closer examination of the text-only baseline models further underscores the value of multimodal integration. TextCNN achieves a relatively low average F1 score of 0.588, suggesting that relying solely on textual inputs may overlook important visual cues essential for accurate discrimination detection. Although Bi-LSTM performs slightly better than TextCNN, with a comparable average F1 score of 0.588, it still faces challenges in modeling complex linguistic patterns&#x2013;particularly in multilingual contexts. These findings emphasize the limitations of traditional text models in capturing nuanced and cross-lingual semantics. Notably, while mBERT alleviates some of these limitations with its strong multilingual capabilities, our proposed VRCL model still surpasses mBERT&#x2019;s performance in all languages except Hindi, further demonstrating VRCL&#x2019;s broader adaptability and effectiveness in detecting multilingual discriminatory content.</p>
</sec>
<sec id="s5_3">
<label>5.3</label>
<title>Analysis of Multimodal Model Performance</title>
<p>In the task of multimodal discrimination detection, VRCL demonstrates significant advantages across multiple datasets and language settings, consistently outperforming mainstream models such as ALBEF, CLIP, and G<sup>2</sup>SAM. Specifically, on the BHM dataset, VRCL achieves an F1 score of 0.714, notably higher than ALBEF&#x2019;s 0.676, indicating stronger capabilities in multimodal feature fusion and semantic representation. On the Multi<sup>3</sup>Hate dataset, VRCL attains the highest F1 score of 0.721 on the English (US) subset, while also showing stable and robust performance across other language subsets: 0.719 on German (DE) and 0.678 on Chinese (CN), both surpassing the best results of existing methods.</p>
<p>Although G<sup>2</sup>SAM, as one of the latest fusion models, achieves a competitive average F1 score of 0.662, close to VRCL, it still falls short. This gap is mainly attributed to G<sup>2</sup>SAM&#x2019;s current and historical feature fusion mechanism, which has not been fully optimized and thus fails to effectively leverage historical semantic context. Consequently, its performance in cross-modal semantic alignment and discrimination signal recognition is limited. In fact, mainstream multimodal models such as ALBEF, CLIP, and G<sup>2</sup>SAM mostly adopt a &#x201C;single-step&#x201D; feature fusion strategy, lacking the ability to remember and model semantic evolution over time. This limitation makes it difficult for these models to accurately capture subtle cross-modal semantic changes in the presence of semantic drift, implicit discrimination, or complex contextual variations.</p>
<p>In contrast, the superior performance of VRCL primarily benefits from its innovative incorporation of a historical feature memory mechanism and a GRU-based sequential modeling architecture. By dynamically retrieving and integrating the most semantically similar historical samples with the current features, VRCL effectively captures the semantic evolution process, greatly enhancing feature expressiveness and semantic continuity. As a result, VRCL is able to more accurately identify cross-modal semantic biases and discriminatory signals, demonstrating stronger robustness and generalization capabilities.</p>
</sec>
<sec id="s5_4">
<label>5.4</label>
<title>Ablation Study</title>
<p>To assess the contribution of each module to the model&#x2019;s performance, this study conducted ablation experiments by systematically removing individual modules and tracking changes in the F1 score. <xref ref-type="table" rid="table-3">Table 3</xref> presents the experimental results.</p>
<table-wrap id="table-3">
<label>Table 3</label>
<caption>
<title>Ablation experiment results analysis</title>
</caption>
<table>
<colgroup>
<col/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col/>
</colgroup>
<thead>
<tr>
<th>Num.</th>
<th align="center">XLM-R feature extraction</th>
<th align="center">Alignment and fusion</th>
<th align="center">Contrastive learning</th>
<th align="center">Dynamic memory-bank</th>
<th>F1</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td></td>
<td>&#x2713;</td>
<td>&#x2713;</td>
<td>&#x2713;</td>
<td>0.685</td>
</tr>
<tr>
<td>2</td>
<td>&#x2713;</td>
<td></td>
<td>&#x2713;</td>
<td>&#x2713;</td>
<td>0.659</td>
</tr>
<tr>
<td>3</td>
<td>&#x2713;</td>
<td>&#x2713;</td>
<td></td>
<td>&#x2713;</td>
<td>0.692</td>
</tr>
<tr>
<td>4</td>
<td>&#x2713;</td>
<td>&#x2713;</td>
<td>&#x2713;</td>
<td></td>
<td>0.633</td>
</tr>
<tr>
<td>5</td>
<td>&#x2713;</td>
<td>&#x2713;</td>
<td>&#x2713;</td>
<td>&#x2713;</td>
<td>0.714</td>
</tr>
</tbody>
</table>
</table-wrap>
<p><list list-type="bullet">
<list-item>
<p>Experiment 1: Replacing the original CLIP model&#x2019;s text extractor with XLM-R resulted in a decrease in the F1 score to 0.685. This indicates that XLM-R is essential for cross-linguistic feature capture and semantic understanding, and its absence significantly impacts performance in multilingual tasks.</p></list-item>
<list-item>
<p>Experiment 2: Removing the multimodal alignment and fusion module resulted in a drop in the F1 score to 0.659. The results demonstrate that this module strengthens alignment between textual and visual features via a cross-attention mechanism, effectively modeling both local and global dependencies and supporting cross-modal semantic interactions.</p></list-item>
<list-item>
<p>Experiment 3: Removing the contrastive learning module reduced the F1 score to 0.692, reflecting a relatively smaller decline. This indicates that the module optimizes feature distribution and enhances model robustness by increasing intra-class similarity and inter-class differentiation, thereby improving feature representation quality.</p></list-item>
<list-item>
<p>Experiment 4: Removing the dynamic memory module caused the F1 score to drop to 0.633, marking the largest decline. This underscores the dynamic memory bank&#x2019;s importance for long-term feature storage and contextual modeling, enabling it to capture historical features and dynamic information, thus improving adaptability to complex tasks.</p></list-item>
<list-item>
<p>Experiment 5: The complete model, incorporating all modules, achieved the best performance with an F1 score of 0.714. These results validate the synergistic contributions of each module to feature extraction, cross-modal alignment, contrastive learning, and dynamic memory modeling, enhancing both model robustness and adaptability.</p></list-item>
</list></p>
</sec>
<sec id="s5_5">
<label>5.5</label>
<title>Computational Complexity and Model Analysis</title>
<p>The proposed VRCL model demonstrates strong performance in integrating multilingual textual and visual features. However, its computational complexity warrants careful consideration. Both the text encoder (XLM-R) and image encoder (ViT) are based on the Transformer architecture, resulting in a time complexity of approximately <inline-formula id="ieqn-73"><mml:math id="mml-ieqn-73"><mml:mrow><mml:mi>&#x1D4AA;</mml:mi></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:msup><mml:mi>n</mml:mi><mml:mn>2</mml:mn></mml:msup><mml:mo>&#x22C5;</mml:mo><mml:mi>d</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> for textual inputs and <inline-formula id="ieqn-74"><mml:math id="mml-ieqn-74"><mml:mrow><mml:mi>&#x1D4AA;</mml:mi></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:msup><mml:mi>p</mml:mi><mml:mn>2</mml:mn></mml:msup><mml:mo>&#x22C5;</mml:mo><mml:mi>d</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> for visual inputs, where <inline-formula id="ieqn-75"><mml:math id="mml-ieqn-75"><mml:mi>n</mml:mi></mml:math></inline-formula> denotes the input text length, <inline-formula id="ieqn-76"><mml:math id="mml-ieqn-76"><mml:mi>p</mml:mi></mml:math></inline-formula> is the number of image patches, and <inline-formula id="ieqn-77"><mml:math id="mml-ieqn-77"><mml:mi>d</mml:mi></mml:math></inline-formula> is the hidden dimension. In addition, the introduction of the semantic memory module improves contextual reasoning but also incurs additional storage and retrieval overhead. While the overall resource consumption is higher than that of traditional unimodal models, the VRCL model remains feasible for training and inference on modern GPU infrastructures. The advantages of VRCL lie in its robust multilingual adaptability, enhanced cross-modal alignment, and improved generalization through memory-based semantic enhancement. Nevertheless, the model still faces challenges such as relatively high computational costs and limited performance on certain low-resource languages, indicating room for further optimization in efficiency and cross-lingual generalization.</p>
</sec>
</sec>
<sec id="s6">
<label>6</label>
<title>Visualization</title>
<p>To further demonstrate the effectiveness of our proposed VRCL module in multimodal multilingual discrimination detection, we visualize the feature distribution around a discrimination instance. Specifically, we retrieve the top 200 nearest neighbor instances as reference samples for a discrimination case, and employ the t-SNE algorithm to reduce the feature dimensions to 2D space, as shown in <xref ref-type="fig" rid="fig-5">Fig. 5</xref>. <xref ref-type="fig" rid="fig-5">Fig. 5a</xref> illustrates the distribution without applying VRCL, whereas <xref ref-type="fig" rid="fig-5">Fig. 5b</xref> shows the feature distribution when VRCL is used. As seen in <xref ref-type="fig" rid="fig-5">Fig. 5b</xref>, the retrieved neighbors are more semantically consistent with the ground-truth label (discrimination), while <xref ref-type="fig" rid="fig-5">Fig. 5a</xref> contains more irrelevant or noisy (non-discrimination) instances. This observation suggests that VRCL enhances the semantic alignment of retrieved k-nearest neighbors across both modalities and languages, thereby boosting model performance in complex real-world scenarios.</p>
<fig id="fig-5">
<label>Figure 5</label>
<caption>
<title>Distribution of the retrieved top 200 nearest neighbors instances for a sarcasm case</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_66532-fig-5.tif"/>
</fig>
</sec>
<sec id="s7">
<label>7</label>
<title>Conclusions</title>
<p>In this paper, we propose a discrimination detection method for multilingual multimodal information, which integrates a ViT image encoder and an XLM-R text encoder. The multimodal feature alignment and fusion module effectively captures fine-grained interactive information between images and text. Based on this, VRCL combines contrastive learning with a historical feature memory bank to create a high-quality discrimination feature embedding space. The GRU module is employed to merge historical and current features, enhancing both the representation capability and detection accuracy of discriminatory signals. Experimental results demonstrate that the proposed method achieves strong performance across multiple multilingual multimodal datasets, highlighting its effectiveness and robustness across diverse cultural and contextual environments. As our future work, we plan to focus on optimizing the model&#x2019;s computational efficiency and expanding its applicability across various domains to better address the growing complexity of online discrimination.</p>
</sec>
</body>
<back>
<ack>
<p>This work was funded by the Open Foundation of Key Laboratory of Cyberspace Security, Ministry of Education [KLCS20240210].</p>
</ack>
<sec>
<title>Funding Statement</title>
<p>This research was funded by the Open Foundation of Key Laboratory of Cyberspace Security, Ministry of Education [KLCS20240210].</p>
</sec>
<sec>
<title>Author Contributions</title>
<p>All authors contributed to the conceptualization and methodology of the study. Jiahao Cheng, Jun Wang, and Ying Yang conducted the primary data collection and analysis. Meijiao Li contributed to drafting the initial manuscript. Kejun Zhang reviewed and revised the manuscript critically for important intellectual content. All authors reviewed the results and approved the final version of the manuscript.</p>
</sec>
<sec sec-type="data-availability">
<title>Availability of Data and Materials</title>
<p>The data that support the findings of this study are openly available in the Multi3Hate dataset and BHM dataset at: <ext-link ext-link-type="uri" xlink:href="https://github.com/minhducbui/multi3hate.">https://github.com/minhducbui/multi3hate</ext-link> (accessed on 22 June 2025) and <ext-link ext-link-type="uri" xlink:href="https://github.com/eftekhar-hossain/bengali-hateful-memes?tab=readme-ov-file">https://github.com/eftekhar-hossain/bengali-hateful-memes?tab=readme-ov-file</ext-link> (accessed on 22 June 2025).</p>
</sec>
<sec>
<title>Ethics Approval</title>
<p>Not applicable.</p>
</sec>
<sec sec-type="COI-statement">
<title>Conflicts of Interest</title>
<p>The authors declare no conflicts of interest to report regarding the present study.</p>
</sec>
<ref-list content-type="authoryear">
<title>References</title>
<ref id="ref-1"><label>[1]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Waseem</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Hovy</surname> <given-names>D</given-names></string-name></person-group>. <article-title>Hateful symbols or hateful people? Predictive features for hate speech detection on Twitter</article-title>. In: <conf-name>Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies; 2016 Jun 12&#x2013;17</conf-name>; <publisher-loc>San Diego, CA, USA</publisher-loc>. p. <fpage>88</fpage>&#x2013;<lpage>93</lpage>. doi:<pub-id pub-id-type="doi">10.18653/v1/n16-2013</pub-id>.</mixed-citation></ref>
<ref id="ref-2"><label>[2]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Sundararajan</surname> <given-names>K</given-names></string-name>, <string-name><surname>Palanisamy</surname> <given-names>A</given-names></string-name></person-group>. <article-title>Multi-rule based ensemble feature selection model for sarcasm type detection in Twitter</article-title>. <source>Comput Intell Neurosci</source>. <year>2020</year>;<volume>2020</volume>(<issue>7</issue>):<fpage>1</fpage>&#x2013;<lpage>17</lpage>. doi:<pub-id pub-id-type="doi">10.1155/2020/2860479</pub-id>; <pub-id pub-id-type="pmid">32405293</pub-id></mixed-citation></ref>
<ref id="ref-3"><label>[3]</label><mixed-citation publication-type="book"><person-group person-group-type="author"><string-name><surname>Dagar</surname> <given-names>V</given-names></string-name>, <string-name><surname>Verma</surname> <given-names>A</given-names></string-name>, <string-name><surname>Govardhan</surname> <given-names>K</given-names></string-name></person-group>. <chapter-title>Sentiment analysis and sarcasm detection (using emoticons)</chapter-title>. In: <person-group person-group-type="author"><string-name><surname>Swarnalatha</surname> <given-names>P</given-names></string-name>, <string-name><surname>Prabu</surname> <given-names>S</given-names></string-name></person-group>, editors. <source>Research anthology on implementing sentiment analysis across multiple disciplines</source>. <publisher-loc>Hershey, PA, USA</publisher-loc>: <publisher-name>IGI Global</publisher-name>; <year>2022</year>. p. <fpage>1600</fpage>&#x2013;<lpage>10</lpage>. doi: <pub-id pub-id-type="doi">10.4018/978-1-6684-6303-1.ch085</pub-id>.</mixed-citation></ref>
<ref id="ref-4"><label>[4]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Bhavana</surname> <given-names>N</given-names></string-name>, <string-name><surname>Guthur</surname> <given-names>AS</given-names></string-name>, <string-name><surname>Reddy</surname> <given-names>KLS</given-names></string-name>, <string-name><surname>Ahmed</surname> <given-names>ST</given-names></string-name>, <string-name><surname>Ahmed</surname> <given-names>A</given-names></string-name></person-group>. <article-title>Cognizance through convolution: a deep learning approach for emotion recognition via convolutional neural networks</article-title>. <source>Procedia Comput Sci</source>. <year>2025</year>;<volume>259</volume>:<fpage>1336</fpage>&#x2013;<lpage>45</lpage>.</mixed-citation></ref>
<ref id="ref-5"><label>[5]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Siddiqui</surname> <given-names>M</given-names></string-name>, <string-name><surname>Pandey</surname> <given-names>R</given-names></string-name>, <string-name><surname>Srivastava</surname> <given-names>S</given-names></string-name>, <string-name><surname>Mishra</surname> <given-names>R</given-names></string-name>, <string-name><surname>Singh</surname> <given-names>N</given-names></string-name></person-group>. <article-title>Sarcasm detection from social media posts using machine-learning techniques: a comparative analysis</article-title>. In: <conf-name>Proceedings of the 3rd International Conference on Advanced Computing and Software Engineering; 2021 Feb 19&#x2013;20</conf-name>; <publisher-loc>Sultanpur, India</publisher-loc>. p. <fpage>28</fpage>&#x2013;<lpage>33</lpage>. doi:<pub-id pub-id-type="doi">10.5220/0010561900003161</pub-id>.</mixed-citation></ref>
<ref id="ref-6"><label>[6]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Kiilu</surname> <given-names>KK</given-names></string-name>, <string-name><surname>Okeyo</surname> <given-names>G</given-names></string-name>, <string-name><surname>Rimiru</surname> <given-names>R</given-names></string-name>, <string-name><surname>Ogada</surname> <given-names>K</given-names></string-name></person-group>. <article-title>Using na&#x00EF;ve Bayes algorithm in detection of hate tweets</article-title>. <source>Int J Sci Res Publ</source>. <year>2018</year>;<volume>8</volume>(<issue>3</issue>):<fpage>99</fpage>&#x2013;<lpage>107</lpage>. doi:<pub-id pub-id-type="doi">10.29322/ijsrp.8.3.2018.p7517</pub-id>.</mixed-citation></ref>
<ref id="ref-7"><label>[7]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Hana</surname> <given-names>KM</given-names></string-name>, <collab>Adiwijaya</collab>, <string-name><surname>Al Faraby</surname> <given-names>S</given-names></string-name>, <string-name><surname>Bramantoro</surname> <given-names>A</given-names></string-name></person-group>. <article-title>Multi-label classification of Indonesian hate speech on Twitter using support vector machines</article-title>. In: <conf-name>Proceedings of the 2020 International Conference on Data Science and Its Applications (ICoDSA); 2020 Aug 5&#x2013;6</conf-name>; <publisher-loc>Bandung, Indonesia</publisher-loc>. p. <fpage>1</fpage>&#x2013;<lpage>7</lpage>. doi:<pub-id pub-id-type="doi">10.1109/icodsa50139.2020.9212992</pub-id>.</mixed-citation></ref>
<ref id="ref-8"><label>[8]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Rodriguez-Sanchez</surname> <given-names>F</given-names></string-name>, <string-name><surname>Carrillo-de-Albornoz</surname> <given-names>J</given-names></string-name>, <string-name><surname>Plaza</surname> <given-names>L</given-names></string-name></person-group>. <article-title>Automatic classification of sexism in social networks: an empirical study on Twitter data</article-title>. <source>IEEE Access</source>. <year>2020</year>;<volume>8</volume>:<fpage>219563</fpage>&#x2013;<lpage>76</lpage>. doi:<pub-id pub-id-type="doi">10.1109/access.2020.3042604</pub-id>.</mixed-citation></ref>
<ref id="ref-9"><label>[9]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Conneau</surname> <given-names>A</given-names></string-name>, <string-name><surname>Baevski</surname> <given-names>A</given-names></string-name>, <string-name><surname>Collobert</surname> <given-names>R</given-names></string-name>, <string-name><surname>Mohamed</surname> <given-names>A</given-names></string-name>, <string-name><surname>Auli</surname> <given-names>M</given-names></string-name></person-group>. <article-title>Unsupervised cross-lingual representation learning for speech recognition</article-title>. In: <conf-name>Proceedings of the Interspeech 2021; 2021 Aug 30&#x2013;Sep 3</conf-name>; <publisher-loc>Brno, Czech Republic</publisher-loc>. p. <fpage>2426</fpage>&#x2013;<lpage>30</lpage>. doi:<pub-id pub-id-type="doi">10.21437/interspeech.2021-329</pub-id>.</mixed-citation></ref>
<ref id="ref-10"><label>[10]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><surname>Dosovitskiy</surname> <given-names>A</given-names></string-name>, <string-name><surname>Beyer</surname> <given-names>L</given-names></string-name>, <string-name><surname>Kolesnikov</surname> <given-names>A</given-names></string-name>, <string-name><surname>Weissenborn</surname> <given-names>D</given-names></string-name>, <string-name><surname>Zhai</surname> <given-names>X</given-names></string-name>, <string-name><surname>Unterthiner</surname> <given-names>T</given-names></string-name>, <etal>et al.</etal></person-group> <article-title>An image is worth 16x16 words: transformers for image recognition at scale</article-title>. <comment>arXiv:2010.11929. 2020</comment>. doi:<pub-id pub-id-type="doi">10.48550/arXiv.2010.11929</pub-id>.</mixed-citation></ref>
<ref id="ref-11"><label>[11]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Ballakur</surname> <given-names>AA</given-names></string-name>, <string-name><surname>Arya</surname> <given-names>A</given-names></string-name></person-group>. <article-title>Empirical evaluation of gated recurrent neural network architectures in aviation delay prediction</article-title>. In: <conf-name>Proceedings of the 5th International Conference on Computing, Communication and Security (ICCCS); 2020 Oct 14&#x2013;16</conf-name>; <publisher-loc>Patna, India</publisher-loc>. p. <fpage>1</fpage>&#x2013;<lpage>7</lpage>. doi:<pub-id pub-id-type="doi">10.1109/icccs49678.2020.9276855</pub-id>.</mixed-citation></ref>
<ref id="ref-12"><label>[12]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Kiela</surname> <given-names>D</given-names></string-name>, <string-name><surname>Firooz</surname> <given-names>H</given-names></string-name>, <string-name><surname>Mohan</surname> <given-names>A</given-names></string-name>, <string-name><surname>Goswami</surname> <given-names>V</given-names></string-name>, <string-name><surname>Singh</surname> <given-names>A</given-names></string-name>, <string-name><surname>Ringshia</surname> <given-names>P</given-names></string-name>, <etal>et al</etal></person-group>. <article-title>The hateful memes challenge: detecting hate speech in multimodal memes</article-title>. <source>Adv Neural Inf Process Syst</source>. <year>2020</year>;<volume>33</volume>:<fpage>2611</fpage>&#x2013;<lpage>24</lpage>. doi:<pub-id pub-id-type="doi">10.48550/arXiv.2005.04790</pub-id>.</mixed-citation></ref>
<ref id="ref-13"><label>[13]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Ma</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Yao</surname> <given-names>S</given-names></string-name>, <string-name><surname>Wu</surname> <given-names>L</given-names></string-name>, <string-name><surname>Gao</surname> <given-names>S</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>Y</given-names></string-name></person-group>. <article-title>Hateful memes detection based on multi-task learning</article-title>. <source>Mathematics</source>. <year>2022</year>;<volume>10</volume>(<issue>23</issue>):<fpage>4525</fpage>. doi:<pub-id pub-id-type="doi">10.3390/math10234525</pub-id>.</mixed-citation></ref>
<ref id="ref-14"><label>[14]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><surname>Chen</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Pan</surname> <given-names>F</given-names></string-name></person-group>. <article-title>Multimodal detection of hateful memes by applying a vision-language pre-training model</article-title>. <year>2022</year>. doi:<pub-id pub-id-type="doi">10.21203/rs.3.rs-1414253/v2</pub-id>.</mixed-citation></ref>
<ref id="ref-15"><label>[15]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Nie</surname> <given-names>L</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>W</given-names></string-name>, <string-name><surname>Hong</surname> <given-names>R</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>M</given-names></string-name>, <string-name><surname>Tian</surname> <given-names>Q</given-names></string-name></person-group>. <article-title>Multimodal dialog system: generating responses via adaptive decoders</article-title>. In: <conf-name>Proceedings of the 27th ACM International Conference on Multimedia; 2019 Oct 21&#x2013;25</conf-name>; <publisher-loc>Nice, France</publisher-loc>. p. <fpage>1098</fpage>&#x2013;<lpage>106</lpage>. doi:<pub-id pub-id-type="doi">10.1145/3343031.3350923</pub-id>.</mixed-citation></ref>
<ref id="ref-16"><label>[16]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Kumar</surname> <given-names>GK</given-names></string-name>, <string-name><surname>Nandakumar</surname> <given-names>K</given-names></string-name></person-group>. <article-title>Hate-CLIPper: multimodal hateful meme classification based on cross-modal interaction of CLIP features</article-title>. In: <conf-name> Proceedings of the 2nd Workshop on NLP for Positive Impact (NLP4PI); 2022 Dec 7</conf-name>; <publisher-loc>Abu Dhabi, United Arab Emirates</publisher-loc>. p. <fpage>171</fpage>&#x2013;<lpage>83</lpage>. doi:<pub-id pub-id-type="doi">10.18653/v1/2022.nlp4pi-1.20</pub-id>.</mixed-citation></ref>
<ref id="ref-17"><label>[17]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><surname>Chen</surname> <given-names>J</given-names></string-name>, <string-name><surname>Yu</surname> <given-names>H</given-names></string-name>, <string-name><surname>Liu</surname> <given-names>W</given-names></string-name>, <string-name><surname>Huang</surname> <given-names>S</given-names></string-name>, <string-name><surname>Liu</surname> <given-names>S</given-names></string-name></person-group>. <article-title>InterCLIP-MEP: interactive CLIP and memory-enhanced predictor for multi-modal sarcasm detection</article-title>. <comment>arXiv:2406.16464. 2024</comment>. doi:<pub-id pub-id-type="doi">10.48550/arXiv.2406.16464</pub-id>.</mixed-citation></ref>
<ref id="ref-18"><label>[18]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Lu</surname> <given-names>J</given-names></string-name>, <string-name><surname>Batra</surname> <given-names>D</given-names></string-name>, <string-name><surname>Parikh</surname> <given-names>D</given-names></string-name>, <string-name><surname>Lee</surname> <given-names>S</given-names></string-name></person-group>. <article-title>ViLBERT: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks</article-title>. In: <conf-name>Proceedings of the Neural Information Processing Systems 2019</conf-name>; <year>2019 Dec 8&#x2013;14</year>; <publisher-loc>Vancouver, BC, Canada</publisher-loc>. doi:<pub-id pub-id-type="doi">10.48550/arXiv.1908.02265</pub-id>.</mixed-citation></ref>
<ref id="ref-19"><label>[19]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Li</surname> <given-names>LH</given-names></string-name>, <string-name><surname>Yatskar</surname> <given-names>M</given-names></string-name>, <string-name><surname>Yin</surname> <given-names>D</given-names></string-name>, <string-name><surname>Hsieh</surname> <given-names>CJ</given-names></string-name>, <string-name><surname>Chang</surname> <given-names>KW</given-names></string-name></person-group>. <article-title>What does BERT with vision look at?</article-title> In: <source>Proceedings of the 58th Annual Meeting of The Association For Computational Linguistics; 2020 Jul 5&#x2013;10; Online</source>. p. <fpage>5265</fpage>&#x2013;<lpage>75</lpage>. doi:<pub-id pub-id-type="doi">10.18653/v1/2020.acl-main.469</pub-id>.</mixed-citation></ref>
<ref id="ref-20"><label>[20]</label><mixed-citation publication-type="book"><person-group person-group-type="author"><string-name><surname>Chen</surname> <given-names>YC</given-names></string-name>, <string-name><surname>Li</surname> <given-names>L</given-names></string-name>, <string-name><surname>Yu</surname> <given-names>L</given-names></string-name>, <string-name><surname>El Kholy</surname> <given-names>A</given-names></string-name>, <string-name><surname>Ahmed</surname> <given-names>F</given-names></string-name>, <string-name><surname>Gan</surname> <given-names>Z</given-names></string-name>, <etal>et al.</etal></person-group> <chapter-title>UNITER: universal image-text representation learning</chapter-title>. In: <source>Proceedings of the Computer Vision&#x2014;ECCV 2020</source>; <year>2020 Aug 23&#x2013;28</year>; <publisher-loc>Glasgow, UK</publisher-loc>. <publisher-name>Springer</publisher-name>; <year>2020</year>. p. <fpage>104</fpage>&#x2013;<lpage>20</lpage>. doi:<pub-id pub-id-type="doi">10.1007/978-3-030-58577-8_7</pub-id>.</mixed-citation></ref>
<ref id="ref-21"><label>[21]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Xue</surname> <given-names>L</given-names></string-name>, <string-name><surname>Constant</surname> <given-names>N</given-names></string-name>, <string-name><surname>Roberts</surname> <given-names>A</given-names></string-name>, <string-name><surname>Kale</surname> <given-names>M</given-names></string-name>, <string-name><surname>Al-Rfou</surname> <given-names>R</given-names></string-name>, <string-name><surname>Siddhant</surname> <given-names>A</given-names></string-name>, <etal>et al.</etal></person-group> <article-title>mT5: a massively multilingual pre-trained text-to-text transformer</article-title>. In: <conf-name>Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies; 2021 Jun 6&#x2013;11; Online</conf-name>. p. <fpage>483</fpage>&#x2013;<lpage>98</lpage>. doi:<pub-id pub-id-type="doi">10.18653/v1/2021.naacl-main.41</pub-id>.</mixed-citation></ref>
<ref id="ref-22"><label>[22]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Montariol</surname> <given-names>S</given-names></string-name>, <string-name><surname>Riabi</surname> <given-names>A</given-names></string-name>, <string-name><surname>Seddah</surname> <given-names>D</given-names></string-name></person-group>. <article-title>Multilingual auxiliary tasks training: bridging the gap between languages for zero-shot transfer of hate speech detection models</article-title>. In: <conf-name>Findings of AACL-IJCNLP 2022; 2022 Nov 21&#x2013;23</conf-name>; <publisher-loc>Taipei, Taiwan</publisher-loc>. p. <fpage>347</fpage>&#x2013;<lpage>63</lpage>. doi:<pub-id pub-id-type="doi">10.18653/v1/2022.findings-aacl.33</pub-id>.</mixed-citation></ref>
<ref id="ref-23"><label>[23]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>R&#x00F6;ttger</surname> <given-names>P</given-names></string-name>, <string-name><surname>Seelawi</surname> <given-names>H</given-names></string-name>, <string-name><surname>Nozza</surname> <given-names>D</given-names></string-name>, <string-name><surname>Talat</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Vidgen</surname> <given-names>B</given-names></string-name></person-group>. <article-title>Multilingual HateCheck: functional tests for multilingual hate speech detection models</article-title>. In: <conf-name>Proceedings of the 6th Workshop on Online Abuse and Harms (WOAH); 2022 Jul 10&#x2013;15</conf-name>; <publisher-loc>Seattle, WA, USA</publisher-loc>. p. <fpage>154</fpage>&#x2013;<lpage>69</lpage>. doi:<pub-id pub-id-type="doi">10.18653/v1/2022.woah-1.15</pub-id>.</mixed-citation></ref>
<ref id="ref-24"><label>[24]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><surname>Bui</surname> <given-names>MD</given-names></string-name>, <string-name><surname>von der Wense</surname> <given-names>K</given-names></string-name>, <string-name><surname>Lauscher</surname> <given-names>A</given-names></string-name></person-group>. <article-title>Multi3Hate: multimodal, multilingual, and multicultural hate speech detection with vision-language models [Preprint]</article-title>. <comment>arXiv:2411.03888. 2024</comment>. doi:<pub-id pub-id-type="doi">10.48550/arXiv.2411.03888</pub-id>.</mixed-citation></ref>
<ref id="ref-25"><label>[25]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Hossain</surname> <given-names>E</given-names></string-name>, <string-name><surname>Sharif</surname> <given-names>O</given-names></string-name>, <string-name><surname>Hoque</surname> <given-names>MM</given-names></string-name>, <string-name><surname>Preum</surname> <given-names>SM</given-names></string-name></person-group>. <article-title>Deciphering hate: identifying hateful memes and their targets</article-title>. In: <conf-name>Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL 2024); 2024 Aug 11&#x2013;16</conf-name>; <publisher-loc>Bangkok, Thailand</publisher-loc>. p. <fpage>8347</fpage>&#x2013;<lpage>59</lpage>. doi:<pub-id pub-id-type="doi">10.18653/v1/2024.acl-long.454</pub-id>.</mixed-citation></ref>
<ref id="ref-26"><label>[26]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Kim</surname> <given-names>Y</given-names></string-name></person-group>. <article-title>Convolutional neural networks for sentence classification</article-title>. In: <conf-name>Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP); 2014 Oct 25&#x2013;29</conf-name>; <publisher-loc>Doha, Qatar</publisher-loc>. p. <fpage>1746</fpage>&#x2013;<lpage>51</lpage>. doi:<pub-id pub-id-type="doi">10.3115/v1/d14-1181</pub-id>.</mixed-citation></ref>
<ref id="ref-27"><label>[27]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Graves</surname> <given-names>A</given-names></string-name>, <string-name><surname>Schmidhuber</surname> <given-names>J</given-names></string-name></person-group>. <article-title>Framewise phoneme classification with bidirectional LSTM and other neural network architectures</article-title>. <source>Neural Netw</source>. <year>2005</year>;<volume>18</volume>(<issue>5</issue>):<fpage>602</fpage>&#x2013;<lpage>10</lpage>. doi:<pub-id pub-id-type="doi">10.1016/j.neunet.2005.06.042</pub-id>; <pub-id pub-id-type="pmid">16112549</pub-id></mixed-citation></ref>
<ref id="ref-28"><label>[28]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Devlin</surname> <given-names>J</given-names></string-name>, <string-name><surname>Chang</surname> <given-names>MW</given-names></string-name>, <string-name><surname>Lee</surname> <given-names>K</given-names></string-name>, <string-name><surname>Toutanova</surname> <given-names>K</given-names></string-name></person-group>. <article-title>BERT: pre-training of deep bidirectional transformers for language understanding</article-title>. In: <conf-name>Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT 2019); 2019 Jun 2&#x2013;7</conf-name>; <publisher-loc>Minneapolis, MN, USA</publisher-loc>. p. <fpage>4171</fpage>&#x2013;<lpage>86</lpage>. doi:<pub-id pub-id-type="doi">10.18653/v1/N19-1423</pub-id>.</mixed-citation></ref>
<ref id="ref-29"><label>[29]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>He</surname> <given-names>K</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>X</given-names></string-name>, <string-name><surname>Ren</surname> <given-names>S</given-names></string-name>, <string-name><surname>Sun</surname> <given-names>J</given-names></string-name></person-group>. <article-title>Deep residual learning for image recognition</article-title>. In: <conf-name>Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR); 2016 Jun 27&#x2013;30</conf-name>; <publisher-loc>Las Vegas, NV, USA</publisher-loc>. p. <fpage>770</fpage>&#x2013;<lpage>8</lpage>. doi:<pub-id pub-id-type="doi">10.1109/CVPR.2016.90</pub-id>.</mixed-citation></ref>
<ref id="ref-30"><label>[30]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Liu</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Mao</surname> <given-names>H</given-names></string-name>, <string-name><surname>Wu</surname> <given-names>CY</given-names></string-name>, <string-name><surname>Feichtenhofer</surname> <given-names>C</given-names></string-name>, <string-name><surname>Darrell</surname> <given-names>T</given-names></string-name>, <string-name><surname>Xie</surname> <given-names>S</given-names></string-name></person-group>. <article-title>A ConvNet for the 2020s</article-title>. In: <conf-name>Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); 2022 Jun 18&#x2013;24</conf-name>; <publisher-loc>New Orleans, LA, USA</publisher-loc>. p. <fpage>11966</fpage>&#x2013;<lpage>76</lpage>. doi:<pub-id pub-id-type="doi">10.1109/CVPR52688.2022.01167</pub-id>.</mixed-citation></ref>
<ref id="ref-31"><label>[31]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Li</surname> <given-names>J</given-names></string-name>, <string-name><surname>Selvaraju</surname> <given-names>RR</given-names></string-name>, <string-name><surname>Gotmare</surname> <given-names>A</given-names></string-name>, <string-name><surname>Joty</surname> <given-names>S</given-names></string-name>, <string-name><surname>Xiong</surname> <given-names>C</given-names></string-name>, <string-name><surname>Hoi</surname> <given-names>SCH</given-names></string-name></person-group>. <article-title>Align before fuse: vision and language representation learning with momentum distillation</article-title>. <source>Adv Neural Inf Process Syst</source>. <year>2021</year>;<volume>34</volume>:<fpage>9694</fpage>&#x2013;<lpage>705</lpage>. doi:<pub-id pub-id-type="doi">10.48550/arXiv.2107.07651</pub-id>.</mixed-citation></ref>
<ref id="ref-32"><label>[32]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Wei</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Yuan</surname> <given-names>S</given-names></string-name>, <string-name><surname>Zhou</surname> <given-names>H</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>L</given-names></string-name>, <string-name><surname>Yan</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Yang</surname> <given-names>R</given-names></string-name>, <etal>et al</etal></person-group>. <article-title>G<sup>2</sup>SAM: graph-based global semantic awareness method for multimodal sarcasm detection</article-title>. In: <conf-name>Proceedings of the 38th AAAI Conference on Artificial Intelligence; 2024 Feb 20&#x2013;27</conf-name>; <publisher-loc>Vancouver, BC, Canada</publisher-loc>. p. <fpage>9151</fpage>&#x2013;<lpage>9</lpage>. doi:<pub-id pub-id-type="doi">10.1609/aaai.v38i8.28766</pub-id>.</mixed-citation></ref>
</ref-list>
</back></article>