<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.1 20151215//EN" "http://jats.nlm.nih.gov/publishing/1.1/JATS-journalpublishing1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:mml="http://www.w3.org/1998/Math/MathML" xml:lang="en" article-type="review-article" dtd-version="1.1">
<front>
<journal-meta>
<journal-id journal-id-type="pmc">CMC</journal-id>
<journal-id journal-id-type="nlm-ta">CMC</journal-id>
<journal-id journal-id-type="publisher-id">CMC</journal-id>
<journal-title-group>
<journal-title>Computers, Materials &#x0026; Continua</journal-title>
</journal-title-group>
<issn pub-type="epub">1546-2226</issn>
<issn pub-type="ppub">1546-2218</issn>
<publisher>
<publisher-name>Tech Science Press</publisher-name>
<publisher-loc>USA</publisher-loc>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="publisher-id">76411</article-id>
<article-id pub-id-type="doi">10.32604/cmc.2026.076411</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Review</subject>
</subj-group>
</article-categories>
<title-group>
<article-title>A Survey on Multimodal Emotion Recognition: Methods, Datasets, and Future Directions</article-title>
<alt-title alt-title-type="left-running-head">A Survey on Multimodal Emotion Recognition: Methods, Datasets, and Future Directions</alt-title>
<alt-title alt-title-type="right-running-head">A Survey on Multimodal Emotion Recognition: Methods, Datasets, and Future Directions</alt-title>
</title-group>
<contrib-group>
<contrib id="author-1" contrib-type="author">
<name name-style="western"><surname>Moon</surname><given-names>A-Seong</given-names></name></contrib>
<contrib id="author-2" contrib-type="author">
<name name-style="western"><surname>Kim</surname><given-names>Haesung</given-names></name></contrib>
<contrib id="author-3" contrib-type="author">
<name name-style="western"><surname>Park</surname><given-names>Ye-Chan</given-names></name></contrib>
<contrib id="author-4" contrib-type="author" corresp="yes">
<name name-style="western"><surname>Lee</surname><given-names>Jaesung</given-names></name><email>curseor@cau.ac.kr</email></contrib>
<aff id="aff-1"><institution>Department of Artificial Intelligence, Chung-Ang University</institution>, <addr-line>Seoul</addr-line>, <country>Republic of Korea</country></aff>
</contrib-group>
<author-notes>
<corresp id="cor1"><label>&#x002A;</label>Corresponding Author: Jaesung Lee. Email: <email>curseor@cau.ac.kr</email></corresp>
</author-notes>
<pub-date date-type="collection" publication-format="electronic">
<year>2026</year>
</pub-date>
<pub-date date-type="pub" publication-format="electronic">
<day>12</day><month>3</month><year>2026</year>
</pub-date>
<volume>87</volume>
<issue>2</issue>
<elocation-id>1</elocation-id>
<history>
<date date-type="received">
<day>20</day>
<month>11</month>
<year>2025</year>
</date>
<date date-type="accepted">
<day>19</day>
<month>01</month>
<year>2026</year>
</date>
</history>
<permissions>
<copyright-statement>&#x00A9; 2026 The Authors. Published by Tech Science Press.</copyright-statement>
<copyright-year>2026</copyright-year>
<copyright-holder>The Authors</copyright-holder>
<license xlink:href="https://creativecommons.org/licenses/by/4.0/">
<license-p>This work is licensed under a <ext-link ext-link-type="uri" xlink:type="simple" xlink:href="https://creativecommons.org/licenses/by/4.0/">Creative Commons Attribution 4.0 International License</ext-link>, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
</license>
</permissions>
<self-uri content-type="pdf" xlink:href="TSP_CMC_76411.pdf"></self-uri>
<abstract>
<p>Multimodal emotion recognition has emerged as a key research area for enabling human-centered artificial intelligence, supported by the rapid progress in vision, audio, language, and physiological modeling. Existing approaches integrate heterogeneous affective cues through diverse embedding strategies and fusion mechanisms, yet the field remains fragmented due to differences in feature alignment, temporal synchronization, modality reliability, and robustness to noise or missing inputs. This survey provides a comprehensive analysis of MER research from 2021 to 2025, consolidating advances in modality-specific representation learning, cross-modal feature construction, and early, late, and hybrid fusion paradigms. We systematically review visual, acoustic, textual, and sensor-based embeddings, highlighting how pre-trained encoders, self-supervised learning, and large language models have reshaped the representational foundations of MER. We further categorize fusion strategies by interaction depth and architectural design, examining how attention mechanisms, cross-modal transformers, adaptive gating, and multimodal large language models redefine the integration of affective signals. Finally, we summarize major benchmark datasets and evaluation metrics and discuss emerging challenges related to scalability, generalization, and interpretability. This survey aims to provide a unified perspective on multimodal fusion for emotion recognition and to guide future research toward more coherent and generalizable multimodal affective intelligence.</p>
</abstract>
<kwd-group kwd-group-type="author">
<kwd>Multimodal emotion recognition</kwd>
<kwd>multimodal learning</kwd>
<kwd>cross-modal learning</kwd>
<kwd>fusion strategies</kwd>
<kwd>representation learning</kwd>
</kwd-group>
<funding-group>
<award-group id="awg1">
<funding-source>Korea government</funding-source>
<award-id>RS-2021-II211341</award-id>
</award-group>
<award-group id="awg2">
<funding-source>Automatic Neural Network Generation and Deployment Optimized for Runtime Environment</funding-source>
<award-id>2021-0-00766</award-id>
</award-group>
</funding-group>
</article-meta>
</front>
<body>
<sec id="s1">
<label>1</label>
<title>Introduction</title>
<p>Emotions serve as essential elements in shaping human cognition and behavior. These affective states influence perception, guide decision-making, and govern both verbal and non-verbal communication. As intelligent systems continue to advance and permeate daily life, the capacity to recognize and interpret human emotions has become a vital component of human-centered artificial intelligence [<xref ref-type="bibr" rid="ref-1">1</xref>]. Emotion recognition enables machines to engage with users in a more adaptive and context-aware manner, fostering natural interaction across domains such as healthcare [<xref ref-type="bibr" rid="ref-2">2</xref>], education [<xref ref-type="bibr" rid="ref-3">3</xref>], and autonomous systems [<xref ref-type="bibr" rid="ref-4">4</xref>]. Over the past decade, automatic emotion recognition has evolved from unimodal frameworks that relied solely on audio or visual cues to multimodal systems that integrate diverse affective signals to capture the complexity of human emotion [<xref ref-type="bibr" rid="ref-5">5</xref>,<xref ref-type="bibr" rid="ref-6">6</xref>]. This transition has positioned fusion strategies at the core of emotion recognition research, shaping how information from multiple modalities is processed, aligned, and interpreted within computational models.</p>
<p>The integration of multiple modalities within emotion recognition systems has been motivated by the complementary nature of affective cues. Visual signals convey facial expressions and gestures, while acoustic signals capture prosodic and tonal variations that reflect subtle emotional nuances. When combined, these modalities provide a richer and more reliable understanding of human affect than either source alone. However, the process of merging heterogeneous data streams raises a fundamental design question: how and when should information from different modalities be fused? Over the years, research has converged on three principal paradigms: early fusion, late fusion, and hybrid fusion [<xref ref-type="bibr" rid="ref-7">7</xref>]. Early fusion aggregates features before inference, enabling joint learning of cross-modal relationships at the representation level. Late fusion, in contrast, combines independently learned modality-specific predictions, emphasizing interpretability and modularity. Hybrid fusion seeks to balance these two approaches by introducing intermediate-level interactions through mechanisms such as attention, gating, or transformer-based alignment. Each paradigm embodies a distinct trade-off among expressiveness, flexibility, and computational efficiency, forming the foundation for most MER frameworks.</p>
<p>In emotion recognition research, an equally important consideration lies in how emotions are represented. Broadly, affective states are modeled using either categorical or dimensional formulations. Categorical emotion models describe emotions as discrete classes, such as happiness, sadness, anger, or fear, and have been widely adopted in early emotion recognition systems due to their intuitive interpretability. In contrast, dimensional emotion models characterize affect along continuous axes, most commonly valence and arousal, where valence reflects the degree of emotional positivity or negativity and arousal indicates the level of activation or intensity. This representation enables finer-grained modeling of emotional dynamics and ambiguity, and is particularly well suited to multimodal and conversational settings where emotions evolve gradually over time. As a result, both categorical and dimensional formulations coexist in modern MER research, with the choice of representation influencing dataset design, learning objectives, and fusion strategies.</p>
<p>To contextualize these fusion paradigms within the broader MER workflow, <xref ref-type="fig" rid="fig-1">Fig. 1</xref> presents a standard processing pipeline commonly employed in multimodal systems. The pipeline processes raw inputs from visual, audio, text, and sensor modalities and derives informative representations from each stream. These representations are subsequently integrated through an appropriate fusion strategy, either early, late, or hybrid, before the consolidated embedding is mapped to the final emotion prediction space.</p>
<fig id="fig-1">
<label>Figure 1</label>
<caption>
<title>Overview of the multimodal emotion recognition (MER) pipeline.</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_76411-fig-1.tif"/>
</fig>
<p>Although fusion has become the cornerstone of MER, designing an effective integration mechanism remains a persistent challenge. Early fusion approaches enable joint representation learning but often struggle with the heterogeneity of feature distributions and the dominance of certain modalities during training. Late fusion methods, while robust to such discrepancies, tend to overlook fine-grained temporal dependencies that are essential for interpreting dynamic emotional expressions. These limitations have led researchers to explore hybrid fusion strategies that incorporate intermediate interactions between modalities. By leveraging mechanisms such as cross-modal attention, adaptive weighting, and temporal alignment, hybrid fusion frameworks attempt to capture both local and global dependencies across modalities. The continuous evolution of these methods reflects an ongoing effort to balance interpretability, robustness, and adaptability&#x2014;key factors that determine the overall effectiveness of MER systems.</p>
<p>Given the diversity of fusion strategies and their pivotal role in the shape of MER, a systematic examination of these approaches is essential to understand the current research landscape. This survey aims to provide a comprehensive analysis of fusion methodologies, categorizing existing studies by integration stage, architectural design, and interaction dynamics among modalities. In doing so, it highlights how different fusion paradigms influence the learning of emotional representations and the interpretability of affective responses. Furthermore, this paper discusses emerging trends such as attention-based fusion, transformer-driven architectures, and adaptive gating mechanisms that redefine the boundaries between early, late, and hybrid designs. By consolidating the existing literature and identifying conceptual links between studies, this survey aims to offer a unified perspective on multimodal fusion for emotion recognition and to guide future research toward more cohesive, generalizable integration frameworks.</p>
</sec>
<sec id="s2">
<label>2</label>
<title>Related Work</title>
<p>Recent studies in MER have expanded rapidly, encompassing developments in architecture design, feature representation, and data-driven evaluation. This section reviews these advancements across four complementary perspectives. <xref ref-type="sec" rid="s2_1">Section 2.1</xref> introduces recent trends in MER, outlining how deep learning frameworks and multimodal architectures have evolved in recent years. <xref ref-type="sec" rid="s2_2">Section 2.2</xref> discusses embedding and feature representation methods, focusing on how pre-trained encoders and shared latent spaces enhance the expressiveness of affective cues across modalities. <xref ref-type="sec" rid="s2_3">Section 2.3</xref> examines various fusion strategies, comparing early, late, and hybrid approaches and analyzing how each contributes to the integration of multimodal information. Finally, <xref ref-type="sec" rid="s2_4">Section 2.4</xref> summarizes the datasets and evaluation metrics commonly used in the field, highlighting their roles in benchmarking performance and facilitating consistent comparison across studies.</p>
<sec id="s2_1">
<label>2.1</label>
<title>Recent Advances in MER</title>
<p>To provide a systematic overview of recent progress in MER, representative studies published between 2021 and 2025 are organized and analyzed with respect to their core modalities, fusion strategies, pretrained feature extractors, and benchmark datasets. A comprehensive summary of these works is provided in <xref ref-type="table" rid="table-3">Table A1</xref> in <xref ref-type="app" rid="app-1">Appendix A</xref> for reference. In this subsection, the major works are discussed in both thematic and temporal order to highlight their technical motivations, fusion mechanisms, and contributions to multimodal understanding, while emphasizing the methodological shifts from early handcrafted fusion designs to transformer- and large-language-model-based architectures.</p>
<p>The studies included in this survey were selected through a structured literature review process. We primarily targeted peer-reviewed journals and top-tier conference publications in the fields of affective computing, multimodal learning, and human-centered AI. A literature search was conducted across common academic databases using keywords such as multimodal emotion recognition, emotion recognition in conversation, audio-visual-text fusion, and multimodal affective computing. Priority was given to works that introduced novel fusion strategies, alignment mechanisms, or architectural contributions, as well as studies evaluated on widely used public benchmarks. This selection strategy resulted in a curated set of representative papers that collectively reflect recent methodological trends in MER.</p>
<p>Early multimodal approaches established hybrid fusion as the foundation of modern MER. Progressive and cross-modal reinforcement frameworks improved unaligned sequence modeling through attention- and message-based interactions, improving the robustness of emotion inference under modality asynchrony [<xref ref-type="bibr" rid="ref-8">8</xref>,<xref ref-type="bibr" rid="ref-9">9</xref>]. Hybrid integration of convolutional and recurrent encoders enabled richer temporal reasoning across speech, text, and facial modalities, achieving balanced feature complementarity [<xref ref-type="bibr" rid="ref-10">10</xref>,<xref ref-type="bibr" rid="ref-11">11</xref>]. In 2022, research shifted toward disentangled and uncertainty-aware architectures. Feature-disentangled multimodal learning separated shared and modality-specific components to mitigate redundancy and distribution gaps, while hierarchical uncertainty modeling and self-supervised pretraining improved robustness under data imbalance [<xref ref-type="bibr" rid="ref-12">12</xref>&#x2013;<xref ref-type="bibr" rid="ref-14">14</xref>]. Meanwhile, unsupervised and sensor-integrated systems extended MER beyond audiovisual signals by incorporating physiological cues and radar sensing, achieving strong generalization across real-world settings [<xref ref-type="bibr" rid="ref-15">15</xref>&#x2013;<xref ref-type="bibr" rid="ref-17">17</xref>]. Collectively, these developments marked the transition from handcrafted concatenation to semantic-aware fusion and modality disentanglement.</p>
<p>The 2023 research phase marked the consolidation of transformer-based architectures and conversational emotion recognition. Hybrid transformers such as DF-ERC, MER-HAN, and SDT modeled intra- and inter-modal dependencies through attention, self-distillation, and context-aware gating, achieving state-of-the-art accuracy on MELD and IEMOCAP datasets [<xref ref-type="bibr" rid="ref-18">18</xref>&#x2013;<xref ref-type="bibr" rid="ref-20">20</xref>]. Multi-label and emotion-level embedding frameworks introduced emotion co-occurrence modeling and simultaneous classification of multiple affective states [<xref ref-type="bibr" rid="ref-21">21</xref>,<xref ref-type="bibr" rid="ref-22">22</xref>]. Parallel trends focused on label efficiency and robustness under missing modalities. Semi-supervised and uncertainty-calibrated methods such as Expression-MAE, COLD Fusion, and IF-MMIN leveraged pseudo-labeling and modality imagination to sustain performance with limited supervision [<xref ref-type="bibr" rid="ref-23">23</xref>&#x2013;<xref ref-type="bibr" rid="ref-25">25</xref>]. Complementary research emphasized audiovisual synchronization and deep metric learning through cross-modal attention and Bi-GRU modeling, bridging acoustic&#x2013;visual coherence in low-resource environments [<xref ref-type="bibr" rid="ref-26">26</xref>&#x2013;<xref ref-type="bibr" rid="ref-28">28</xref>]. Altogether, 2023 consolidated transformer-based fusion as a dominant paradigm, emphasizing interpretability, resilience, and fine-grained interaction across modalities.</p>
<p>The 2024 research landscape represented a pivotal shift toward instruction-tuned and graph-augmented architectures integrating large language models (LLMs) and cross-modal reasoning. LLM-based textual encoders have recently advanced MER toward reasoning-level affect understanding by enabling richer semantic abstraction and contextual modeling. Through instruction-tuned and dialogue-aware pretraining, these models can capture discourse coherence, implicit emotional cues, and causal relationships that extend beyond surface-level sentiment expressions. Such capabilities are particularly beneficial in conversational MER scenarios, where emotional states are influenced by long-range context, speaker intent, and pragmatic nuances. As a result, LLM-based frameworks provide a powerful mechanism for modeling complex emotional dynamics that are difficult to capture with conventional sequence encoders. Instruction-tuned systems such as Emotion-LLaMA and DialogueMLLM unified audio, visual, and textual modalities through structured prompting and generative reasoning [<xref ref-type="bibr" rid="ref-29">29</xref>,<xref ref-type="bibr" rid="ref-30">30</xref>]. Graph-based designs such as MGLRA, AGF-IB, and DEDNet enhanced contextual alignment and speaker dependency modeling in dialogues [<xref ref-type="bibr" rid="ref-31">31</xref>&#x2013;<xref ref-type="bibr" rid="ref-33">33</xref>]. Hybrid interpretable architectures, including CNN&#x2013;BERT fusion and token-disentangling transformers, increased transparency and localization while maintaining high accuracy [<xref ref-type="bibr" rid="ref-34">34</xref>,<xref ref-type="bibr" rid="ref-35">35</xref>]. Interpretable systems like ParallelNet and KoHMT extended multimodal fusion into cross-lingual and domain-specific contexts [<xref ref-type="bibr" rid="ref-36">36</xref>,<xref ref-type="bibr" rid="ref-37">37</xref>], while sparsity- and uncertainty-based networks dynamically filtered redundant cues to improve robustness [<xref ref-type="bibr" rid="ref-24">24</xref>,<xref ref-type="bibr" rid="ref-38">38</xref>]. Domain-oriented research expanded MER into healthcare and physiological analysis by incorporating electroencephalography (EEG) and remote photoplethysmography (rPPG)&#x2014;based fusion, supported further by cross-subject generalization networks [<xref ref-type="bibr" rid="ref-2">2</xref>,<xref ref-type="bibr" rid="ref-39">39</xref>,<xref ref-type="bibr" rid="ref-40">40</xref>]. Calibration and transfer-learning methods such as CDaT and CMERC refined confidence estimation and semantic consistency across conversational datasets [<xref ref-type="bibr" rid="ref-41">41</xref>,<xref ref-type="bibr" rid="ref-42">42</xref>].</p>
<p>The representational benefits of LLM-based MER come at substantial computational cost. Large parameter scales, deep transformer stacks, and multi-stage inference pipelines increase memory consumption, energy usage, and inference latency, which limits applicability in real-time, on-device, or large-scale deployment scenarios. In contrast, lighter hybrid fusion architectures that combine compact textual encoders with attention- or gating-based multimodal integration mechanisms often achieve a more favorable balance between expressive capacity and computational efficiency. This contrast underscores a fundamental design trade-off in contemporary MER, where the depth of reasoning enabled by LLMs must be carefully weighed against efficiency, scalability, and deployment constraints. Collectively, these considerations highlight that 2024 research consolidated MER not only through LLM integration and graph reasoning, but also by clarifying the practical boundaries between expressive power and computational feasibility.</p>
<p>The 2025 research wave continued the evolution of MER toward foundation-model&#x2013;oriented architectures emphasizing generative recovery, unsupervised learning, and explainability. Transformer-driven frameworks such as MemoCMT and RMER-DT leveraged cross-modal transformers and diffusion-based restoration to reconstruct missing modalities and capture long-range conversational dependencies [<xref ref-type="bibr" rid="ref-43">43</xref>,<xref ref-type="bibr" rid="ref-44">44</xref>]. Graph-spectrum and motion-aware methods, such as GS-MCC and MIST, refined relational and temporal fusion across multimodal signals [<xref ref-type="bibr" rid="ref-45">45</xref>,<xref ref-type="bibr" rid="ref-46">46</xref>]. Explainable and lightweight systems adopted Gradient-SHAP&#x2013;based feature attribution and EEG&#x2013;facial fusion to enhance transparency and efficiency without sacrificing performance [<xref ref-type="bibr" rid="ref-47">47</xref>&#x2013;<xref ref-type="bibr" rid="ref-49">49</xref>]. Weakly supervised frameworks such as MGAFR and MERITS-L combined graph aggregation, contrastive learning, and LLM-guided pretraining for robust adaptation across incomplete datasets [<xref ref-type="bibr" rid="ref-50">50</xref>,<xref ref-type="bibr" rid="ref-51">51</xref>]. Meanwhile, spiking-transformer hybrids such as SPSNCVT introduced bio-inspired temporal encoding to defend against adversarial perturbations [<xref ref-type="bibr" rid="ref-52">52</xref>], while multi-granularity and mixture-of-experts models advanced fine-grained alignment and dynamic expert gating [<xref ref-type="bibr" rid="ref-53">53</xref>&#x2013;<xref ref-type="bibr" rid="ref-55">55</xref>]. Collectively, 2025 research established MER as a convergence of foundation-model fusion, generative recovery, and explainable reasoning, signaling its transition toward scalable and human-aligned affective intelligence.</p>
<p>To provide a consolidated view of recent progress, <xref ref-type="table" rid="table-1">Table 1</xref> summarizes the performance results reported from representative MER studies published in the last five years. The results are organized by evaluation metric, with datasets presented side by side within each table to facilitate comparison across commonly used benchmarks. For each dataset, a single evaluation metric is selected based on prevalent reporting practices in the literature, and each row corresponds to an individual study. Only dataset&#x2013;metric pairs reported by at least ten papers are included to ensure that the comparison reflects sufficiently established empirical trends. All values are reproduced from the original publications under their respective experimental settings and are intended to illustrate performance tendencies rather than establish a unified benchmark.</p>
<table-wrap id="table-1">
<label>Table 1</label>
<caption>
<title>Reported performance snapshots on public MER datasets. Values are reproduced from the original studies under their respective experimental settings.</title>
</caption>
<table>
<colgroup>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/> </colgroup>
<thead>
<tr>
<th>Dataset</th>
<th>Metric</th>
<th>Year</th>
<th>Fusion Approach</th>
<th>Score</th>
<th>Ref.</th>
</tr>
</thead>
<tbody>
<tr>
<td rowspan="10"><bold>CMU-MOSEI</bold></td>
<td rowspan="10">F1-score</td>
<td>2025</td>
<td>Hybrid Fusion</td>
<td>86.17</td>
<td>[<xref ref-type="bibr" rid="ref-51">51</xref>]</td>
</tr>
<tr>


<td>2024</td>
<td>Early Fusion</td>
<td>87.90</td>
<td>[<xref ref-type="bibr" rid="ref-34">34</xref>]</td>
</tr>
<tr>


<td>2024</td>
<td>Hybrid Fusion</td>
<td>86.50</td>
<td>[<xref ref-type="bibr" rid="ref-35">35</xref>]</td>
</tr>
<tr>


<td>2024</td>
<td>Late Fusion</td>
<td>83.00</td>
<td>[<xref ref-type="bibr" rid="ref-56">56</xref>]</td>
</tr>
<tr>


<td>2023</td>
<td>Hybrid Fusion</td>
<td>46.10</td>
<td>[<xref ref-type="bibr" rid="ref-24">24</xref>]</td>
</tr>
<tr>


<td>2022</td>
<td>Hybrid Fusion</td>
<td>85.80</td>
<td>[<xref ref-type="bibr" rid="ref-12">12</xref>]</td>
</tr>
<tr>


<td>2022</td>
<td>Hybrid Fusion</td>
<td>75.90</td>
<td>[<xref ref-type="bibr" rid="ref-17">17</xref>]</td>
</tr>
<tr>


<td>2022</td>
<td>Hybrid Fusion</td>
<td>69.50</td>
<td>[<xref ref-type="bibr" rid="ref-57">57</xref>]</td>
</tr>
<tr>


<td>2021</td>
<td>Hybrid Fusion</td>
<td>86.23</td>
<td>[<xref ref-type="bibr" rid="ref-10">10</xref>]</td>
</tr>
<tr>


<td>2021</td>
<td>Hybrid Fusion</td>
<td>82.60</td>
<td>[<xref ref-type="bibr" rid="ref-8">8</xref>]</td>
</tr>
<tr>
<td rowspan="10"><bold>MELD</bold></td>
<td rowspan="10">F1-score</td>
<td>2025</td>
<td>Early Fusion</td>
<td>69.00</td>
<td>[<xref ref-type="bibr" rid="ref-45">45</xref>]</td>
</tr>
<tr>


<td>2025</td>
<td>Hybrid Fusion</td>
<td>67.02</td>
<td>[<xref ref-type="bibr" rid="ref-44">44</xref>]</td>
</tr>
<tr>


<td>2025</td>
<td>Late Fusion</td>
<td>66.02</td>
<td>[<xref ref-type="bibr" rid="ref-50">50</xref>]</td>
</tr>
<tr>


<td>2024</td>
<td>Hybrid Fusion</td>
<td>67.02</td>
<td>[<xref ref-type="bibr" rid="ref-58">58</xref>]</td>
</tr>
<tr>


<td>2024</td>
<td>Early Fusion</td>
<td>66.85</td>
<td>[<xref ref-type="bibr" rid="ref-42">42</xref>]</td>
</tr>
<tr>


<td>2024</td>
<td>Hybrid Fusion</td>
<td>65.76</td>
<td>[<xref ref-type="bibr" rid="ref-33">33</xref>]</td>
</tr>
<tr>


<td>2023</td>
<td>Hybrid Fusion</td>
<td>66.60</td>
<td>[<xref ref-type="bibr" rid="ref-20">20</xref>]</td>
</tr>
<tr>


<td>2023</td>
<td>Hybrid Fusion</td>
<td>60.22</td>
<td>[<xref ref-type="bibr" rid="ref-19">19</xref>]</td>
</tr>
<tr>


<td>2022</td>
<td>Hybrid Fusion</td>
<td>58.56</td>
<td>[<xref ref-type="bibr" rid="ref-13">13</xref>]</td>
</tr>
<tr>


<td>2021</td>
<td>Late Fusion</td>
<td>64.00</td>
<td>[<xref ref-type="bibr" rid="ref-9">9</xref>]</td>
</tr>
<tr>
<td rowspan="11"><bold>IEMOCAP</bold></td>
<td rowspan="11">Accuracy</td>
<td>2025</td>
<td>Hybrid Fusion</td>
<td>81.33</td>
<td>[<xref ref-type="bibr" rid="ref-43">43</xref>]</td>
</tr>
<tr>


<td>2025</td>
<td>Hybrid Fusion</td>
<td>80.24</td>
<td>[<xref ref-type="bibr" rid="ref-53">53</xref>]</td>
</tr>
<tr>


<td>2024</td>
<td>Hybrid Fusion</td>
<td>71.72</td>
<td>[<xref ref-type="bibr" rid="ref-58">58</xref>]</td>
</tr>
<tr>


<td>2023</td>
<td>Early Fusion</td>
<td>85.90</td>
<td>[<xref ref-type="bibr" rid="ref-21">21</xref>]</td>
</tr>
<tr>


<td>2023</td>
<td>Hybrid Fusion</td>
<td>82.70</td>
<td>[<xref ref-type="bibr" rid="ref-24">24</xref>]</td>
</tr>
<tr>


<td>2023</td>
<td>Late Fusion</td>
<td>82.57</td>
<td>[<xref ref-type="bibr" rid="ref-59">59</xref>]</td>
</tr>
<tr>


<td>2023</td>
<td>Hybrid Fusion</td>
<td>79.71</td>
<td>[<xref ref-type="bibr" rid="ref-60">60</xref>]</td>
</tr>
<tr>


<td>2023</td>
<td>Hybrid Fusion</td>
<td>77.00</td>
<td>[<xref ref-type="bibr" rid="ref-61">61</xref>]</td>
</tr>
<tr>


<td>2022</td>
<td>Hybrid Fusion</td>
<td>69.60</td>
<td>[<xref ref-type="bibr" rid="ref-62">62</xref>]</td>
</tr>
<tr>


<td>2021</td>
<td>Hybrid Fusion</td>
<td>79.77</td>
<td>[<xref ref-type="bibr" rid="ref-63">63</xref>]</td>
</tr>
<tr>


<td>2021</td>
<td>Hybrid Fusion</td>
<td>61.80</td>
<td>[<xref ref-type="bibr" rid="ref-11">11</xref>]</td>
</tr>
<tr>
<td rowspan="11"><bold>RAVDESS</bold></td>
<td rowspan="11">Accuracy</td>
<td>2025</td>
<td>Hybrid Fusion</td>
<td>88.10</td>
<td>[<xref ref-type="bibr" rid="ref-64">64</xref>]</td>
</tr>
<tr>


<td>2024</td>
<td>Hybrid Fusion</td>
<td>95.00</td>
<td>[<xref ref-type="bibr" rid="ref-65">65</xref>]</td>
</tr>
<tr>


<td>2024</td>
<td>Early Fusion</td>
<td>93.18</td>
<td>[<xref ref-type="bibr" rid="ref-66">66</xref>]</td>
</tr>
<tr>


<td>2024</td>
<td>Late Fusion</td>
<td>81.90</td>
<td>[<xref ref-type="bibr" rid="ref-67">67</xref>]</td>
</tr>
<tr>


<td>2023</td>
<td>Early Fusion</td>
<td>93.23</td>
<td>[<xref ref-type="bibr" rid="ref-26">26</xref>]</td>
</tr>
<tr>


<td>2023</td>
<td>Hybrid Fusion</td>
<td>89.25</td>
<td>[<xref ref-type="bibr" rid="ref-27">27</xref>]</td>
</tr>
<tr>


<td>2022</td>
<td>Late Fusion</td>
<td>94.99</td>
<td>[<xref ref-type="bibr" rid="ref-68">68</xref>]</td>
</tr>
<tr>


<td>2022</td>
<td>Hybrid Fusion</td>
<td>93.17</td>
<td>[<xref ref-type="bibr" rid="ref-57">57</xref>]</td>
</tr>
<tr>


<td>2022</td>
<td>Late Fusion</td>
<td>86.00</td>
<td>[<xref ref-type="bibr" rid="ref-69">69</xref>]</td>
</tr>
<tr>


<td>2021</td>
<td>Late Fusion</td>
<td>86.70</td>
<td>[<xref ref-type="bibr" rid="ref-70">70</xref>]</td>
</tr>
<tr>


<td>2021</td>
<td>Hybrid Fusion</td>
<td>80.08</td>
<td>[<xref ref-type="bibr" rid="ref-71">71</xref>]</td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
<sec id="s2_2">
<label>2.2</label>
<title>Embedding and Feature Representation</title>
<p>Emotion recognition relies heavily on how features are extracted and represented across heterogeneous modalities. Effective embedding enables the model to capture both low-level perceptual signals and high-level semantic patterns that reflect emotional states. This section categorizes feature representation methods into five perspectives, visual, audio, textual, sensor-based, and cross-modal, highlighting how each modality contributes to a comprehensive affective understanding and how feature integration before fusion enhances multimodal robustness and interpretability.</p>
<p><italic>Visual Feature Embedding</italic></p>
<p>Visual feature embedding serves as the primary channel for interpreting facial and bodily expressions that reflect subtle emotional states in MER. Early visual frameworks relied heavily on convolutional neural networks such as VGG and ResNet to extract spatial features from facial images, providing robust representations of muscle movements and expression intensities [<xref ref-type="bibr" rid="ref-72">72</xref>,<xref ref-type="bibr" rid="ref-73">73</xref>]. These methods primarily focused on static cues, achieving stable results in controlled environments but showing limited adaptability in dynamic, spontaneous contexts.</p>
<p>To capture temporal information, later architectures extended spatial encoders with recurrent or 3D convolutional layers. For instance, hybrid CNN&#x2013;BiLSTM and 3D-CNN models enabled spatio-temporal learning from sequential frames, successfully modeling emotion transitions in datasets such as RAVDESS and CK&#x002B; [<xref ref-type="bibr" rid="ref-46">46</xref>,<xref ref-type="bibr" rid="ref-62">62</xref>]. This evolution reflected the growing awareness that emotion perception is inherently dynamic, requiring models to track subtle temporal variations in facial action units. Some approaches also leveraged optical flow and motion vectors to supplement static frame representations, enriching temporal consistency in recognition outcomes [<xref ref-type="bibr" rid="ref-65">65</xref>].</p>
<p>Attention-based visual modules subsequently emerged to emphasize discriminative regions and suppress irrelevant variations caused by lighting or pose. Spatial and channel attention frameworks selectively amplified emotion-relevant facial areas, while transformer-based mechanisms learned context-aware global dependencies [<xref ref-type="bibr" rid="ref-27">27</xref>,<xref ref-type="bibr" rid="ref-74">74</xref>]. These designs significantly improved feature interpretability and generalization, particularly in in-the-wild datasets characterized by occlusion and heterogeneous subjects.</p>
<p>The introduction of Vision Transformers (ViT) marked a further paradigm shift in visual feature extraction. ViT and its hybrid variants captured long-range dependencies across facial regions, while capsule and graph-transformer hybrids improved relational reasoning by modeling inter-feature connectivity [<xref ref-type="bibr" rid="ref-75">75</xref>,<xref ref-type="bibr" rid="ref-76">76</xref>]. Such methods demonstrated higher robustness to complex emotional scenes, especially those involving microexpressions, head movements, and multi-person interactions.</p>
<p>Recently, visual embeddings have become tightly integrated into unified frameworks. Emotion-LLaMA [<xref ref-type="bibr" rid="ref-29">29</xref>] leveraged instruction-tuned learning to align visual, acoustic, and textual inputs for emotion reasoning, while MIST [<xref ref-type="bibr" rid="ref-46">46</xref>] combined ResNet-50 and 3D-CNN modules to analyze facial appearance and motion cues jointly. Other studies have sought to enhance data efficiency through self-supervised and cross-domain pretraining, incorporating contrastive or generative objectives to address limited labeled data and domain variability [<xref ref-type="bibr" rid="ref-48">48</xref>,<xref ref-type="bibr" rid="ref-73">73</xref>]. Collectively, these advancements reveal that visual embedding research has evolved from static CNN representations toward transformer-based, semantically aligned, and pretraining-enhanced paradigms that support scalable multimodal affect understanding.</p>
<p><italic>Audio Feature Embedding</italic></p>
<p>Audio feature embedding captures prosodic, spectral, and paralinguistic cues that reflect affective states such as tone, rhythm, and vocal intensity. Early studies primarily relied on handcrafted acoustic features, including Mel-Frequency Cepstral Coefficients, pitch, energy, and zero-crossing rate, which were processed through classical classifiers or shallow neural networks [<xref ref-type="bibr" rid="ref-68">68</xref>,<xref ref-type="bibr" rid="ref-72">72</xref>]. Although these features provided interpretable descriptors of emotion-related variations, they lacked the ability to represent high-level semantics or contextual dependencies within speech.</p>
<p>The advent of deep learning expanded the expressiveness of acoustic representations through convolutional and recurrent encoders. CNN-based architectures were utilized to learn discriminative spectro-temporal patterns directly from Mel-spectrograms, capturing fine-grained frequency transitions and amplitude dynamics [<xref ref-type="bibr" rid="ref-2">2</xref>,<xref ref-type="bibr" rid="ref-65">65</xref>]. These models improved generalization across diverse speakers and recording environments. To capture long-range dependencies and rhythm-sensitive structures, BiLSTM and GRU layers were often appended to CNN backbones, effectively modeling temporal continuity in emotional speech sequences [<xref ref-type="bibr" rid="ref-39">39</xref>,<xref ref-type="bibr" rid="ref-62">62</xref>].</p>
<p>Self-supervised and transformer-based encoders have since redefined the audio embedding paradigm. Pretrained speech models such as wav2vec2.0, HuBERT, and PANNs emerged as dominant backbones due to their capacity to represent phonetic and prosodic variations without extensive labeled data [<xref ref-type="bibr" rid="ref-29">29</xref>,<xref ref-type="bibr" rid="ref-37">37</xref>,<xref ref-type="bibr" rid="ref-49">49</xref>]. These models enabled more transferable emotion features by capturing latent patterns in pitch contour and energy modulation across multiple corpora. Studies such as MemoCMT [<xref ref-type="bibr" rid="ref-43">43</xref>] and MGCMA [<xref ref-type="bibr" rid="ref-53">53</xref>] further demonstrated that transformer-based acoustic embeddings preserve cross-temporal coherence, allowing stable alignment with linguistic content and contextual affect transitions.</p>
<p>Attention mechanisms have also played a crucial role in refining the representation of acoustic emotions. Channel and time-domain attention modules dynamically reweighted salient frequency bands to enhance emotion sensitivity while filtering noise and neutral speech segments [<xref ref-type="bibr" rid="ref-32">32</xref>,<xref ref-type="bibr" rid="ref-34">34</xref>]. More recent designs integrated multi-scale or token-level attention to focus on rhythm discontinuities, emotional bursts, and silence intervals that correlate with affective intensity [<xref ref-type="bibr" rid="ref-35">35</xref>,<xref ref-type="bibr" rid="ref-55">55</xref>]. These methods bridged low-level spectral variation with higher-level emotional intent, significantly improving discrimination among subtle emotions such as anxiety, surprise, and contempt.</p>
<p>Recent research has also explored bioacoustic and physiological speech correlates to expand the affective dimension of auditory representation. Studies combining acoustic features with auxiliary sensor data, such as respiratory or vocal muscle signals, have shown enhanced robustness under noisy or spontaneous conditions [<xref ref-type="bibr" rid="ref-15">15</xref>,<xref ref-type="bibr" rid="ref-16">16</xref>]. Furthermore, end-to-end audio pipelines employing variational or diffusion-based augmentation strategies have emerged to increase resilience against data imbalance and adversarial perturbations [<xref ref-type="bibr" rid="ref-52">52</xref>,<xref ref-type="bibr" rid="ref-77">77</xref>].</p>
<p>Taken together, advancements in audio embedding have transitioned from handcrafted spectral analysis to deep transformer-driven architectures that model emotion as a structured, temporally evolving process. The evolution of pretrained speech encoders and attention-based refinement has made acoustic representation a crucial foundation for capturing implicit affective dynamics in MER systems.</p>
<p><italic>Text Feature Embedding</italic></p>
<p>Textual feature embedding provides the semantic and syntactic foundation for understanding the affective meaning conveyed by linguistic expressions, such as word choice, sentence structure, and discourse context. In MER, text serves as a high-level cue that complements perceptual modalities by explicitly expressing emotions through sentiment, appraisal, or figurative language. Early approaches employed word-level features such as bag-of-words and TF&#x2013;IDF vectors or used static distributed representations like Word2Vec and GloVe to encode emotional semantics within utterances [<xref ref-type="bibr" rid="ref-11">11</xref>,<xref ref-type="bibr" rid="ref-62">62</xref>]. These representations enabled initial progress in textual sentiment classification but lacked context awareness and failed to capture polarity shifts across sentences.</p>
<p>The introduction of deep neural encoders such as CNNs, RNNs, and Bi-LSTMs enhanced the capacity to model sequential dependencies and contextual nuances in emotional text [<xref ref-type="bibr" rid="ref-18">18</xref>,<xref ref-type="bibr" rid="ref-34">34</xref>]. Hierarchical recurrent networks further incorporated dialogue-level context, enabling emotion understanding in conversations where prior utterances influence the affective meaning of later ones. Despite these advancements, such models remained constrained by their reliance on fixed context windows and struggled to represent long-range semantic dependencies or subtle pragmatic cues such as sarcasm and irony.</p>
<p>Transformer-based encoders have fundamentally reshaped textual emotion representation by introducing self-attention mechanisms that capture global dependencies and polysemous word usage. As LLMs proliferate, pretrained language models such as BERT, RoBERTa, DeBERTa, and ELECTRA have become dominant backbones for textual embedding in MER, offering fine-grained contextual representations of emotional polarity and intensity [<xref ref-type="bibr" rid="ref-37">37</xref>,<xref ref-type="bibr" rid="ref-58">58</xref>,<xref ref-type="bibr" rid="ref-60">60</xref>]. These models learn not only lexical associations but also discourse-level sentiment trajectories, allowing them to align with temporal or conversational structures when combined with other modalities. Moreover, fine-tuning on emotion-specific corpora such as IEMOCAP, MELD, and CMU-MOSEI has proven effective for adapting general-purpose transformers to affective language tasks.</p>
<p>Beyond single-stream textual encoding, recent research emphasizes semantic disentanglement and cross-modal alignment at the embedding level. Models such as FDRL [<xref ref-type="bibr" rid="ref-78">78</xref>] and CAMEL [<xref ref-type="bibr" rid="ref-75">75</xref>] decouple shared and modality-specific semantic spaces, enabling robust representation of emotional meaning even in metaphorical or abstract expressions. Other approaches integrate linguistic prosody by aligning textual embeddings with the corresponding acoustic cues, forming joint representations aware of emotions that account for explicit and implicit affect [<xref ref-type="bibr" rid="ref-38">38</xref>,<xref ref-type="bibr" rid="ref-43">43</xref>].</p>
<p>LLMs&#x2013;based architectures have recently extended textual embedding toward reasoning-level affect understanding. Systems such as Emotion-LLaMA [<xref ref-type="bibr" rid="ref-29">29</xref>] and DialogueMLLM [<xref ref-type="bibr" rid="ref-30">30</xref>] incorporate instruction-tuned and dialogue-aware pretraining to interpret emotional intent, causal relations, and contextual shifts. These frameworks highlight an emerging trend in textual embedding&#x2014;transitioning from surface-level sentiment extraction to reasoning-driven affect interpretation that supports generalized multimodal emotion understanding.</p>
<p><italic>Sensor Feature Embedding</italic></p>
<p>Sensor-based feature embedding focuses on physiological and behavioral signals that reflect internal affective states beyond visible or auditory expression. Typical modalities include EEG, electrocardiography (ECG), galvanic skin response (GSR), electromyography (EMG), motion capture, and rPPG. These biosignals capture autonomic activity patterns related to arousal, stress, and emotion, enabling the interpretation of latent affective cues that are not directly observable.</p>
<p>Early approaches relied on convolutional and recurrent architectures to extract time&#x2013;frequency and temporal dynamics from multi-channel signals. For instance, depthwise separable CNNs and Bi-LSTM models were used to model spatial and temporal dependencies in EEG and ECG data, improving recognition accuracy in healthcare-related affective analytics [<xref ref-type="bibr" rid="ref-2">2</xref>]. Recent architectures such as graph neural networks and transformers have been employed to model spatial correlations among electrodes or motion nodes, as demonstrated in multimodal physiological frameworks like MEmoR [<xref ref-type="bibr" rid="ref-16">16</xref>] and TDTN-HLFR [<xref ref-type="bibr" rid="ref-79">79</xref>], which integrate physiological and visual cues for robust emotion inference.</p>
<p>Despite their effectiveness, sensor-based embeddings face challenges due to signal noise, inter-subject variability, and the limited availability of large-scale labeled datasets. To address these issues, recent studies have adopted normalization, domain adaptation, and adversarial training strategies to enhance cross-subject generalization. For example, the CAG-MoE framework [<xref ref-type="bibr" rid="ref-54">54</xref>] integrates multiple sensor streams using cross-attention and gated mixtures of experts to adaptively weight modalities in the presence of noisy or missing data. These advances highlight a shift toward multimodal physiological fusion, combining sensors with visual or acoustic cues to achieve interpretable and context-aware emotion recognition across diverse environments.</p>
<p><italic>Cross-Modal Feature Representation</italic></p>
<p>Cross-modal feature representation aims to align heterogeneous modality embeddings into a unified latent space where shared affective semantics can be effectively captured. Unlike unimodal encoders that learn features independently, this approach focuses on discovering common structures and correlations between modalities such as audio, visual, text, and physiological signals. By projecting them into a shared embedding space, models can transfer emotion-relevant information across modalities, enabling robust recognition even under degraded or missing conditions.</p>
<p>Early studies used deep canonical correlation analysis and autoencoder-based projection to align multimodal features, as seen in MERDCCA [<xref ref-type="bibr" rid="ref-11">11</xref>] and MCWSA-CMHA [<xref ref-type="bibr" rid="ref-80">80</xref>], which enforced correlation consistency between GRU- or CNN-encoded features. More recent frameworks introduced contrastive and self-supervised objectives, exemplified by modality-pairwise contrastive learning [<xref ref-type="bibr" rid="ref-57">57</xref>] and contrastive modality-invariant representation models like IF-MMIN [<xref ref-type="bibr" rid="ref-25">25</xref>] and CIF-MMIN [<xref ref-type="bibr" rid="ref-81">81</xref>], which improved robustness to missing or corrupted modalities. Other approaches, such as CAMEL [<xref ref-type="bibr" rid="ref-75">75</xref>], extended this idea by incorporating metaphor-aware contrastive alignment and disentangled contextual learning to capture higher-order semantic relationships between modalities.</p>
<p>In parallel, the rise of large pre-trained encoders and multimodal foundation models has accelerated the use of CLIP-style alignment and prompt-tuned joint embeddings. Architectures such as MEmoBERT [<xref ref-type="bibr" rid="ref-14">14</xref>], KoHMT [<xref ref-type="bibr" rid="ref-37">37</xref>], and DialogueMLLM [<xref ref-type="bibr" rid="ref-30">30</xref>] demonstrate how instruction-tuned or cross-modal transformer layers unify feature spaces across modalities, enabling emotion recognition and reasoning through shared attention mechanisms. These models often leverage knowledge distillation or adversarial feature alignment to ensure modality consistency while preserving discriminative affective cues.</p>
<p>Recent research trends increasingly emphasize the transition from simple feature concatenation toward semantically aligned representation learning, where emotion-relevant patterns are disentangled, normalized, and jointly optimized across modalities. Although computationally intensive, such alignment-driven methods serve as a conceptual bridge between feature-level fusion and foundation-level multimodal reasoning, marking an essential step toward scalable, generalizable emotion-understanding systems.</p>
</sec>
<sec id="s2_3">
<label>2.3</label>
<title>Fusion Approaches in MER</title>
<p>Fusion serves as the core mechanism through which MER systems integrate heterogeneous affective cues into a unified representation space. As summarized in <xref ref-type="fig" rid="fig-2">Fig. 2</xref>, existing studies can be broadly grouped into three paradigms: early fusion, late fusion, and hybrid fusion. Early fusion aggregates features before inference; late fusion combines modality-specific decisions at the score level; and hybrid fusion introduces intermediate interactions via attention, graph, or transformer-based modules.</p>
<fig id="fig-2">
<label>Figure 2</label>
<caption>
<title>Taxonomy of fusion strategies in MER, organized into early, late, and hybrid fusion with representative implementation patterns.</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_76411-fig-2.tif"/>
</fig>
<p>Fusion serves as the central mechanism through which MER systems integrate heterogeneous affective cues into coherent representations. As discussed in <xref ref-type="sec" rid="s1">Section 1</xref>, the effectiveness of these systems largely depends on how and when modalities are combined. Existing research has converged on three dominant paradigms&#x2014;early, late, and hybrid fusion&#x2014;each reflecting a distinct design philosophy and trade-off between representational richness and interpretability. Early fusion strategies concatenate or jointly encode raw or low-level features prior to inference, promoting deep cross-modal interactions but facing challenges from distributional mismatches. Late fusion aggregates modality-specific predictions at the decision level, allowing modular training and interpretability but often neglecting temporal and contextual dependencies. Hybrid fusion introduces intermediate-level integration through attention, gating, or transformer-based alignment mechanisms, balancing modality specialization with joint representation learning.</p>
<p>Before 2021, MER studies predominantly relied on early and late fusion schemes that combined handcrafted acoustic and visual features or aggregated classifier outputs at the decision level. As deep neural networks and transformer-based mechanisms matured, the field experienced a paradigm shift toward hybrid fusion, enabling adaptive and context-aware integration among modalities. <xref ref-type="fig" rid="fig-3">Fig. 3</xref> presents the distribution of fusion approaches adopted between 2021 and 2025. While early and late fusion remain relevant for interpretability and lightweight deployment, hybrid fusion dominates recent research, reflecting its capacity to model complex inter-modality relationships and temporal dependencies.</p>
<fig id="fig-3">
<label>Figure 3</label>
<caption>
<title>Trends in the adoption of early, late, and hybrid fusion approaches in MER from 2021 to 2025. Hybrid fusion has consistently maintained the highest adoption rate, signifying its role as the mainstream integration strategy in modern MER architectures.</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_76411-fig-3.tif"/>
</fig>
<p><italic>Early Fusion Approach</italic></p>
<p>Early Fusion is the most direct strategy for integrating multimodal inputs, where heterogeneous affective cues such as audio, visual, and textual features are combined before inference. In this approach, the model receives concatenated or jointly encoded representations of multiple modalities, enabling it to learn cross-modal relationships within a unified embedding space. This joint optimization allows fine-grained emotional correlations, such as prosodic variations aligning with subtle facial expressions, to be captured at an early stage of learning. Despite these strengths, the design increases the risk of imbalance between modalities due to differences in feature distributions or temporal resolutions.</p>
<p>In MER, early fusion played an essential role in the initial adoption of deep learning&#x2013;based frameworks. Many early studies focused on combining acoustic and visual features through convolutional or recurrent architectures to capture both spatial and temporal emotional dynamics. For instance, CNN&#x2013;BiLSTM pipelines effectively modeled frame-level facial features and spectral variations in speech, achieving synchronized recognition of expressive cues across modalities [<xref ref-type="bibr" rid="ref-26">26</xref>,<xref ref-type="bibr" rid="ref-65">65</xref>]. Later studies extended this concept to physiological signals such as EEG and rPPG, demonstrating that fusing visual and biosignal features enables more stable, noise-resistant affect prediction, particularly in healthcare and real-time monitoring environments [<xref ref-type="bibr" rid="ref-2">2</xref>,<xref ref-type="bibr" rid="ref-39">39</xref>,<xref ref-type="bibr" rid="ref-73">73</xref>]. Early fusion has also been applied to text&#x2013;speech pairs, where embeddings from pretrained models such as HuBERT, BERT, or fastText are jointly processed to enhance efficiency in conversational emotion recognition tasks [<xref ref-type="bibr" rid="ref-29">29</xref>,<xref ref-type="bibr" rid="ref-56">56</xref>].</p>
<p>Early fusion can be summarized as a process in which modality-specific features are first extracted, aligned, and concatenated into a joint feature space before being passed to shared layers for classification. <xref ref-type="fig" rid="fig-4">Fig. 4</xref> illustrates this mechanism, showing how feature-level integration encourages direct cross-modal interaction but requires precise temporal alignment among inputs. Such direct concatenation supports dense inter-modality learning but often amplifies noise or redundancy when one modality dominates the shared representation. To address these issues, recent models incorporate lightweight normalization or attention-based balancing layers, aiming to retain the simplicity of early fusion while improving its stability in large-scale multimodal settings [<xref ref-type="bibr" rid="ref-35">35</xref>].</p>
<fig id="fig-4">
<label>Figure 4</label>
<caption>
<title>Early fusion integrates modality-specific features, such as audio, visual, and text, into a shared representation space prior to inference. This joint feature-level combination enables cross-modal learning but increases sensitivity to temporal misalignment and imbalance across modalities.</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_76411-fig-4.tif"/>
</fig>
<p>Overall, early fusion remains a foundational yet evolving paradigm in MER. Its strength lies in its ability to promote deep joint representation learning and real-time processing efficiency, making it particularly suitable for synchronized or resource-limited scenarios. Nevertheless, its sensitivity to misaligned or missing data and limited flexibility in dynamic weighting modalities have driven the emergence of hybrid fusion architectures, which integrate the representational depth of early fusion with adaptive mechanisms for modality control.</p>
<p><italic>Late Fusion Approach</italic></p>
<p>Late fusion refers to the integration of modality-specific predictions at the decision stage rather than during feature representation learning. In this paradigm, each modality, such as audio, visual, or text, is processed independently through its own encoder or classifier, and the resulting emotion probabilities are aggregated through ensemble mechanisms such as weighted averaging, majority voting, or trainable gating networks. This modular structure allows models to flexibly combine heterogeneous input sources and maintain interpretability, making late fusion particularly useful in scenarios where each modality performs reliably in isolation.</p>
<p>Representative implementations have demonstrated its practicality across diverse emotion recognition settings. For instance, ensemble-based audiovisual systems such as MIST [<xref ref-type="bibr" rid="ref-46">46</xref>] integrate ResNet-50, 3D-CNN, and Semi-CNN features through weighted decision aggregation, while real-time frameworks like Dixit et al. [<xref ref-type="bibr" rid="ref-56">56</xref>] employ 1D and 2D CNNs with text embeddings to achieve efficient inference through random forest fusion. Similarly, multimodal learning schemes such as Radoi et al. [<xref ref-type="bibr" rid="ref-67">67</xref>] improve robustness under resource constraints by combining CNN outputs that are aware of uncertainty from audio and visual streams.</p>
<p>As illustrated in <xref ref-type="fig" rid="fig-5">Fig. 5</xref>, late fusion operates by first generating independent modality predictions that are subsequently merged at the decision level. Each stream outputs a distinct emotional probability distribution, and these are unified through ensemble or gating mechanisms to produce the final classification result. This structure enables flexible model substitution and modular retraining without altering the entire pipeline, a key advantage in large-scale or continually evolving affective computing systems.</p>
<fig id="fig-5">
<label>Figure 5</label>
<caption>
<title>Late fusion combines modality-specific predictions, such as those from audio, visual, and text classifiers, at the decision level through weighted or ensemble aggregation. This design enhances modularity and interpretability but limits the ability to model fine-grained temporal and contextual dependencies across modalities.</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_76411-fig-5.tif"/>
</fig>
<p>Although late fusion offers modularity and flexibility, the separation between representation learning and decision integration limits the model&#x2019;s ability to capture temporal synchronization and semantic dependencies across modalities. Consequently, late fusion frameworks may struggle with tasks that require subtle context inference or emotional transition tracking. Recent approaches, such as CAG-MoE [<xref ref-type="bibr" rid="ref-54">54</xref>], attempt to address these limitations by incorporating cross-attention and expert gating mechanisms, bridging decision-level integration with intermediate representation learning.</p>
<p><italic>Hybrid Fusion Approach</italic></p>
<p>Hybrid fusion represents an intermediate paradigm that bridges early and late fusion by enabling cross-modal interactions at multiple stages of representation. Rather than fusing only raw features or final predictions, hybrid frameworks incorporate adaptive mechanisms that dynamically determine how modalities interact at the feature, intermediate, or decision level. As shown in <xref ref-type="fig" rid="fig-6">Fig. 6</xref>, this paradigm integrates attention, gating, and transformer-based alignment modules to facilitate both modality-specific specialization and shared representation learning. Such flexibility allows hybrid fusion to balance the expressiveness of early fusion with the modular interpretability of late fusion, becoming the dominant design trend in MER.</p>
<fig id="fig-6">
<label>Figure 6</label>
<caption>
<title>Hybrid fusion introduces intermediate-level interactions between modalities through attention, gating, or transformer-based alignment. This approach balances the complementary strengths of early and late fusion, capturing both fine-grained cross-modal dependencies and high-level semantic coherence.</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_76411-fig-6.tif"/>
</fig>
<p>Attention-based hybrid fusion utilizes selective weighting to highlight the most informative modality features while suppressing noisy or irrelevant signals. Cross-modal attention modules compute inter-modal relevance, allowing one modality (e.g., audio) to conditionally refine another (e.g., visual) based on emotional salience. Early implementations employed hierarchical co-attention [<xref ref-type="bibr" rid="ref-19">19</xref>,<xref ref-type="bibr" rid="ref-20">20</xref>], where intra-modal attention first captured local dependencies before inter-modal attention fused complementary cues. More advanced variants, such as DF-ERC [<xref ref-type="bibr" rid="ref-18">18</xref>] and MER-HAN [<xref ref-type="bibr" rid="ref-19">19</xref>], integrated multi-head self-attention and context-aware mechanisms to model temporal emotion progression across dialogue turns. Recent works like MemoCMT [<xref ref-type="bibr" rid="ref-43">43</xref>] further evolved this paradigm by applying cross-modal transformers with HuBERT&#x2013;BERT embeddings, enabling long-range dependency modeling and fine-grained contextual synchronization. Through these refinements, attention-based fusion has established itself as a robust and interpretable design for modeling interdependent emotional cues.</p>
<p>Gating-based mechanisms focus on dynamically controlling the contribution of each modality to the final representation. Instead of fixed fusion weights, these methods learn adaptive gates&#x2014;often implemented via sigmoid or softmax functions&#x2014;that regulate the importance of the modality according to reliability, context, or signal quality. Representative studies such as COLD Fusion [<xref ref-type="bibr" rid="ref-24">24</xref>] and GMA [<xref ref-type="bibr" rid="ref-55">55</xref>] apply calibrated or gated attention modules that modulate fusion weights based on uncertainty estimation and modality-specific confidence. Similarly, DF-ERC and IF-MMIN [<xref ref-type="bibr" rid="ref-25">25</xref>] introduced dynamic gating to handle missing or degraded modalities, reconstructing latent representations through auxiliary imagination modules. These methods allow hybrid fusion models to adjust modality influence in real time, improving robustness under noise or partial observation. Gating mechanisms have thus become central to emotion recognition systems that operate under unpredictable environmental or conversational conditions.</p>
<p>Transformer-based and alignment-oriented hybrid fusion frameworks extend the attention paradigm by explicitly modeling the correspondence between modalities through positional encoding and multi-head projection spaces. These models treat each modality as a token sequence within a unified latent space, where self- and cross-attention jointly learn modality relationships across time. Architectures such as TDFNet [<xref ref-type="bibr" rid="ref-59">59</xref>], MGCMA [<xref ref-type="bibr" rid="ref-53">53</xref>], and CDaT [<xref ref-type="bibr" rid="ref-41">41</xref>] exemplify this approach, combining deep-scale transformers or contrastive alignment to capture hierarchical dependencies. More recent multimodal LLMs tuned to instruction, such as DialogueMLLM [<xref ref-type="bibr" rid="ref-30">30</xref>] and Emotion-LLaMA [<xref ref-type="bibr" rid="ref-29">29</xref>], further generalize the concept by embedding emotional cues from audio and visual modalities into language tokens, achieving unified reasoning through text-conditioned alignment. These frameworks represent the evolution of hybrid fusion toward scalable and semantically grounded architectures, merging foundation-model-level reasoning with traditional cross-modal interaction.</p>
<p>While hybrid fusion offers the most flexible and powerful integration paradigm, it often demands substantial computational resources and large-scale pretraining to achieve stable convergence. The use of multi-head attention, gating networks, and transformer stacks introduces additional parameters, leading to latency and interpretability trade-offs. To address these issues, recent research has explored lightweight hybrid designs such as ParallelNet [<xref ref-type="bibr" rid="ref-36">36</xref>] and SIA-Net [<xref ref-type="bibr" rid="ref-38">38</xref>], which maintain intermediate fusion capabilities through sparse attention and Shapley-value interpretability analysis. Collectively, hybrid fusion reflects the culmination of fusion strategy development, integrating the advantages of early and late paradigms while paving the way toward generalized, scalable, and context-aware multimodal emotion understanding.</p>
</sec>
<sec id="s2_4">
<label>2.4</label>
<title>Datasets and Evaluation Metrics</title>
<p>Datasets and evaluation metrics form the empirical backbone of MER research, shaping both model development and the way progress is assessed across studies. The choice of dataset determines the diversity of affective cues and interaction contexts available for learning, while evaluation metrics define how reliably these models capture emotional dynamics under varying conditions. This section first summarizes widely used multimodal emotion datasets and then reviews the primary metrics adopted in both categorical and continuous affect prediction.</p>
<p><italic>Datasets</italic></p>
<p>Datasets form the empirical foundation of MER, determining the richness of affective cues, the diversity of contexts, and the methodological direction of model development. The field has evolved from early, controlled audiovisual collections to large-scale, naturalistic datasets that integrate text, physiological signals, and conversational structures. This section outlines the major dataset categories used in MER research, highlighting their characteristics and the roles they play in shaping fusion strategies, representation learning, and evaluation protocols.</p>
<p>Early multimodal research relied primarily on audiovisual datasets, which provide aligned facial expressions and speech for studying foundational fusion architectures. Controlled and acted datasets such as RAVDESS [<xref ref-type="bibr" rid="ref-83">83</xref>], SAVEE [<xref ref-type="bibr" rid="ref-84">84</xref>], and eNTERFACE [<xref ref-type="bibr" rid="ref-85">85</xref>] offer clean, well-structured emotional expressions that are useful for benchmarking feature extraction and early/late fusion techniques. More complex audiovisual resources&#x2014;including IEMOCAP [<xref ref-type="bibr" rid="ref-82">82</xref>], MSP-IMPROV [<xref ref-type="bibr" rid="ref-87">87</xref>], CREMA-D [<xref ref-type="bibr" rid="ref-86">86</xref>], and Aff-Wild2 [<xref ref-type="bibr" rid="ref-89">89</xref>]&#x2014;introduce spontaneous interaction, continuous annotations, and environmental variability, enabling research on temporal modeling, robustness, and continuous affect prediction. Datasets such as BAUM-1 [<xref ref-type="bibr" rid="ref-88">88</xref>] further contribute to spontaneous interview-style emotional behavior, supporting studies on naturalistic visual&#x2013;acoustic synchrony.</p>
<p>As multimodal learning expanded toward conversational and context-aware affect analysis, text-inclusive datasets became increasingly central. CMU-MOSI [<xref ref-type="bibr" rid="ref-90">90</xref>] and CMU-MOSEI [<xref ref-type="bibr" rid="ref-91">91</xref>] provide rich monologue-style annotations with corresponding audio and video streams, supporting research on sentiment grounding, multimodal alignment, and hybrid fusion. Dialogue-oriented datasets such as MELD [<xref ref-type="bibr" rid="ref-92">92</xref>], EmoryNLP [<xref ref-type="bibr" rid="ref-94">94</xref>], and UR-FUNNY [<xref ref-type="bibr" rid="ref-95">95</xref>] introduce multi-party interaction, turn-taking, and humor-related affect, expanding MER toward conversational settings. These datasets have been instrumental in advancing transformer-based architectures, cross-modal attention, and co-representation learning across text, audio, and vision.</p>
<p>In parallel, physiological and sensor-based datasets broadened the scope of MER by focusing on internal affective responses that are often inaccessible through external behavior alone. DEAP [<xref ref-type="bibr" rid="ref-99">99</xref>], MAHNOB-HCI [<xref ref-type="bibr" rid="ref-100">100</xref>], AMIGOS [<xref ref-type="bibr" rid="ref-102">102</xref>], BioVid EmoDB [<xref ref-type="bibr" rid="ref-101">101</xref>], and DREAMER [<xref ref-type="bibr" rid="ref-103">103</xref>] contain combinations of EEG, ECG, GSR, EMG, rPPG, and auxiliary video recordings. Their controlled designs and continuous valence&#x2013;arousal annotations make them essential for studying affective computing from a biometric perspective, enabling research on domain adaptation, sensor fusion, and personalized emotion modeling. These datasets complement audiovisual corpora by capturing implicit affective states that remain stable even under occlusion or ambiguous facial behavior.</p>
<p>Building on these foundations, recent years have introduced large-scale and open-domain datasets designed to support robust MER in real-world environments. Resources such as MER2023 [<xref ref-type="bibr" rid="ref-96">96</xref>], MER2024 [<xref ref-type="bibr" rid="ref-97">97</xref>], and the MER2025 Challenge dataset [<xref ref-type="bibr" rid="ref-98">98</xref>] capture multilingual, cross-domain, and sensor-augmented emotional behavior across diverse demographics. Their large size and ecological variability enable the training of data-intensive models, such as transformer-based fusion networks, cross-modal contrastive learners, and self-supervised multimodal encoders, while providing benchmarks suitable for evaluating generalization beyond controlled laboratory settings.</p>
<p>A consolidated overview of representative MER datasets is presented in <xref ref-type="table" rid="table-2">Table 2</xref>. Together, these datasets illustrate the progression from controlled audiovisual corpora to rich, multimodal, large-scale resources that reflect the complexity of real-world affect. This evolution has shaped the methodological direction of MER, enabling increasingly sophisticated fusion architectures and representation learning frameworks.</p>
<table-wrap id="table-2">
<label>Table 2</label>
<caption>
<title>Summary of representative MER datasets from 2021&#x2013;2025.</title>
</caption>
<table>
<colgroup>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/> </colgroup>
<thead>
<tr>
<th>Dataset</th>
<th>Modalities</th>
<th>No. of Samples</th>
<th>Annotation</th>
</tr>
</thead>
<tbody>
<tr>
<td colspan="4"><italic>Audiovisual Datasets</italic></td>
</tr>
<tr>
<td>IEMOCAP [<xref ref-type="bibr" rid="ref-82">82</xref>]</td>
<td>V, A, T</td>
<td>12 h/10K utterances</td>
<td>Valence, Arousal, Categories</td>
</tr>
<tr>
<td>RAVDESS [<xref ref-type="bibr" rid="ref-83">83</xref>]</td>
<td>V, A</td>
<td>7356 clips</td>
<td>8 emotions</td>
</tr>
<tr>
<td>SAVEE [<xref ref-type="bibr" rid="ref-84">84</xref>]</td>
<td>V, A</td>
<td>480 clips</td>
<td>7 emotions</td>
</tr>
<tr>
<td>eNTERFACE [<xref ref-type="bibr" rid="ref-85">85</xref>]</td>
<td>V, A</td>
<td>1260 videos</td>
<td>6 emotions</td>
</tr>
<tr>
<td>CREMA-D [<xref ref-type="bibr" rid="ref-86">86</xref>]</td>
<td>V, A</td>
<td>7442 clips</td>
<td>6 emotions</td>
</tr>
<tr>
<td>MSP-IMPROV [<xref ref-type="bibr" rid="ref-87">87</xref>]</td>
<td>V, A, T</td>
<td>8438 segments</td>
<td>Valence, Arousal</td>
</tr>
<tr>
<td>BAUM-1 [<xref ref-type="bibr" rid="ref-88">88</xref>]</td>
<td>V, A</td>
<td>1200 videos</td>
<td>7 emotions</td>
</tr>
<tr>
<td>Aff-Wild2 [<xref ref-type="bibr" rid="ref-89">89</xref>]</td>
<td>V, A</td>
<td>2.8M frames</td>
<td>Valence, Arousal, Expr. Intensity</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td colspan="4"><italic>Text-Inclusive Multimodal Datasets</italic></td>
</tr>
<tr>
<td>CMU-MOSI [<xref ref-type="bibr" rid="ref-90">90</xref>]</td>
<td>V, A, T</td>
<td>2199 opinion segments</td>
<td>Sentiment/Valence</td>
</tr>
<tr>
<td>CMU-MOSEI [<xref ref-type="bibr" rid="ref-91">91</xref>]</td>
<td>V, A, T</td>
<td>23,453 segments</td>
<td>Sentiment/Emotions</td>
</tr>
<tr>
<td>MELD [<xref ref-type="bibr" rid="ref-92">92</xref>]</td>
<td>V, A, T</td>
<td>13,000 utterances</td>
<td>7 emotions</td>
</tr>
<tr>
<td>DailyDialog [<xref ref-type="bibr" rid="ref-93">93</xref>]</td>
<td>T, A</td>
<td>13,118 dialogues</td>
<td>7 emotions</td>
</tr>
<tr>
<td>EmoryNLP [<xref ref-type="bibr" rid="ref-94">94</xref>]</td>
<td>V, A, T</td>
<td>12,000 utterances</td>
<td>7 emotions</td>
</tr>
<tr>
<td>UR-FUNNY [<xref ref-type="bibr" rid="ref-95">95</xref>]</td>
<td>V, A, T</td>
<td>1864 videos</td>
<td>Humor &#x002B; Emotion</td>
</tr>
<tr>
<td>MER2023 [<xref ref-type="bibr" rid="ref-96">96</xref>]</td>
<td>V, A, T</td>
<td>250 h/50K clips</td>
<td>8 emotions</td>
</tr>
<tr>
<td>MER2024 [<xref ref-type="bibr" rid="ref-97">97</xref>]</td>
<td>V, A, T</td>
<td>300 h (multilingual)</td>
<td>Valence, Arousal, Engagement</td>
</tr>
<tr>
<td>MER2025 [<xref ref-type="bibr" rid="ref-98">98</xref>]</td>
<td>V, A, T, S</td>
<td>500&#x002B; h/10K subjects</td>
<td>Valence, Arousal, Dominance</td>
</tr>
<tr>
<td colspan="4"><italic>Physiological and Sensor-Based Datasets</italic></td>
</tr>
<tr>
<td>DEAP [<xref ref-type="bibr" rid="ref-99">99</xref>]</td>
<td>V, S</td>
<td>32 subjects <inline-formula id="ieqn-1"><mml:math id="mml-ieqn-1"><mml:mo>&#x00D7;</mml:mo></mml:math></inline-formula> 40 trials</td>
<td>Valence, Arousal</td>
</tr>
<tr>
<td>MAHNOB-HCI [<xref ref-type="bibr" rid="ref-100">100</xref>]</td>
<td>V, S</td>
<td>30 subjects <inline-formula id="ieqn-2"><mml:math id="mml-ieqn-2"><mml:mo>&#x00D7;</mml:mo></mml:math></inline-formula> 20 trials</td>
<td>Valence, Arousal, Dominance</td>
</tr>
<tr>
<td>BioVid EmoDB [<xref ref-type="bibr" rid="ref-101">101</xref>]</td>
<td>V, S</td>
<td>87 subjects <inline-formula id="ieqn-3"><mml:math id="mml-ieqn-3"><mml:mo>&#x00D7;</mml:mo></mml:math></inline-formula> 900 trials</td>
<td>Intensity levels</td>
</tr>
<tr>
<td>AMIGOS [<xref ref-type="bibr" rid="ref-102">102</xref>]</td>
<td>V, S</td>
<td>40 subjects <inline-formula id="ieqn-4"><mml:math id="mml-ieqn-4"><mml:mo>&#x00D7;</mml:mo></mml:math></inline-formula> 16 videos</td>
<td>Valence, Arousal, Liking</td>
</tr>
<tr>
<td>DREAMER [<xref ref-type="bibr" rid="ref-103">103</xref>]</td>
<td>S</td>
<td>23 subjects <inline-formula id="ieqn-5"><mml:math id="mml-ieqn-5"><mml:mo>&#x00D7;</mml:mo></mml:math></inline-formula> 18 stimuli</td>
<td>Valence, Arousal, Dominance</td>
</tr>
</tbody>
</table>
</table-wrap>
<p><italic>Evaluation Metrics</italic></p>
<p>The evaluation of MER models depends on whether the task is defined as discrete emotion classification or continuous affect regression. Each formulation requires metrics that capture different properties of emotional expressiveness, class discriminability, and temporal reliability. This subsection formalizes the metrics most widely used in MER research and highlights representative studies that employ these measures.</p>
<p>Discrete emotion recognition is commonly used in datasets such as RAVDESS [<xref ref-type="bibr" rid="ref-83">83</xref>], IEMOCAP [<xref ref-type="bibr" rid="ref-82">82</xref>], CMU-MOSEI [<xref ref-type="bibr" rid="ref-91">91</xref>], and MELD [<xref ref-type="bibr" rid="ref-92">92</xref>]. Accuracy is the simplest and most frequently reported metric, defined as
<disp-formula id="eqn-1"><label>(1)</label><mml:math id="mml-eqn-1" display="block"><mml:mrow><mml:mtext>Accuracy</mml:mtext></mml:mrow><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>C</mml:mi></mml:mrow></mml:munderover><mml:mi>T</mml:mi><mml:msub><mml:mi>P</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mrow><mml:mi>N</mml:mi></mml:mfrac><mml:mo>,</mml:mo></mml:math></disp-formula>where <inline-formula id="ieqn-6"><mml:math id="mml-ieqn-6"><mml:mi>T</mml:mi><mml:msub><mml:mi>P</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:math></inline-formula> is the number of correctly predicted samples of class <inline-formula id="ieqn-7"><mml:math id="mml-ieqn-7"><mml:mi>i</mml:mi></mml:math></inline-formula>, <italic>C</italic> is the total number of classes, and <italic>N</italic> is the total number of samples. Although used in early multimodal systems [<xref ref-type="bibr" rid="ref-65">65</xref>,<xref ref-type="bibr" rid="ref-71">71</xref>], accuracy is sensitive to class imbalance and therefore insufficient as a primary metric. Recent MER research instead emphasizes the macro-averaged F1-score, which assigns equal weight to each emotion category. For each emotion class <inline-formula id="ieqn-8"><mml:math id="mml-ieqn-8"><mml:mi>i</mml:mi></mml:math></inline-formula>, precision and recall are defined as
<disp-formula id="eqn-2"><label>(2)</label><mml:math id="mml-eqn-2" display="block"><mml:msub><mml:mrow><mml:mtext>Precision</mml:mtext></mml:mrow><mml:mi>i</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mi>T</mml:mi><mml:msub><mml:mi>P</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mrow><mml:mrow><mml:mi>T</mml:mi><mml:msub><mml:mi>P</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo>+</mml:mo><mml:mi>F</mml:mi><mml:msub><mml:mi>P</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mrow></mml:mfrac><mml:mo>,</mml:mo><mml:mspace width="2em" /><mml:msub><mml:mrow><mml:mtext>Recall</mml:mtext></mml:mrow><mml:mi>i</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mi>T</mml:mi><mml:msub><mml:mi>P</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mrow><mml:mrow><mml:mi>T</mml:mi><mml:msub><mml:mi>P</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo>+</mml:mo><mml:mi>F</mml:mi><mml:msub><mml:mi>N</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mrow></mml:mfrac><mml:mo>.</mml:mo></mml:math></disp-formula></p>
<p>The class-wise F1-score is then computed as the harmonic mean of precision and recall as
<disp-formula id="eqn-3"><label>(3)</label><mml:math id="mml-eqn-3" display="block"><mml:msub><mml:mrow><mml:mtext>F1</mml:mtext></mml:mrow><mml:mi>i</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mn>2</mml:mn><mml:mo>&#x22C5;</mml:mo><mml:msub><mml:mrow><mml:mtext>Precision</mml:mtext></mml:mrow><mml:mi>i</mml:mi></mml:msub><mml:mo>&#x22C5;</mml:mo><mml:msub><mml:mrow><mml:mtext>Recall</mml:mtext></mml:mrow><mml:mi>i</mml:mi></mml:msub></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mtext>Precision</mml:mtext></mml:mrow><mml:mi>i</mml:mi></mml:msub><mml:mo>+</mml:mo><mml:msub><mml:mrow><mml:mtext>Recall</mml:mtext></mml:mrow><mml:mi>i</mml:mi></mml:msub></mml:mrow></mml:mfrac><mml:mo>.</mml:mo></mml:math></disp-formula></p>
<p>Macro-F1 treats all emotion classes equally by averaging per-class F1-scores as
<disp-formula id="eqn-4"><label>(4)</label><mml:math id="mml-eqn-4" display="block"><mml:mrow><mml:mtext>Macro</mml:mtext><mml:mo>&#x2013;</mml:mo><mml:mi>F1</mml:mi></mml:mrow><mml:mo>=</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:mi>C</mml:mi></mml:mfrac><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>C</mml:mi></mml:mrow></mml:munderover><mml:msub><mml:mrow><mml:mtext>F1</mml:mtext></mml:mrow><mml:mi>i</mml:mi></mml:msub><mml:mo>.</mml:mo></mml:math></disp-formula></p>
<p>In contrast, Weighted-F1 adjusts each class contribution based on its sample proportion as
<disp-formula id="eqn-5"><label>(5)</label><mml:math id="mml-eqn-5" display="block"><mml:mrow><mml:mtext>Weighted</mml:mtext><mml:mo>&#x2013;</mml:mo><mml:mi>F1</mml:mi></mml:mrow><mml:mo>=</mml:mo><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>C</mml:mi></mml:mrow></mml:munderover><mml:msub><mml:mi>w</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo>&#x22C5;</mml:mo><mml:msub><mml:mrow><mml:mtext>F1</mml:mtext></mml:mrow><mml:mi>i</mml:mi></mml:msub><mml:mo>,</mml:mo><mml:mspace width="2em" /><mml:msub><mml:mi>w</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mfrac><mml:msub><mml:mi>N</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mrow><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>j</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>C</mml:mi></mml:mrow></mml:munderover><mml:msub><mml:mi>N</mml:mi><mml:mi>j</mml:mi></mml:msub></mml:mrow></mml:mfrac><mml:mo>.</mml:mo></mml:math></disp-formula>where <inline-formula id="ieqn-9"><mml:math id="mml-ieqn-9"><mml:msub><mml:mi>w</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:math></inline-formula> is the number of samples in class <inline-formula id="ieqn-10"><mml:math id="mml-ieqn-10"><mml:mi>i</mml:mi></mml:math></inline-formula>. Although informative, it may obscure poor performance on infrequent emotions, making macro-F1 more suitable for benchmarking.</p>
<p>Continuous affect estimation, used in Aff-Wild2 [<xref ref-type="bibr" rid="ref-89">89</xref>], DEAP [<xref ref-type="bibr" rid="ref-99">99</xref>], MAHNOB-HCI [<xref ref-type="bibr" rid="ref-100">100</xref>], and AMIGOS [<xref ref-type="bibr" rid="ref-102">102</xref>], requires metrics that account for temporal consistency and scale agreement between predicted and annotated emotional trajectories. The Concordance Correlation Coefficient (CCC) is the most widely adopted measure and is defined as
<disp-formula id="eqn-6"><label>(6)</label><mml:math id="mml-eqn-6" display="block"><mml:mrow><mml:mtext>CCC</mml:mtext></mml:mrow><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mn>2</mml:mn><mml:msub><mml:mi>&#x03C1;</mml:mi><mml:mrow><mml:mi>x</mml:mi><mml:mi>y</mml:mi></mml:mrow></mml:msub><mml:msub><mml:mi>&#x03C3;</mml:mi><mml:mi>x</mml:mi></mml:msub><mml:msub><mml:mi>&#x03C3;</mml:mi><mml:mi>y</mml:mi></mml:msub></mml:mrow><mml:mrow><mml:msubsup><mml:mi>&#x03C3;</mml:mi><mml:mi>x</mml:mi><mml:mn>2</mml:mn></mml:msubsup><mml:mo>+</mml:mo><mml:msubsup><mml:mi>&#x03C3;</mml:mi><mml:mi>y</mml:mi><mml:mn>2</mml:mn></mml:msubsup><mml:mo>+</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>&#x03BC;</mml:mi><mml:mi>x</mml:mi></mml:msub><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mi>&#x03BC;</mml:mi><mml:mi>y</mml:mi></mml:msub><mml:msup><mml:mo stretchy="false">)</mml:mo><mml:mn>2</mml:mn></mml:msup></mml:mrow></mml:mfrac><mml:mo>,</mml:mo></mml:math></disp-formula>where <inline-formula id="ieqn-11"><mml:math id="mml-ieqn-11"><mml:msub><mml:mi>&#x03C1;</mml:mi><mml:mrow><mml:mi>x</mml:mi><mml:mi>y</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> is the Pearson correlation coefficient, <inline-formula id="ieqn-12"><mml:math id="mml-ieqn-12"><mml:msub><mml:mi>&#x03BC;</mml:mi><mml:mi>x</mml:mi></mml:msub></mml:math></inline-formula>, <inline-formula id="ieqn-13"><mml:math id="mml-ieqn-13"><mml:msub><mml:mi>&#x03BC;</mml:mi><mml:mi>y</mml:mi></mml:msub></mml:math></inline-formula> are mean values, and <inline-formula id="ieqn-14"><mml:math id="mml-ieqn-14"><mml:msub><mml:mi>&#x03C3;</mml:mi><mml:mi>x</mml:mi></mml:msub></mml:math></inline-formula>, <inline-formula id="ieqn-15"><mml:math id="mml-ieqn-15"><mml:msub><mml:mi>&#x03C3;</mml:mi><mml:mi>y</mml:mi></mml:msub></mml:math></inline-formula> are standard deviations of predictions and ground truth. CCC penalizes both scale mismatch and mean shift, making it the official metric in Aff-Wild2 and ABAW challenges [<xref ref-type="bibr" rid="ref-89">89</xref>,<xref ref-type="bibr" rid="ref-104">104</xref>].</p>
<p>Mean squared error (MSE) provides a direct measure of the absolute difference between predicted and reference emotional trajectories. It emphasizes the magnitude of deviations at each timestep through a squared aggregation of residuals.
<disp-formula id="eqn-7"><label>(7)</label><mml:math id="mml-eqn-7" display="block"><mml:mrow><mml:mtext>MSE</mml:mtext></mml:mrow><mml:mo>=</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:mi>N</mml:mi></mml:mfrac><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>N</mml:mi></mml:mrow></mml:munderover><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mi>y</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:msup><mml:mo stretchy="false">)</mml:mo><mml:mn>2</mml:mn></mml:msup><mml:mo>,</mml:mo></mml:math></disp-formula>where <inline-formula id="ieqn-16"><mml:math id="mml-ieqn-16"><mml:msub><mml:mi>x</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:math></inline-formula> denotes the predicted emotional value at timestep <inline-formula id="ieqn-17"><mml:math id="mml-ieqn-17"><mml:mi>i</mml:mi></mml:math></inline-formula>, <inline-formula id="ieqn-18"><mml:math id="mml-ieqn-18"><mml:msub><mml:mi>y</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:math></inline-formula> denotes the corresponding ground-truth annotation, and <italic>N</italic> indicates the total number of temporal samples. MSE highlights pointwise discrepancy and is frequently adopted in continuous valence&#x2013;arousal tasks [<xref ref-type="bibr" rid="ref-40">40</xref>,<xref ref-type="bibr" rid="ref-73">73</xref>].</p>
<p>Root mean squared error (RMSE) expresses prediction discrepancy in the same scale as the target signal, improving interpretability while retaining the sensitivity of squared errors.
<disp-formula id="eqn-8"><label>(8)</label><mml:math id="mml-eqn-8" display="block"><mml:mrow><mml:mtext>RMSE</mml:mtext></mml:mrow><mml:mo>=</mml:mo><mml:msqrt><mml:mfrac><mml:mn>1</mml:mn><mml:mi>N</mml:mi></mml:mfrac><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>N</mml:mi></mml:mrow></mml:munderover><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mi>y</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:msup><mml:mo stretchy="false">)</mml:mo><mml:mn>2</mml:mn></mml:msup></mml:msqrt><mml:mo>.</mml:mo></mml:math></disp-formula></p>
<p>Variables follow the same definitions as in MSE, with RMSE providing a scale-aligned measure of overall prediction deviation [<xref ref-type="bibr" rid="ref-74">74</xref>].</p>
<p>Together, these metrics provide a consistent basis for evaluating MER systems across both categorical and continuous affect settings. Macro-F1 has become the predominant metric for discrete emotion classification because it reflects balanced class-wise performance, while CCC remains the standard in continuous emotion regression due to its combined assessment of correlation and agreement. These complementary metrics support reliable comparison across fusion strategies, embedding architectures, and cross-modal interaction designs in MER.</p>
</sec>
</sec>
<sec id="s3">
<label>3</label>
<title>Future Research Directions</title>
<p>Despite significant progress in MER over recent years, current systems still face several foundational challenges that limit their robustness, scalability, and applicability to real-world environments. The core difficulty arises from the inherent heterogeneity of multimodal affective signals, which leads to persistent issues with temporal alignment, modality reliability, and cross-domain generalization. Because visual, acoustic, textual, and physiological cues evolve on different timescales and possess distinct noise characteristics, MER models often struggle to maintain consistent cross-modal interactions, particularly when signals are asynchronous or partially missing. Furthermore, varying levels of sensor noise, occlusion, background interference, and modality imbalance undermine the stability of fusion mechanisms, revealing a gap between controlled experimental settings and deployment scenarios. In addition, emotional expressions differ substantially across speakers, cultures, and situational contexts, creating a strong domain shift that current models are not yet equipped to handle. These challenges collectively highlight the need for new research directions that move beyond incremental architectural modifications toward more adaptive, generalizable, and reliability-aware multimodal learning frameworks. The following subsections outline emerging opportunities and potential pathways for advancing the next generation of MER systems.</p>
<sec id="s3_1">
<label>3.1</label>
<title>Cross-Modal Alignment Challenges</title>
<p>One of the most persistent difficulties in MER lies in aligning heterogeneous signals that differ in temporal scale, structural properties, and semantic density. Visual cues evolve at the frame level, acoustic features exhibit continuous prosodic variation, and textual representations encapsulate discrete linguistic semantics. These discrepancies create a modality alignment gap, in which features extracted from different streams fail to correspond to the same emotional events in terms of time or semantic granularity. When such misalignment accumulates, fusion modules overfit to spurious correlations or suppress informative modality-specific cues, ultimately degrading affective inference.</p>
<p>Several recent studies highlight the consequences of this alignment gap across diverse multimodal settings. Audiovisual models often struggle to synchronize facial movements with rapidly varying speech prosody [<xref ref-type="bibr" rid="ref-68">68</xref>,<xref ref-type="bibr" rid="ref-70">70</xref>], while dialogue-level MER frameworks report instability when textual context shifts more slowly than speech or facial expressions [<xref ref-type="bibr" rid="ref-18">18</xref>,<xref ref-type="bibr" rid="ref-20">20</xref>]. Physiological&#x2013;visual fusion pipelines further exaggerate this issue, as EEG or rPPG signals operate at substantially higher sampling rates than video frames [<xref ref-type="bibr" rid="ref-2">2</xref>,<xref ref-type="bibr" rid="ref-39">39</xref>]. Collectively, these findings indicate that alignment errors propagate through the network, increasing modality dominance, reducing complementarity, and weakening the reliability of downstream fusion.</p>
<p>To mitigate these effects, MER research has explored alignment-aware representation learning that explicitly constrains temporal or semantic correspondence before fusion. Common strategies include cross-modal attention mechanisms that dynamically match salient segments across modalities [<xref ref-type="bibr" rid="ref-19">19</xref>,<xref ref-type="bibr" rid="ref-60">60</xref>], alignment losses that encourage synchronized embedding trajectories through shared latent spaces [<xref ref-type="bibr" rid="ref-25">25</xref>,<xref ref-type="bibr" rid="ref-61">61</xref>], and hierarchical gating functions that suppress unsynchronized modality features during context modeling [<xref ref-type="bibr" rid="ref-20">20</xref>,<xref ref-type="bibr" rid="ref-55">55</xref>]. Frame-to-token matching and temporal interpolation layers have also been adopted to bridge differences in sampling density between visual and acoustic streams [<xref ref-type="bibr" rid="ref-26">26</xref>]. These approaches collectively indicate a gradual shift toward alignment-preserving fusion pipelines that prioritize correspondence before integration.</p>
<p>This alignment challenge extends beyond emotion recognition and arises broadly in multimodal learning scenarios involving heterogeneous and asynchronous data. In domains such as remote sensing and Earth observation, hybrid deep learning models are commonly used to fuse satellite imagery with auxiliary time-series signals, including meteorological or environmental measurements, where spatial observations and temporal dynamics are inherently misaligned [<xref ref-type="bibr" rid="ref-105">105</xref>,<xref ref-type="bibr" rid="ref-106">106</xref>]. Similar to MER, these frameworks integrate convolutional encoders for visual streams with recurrent or transformer-based modules for temporal signals [<xref ref-type="bibr" rid="ref-5">5</xref>,<xref ref-type="bibr" rid="ref-6">6</xref>], enabling intermediate-level alignment that preserves modality-specific structure while learning shared representations [<xref ref-type="bibr" rid="ref-1">1</xref>]. This parallel highlights that alignment-aware hybrid fusion constitutes a general architectural response to heterogeneous data integration rather than a domain-specific solution confined to affective computing.</p>
<p>Despite these advances, achieving consistent, fine-grained alignment remains an open challenge. High-frequency transitions in speech, rapid head movements, background noise, and variable speaking styles introduce non-stationarity, complicating temporal matching. Moreover, current alignment methods often rely on fixed attention windows or pairwise matching, limiting their ability to model long-range emotional dependencies. Future progress requires scalable alignment mechanisms that integrate temporal dynamics, semantic abstraction, and uncertainty modeling into a unified framework, enabling multimodal systems to capture emotional correspondence with higher precision and stability.</p>
</sec>
<sec id="s3_2">
<label>3.2</label>
<title>Modality Reliability and Robust Fusion</title>
<p>In real-world MER, modalities are often missing or degraded due to environmental noise, occlusion, motion blur, sensor artifacts, or transcription errors. Audio streams may suffer from low signal-to-noise ratios or overlapping speakers, visual frames are often affected by rapid head movements or lighting changes, and textual transcripts derived from automatic speech recognition can contain alignment errors. Physiological signals, such as EEG and rPPG, also fluctuate significantly across sessions and subjects. These inconsistencies lead to modality imbalance, where one modality becomes unreliable or unavailable during inference, resulting in a substantial drop in recognition performance. As MER systems increasingly move toward unconstrained, in-the-wild deployment, developing architectures that maintain stable performance under partial or noisy multimodal input has become an essential research direction.</p>
<p>To address this challenge, several frameworks incorporate mechanisms for learning robust representations that remain informative even when certain modalities are incomplete or corrupted. Modality-invariant feature learning, exemplified by IF-MMIN, improves robustness by aligning distributional characteristics across modalities and generating plausible latent representations when specific inputs are missing [<xref ref-type="bibr" rid="ref-25">25</xref>]. Graph-based models such as MGAFR refine cross-modal correlations using multiplex graph aggregation and contrastive refinement, offering stable performance in unsupervised and incomplete-modality scenarios [<xref ref-type="bibr" rid="ref-51">51</xref>]. Diffusion-driven restoration approaches, including RMER-DT, reconstruct missing modalities through denoising-based generation before applying hierarchical transformer fusion [<xref ref-type="bibr" rid="ref-44">44</xref>]. In physiological&#x2013;behavioral settings, CMSLNet integrates adaptive consistency metrics and joint attention to achieve cross-subject robustness with heterogeneous EEG and eye-tracking signals [<xref ref-type="bibr" rid="ref-40">40</xref>]. Reliability-aware fusion mechanisms have also emerged; GMA introduces gated cross-modal enhancement to filter redundant signals [<xref ref-type="bibr" rid="ref-55">55</xref>], and CDaT dynamically adjusts the contribution of each modality by transferring information from more reliable streams to less reliable ones [<xref ref-type="bibr" rid="ref-41">41</xref>]. These approaches collectively demonstrate the increasing importance of modeling modality confidence, cross-modal redundancy, and latent restoration.</p>
<p>Despite these advances, achieving high robustness under severe modality degradation remains challenging. Generating latent substitutes for missing modalities introduces the risk of over-smoothing or hallucinated emotional cues, especially when the remaining modalities provide limited contextual evidence. Reliability estimation often fluctuates across speakers, domains, and recording conditions, leading to unstable weighting during fusion. Physiological signals are susceptible to noise and personalization effects, making it challenging to generalize subjects even when adaptive normalization is applied. Furthermore, most existing methods assume that at least one high-quality modality is available during inference; handling situations where all modalities are partially corrupted remains difficult. These limitations highlight the need for more principled formulations of modality uncertainty and cross-modal redundancy.</p>
<p>Future research may address these issues by developing generative reconstruction frameworks that integrate diffusion models, masked autoencoding, or modality-specific priors to restore incomplete inputs more faithfully. Another promising direction involves self-supervised reliability estimation, where modality confidence is learned without explicit labels by predicting consistency across temporal neighborhoods or cross-modal agreement. Cross-modal redundancy learning, in which the model identifies semantically shared information across modalities, can further reduce dependence on any single stream. Large multimodal language models may also support self-correction through internally generated descriptions or synthetic complementary signals. More comprehensive uncertainty modeling, spanning epistemic, aleatoric, and cross-modal uncertainty, can guide adaptive fusion policies that generalize across diverse real-world settings. These directions collectively represent an important step toward building MER systems capable of maintaining stable performance in incomplete, noisy, and operationally unconstrained environments.</p>
</sec>
<sec id="s3_3">
<label>3.3</label>
<title>Generalization across Speakers and Environments</title>
<p>MER models frequently experience substantial performance degradation when deployed across speakers, cultures, and recording environments that differ from their training conditions. Variations in age, gender, speaking style, facial morphology, linguistic background, and sensor quality create distribution shifts that challenge conventional supervised learning pipelines. Even models trained on large-scale in-the-wild corpora often overfit to dataset-specific affective cues, limiting their generalization to new domains such as clinical interviews, automotive cabins, call-center conversations, or multi-speaker social interactions. These issues become more pronounced when modalities exhibit unequal robustness across domains&#x2014;for instance, when visual cues degrade under low-light conditions or when acoustic features vary due to accent, microphone quality, and environmental noise.</p>
<p>Recent studies have sought to address these limitations through domain generalization and cross-subject adaptation strategies. Adversarial domain alignment has been employed to learn domain-invariant affective representations, for example, in cross-corpus speech emotion recognition and cross-subject EEG&#x2013;video fusion models for emotion recognition [<xref ref-type="bibr" rid="ref-107">107</xref>,<xref ref-type="bibr" rid="ref-108">108</xref>]. Meta-learning has also been explored for rapid adaptation to unseen domains in text and speech emotion classification, improving generalization across datasets and label distributions [<xref ref-type="bibr" rid="ref-109">109</xref>]. Cross-cultural continuous emotion recognition frameworks further highlight the need for culture-aware normalization and calibration when transferring models across populations [<xref ref-type="bibr" rid="ref-110">110</xref>]. In parallel, GAN-based data augmentation has been used to synthesize additional emotional samples and increase robustness to recording conditions [<xref ref-type="bibr" rid="ref-111">111</xref>]. Alongside these modeling techniques, zero-shot and cross-domain evaluation protocols have emerged in large-scale multimodal LLM-based systems, where models such as Emotion-LLaMA are tested on unseen corpora without task-specific fine-tuning [<xref ref-type="bibr" rid="ref-29">29</xref>]. Although these approaches collectively reduce sensitivity to domain-specific biases, substantial gaps persist between in-domain and zero-shot performance, especially under significant demographic or environmental shifts.</p>
<p>Promising research directions include developing pretraining pipelines compatible with emerging foundation models and drawing on large-scale, multilingual, and multicultural affective corpora to capture broad emotional variability. Another direction is the integration of parameter-efficient adaptation modules such as low-rank adaptation [<xref ref-type="bibr" rid="ref-112">112</xref>] and prompt-based modulation, which enable general-purpose multimodal encoders to specialize in emotion-related reasoning without extensive retraining. Expanding the coverage of speaker and environment diversity through weak supervision or pseudo-labeled affective data also offers a practical path toward more robust generalization. In addition, developing context-aware representations that capture environmental factors, lighting conditions, background noise, and social interaction patterns can reduce a model&#x2019;s reliance on identity-specific cues and encourage the extraction of stable, transferable emotional signals. These advances point toward a future in which MER systems achieve reliable performance across diverse demographic, linguistic, and situational contexts suited for deployment in real-world environments.</p>
</sec>
</sec>
<sec id="s4">
<label>4</label>
<title>Conclusion</title>
<p>MER has evolved into a central research field within human-centered artificial intelligence, driven by the need to understand complex affective cues that span visual, acoustic, textual, and physiological modalities. This survey consolidated recent advances across feature representation, fusion methodologies, and dataset development, highlighting how multimodal architectures have transitioned from early concatenation-based frameworks to hybrid fusion strategies that leverage cross-modal attention, contrastive alignment, and transformer-based interaction modeling. The review further demonstrated that the rapid growth of pre-trained encoders and large-scale multimodal datasets has substantially expanded the design space of affective modeling, enabling more expressive and context-aware emotional representations.</p>
<p>Despite these advances, significant challenges remain, including asynchronous modality alignment, incomplete or noisy signals, domain and demographic generalization, and the integration of large multimodal foundation models. The future research directions discussed in this survey emphasize scalable pretraining pipelines, parameter-efficient adaptation, context-aware representation learning, and robust multimodal synchronization frameworks. Collectively, these developments suggest a clear trajectory toward unified, adaptive, and deployment-ready emotion recognition models that operate reliably across diverse environments.</p>
</sec>
</body>
<back>
<ack>
<p>The authors acknowledge that this research was conducted in connection with IITP-funded projects supported by the Korea government (MSIT) (Grant Nos. RS-2021-II211341 and 2021-0-00766).</p>
</ack>
<sec>
<title>Funding Statement</title>
<p>This work was supported by the Institute of Information &#x0026; Communications Technology Planning &#x0026; Evaluation grant funded by the Korea government (MSIT) (No. RS-2021-II211341, AI Graduate School Support Program, Chung-Ang University), and in part by the Institute of Information and Communications Technology Planning and Evaluation grant funded by the Korea government (MSIT) (Development of Integrated Development Framework that Supports Automatic Neural Network Generation and Deployment Optimized for Runtime Environment, Grant No. 2021-0-00766).</p>
</sec>
<sec>
<title>Author Contributions</title>
<p>The authors confirm contribution to the paper as follows: A-Seong Moon contributed to the conceptualization, validation, formal analysis, project administration, original draft preparation, review and editing, and secured key resources for the study. Haesung Kim contributed to the methodology design, data curation, investigation, and assisted in the preparation of the original manuscript draft. Ye-Chan Park contributed to the validation, visualization, and participated in the review and editing of the manuscript. Jaesung Lee supervised the overall research process, contributed to the conceptualization and project administration. All authors reviewed and approved the final version of the manuscript.</p>
</sec>
<sec sec-type="data-availability">
<title>Availability of Data and Materials</title>
<p>Not applicable.</p>
</sec>
<sec>
<title>Ethics Approval</title>
<p>Not applicable.</p>
</sec>
<sec sec-type="COI-statement">
<title>Conflicts of Interest</title>
<p>The authors declare no conflicts of interest.</p>
</sec>
<app-group id="appg-1">
<app id="app-1">
<title>Appendix A Summary of Recent MER Studies</title>
<p>This appendix provides a comprehensive summary table of representative multimodal emotion recognition studies published between 2021 and 2025.</p>
<table-wrap id="table-3">
<label>Table A1</label>
<caption>
<title>Summary of representative MER studies published from 2021 to 2025.</title>
</caption>
<table>
<colgroup>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/> </colgroup>
<thead>
<tr>
<th>Ref.</th>
<th>Year</th>
<th>Modality</th>
<th>Approach</th>
<th>Feature Extraction</th>
<th>Datasets</th>
<th>Contribution</th>
</tr>
</thead>
<tbody>
<tr>
<td>[<xref ref-type="bibr" rid="ref-43">43</xref>]</td>
<td>2025</td>
<td>A, T</td>
<td>Hybrid Fusion</td>
<td>A: HuBERT<break/>T: BERT</td>
<td>IEMOCAP, ESD, MELD</td>
<td>Introduces a cross-modal transformer fusion that strengthens audio&#x2013;text alignment and improves conversational MER.</td>
</tr>
<tr>
<td>[<xref ref-type="bibr" rid="ref-47">47</xref>]</td>
<td>2025</td>
<td>V, A, T</td>
<td>Early Fusion</td>
<td>V: CMU-SDK<break/>A: CMU-SDK<break/>T: BERT</td>
<td>CMU-MOSEI</td>
<td>Applies XAI-based feature selection to identify influential multimodal cues, improving interpretability in MER.</td>
</tr>
<tr>
<td>[<xref ref-type="bibr" rid="ref-44">44</xref>]</td>
<td>2025</td>
<td>V, A, T</td>
<td>Hybrid Fusion</td>
<td>V: DenseNet<break/>A: OpenSmile<break/>T: RoBERTa</td>
<td>IEMOCAP, MELD</td>
<td>Combines diffusion-based restoration and hierarchical Transformers to improve robustness in conversational MER.</td>
</tr>
<tr>
<td>[<xref ref-type="bibr" rid="ref-45">45</xref>]</td>
<td>2025</td>
<td>V, A, T</td>
<td>Early Fusion</td>
<td>V: 3D-CNN<break/>A: OpenSmile<break/>T: RoBERTa</td>
<td>IEMOCAP, MELD</td>
<td>Examines MER through graph spectrum theory to enhance high-frequency affective signal preservation.</td>
</tr>
<tr>
<td>[<xref ref-type="bibr" rid="ref-46">46</xref>]</td>
<td>2025</td>
<td>V, A, T</td>
<td>Late Fusion</td>
<td>V: ResNet-50<break/>A: Semi-CNN<break/>T: DeBERTa</td>
<td>BAUM-1, SAVEE</td>
<td>Integrates text, speech, face, and motion signals using modular late fusion for comprehensive MER.</td>
</tr>
<tr>
<td>[<xref ref-type="bibr" rid="ref-48">48</xref>]</td>
<td>2025</td>
<td>V, S</td>
<td>Early Fusion</td>
<td>V: MTCNN<break/>S: pre-processing</td>
<td>DEAP</td>
<td>Combines EEG and facial expressions to capture both physiological and observable cues for emotion discrimination.</td>
</tr>
<tr>
<td>[<xref ref-type="bibr" rid="ref-50">50</xref>]</td>
<td>2025</td>
<td>A, T</td>
<td>Late Fusion</td>
<td>A: CARE<break/>T: RoBERTa</td>
<td>IEMOCAP, MELD, CMU-MOSI</td>
<td>Uses LLM-supervised pretraining to improve semantic alignment and contextual emotion understanding.</td>
</tr> 
<tr>
<td>[<xref ref-type="bibr" rid="ref-51">51</xref>]</td>
<td>2025</td>
<td>V, A, T</td>
<td>Hybrid Fusion</td>
<td>V: MA-Net<break/>A: Wav2vec<break/>T: DeBERTa</td>
<td>CMU-MOSI, CMU-MOSEI</td>
<td>Introduces multiplex graph aggregation to address incomplete modalities and refine multimodal feature relations.</td>
</tr>
<tr>
<td>[<xref ref-type="bibr" rid="ref-79">79</xref>]</td>
<td>2025</td>
<td>V, S</td>
<td>Hybrid Fusion</td>
<td>V: 1D-CNN<break/>S: 1D-CNN</td>
<td>RECOLA, ULM-TSST</td>
<td>Proposes dual-view disentanglement and hierarchical reconstruction to improve robustness under modality noise.</td>
</tr>
<tr>
<td>[<xref ref-type="bibr" rid="ref-113">113</xref>]</td>
<td>2025</td>
<td>V, A, T</td>
<td>Late Fusion</td>
<td>V: OpenFace<break/>A: COVERAP<break/>T: T5</td>
<td>Aff-Wild2</td>
<td>Develops a dynamic-scene-aware MER framework robust to real-world variations and complex environments.</td>
</tr>
<tr>
<td>[<xref ref-type="bibr" rid="ref-76">76</xref>]</td>
<td>2025</td>
<td>V, A, T</td>
<td>Early Fusion</td>
<td>V: ViT<break/>A: GNN<break/>T: CapsNet</td>
<td>MELD, CMU-MOSEI</td>
<td>Employs capsule&#x2013;graph Transformer architecture to enhance relational modeling of multimodal affective cues.</td>
</tr>
<tr>
<td>[<xref ref-type="bibr" rid="ref-53">53</xref>]</td>
<td>2025</td>
<td>A, T</td>
<td>Hybrid Fusion</td>
<td>A: Wav2vec2<break/>T: BERT</td>
<td>IEMOCAP</td>
<td>Introduces multi-granularity cross-modal alignment for better synchronization of emotional cues.</td>
</tr>
<tr>
<td>[<xref ref-type="bibr" rid="ref-52">52</xref>]</td>
<td>2025</td>
<td>V, A, T, S</td>
<td>Hybrid Fusion</td>
<td>V: CLIP<break/>A: Wav2vec2<break/>T: CLIP<break/>S: ViT</td>
<td>MELD</td>
<td>Uses spiking Transformers to improve robustness against adversarial attacks in multimodal emotion recognition.</td>
</tr> 
<tr>
<td>[<xref ref-type="bibr" rid="ref-55">55</xref>]</td>
<td>2025</td>
<td>V, A, T</td>
<td>Hybrid Fusion</td>
<td>V: DenseNet<break/>A: OpenSmile<break/>T: RoBERTa</td>
<td>IEMOCAP, MELD</td>
<td>Applies cross-modal gating to enhance interaction strength and refine multimodal feature learning.</td>
</tr>
<tr>
<td>[<xref ref-type="bibr" rid="ref-54">54</xref>]</td>
<td>2025</td>
<td>V, A, T, S</td>
<td>Late Fusion</td>
<td>V, A, T, S: Transformer</td>
<td>ASCERTAIN, KEMDy20</td>
<td>Introduces a gated mixture-of-experts architecture with cross-attention for multimodal and sensor-based MER.</td>
</tr>
<tr>
<td>[<xref ref-type="bibr" rid="ref-30">30</xref>]</td>
<td>2025</td>
<td>V, A, T</td>
<td>Hybrid Fusion</td>
<td>V,A,T: Video-LLaMA</td>
<td>MELD</td>
<td>Introduces an instruction-tuned multimodal LLM that enhances conversational emotion reasoning and fusion robustness.</td>
</tr>
<tr>
<td>[<xref ref-type="bibr" rid="ref-49">49</xref>]</td>
<td>2025</td>
<td>V, A</td>
<td>Late Fusion</td>
<td>V: MTCNN, EfficientNet<break/>A: Wav2vec2</td>
<td>CREMA-D, eNTERFACE</td>
<td>Develops an efficient audiovisual pipeline that leverages deep speech and facial encoders for robust MER.</td>
</tr>
<tr>
<td>[<xref ref-type="bibr" rid="ref-64">64</xref>]</td>
<td>2025</td>
<td>V, A</td>
<td>Hybrid Fusion</td>
<td>V: VGG<break/>A: X-vector</td>
<td>RAVDESS, SAVEE, CREMA-D</td>
<td>Combines audiovisual temporal modeling with attention-based fusion to capture dynamic emotional transitions.</td>
</tr>
<tr>
<td>[<xref ref-type="bibr" rid="ref-29">29</xref>]</td>
<td>2024</td>
<td>V, A, T</td>
<td>Early Fusion</td>
<td>V: ViT<break/>A: HuBERT<break/>T: LLaMA2</td>
<td>MERR, MER2023, MER2024, DFEW, EMER</td>
<td>Proposes an instruction-tuned multimodal LLaMA that unifies perception and reasoning for large-scale MER.</td>
</tr>
<tr>
<td>[<xref ref-type="bibr" rid="ref-77">77</xref>]</td>
<td>2024</td>
<td>V, A, T</td>
<td>Early Fusion</td>
<td>V: CNN<break/>A: 1D Conv<break/>T: Bi-LSTM</td>
<td>MELD, IEMOCAP</td>
<td>Addresses class imbalance using deep imbalanced learning and cross-modal integration for conversational MER.</td>
</tr>
<tr>
<td>[<xref ref-type="bibr" rid="ref-65">65</xref>]</td>
<td>2024</td>
<td>V, A</td>
<td>Hybrid Fusion</td>
<td>V: CNN<break/>A: CNN</td>
<td>BAUM-1, RAVDESS</td>
<td>Evaluates multiple fusion schemes and shows that combined CNN-based audiovisual cues improve affective recognition.</td>
</tr>
<tr>
<td>[<xref ref-type="bibr" rid="ref-34">34</xref>]</td>
<td>2024</td>
<td>A, T</td>
<td>Early Fusion</td>
<td>A: CNN<break/>T: BERT</td>
<td>MELD, CMU-MOSEI</td>
<td>Incorporates attention-enhanced BERT&#x2013;CNN fusion to improve linguistic&#x2013;acoustic alignment in MER tasks.</td>
</tr>
<tr>
<td>[<xref ref-type="bibr" rid="ref-31">31</xref>]</td>
<td>2024</td>
<td>V, A, T</td>
<td>Hybrid Fusion</td>
<td>V: 3D-CNN<break/>A: Opensmile<break/>T: RoBERTa</td>
<td>IEMOCAP, MELD</td>
<td>Introduces masked graph learning with recurrent alignment for stable cross-modal fusion in dialogue MER.</td>
</tr>
<tr>
<td>[<xref ref-type="bibr" rid="ref-66">66</xref>]</td>
<td>2024</td>
<td>V, A</td>
<td>Early Fusion</td>
<td>V: PyFEAT<break/>A: Opensmile</td>
<td>RAVDESS, BAUM-1</td>
<td>Leverages LLM-based fusion of handcrafted visual and acoustic cues to enhance multimodal representation learning.</td>
</tr>
<tr>
<td>[<xref ref-type="bibr" rid="ref-56">56</xref>]</td>
<td>2024</td>
<td>V, A, T</td>
<td>Late Fusion</td>
<td>V: CNN<break/>A: 1D CNN<break/>T: fastText</td>
<td>CMU-MOSEI</td>
<td>Designs a lightweight late-fusion pipeline optimized for real-time MER with competitive performance.</td>
</tr>
<tr>
<td>[<xref ref-type="bibr" rid="ref-35">35</xref>]</td>
<td>2024</td>
<td>V, A, T</td>
<td>Hybrid Fusion</td>
<td>V, A, T: MLP</td>
<td>CMU-MOSI, CMU-MOSEI, CH-SIMS</td>
<td>Presents a token-disentangling transformer that isolates modality-specific and shared cues for improved MER.</td>
</tr>
<tr>
<td>[<xref ref-type="bibr" rid="ref-32">32</xref>]</td>
<td>2024</td>
<td>V, A, T</td>
<td>Hybrid Fusion</td>
<td>V: 3D-CNN<break/>A: Opensmile<break/>T: RoBERTa</td>
<td>IEMOCAP, MELD</td>
<td>Applies adversarial alignment and information bottleneck fusion to improve robustness in conversational MER.</td>
</tr>
<tr>
<td>[<xref ref-type="bibr" rid="ref-114">114</xref>]</td>
<td>2024</td>
<td>V, A, T</td>
<td>Early, Late Fusion</td>
<td>V: Py-Feat<break/>A: Wav2vec2<break/>T: Whisper</td>
<td>C-EXPR-DB, MELD</td>
<td>Compares textualized and feature-based MER, showing advantages of converting modalities into rich text embeddings.</td>
</tr>
<tr>
<td>[<xref ref-type="bibr" rid="ref-2">2</xref>]</td>
<td>2024</td>
<td>V, S</td>
<td>Early Fusion</td>
<td>V: DSCNN<break/>S: Bi-LSTM</td>
<td>BioVid EmoDB</td>
<td>Introduces a healthcare-oriented MER framework integrating physiological and video cues with model-level fusion.</td>
</tr>
<tr>
<td>[<xref ref-type="bibr" rid="ref-81">81</xref>]</td>
<td>2024</td>
<td>V, A, T</td>
<td>Late Fusion</td>
<td>V: DenseNet<break/>A: Opensmile, COVAREP<break/>T: BERT</td>
<td>IEMOCAP, MSP-IMPROV, CMU-MOSI</td>
<td>Uses contrastive learning to acquire modality-invariant features for missing-modality MER scenarios.</td>
</tr>
<tr>
<td>[<xref ref-type="bibr" rid="ref-75">75</xref>]</td>
<td>2024</td>
<td>V, A, T</td>
<td>Hybrid Fusion</td>
<td>V, A: Transformer<break/>T: BLIP</td>
<td>MET-Meme, MemeCap</td>
<td>Introduces metaphor-aware alignment using context disentangling to improve robustness in multimodal emotion recognition.</td>
</tr>
<tr>
<td>[<xref ref-type="bibr" rid="ref-40">40</xref>]</td>
<td>2024</td>
<td>V, S</td>
<td>Hybrid Fusion</td>
<td>V, S: MLP</td>
<td>SEED-IV, SEED-V</td>
<td>Presents a multisource learning framework that enhances cross-subject stability through comprehensive modal integration.</td>
</tr>
<tr>
<td>[<xref ref-type="bibr" rid="ref-78">78</xref>]</td>
<td>2024</td>
<td>A, T</td>
<td>Hybrid Fusion</td>
<td>A: Wav2vec2<break/>T: BERT</td>
<td>IEMOCAP</td>
<td>Develops fine-grained disentangled representation learning to better separate emotional factors across modalities.</td>
</tr>
<tr>
<td>[<xref ref-type="bibr" rid="ref-36">36</xref>]</td>
<td>2024</td>
<td>V, A</td>
<td>Hybrid Fusion</td>
<td>V: VGG<break/>A: VGG</td>
<td>IIT-R SIER</td>
<td>Proposes an interpretable fusion pipeline combining visual and speech cues for transparent multimodal emotion analysis.</td>
</tr>
<tr>
<td>[<xref ref-type="bibr" rid="ref-37">37</xref>]</td>
<td>2024</td>
<td>A, T</td>
<td>Hybrid Fusion</td>
<td>A: HuBERT<break/>T: KoELECTRA</td>
<td>KER (AI-HUB)</td>
<td>Integrates Korean speech and text encoders with a multimodal transformer for culturally aligned MER modeling.</td>
</tr>
<tr>
<td>[<xref ref-type="bibr" rid="ref-58">58</xref>]</td>
<td>2024</td>
<td>V, A, T</td>
<td>Hybrid Fusion</td>
<td>V: DenseNet<break/>A: Opensmile<break/>T: RoBERTa</td>
<td>IEMOCAP, MELD</td>
<td>Introduces persona-infused graph learning to capture emotion shifts and conversational structure across modalities.</td>
</tr>
<tr>
<td>[<xref ref-type="bibr" rid="ref-67">67</xref>]</td>
<td>2024</td>
<td>V, A</td>
<td>Late Fusion</td>
<td>V, A: CNN</td>
<td>CREMA-D, RAVDESS</td>
<td>Uses uncertainty-based weighting to optimize a lightweight audiovisual model for efficient emotion recognition.</td>
</tr>
<tr>
<td>[<xref ref-type="bibr" rid="ref-39">39</xref>]</td>
<td>2024</td>
<td>V, S</td>
<td>Hybrid Fusion</td>
<td>V: ResNet<break/>S: MTTS-CAN</td>
<td>MAHNOB-HCI</td>
<td>Combines facial expression cues with rPPG signals to build an end-to-end physiological&#x2013;visual emotion recognizer.</td>
</tr>
<tr>
<td>[<xref ref-type="bibr" rid="ref-42">42</xref>]</td>
<td>2024</td>
<td>V, A, T</td>
<td>Early Fusion</td>
<td>V: DenseNet, SVM<break/>A: Opensmile<break/>T: RoBERTa</td>
<td>IEMOCAP, MELD</td>
<td>Proposes calibration for conversational MER through early fusion and modality-specific reliability adjustments.</td>
</tr>
<tr>
<td>[<xref ref-type="bibr" rid="ref-74">74</xref>]</td>
<td>2024</td>
<td>V, S</td>
<td>Hybrid Fusion</td>
<td>V: ResNet<break/>S: DGC</td>
<td>DEAP, SEED-IV</td>
<td>Applies dense GCN with joint cross-attention to enhance emotional state inference from video and physiological signals.</td>
</tr> 
<tr>
<td>[<xref ref-type="bibr" rid="ref-115">115</xref>]</td>
<td>2024</td>
<td>V, A, T</td>
<td>Hybrid Fusion</td>
<td>V: 3D-CNN<break/>A: Opensmile<break/>T: TextCNN</td>
<td>IEMOCAP, MELD</td>
<td>Incorporates speaker-aware cognitive modeling with cross-modal attention to better capture dialogue-based emotions.</td>
</tr>
<tr>
<td>[<xref ref-type="bibr" rid="ref-41">41</xref>]</td>
<td>2024</td>
<td>V, A, T</td>
<td>Hybrid Fusion</td>
<td>V: LSTM<break/>A: LSTM<break/>T: BERT</td>
<td>CMU-MOSEI, IEMOCAP</td>
<td>Introduces dynamic transfer learning across modalities to enhance generalization in multimodal emotion tasks.</td>
</tr>
<tr>
<td>[<xref ref-type="bibr" rid="ref-38">38</xref>]</td>
<td>2024</td>
<td>V, A, T</td>
<td>Early Fusion</td>
<td>V: FabNet<break/>A: Wav-RoBERTa<break/>T: RoBERTa</td>
<td>MELD, CMU-MOSI, CMU-MOSEI</td>
<td>Presents sparse interactive attention enabling effective integration of fine-grained cues from all modalities.</td>
</tr>
<tr>
<td>[<xref ref-type="bibr" rid="ref-33">33</xref>]</td>
<td>2024</td>
<td>V, A, T</td>
<td>Hybrid Fusion</td>
<td>V: DenseNet<break/>A: Opensmile<break/>T: RoBERTa</td>
<td>IEMOCAP, MELD, DailyDialog</td>
<td>Models relational subgraph interactions to capture emotion-dependent patterns across diverse modalities.</td>
</tr>
<tr>
<td>[<xref ref-type="bibr" rid="ref-116">116</xref>]</td>
<td>2024</td>
<td>V, A, T</td>
<td>Early Fusion</td>
<td>V: DenseNet, 3D-CNN<break/>A: Opensmile<break/>T: RoBERTa</td>
<td>IEMOCAP</td>
<td>Introduces reinforcement learning&#x2013;based optimization to refine early-fusion representations for MER tasks.</td>
</tr>
<tr>
<td>[<xref ref-type="bibr" rid="ref-73">73</xref>]</td>
<td>2023</td>
<td>V, S</td>
<td>Hybrid Fusion</td>
<td>V: DeepVANet S: CNN</td>
<td>DEAP, MAHNOB-HCI</td>
<td>Combines EEG and facial expression cues with hybrid fusion to enhance emotional state estimation in multimodal settings.</td>
</tr>
<tr>
<td>[<xref ref-type="bibr" rid="ref-117">117</xref>]</td>
<td>2023</td>
<td>V, A, S</td>
<td>Late Fusion</td>
<td>V: GhostNet<break/>A: LFCNN<break/>S: tLSTM</td>
<td>CK&#x002B;, EMO-DB, MAHNOB-HCI</td>
<td>Integrates facial, speech, and EEG modalities via late fusion to improve multimodal affect recognition robustness.</td>
</tr>
<tr>
<td>[<xref ref-type="bibr" rid="ref-18">18</xref>]</td>
<td>2023</td>
<td>V, A, T</td>
<td>Hybrid Fusion</td>
<td>V: DenseNet<break/>A: Opensmile<break/>T: RoBERTa</td>
<td>MELD, IEMOCAP</td>
<td>Revisits disentanglement of modality and context to refine conversational MER with enhanced fusion strategies.</td>
</tr>
<tr>
<td>[<xref ref-type="bibr" rid="ref-21">21</xref>]</td>
<td>2023</td>
<td>V, A, T</td>
<td>Early Fusion</td>
<td>V, A: CNN, Transformer<break/>T: ALBERT</td>
<td>IEMOCAP, CMU-MOSEI</td>
<td>Proposes transformer-based early fusion with emotion-level representation learning for improved multi-label MER.</td>
</tr>
<tr>
<td>[<xref ref-type="bibr" rid="ref-19">19</xref>]</td>
<td>2023</td>
<td>A, T</td>
<td>Hybrid Fusion</td>
<td>A: Bi-LSTM<break/>T: BERT</td>
<td>MELD, IEMOCAP</td>
<td>Uses hybrid attention networks to effectively integrate audio and text cues for conversational emotion analysis.</td>
</tr>
<tr>
<td>[<xref ref-type="bibr" rid="ref-20">20</xref>]</td>
<td>2023</td>
<td>V, A, T</td>
<td>Hybrid Fusion</td>
<td>V: DenseNet<break/>A: Opensmile<break/>T: RoBERTa</td>
<td>MELD, IEMOCAP</td>
<td>Introduces a transformer model with self-distillation to strengthen multimodal learning in dialogue contexts.</td>
</tr>
<tr>
<td>[<xref ref-type="bibr" rid="ref-27">27</xref>]</td>
<td>2023</td>
<td>V, A</td>
<td>Hybrid Fusion</td>
<td>V: 3D ResNet<break/>A: ResNet</td>
<td>CREMA-D, RAVDESS</td>
<td>Employs cross-modal audio&#x2013;video fusion with attention and metric learning to enhance emotion classification.</td>
</tr>
<tr>
<td>[<xref ref-type="bibr" rid="ref-22">22</xref>]</td>
<td>2023</td>
<td>V, A, T</td>
<td>Hybrid Fusion</td>
<td>V: VGG<break/>A: CNN<break/>T: DialogXL</td>
<td>IEMOCAP</td>
<td>Provides a unified evaluation framework for comparing multimodal fusion techniques across emotion tasks.</td>
</tr>
<tr>
<td>[<xref ref-type="bibr" rid="ref-25">25</xref>]</td>
<td>2023</td>
<td>V, A, T</td>
<td>Hybrid Fusion</td>
<td>V: DenseNet<break/>A: Opensmile<break/>T: BERT</td>
<td>IEMOCAP</td>
<td>Learns modality-invariant representations to achieve robust emotion recognition under missing modality conditions.</td>
</tr>
<tr>
<td>[<xref ref-type="bibr" rid="ref-23">23</xref>]</td>
<td>2023</td>
<td>V, A, T</td>
<td>Hybrid Fusion</td>
<td>V: expMAE<break/>A: HuBERT<break/>T: MacBERT</td>
<td>MER-SEMI</td>
<td>Introduces semi-supervised learning with expression MAE and multi-modal pseudo-labeling to enhance MER accuracy.</td>
</tr>
<tr>
<td>[<xref ref-type="bibr" rid="ref-26">26</xref>]</td>
<td>2023</td>
<td>V, A</td>
<td>Early Fusion</td>
<td>V: ResNet<break/>A: Transformer</td>
<td>RAVDESS, eNTERFACE</td>
<td>Combines facial and speech features through early fusion to improve emotion prediction in audiovisual settings.</td>
</tr>
<tr>
<td>[<xref ref-type="bibr" rid="ref-59">59</xref>]</td>
<td>2023</td>
<td>A, T</td>
<td>Late Fusion</td>
<td>A: BuBERT, ECAPA<break/>T: BERT</td>
<td>IEMOCAP</td>
<td>Proposes TDFNet for deep-scale fusion of audio and text signals using transformer modules for emotion inference.</td>
</tr>
<tr>
<td>[<xref ref-type="bibr" rid="ref-118">118</xref>]</td>
<td>2023</td>
<td>V, A, T</td>
<td>Hybrid Fusion</td>
<td>V: ResNet<break/>A: HuBERT<break/>T: MacBERT</td>
<td>MER2023-SEMI</td>
<td>Uses class-balanced pseudo-labeling to enhance multimodal learning under limited supervision conditions.</td>
</tr>
<tr>
<td>[<xref ref-type="bibr" rid="ref-24">24</xref>]</td>
<td>2023</td>
<td>V, A</td>
<td>Hybrid Fusion</td>
<td>V: EmoFAN<break/>A: VGGish</td>
<td>AVEC, CMU-MOSEI, IEMOCAP</td>
<td>Introduces calibrated ordinal latent distribution fusion to model uncertainty in multimodal emotion recognition.</td>
</tr>
<tr>
<td>[<xref ref-type="bibr" rid="ref-60">60</xref>]</td>
<td>2023</td>
<td>A, T</td>
<td>Hybrid Fusion</td>
<td>A: Wav2vec2<break/>T: BERT</td>
<td>IEMOCAP</td>
<td>Improves fusion of Wav2vec2 and BERT by incorporating auxiliary tasks that strengthen cross-modal alignment.</td>
</tr>
<tr>
<td>[<xref ref-type="bibr" rid="ref-61">61</xref>]</td>
<td>2023</td>
<td>A, T</td>
<td>Hybrid Fusion</td>
<td>A: Wav2vec2<break/>T: BERT</td>
<td>IEMOCAP</td>
<td>Introduces a Bayesian co-attention mechanism that leverages knowledge-aware priors to strengthen audio&#x2013;text fusion for MER.</td>
</tr>
<tr>
<td>[<xref ref-type="bibr" rid="ref-119">119</xref>]</td>
<td>2023</td>
<td>A, T</td>
<td>Hybrid Fusion</td>
<td>A: Wav2vec2<break/>T: FlauBERT</td>
<td>CEMO</td>
<td>Analyzes attention mechanisms for emotion detection in emergency-call speech by modeling cross-modal audio&#x2013;text dependencies.</td>
</tr>
<tr>
<td>[<xref ref-type="bibr" rid="ref-28">28</xref>]</td>
<td>2023</td>
<td>A, T</td>
<td>Hybrid Fusion</td>
<td>A: CNN, BiGRU<break/>T: BiGRU</td>
<td>IEMOCAP</td>
<td>Uses deep temporal modeling and cross-modal transformers to capture synchronized emotional cues from audio and text streams.</td>
</tr>
<tr>
<td>[<xref ref-type="bibr" rid="ref-68">68</xref>]</td>
<td>2022</td>
<td>V, A</td>
<td>Late Fusion</td>
<td>V: DSN, DTN<break/>A: CNN</td>
<td>SAVEE, RAVDESS, RML</td>
<td>Proposes a spatio-temporal CNN framework integrating facial and vocal features through late fusion for emotion recognition.</td>
</tr>
<tr>
<td>[<xref ref-type="bibr" rid="ref-12">12</xref>]</td>
<td>2022</td>
<td>V, A, T</td>
<td>Hybrid Fusion</td>
<td>V: OpenFace, Facet<break/>A: COVAREP<break/>T: BERT</td>
<td>CMU-MOSI, CMU-MOSEI, UR-FUNNY</td>
<td>Disentangles modality-specific and contextual features to improve robustness and fusion effectiveness in conversational MER.</td>
</tr> 
<tr>
<td>[<xref ref-type="bibr" rid="ref-69">69</xref>]</td>
<td>2022</td>
<td>V, A</td>
<td>Late Fusion</td>
<td>V, A: CNN</td>
<td>RAVDESS, SAVEE</td>
<td>Applies model-level late fusion using CNN-based visual and acoustic encoders to improve audiovisual emotion classification.</td>
</tr>
<tr>
<td>[<xref ref-type="bibr" rid="ref-62">62</xref>]</td>
<td>2022</td>
<td>V, A, T</td>
<td>Hybrid Fusion</td>
<td>V: 3D CNN<break/>A: CNN-GRU<break/>T: Bi-LSTM</td>
<td>IEMOCAP</td>
<td>Integrates speech, video, and MoCAP cues using hybrid fusion to model fine-grained temporal dynamics in emotional behavior.</td>
</tr>
<tr>
<td>[<xref ref-type="bibr" rid="ref-13">13</xref>]</td>
<td>2022</td>
<td>V, A, T</td>
<td>Hybrid Fusion</td>
<td>V: 3D CNN<break/>A: Opensmile<break/>T: CNN</td>
<td>IEMOCAP, AVEC, MELD</td>
<td>Models hierarchical uncertainty across modalities to improve robustness of multimodal emotion understanding in conversations.</td>
</tr>
<tr>
<td>[<xref ref-type="bibr" rid="ref-120">120</xref>]</td>
<td>2022</td>
<td>V, A</td>
<td>Early Fusion</td>
<td>V, A: -</td>
<td>SAVEE, eNTERFACE</td>
<td>Uses K-means-guided kernel CCA to enhance early-fusion alignment in human&#x2013;robot interaction emotion recognition tasks.</td>
</tr>
<tr>
<td>[<xref ref-type="bibr" rid="ref-14">14</xref>]</td>
<td>2022</td>
<td>V, A, T</td>
<td>Hybrid Fusion</td>
<td>V: DenseNet<break/>A: Wav2vec2<break/>T: BERT</td>
<td>IEMOCAP, MSP-IMPROV</td>
<td>Introduces a prompt-based pretraining strategy that enhances multimodal alignment for improved downstream MER performance.</td>
</tr>
<tr>
<td>[<xref ref-type="bibr" rid="ref-16">16</xref>]</td>
<td>2022</td>
<td>V, S</td>
<td>Late Fusion</td>
<td>V: ResNet<break/>S: CNN</td>
<td>Bio Vid Emo DB, CIFE</td>
<td>Combines facial features and affective biomarkers to support emotion-state prediction in smart-industry health analytics.</td>
</tr>
<tr>
<td>[<xref ref-type="bibr" rid="ref-80">80</xref>]</td>
<td>2022</td>
<td>V, A, T</td>
<td>Hybrid Fusion</td>
<td>V: ResNet<break/>A: CNN<break/>T: Word2Vec</td>
<td>IEMOCAP, MSP-IMPROV</td>
<td>Employs a weight-sharing autoencoder with multi-head attention to enhance multimodal feature refinement and fusion.</td>
</tr>
<tr>
<td>[<xref ref-type="bibr" rid="ref-15">15</xref>]</td>
<td>2022</td>
<td>V, S</td>
<td>Hybrid Fusion</td>
<td>V: CNN<break/>S: 1D CNN</td>
<td>In-house database</td>
<td>Combines facial cues with wireless sensing signals to provide a multimodal perspective for emotion detection applications.</td>
</tr>
<tr>
<td>[<xref ref-type="bibr" rid="ref-57">57</xref>]</td>
<td>2022</td>
<td>V, A</td>
<td>Hybrid Fusion</td>
<td>V: R(2&#x002B;1)D, ST-GCN<break/>A: TCN<break/>T: Transformer</td>
<td>RAVDESS, CMU-MOSEI</td>
<td>Uses pairwise contrastive loss to enforce modality consistency and strengthen cross-modal representation learning.</td>
</tr>
<tr>
<td>[<xref ref-type="bibr" rid="ref-17">17</xref>]</td>
<td>2022</td>
<td>A, T</td>
<td>Hybrid Fusion</td>
<td>V: FAN<break/>A: PANNs, LEAF<break/>T: Transformer</td>
<td>IEMOCAP, CMU-MOSEI</td>
<td>Explores cross-modal translation to leverage additional datasets, improving generalization in audio&#x2013;text MER settings.</td>
</tr>
<tr>
<td>[<xref ref-type="bibr" rid="ref-8">8</xref>]</td>
<td>2021</td>
<td>V, A, T</td>
<td>Hybrid Fusion</td>
<td>V: Facet<break/>A: COVAREP<break/>T: Glove</td>
<td>CMU-MOSI, CMU-MOSEI, IEMOCAP</td>
<td>Introduces progressive reinforcement to handle unaligned streams and enhance multimodal consistency in emotional cues.</td>
</tr>
<tr>
<td>[<xref ref-type="bibr" rid="ref-72">72</xref>]</td>
<td>2021</td>
<td>V, S</td>
<td>Hybrid Fusion</td>
<td>V: CNN<break/>S: SVM</td>
<td>FER2013, SEED-IV</td>
<td>Combines facial features with EEG-based affective signals to improve robustness in multimodal emotion classifications.</td>
</tr>
<tr>
<td>[<xref ref-type="bibr" rid="ref-70">70</xref>]</td>
<td>2021</td>
<td>V, A</td>
<td>Late Fusion</td>
<td>V: OpenFace<break/>A: Wav2vec2</td>
<td>RAVDESS</td>
<td>Applies aural transformers and facial action units with late fusion to improve audiovisual emotion detection accuracy.</td>
</tr>
<tr>
<td>[<xref ref-type="bibr" rid="ref-9">9</xref>]</td>
<td>2021</td>
<td>V, A, T</td>
<td>Late Fusion</td>
<td>V: FaceNet, GRU<break/>A: WaveRNN<break/>T: GPT</td>
<td>MELD</td>
<td>Uses transformer-based cross-modal fusion to achieve robust conversational MER under complex emotional variability.</td>
</tr>
<tr>
<td>[<xref ref-type="bibr" rid="ref-63">63</xref>]</td>
<td>2021</td>
<td>A, T</td>
<td>Hybrid Fusion</td>
<td>A: log-MFB<break/>T: BERT</td>
<td>IEMOCAP</td>
<td>Models temporal&#x2013;semantic consistency to enhance alignment between acoustic changes and text-based emotional cues.</td>
</tr>
<tr>
<td>[<xref ref-type="bibr" rid="ref-71">71</xref>]</td>
<td>2021</td>
<td>V, A</td>
<td>Hybrid Fusion</td>
<td>V: STN<break/>A: CNN</td>
<td>RAVDESS</td>
<td>Uses transfer learning pipelines for audiovisual analysis, improving generalization in limited-size emotional datasets.</td>
</tr>
<tr>
<td>[<xref ref-type="bibr" rid="ref-10">10</xref>]</td>
<td>2021</td>
<td>V, A, T</td>
<td>Hybrid Fusion</td>
<td>V: VGG<break/>A: VGGish, SoundNet<break/>T: BERT</td>
<td>CMU-MOSI, CMU-MOSEI, IEMOCAP</td>
<td>Unifies heterogeneous features using BERT-driven fusion, enhancing representation quality across three modalities.</td>
</tr>
<tr>
<td>[<xref ref-type="bibr" rid="ref-121">121</xref>]</td>
<td>2021</td>
<td>V, A, T</td>
<td>Hybrid Fusion</td>
<td>V: ResNet<break/>T: CNN, Bi-GRU</td>
<td>WikiArt Emotions</td>
<td>Introduces sequential co-attention to analyze emotional signals in artworks by combining visual and semantic cues.</td>
</tr>
<tr>
<td>[<xref ref-type="bibr" rid="ref-122">122</xref>]</td>
<td>2021</td>
<td>V, A</td>
<td>Hybrid Fusion</td>
<td>V: VGG<break/>A: CNN</td>
<td>eNTERFACE</td>
<td>Uses capsule graph convolution to refine audiovisual features and strengthen multimodal emotion representation.</td>
</tr>
<tr>
<td>[<xref ref-type="bibr" rid="ref-11">11</xref>]</td>
<td>2021</td>
<td>V, A, T</td>
<td>Hybrid Fusion</td>
<td>V, A, T: Bi-GRU</td>
<td>IEMOCAP, CMU-MOSI</td>
<td>Applies deep CCA-based fusion to extract correlated multimodal features and improve stability in emotional predictions.</td>
</tr>
<tr>
<td>[<xref ref-type="bibr" rid="ref-123">123</xref>]</td>
<td>2021</td>
<td>V, A</td>
<td>Hybrid Fusion</td>
<td>V: SIFT, CNN<break/>A: PyAudio</td>
<td>Friends dataset</td>
<td>Integrates handcrafted visual features and speech descriptors using deep belief networks for emotion recognition.</td>
</tr>
<tr>
<td>[<xref ref-type="bibr" rid="ref-124">124</xref>]</td>
<td>2021</td>
<td>V, A, T</td>
<td>Early Fusion</td>
<td>V: FEA<break/>A: AcousEmo<break/>T: CrystalFeel</td>
<td>OMG Emotion</td>
<td>Analyzes contextual and environmental factors in video-based MER, emphasizing early fusion of multimodal cues.</td>
</tr>
</tbody>
</table>
</table-wrap>
</app>
</app-group>
<ref-list content-type="authoryear">
<title>References</title>
<ref id="ref-1"><label>[1]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Wu</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Mi</surname> <given-names>Q</given-names></string-name>, <string-name><surname>Gao</surname> <given-names>T</given-names></string-name></person-group>. <article-title>A comprehensive review of multimodal emotion recognition: techniques, challenges, and future directions</article-title>. <source>Biomimetics</source>. <year>2025</year>;<volume>10</volume>(<issue>7</issue>):<fpage>418</fpage>. doi:<pub-id pub-id-type="doi">10.3390/biomimetics10070418</pub-id>; <pub-id pub-id-type="pmid">40710231</pub-id></mixed-citation></ref>
<ref id="ref-2"><label>[2]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Islam</surname> <given-names>MM</given-names></string-name>, <string-name><surname>Nooruddin</surname> <given-names>S</given-names></string-name>, <string-name><surname>Karray</surname> <given-names>F</given-names></string-name>, <string-name><surname>Muhammad</surname> <given-names>G</given-names></string-name></person-group>. <article-title>Enhanced multimodal emotion recognition in healthcare analytics: a deep learning based model-level fusion approach</article-title>. <source>Biomed Signal Process Control</source>. <year>2024</year>;<volume>94</volume>(<issue>1</issue>):<fpage>106241</fpage>. doi:<pub-id pub-id-type="doi">10.1016/j.bspc.2024.106241</pub-id>.</mixed-citation></ref>
<ref id="ref-3"><label>[3]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Zhou</surname> <given-names>H</given-names></string-name>, <string-name><surname>Liu</surname> <given-names>Z</given-names></string-name></person-group>. <article-title>Realization of self-adaptive higher teaching management based upon expression and speech multimodal emotion recognition</article-title>. <source>Front Psychol</source>. <year>2022</year>;<volume>13</volume>:<fpage>857924</fpage>. doi:<pub-id pub-id-type="doi">10.3389/fpsyg.2022.857924</pub-id>; <pub-id pub-id-type="pmid">35418897</pub-id></mixed-citation></ref>
<ref id="ref-4"><label>[4]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Chen</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Feng</surname> <given-names>X</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>S</given-names></string-name></person-group>. <article-title>Emotion detection and face recognition of drivers in autonomous vehicles in IoT platform</article-title>. <source>Image Vis Comput</source>. <year>2022</year>;<volume>128</volume>:<fpage>104569</fpage>. doi:<pub-id pub-id-type="doi">10.1016/j.imavis.2022.104569</pub-id>.</mixed-citation></ref>
<ref id="ref-5"><label>[5]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Ramaswamy</surname> <given-names>MPA</given-names></string-name>, <string-name><surname>Palaniswamy</surname> <given-names>S</given-names></string-name></person-group>. <article-title>Multimodal emotion recognition: a comprehensive review, trends, and challenges</article-title>. <source>Wiley Interdiscip Rev Data Min Know Disc</source>. <year>2024</year>;<volume>14</volume>(<issue>6</issue>):<fpage>e1563</fpage>. doi:<pub-id pub-id-type="doi">10.1002/widm.1563</pub-id>.</mixed-citation></ref>
<ref id="ref-6"><label>[6]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Kalateh</surname> <given-names>S</given-names></string-name>, <string-name><surname>Estrada-Jimenez</surname> <given-names>LA</given-names></string-name>, <string-name><surname>Nikghadam-Hojjati</surname> <given-names>S</given-names></string-name>, <string-name><surname>Barata</surname> <given-names>J</given-names></string-name></person-group>. <article-title>A systematic review on multimodal emotion recognition: building blocks, current state, applications, and challenges</article-title>. <source>IEEE Access</source>. <year>2024</year>;<volume>12</volume>:<fpage>103976</fpage>&#x2013;<lpage>4019</lpage>. doi:<pub-id pub-id-type="doi">10.1109/access.2024.3430850</pub-id>.</mixed-citation></ref>
<ref id="ref-7"><label>[7]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Chang</surname> <given-names>KC</given-names></string-name>, <string-name><surname>Chen</surname> <given-names>SQ</given-names></string-name></person-group>. <article-title>A survey of multimodal emotion recognition: fusion techniques, datasets, challenges and future directions</article-title>. <source>Int J Biomet</source>. <year>2025</year>;<volume>17</volume>(<issue>5</issue>):<fpage>485</fpage>&#x2013;<lpage>510</lpage>. doi:<pub-id pub-id-type="doi">10.1504/ijbm.2025.148281</pub-id>.</mixed-citation></ref>
<ref id="ref-8"><label>[8]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Lv</surname> <given-names>F</given-names></string-name>, <string-name><surname>Chen</surname> <given-names>X</given-names></string-name>, <string-name><surname>Huang</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Duan</surname> <given-names>L</given-names></string-name>, <string-name><surname>Lin</surname> <given-names>G</given-names></string-name></person-group>. <article-title>Progressive modality reinforcement for human multimodal emotion recognition from unaligned multimodal sequences</article-title>. In: <conf-name>Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2021 Jun 19&#x2013;25</conf-name>; <publisher-loc>Virtual</publisher-loc>. p. <fpage>2554</fpage>&#x2013;<lpage>62</lpage>. doi:<pub-id pub-id-type="doi">10.1109/cvpr46437.2021.00258</pub-id>.</mixed-citation></ref>
<ref id="ref-9"><label>[9]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Xie</surname> <given-names>B</given-names></string-name>, <string-name><surname>Sidulova</surname> <given-names>M</given-names></string-name>, <string-name><surname>Park</surname> <given-names>CH</given-names></string-name></person-group>. <article-title>Robust multimodal emotion recognition from conversation with transformer-based crossmodality fusion</article-title>. <source>Sensors</source>. <year>2021</year>;<volume>21</volume>(<issue>14</issue>):<fpage>4913</fpage>. doi:<pub-id pub-id-type="doi">10.3390/s21144913</pub-id>; <pub-id pub-id-type="pmid">34300651</pub-id></mixed-citation></ref>
<ref id="ref-10"><label>[10]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Lee</surname> <given-names>S</given-names></string-name>, <string-name><surname>Han</surname> <given-names>DK</given-names></string-name>, <string-name><surname>Ko</surname> <given-names>H</given-names></string-name></person-group>. <article-title>Multimodal emotion recognition fusion analysis adapting BERT with heterogeneous feature unification</article-title>. <source>IEEE Access</source>. <year>2021</year>;<volume>9</volume>:<fpage>94557</fpage>&#x2013;<lpage>72</lpage>. doi:<pub-id pub-id-type="doi">10.1109/access.2021.3092735</pub-id>.</mixed-citation></ref>
<ref id="ref-11"><label>[11]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Zhang</surname> <given-names>K</given-names></string-name>, <string-name><surname>Li</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>J</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Li</surname> <given-names>X</given-names></string-name></person-group>. <article-title>Feature fusion for multimodal emotion recognition based on deep canonical correlation analysis</article-title>. <source>IEEE Signal Process Letters</source>. <year>2021</year>;<volume>28</volume>:<fpage>1898</fpage>&#x2013;<lpage>902</lpage>. doi:<pub-id pub-id-type="doi">10.1109/lsp.2021.3112314</pub-id>.</mixed-citation></ref>
<ref id="ref-12"><label>[12]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Yang</surname> <given-names>D</given-names></string-name>, <string-name><surname>Huang</surname> <given-names>S</given-names></string-name>, <string-name><surname>Kuang</surname> <given-names>H</given-names></string-name>, <string-name><surname>Du</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>L</given-names></string-name></person-group>. <article-title>Disentangled representation learning for multimodal emotion recognition</article-title>. In: <conf-name>Proceedings of the 30th ACM International Conference on Multimedia; 2022 Oct 10&#x2013;14</conf-name>; <publisher-loc>Lisbon, Portugal</publisher-loc>. p. <fpage>10</fpage>&#x2013;<lpage>4</lpage>.</mixed-citation></ref>
<ref id="ref-13"><label>[13]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Chen</surname> <given-names>F</given-names></string-name>, <string-name><surname>Shao</surname> <given-names>J</given-names></string-name>, <string-name><surname>Zhu</surname> <given-names>A</given-names></string-name>, <string-name><surname>Ouyang</surname> <given-names>D</given-names></string-name>, <string-name><surname>Liu</surname> <given-names>X</given-names></string-name>, <string-name><surname>Shen</surname> <given-names>HT</given-names></string-name></person-group>. <article-title>Modeling hierarchical uncertainty for multimodal emotion recognition in conversation</article-title>. <source>IEEE Trans Cybern</source>. <year>2022</year>;<volume>54</volume>(<issue>1</issue>):<fpage>187</fpage>&#x2013;<lpage>98</lpage>. doi:<pub-id pub-id-type="doi">10.1109/tcyb.2022.3185119</pub-id>; <pub-id pub-id-type="pmid">35820006</pub-id></mixed-citation></ref>
<ref id="ref-14"><label>[14]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Zhao</surname> <given-names>J</given-names></string-name>, <string-name><surname>Li</surname> <given-names>R</given-names></string-name>, <string-name><surname>Jin</surname> <given-names>Q</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>X</given-names></string-name>, <string-name><surname>Li</surname> <given-names>H</given-names></string-name></person-group>. <article-title>MEmoBERT: pre-training model with prompt-based learning for multimodal emotion recognition</article-title>. In: <conf-name>Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing</conf-name>. <publisher-loc>Piscataway, NJ, USA</publisher-loc>: <publisher-name>IEEE</publisher-name>; <year>2022</year>. p. <fpage>4703</fpage>&#x2013;<lpage>7</lpage>.</mixed-citation></ref>
<ref id="ref-15"><label>[15]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Dang</surname> <given-names>X</given-names></string-name>, <string-name><surname>Chen</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Hao</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Ga</surname> <given-names>M</given-names></string-name>, <string-name><surname>Han</surname> <given-names>X</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>X</given-names></string-name>, <etal>et al</etal></person-group>. <article-title>Wireless sensing technology combined with facial expression to realize multimodal emotion recognition</article-title>. <source>Sensors</source>. <year>2022</year>;<volume>23</volume>(<issue>1</issue>):<fpage>338</fpage>. doi:<pub-id pub-id-type="doi">10.3390/s23010338</pub-id>; <pub-id pub-id-type="pmid">36616935</pub-id></mixed-citation></ref>
<ref id="ref-16"><label>[16]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Kumar</surname> <given-names>A</given-names></string-name>, <string-name><surname>Sharma</surname> <given-names>K</given-names></string-name>, <string-name><surname>Sharma</surname> <given-names>A</given-names></string-name></person-group>. <article-title>MEmoR: a multimodal emotion recognition using affective biomarkers for smart prediction of emotional health for people analytics in smart industries</article-title>. <source>Image Vis Comput</source>. <year>2022</year>;<volume>123</volume>:<fpage>104483</fpage>. doi:<pub-id pub-id-type="doi">10.1016/j.imavis.2022.104483</pub-id>.</mixed-citation></ref>
<ref id="ref-17"><label>[17]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Yoon</surname> <given-names>YC</given-names></string-name></person-group>. <article-title>Can we exploit all datasets? Multimodal emotion recognition using cross-modal translation</article-title>. <source>IEEE Access</source>. <year>2022</year>;<volume>10</volume>:<fpage>64516</fpage>&#x2013;<lpage>24</lpage>. doi:<pub-id pub-id-type="doi">10.1109/access.2022.3183587</pub-id>.</mixed-citation></ref>
<ref id="ref-18"><label>[18]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Li</surname> <given-names>B</given-names></string-name>, <string-name><surname>Fei</surname> <given-names>H</given-names></string-name>, <string-name><surname>Liao</surname> <given-names>L</given-names></string-name>, <string-name><surname>Zhao</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Teng</surname> <given-names>C</given-names></string-name>, <string-name><surname>Chua</surname> <given-names>TS</given-names></string-name>, <etal>et al.</etal></person-group> <article-title>Revisiting disentanglement and fusion on modality and context in conversational multimodal emotion recognition</article-title>. In: <conf-name>Proceedings of the 31st ACM International Conference on Multimedia; 2023 Oct 29&#x2013;Nov 3</conf-name>; <publisher-loc>Ottawa, ON, Canada</publisher-loc>. p. <fpage>5923</fpage>&#x2013;<lpage>34</lpage>. doi:<pub-id pub-id-type="doi">10.1145/3581783.3612053</pub-id>.</mixed-citation></ref>
<ref id="ref-19"><label>[19]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Zhang</surname> <given-names>S</given-names></string-name>, <string-name><surname>Yang</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Chen</surname> <given-names>C</given-names></string-name>, <string-name><surname>Liu</surname> <given-names>R</given-names></string-name>, <string-name><surname>Tao</surname> <given-names>X</given-names></string-name>, <string-name><surname>Guo</surname> <given-names>W</given-names></string-name>, <etal>et al</etal></person-group>. <article-title>Multimodal emotion recognition based on audio and text by using hybrid attention networks</article-title>. <source>Biomed Signal Process Control</source>. <year>2023</year>;<volume>85</volume>:<fpage>105052</fpage>. doi:<pub-id pub-id-type="doi">10.1016/j.bspc.2023.105052</pub-id>.</mixed-citation></ref>
<ref id="ref-20"><label>[20]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Ma</surname> <given-names>H</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>J</given-names></string-name>, <string-name><surname>Lin</surname> <given-names>H</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>B</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Xu</surname> <given-names>B</given-names></string-name></person-group>. <article-title>A transformer-based model with self-distillation for multimodal emotion recognition in conversations</article-title>. <source>IEEE Trans Multim</source>. <year>2023</year>;<volume>26</volume>:<fpage>776</fpage>&#x2013;<lpage>88</lpage>. doi:<pub-id pub-id-type="doi">10.1109/tmm.2023.3271019</pub-id>.</mixed-citation></ref>
<ref id="ref-21"><label>[21]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Le</surname> <given-names>HD</given-names></string-name>, <string-name><surname>Lee</surname> <given-names>GS</given-names></string-name>, <string-name><surname>Kim</surname> <given-names>SH</given-names></string-name>, <string-name><surname>Kim</surname> <given-names>S</given-names></string-name>, <string-name><surname>Yang</surname> <given-names>HJ</given-names></string-name></person-group>. <article-title>Multi-label multimodal emotion recognition with transformer-based fusion and emotion-level representation learning</article-title>. <source>IEEE Access</source>. <year>2023</year>;<volume>11</volume>:<fpage>14742</fpage>&#x2013;<lpage>51</lpage>. doi:<pub-id pub-id-type="doi">10.1109/access.2023.3244390</pub-id>.</mixed-citation></ref>
<ref id="ref-22"><label>[22]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Pe&#x00F1;a</surname> <given-names>D</given-names></string-name>, <string-name><surname>Aguilera</surname> <given-names>A</given-names></string-name>, <string-name><surname>Dongo</surname> <given-names>I</given-names></string-name>, <string-name><surname>Heredia</surname> <given-names>J</given-names></string-name>, <string-name><surname>Cardinale</surname> <given-names>Y</given-names></string-name></person-group>. <article-title>A framework to evaluate fusion methods for multimodal emotion recognition</article-title>. <source>IEEE Access</source>. <year>2023</year>;<volume>11</volume>:<fpage>10218</fpage>&#x2013;<lpage>37</lpage>. doi:<pub-id pub-id-type="doi">10.1109/access.2023.3240420</pub-id>.</mixed-citation></ref>
<ref id="ref-23"><label>[23]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Cheng</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Lin</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Chen</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Li</surname> <given-names>X</given-names></string-name>, <string-name><surname>Mao</surname> <given-names>S</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>F</given-names></string-name>, <etal>et al.</etal></person-group> <article-title>Semi-supervised multimodal emotion recognition with expression mae</article-title>. In: <conf-name>Proceedings of the 31st ACM International Conference on Multimedia; 2023 Oct 29&#x2013;Nov 3</conf-name>; <publisher-loc>Ottawa, ON, Canada</publisher-loc>. p. <fpage>9436</fpage>&#x2013;<lpage>40</lpage>. doi:<pub-id pub-id-type="doi">10.1145/3581783.3612840</pub-id>.</mixed-citation></ref>
<ref id="ref-24"><label>[24]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Tellamekala</surname> <given-names>MK</given-names></string-name>, <string-name><surname>Amiriparian</surname> <given-names>S</given-names></string-name>, <string-name><surname>Schuller</surname> <given-names>BW</given-names></string-name>, <string-name><surname>Andr&#x00E9;</surname> <given-names>E</given-names></string-name>, <string-name><surname>Giesbrecht</surname> <given-names>T</given-names></string-name>, <string-name><surname>Valstar</surname> <given-names>M</given-names></string-name></person-group>. <article-title>COLD fusion: calibrated and ordinal latent distribution fusion for uncertainty-aware multimodal emotion recognition</article-title>. <source>IEEE Trans Pattern Anal Mach Intell</source>. <year>2023</year>;<volume>46</volume>(<issue>2</issue>):<fpage>805</fpage>&#x2013;<lpage>22</lpage>. doi:<pub-id pub-id-type="doi">10.1109/tpami.2023.3325770</pub-id>; <pub-id pub-id-type="pmid">37851557</pub-id></mixed-citation></ref>
<ref id="ref-25"><label>[25]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Zuo</surname> <given-names>H</given-names></string-name>, <string-name><surname>Liu</surname> <given-names>R</given-names></string-name>, <string-name><surname>Zhao</surname> <given-names>J</given-names></string-name>, <string-name><surname>Gao</surname> <given-names>G</given-names></string-name>, <string-name><surname>Li</surname> <given-names>H</given-names></string-name></person-group>. <article-title>Exploiting modality-invariant feature for robust multimodal emotion recognition with missing modalities</article-title>. In: <conf-name>Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing; 2023 Jun 4&#x2013;10</conf-name>; <publisher-loc>Rhodes, Greece</publisher-loc>. p. <fpage>1</fpage>&#x2013;<lpage>5</lpage>.</mixed-citation></ref>
<ref id="ref-26"><label>[26]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Tang</surname> <given-names>G</given-names></string-name>, <string-name><surname>Xie</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Li</surname> <given-names>K</given-names></string-name>, <string-name><surname>Liang</surname> <given-names>R</given-names></string-name>, <string-name><surname>Zhao</surname> <given-names>L</given-names></string-name></person-group>. <article-title>Multimodal emotion recognition from facial expression and speech based on feature fusion</article-title>. <source>Multim Tools Appl</source>. <year>2023</year>;<volume>82</volume>(<issue>11</issue>):<fpage>16359</fpage>&#x2013;<lpage>73</lpage>. doi:<pub-id pub-id-type="doi">10.1007/s11042-022-14185-0</pub-id>.</mixed-citation></ref>
<ref id="ref-27"><label>[27]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Mocanu</surname> <given-names>B</given-names></string-name>, <string-name><surname>Tapu</surname> <given-names>R</given-names></string-name>, <string-name><surname>Zaharia</surname> <given-names>T</given-names></string-name></person-group>. <article-title>Multimodal emotion recognition using cross modal audio-video fusion with attention and deep metric learning</article-title>. <source>Image Vis Comput</source>. <year>2023</year>;<volume>133</volume>:<fpage>104676</fpage>. doi:<pub-id pub-id-type="doi">10.1016/j.imavis.2023.104676</pub-id>.</mixed-citation></ref>
<ref id="ref-28"><label>[28]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Maji</surname> <given-names>B</given-names></string-name>, <string-name><surname>Swain</surname> <given-names>M</given-names></string-name>, <string-name><surname>Guha</surname> <given-names>R</given-names></string-name>, <string-name><surname>Routray</surname> <given-names>A</given-names></string-name></person-group>. <article-title>Multimodal emotion recognition based on deep temporal features using cross-modal transformer and self-attention</article-title>. In: <conf-name>Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing; 2023 Jun 4&#x2013;10</conf-name>; <publisher-loc>Rhodes, Greece</publisher-loc>. p. <fpage>1</fpage>&#x2013;<lpage>5</lpage>.</mixed-citation></ref>
<ref id="ref-29"><label>[29]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Cheng</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Cheng</surname> <given-names>ZQ</given-names></string-name>, <string-name><surname>He</surname> <given-names>JY</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>K</given-names></string-name>, <string-name><surname>Lin</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Lian</surname> <given-names>Z</given-names></string-name>, <etal>et al.</etal></person-group> <article-title>Emotion-LLaMA: multimodal emotion recognition and reasoning with instruction tuning</article-title>. In: <conf-name>Proceedings of Advances in Neural Information Processing Systems</conf-name>. <publisher-loc>London, UK</publisher-loc>; <year>2024</year>. Vol. 37, p. <fpage>110805</fpage>&#x2013;<lpage>53</lpage>.</mixed-citation></ref>
<ref id="ref-30"><label>[30]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Sun</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Zhou</surname> <given-names>T</given-names></string-name></person-group>. <article-title>DialogueMLLM: transforming multimodal emotion recognition in conversation through instruction-tuned MLLM</article-title>. <source>IEEE Access</source>. <year>2025</year>;<volume>13</volume>:<fpage>121048</fpage>&#x2013;<lpage>60</lpage>. doi:<pub-id pub-id-type="doi">10.1109/access.2025.3591447</pub-id>.</mixed-citation></ref>
<ref id="ref-31"><label>[31]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Meng</surname> <given-names>T</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>F</given-names></string-name>, <string-name><surname>Shou</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Shao</surname> <given-names>H</given-names></string-name>, <string-name><surname>Ai</surname> <given-names>W</given-names></string-name>, <string-name><surname>Li</surname> <given-names>K</given-names></string-name></person-group>. <article-title>Masked graph learning with recurrent alignment for multimodal emotion recognition in conversation</article-title>. <source>IEEE/ACM Trans Audio Speech Lang Process</source>. <year>2024</year>;<volume>32</volume>:<fpage>4298</fpage>&#x2013;<lpage>312</lpage>. doi:<pub-id pub-id-type="doi">10.1109/taslp.2024.3434495</pub-id>.</mixed-citation></ref>
<ref id="ref-32"><label>[32]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Shou</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Meng</surname> <given-names>T</given-names></string-name>, <string-name><surname>Ai</surname> <given-names>W</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>F</given-names></string-name>, <string-name><surname>Yin</surname> <given-names>N</given-names></string-name>, <string-name><surname>Li</surname> <given-names>K</given-names></string-name></person-group>. <article-title>Adversarial alignment and graph fusion via information bottleneck for multimodal emotion recognition in conversations</article-title>. <source>Inf Fusion</source>. <year>2024</year>;<volume>112</volume>:<fpage>102590</fpage>. doi:<pub-id pub-id-type="doi">10.1016/j.inffus.2024.102590</pub-id>.</mixed-citation></ref>
<ref id="ref-33"><label>[33]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Wang</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>W</given-names></string-name>, <string-name><surname>Liu</surname> <given-names>K</given-names></string-name>, <string-name><surname>Wu</surname> <given-names>W</given-names></string-name>, <string-name><surname>Hu</surname> <given-names>F</given-names></string-name>, <string-name><surname>Yu</surname> <given-names>H</given-names></string-name>, <etal>et al</etal></person-group>. <article-title>Dynamic emotion-dependent network with relational subgraph interaction for multimodal emotion recognition</article-title>. <source>IEEE Trans Affect Comput</source>. <year>2025</year>;<volume>16</volume>(<issue>2</issue>):<fpage>712</fpage>&#x2013;<lpage>25</lpage>. doi:<pub-id pub-id-type="doi">10.1109/taffc.2024.3461148</pub-id>.</mixed-citation></ref>
<ref id="ref-34"><label>[34]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Makhmudov</surname> <given-names>F</given-names></string-name>, <string-name><surname>Kultimuratov</surname> <given-names>A</given-names></string-name>, <string-name><surname>Cho</surname> <given-names>YI</given-names></string-name></person-group>. <article-title>Enhancing multimodal emotion recognition through attention mechanisms in BERT and CNN architectures</article-title>. <source>Appl Sci</source>. <year>2024</year>;<volume>14</volume>(<issue>10</issue>):<fpage>4199</fpage>. doi:<pub-id pub-id-type="doi">10.20944/preprints202404.1574.v1</pub-id>.</mixed-citation></ref>
<ref id="ref-35"><label>[35]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Yin</surname> <given-names>G</given-names></string-name>, <string-name><surname>Liu</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Liu</surname> <given-names>T</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>H</given-names></string-name>, <string-name><surname>Fang</surname> <given-names>F</given-names></string-name>, <string-name><surname>Tang</surname> <given-names>C</given-names></string-name>, <etal>et al.</etal></person-group> <article-title>Token-disentangling mutual transformer for multimodal emotion recognition</article-title>. <source>Eng Appl Artif Intell</source>. <year>2024</year>;<volume>133</volume>:<fpage>108348</fpage>. doi:<pub-id pub-id-type="doi">10.1016/j.engappai.2024.108348</pub-id>.</mixed-citation></ref>
<ref id="ref-36"><label>[36]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Kumar</surname> <given-names>P</given-names></string-name>, <string-name><surname>Malik</surname> <given-names>S</given-names></string-name>, <string-name><surname>Raman</surname> <given-names>B</given-names></string-name></person-group>. <article-title>Interpretable multimodal emotion recognition using hybrid fusion of speech and image data</article-title>. <source>Multim Tools Appl</source>. <year>2024</year>;<volume>83</volume>(<issue>10</issue>):<fpage>28373</fpage>&#x2013;<lpage>94</lpage>. doi:<pub-id pub-id-type="doi">10.1007/s11042-023-16443-1</pub-id>.</mixed-citation></ref>
<ref id="ref-37"><label>[37]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Yi</surname> <given-names>MH</given-names></string-name>, <string-name><surname>Kwak</surname> <given-names>KC</given-names></string-name>, <string-name><surname>Shin</surname> <given-names>JH</given-names></string-name></person-group>. <article-title>KoHMT: a multimodal emotion recognition model integrating KoELECTRA, HuBERT with multimodal transformer</article-title>. <source>Electronics</source>. <year>2024</year>;<volume>13</volume>(<issue>23</issue>):<fpage>4674</fpage>. doi:<pub-id pub-id-type="doi">10.3390/electronics13234674</pub-id>.</mixed-citation></ref>
<ref id="ref-38"><label>[38]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Li</surname> <given-names>S</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>T</given-names></string-name>, <string-name><surname>Chen</surname> <given-names>CP</given-names></string-name></person-group>. <article-title>Sia-net: sparse interactive attention network for multimodal emotion recognition</article-title>. <source>IEEE Trans Computat Soc Syst</source>. <year>2024</year>;<volume>11</volume>(<issue>5</issue>):<fpage>6782</fpage>&#x2013;<lpage>94</lpage>. doi:<pub-id pub-id-type="doi">10.1109/tcss.2024.3409715</pub-id>.</mixed-citation></ref>
<ref id="ref-39"><label>[39]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Li</surname> <given-names>J</given-names></string-name>, <string-name><surname>Peng</surname> <given-names>J</given-names></string-name></person-group>. <article-title>End-to-end multimodal emotion recognition based on facial expressions and remote photoplethysmography signals</article-title>. <source>IEEE J Biomed Health Inform</source>. <year>2024</year>;<volume>28</volume>(<issue>10</issue>):<fpage>6054</fpage>&#x2013;<lpage>63</lpage>. doi:<pub-id pub-id-type="doi">10.1109/jbhi.2024.3430310</pub-id>; <pub-id pub-id-type="pmid">39024092</pub-id></mixed-citation></ref>
<ref id="ref-40"><label>[40]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Chen</surname> <given-names>C</given-names></string-name>, <string-name><surname>Li</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Kou</surname> <given-names>KI</given-names></string-name>, <string-name><surname>Du</surname> <given-names>J</given-names></string-name>, <string-name><surname>Li</surname> <given-names>C</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>H</given-names></string-name>, <etal>et al</etal></person-group>. <article-title>Comprehensive multisource learning network for cross-subject multimodal emotion recognition</article-title>. <source>IEEE Trans Emerg Topics Computat Intell</source>. <year>2024</year>;<volume>9</volume>(<issue>1</issue>):<fpage>365</fpage>&#x2013;<lpage>80</lpage>. doi:<pub-id pub-id-type="doi">10.1109/tetci.2024.3406422</pub-id>.</mixed-citation></ref>
<ref id="ref-41"><label>[41]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Hong</surname> <given-names>S</given-names></string-name>, <string-name><surname>Kang</surname> <given-names>H</given-names></string-name>, <string-name><surname>Cho</surname> <given-names>H</given-names></string-name></person-group>. <article-title>Cross-modal dynamic transfer learning for multimodal emotion recognition</article-title>. <source>IEEE Access</source>. <year>2024</year>;<volume>12</volume>:<fpage>14324</fpage>&#x2013;<lpage>33</lpage>. doi:<pub-id pub-id-type="doi">10.1109/access.2024.3356185</pub-id>.</mixed-citation></ref>
<ref id="ref-42"><label>[42]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Tu</surname> <given-names>G</given-names></string-name>, <string-name><surname>Xiong</surname> <given-names>F</given-names></string-name>, <string-name><surname>Liang</surname> <given-names>B</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>H</given-names></string-name>, <string-name><surname>Zeng</surname> <given-names>X</given-names></string-name>, <string-name><surname>Xu</surname> <given-names>R</given-names></string-name></person-group>. <article-title>Multimodal emotion recognition calibration in conversations</article-title>. In: <conf-name>Proceedings of the 32nd ACM International Conference on Multimedia; 2024 Oct 28&#x2013;Nov 1</conf-name>; <publisher-loc>Melbourne, VIC, Australia</publisher-loc>. p. <fpage>9621</fpage>&#x2013;<lpage>30</lpage>.</mixed-citation></ref>
<ref id="ref-43"><label>[43]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Khan</surname> <given-names>M</given-names></string-name>, <string-name><surname>Tran</surname> <given-names>PN</given-names></string-name>, <string-name><surname>Pham</surname> <given-names>NT</given-names></string-name>, <string-name><surname>El Saddik</surname> <given-names>A</given-names></string-name>, <string-name><surname>Othmani</surname> <given-names>A</given-names></string-name></person-group>. <article-title>MemoCMT: multimodal emotion recognition using cross-modal transformer-based feature fusion</article-title>. <source>Sci Rep</source>. <year>2025</year>;<volume>15</volume>(<issue>1</issue>):<fpage>5473</fpage>. doi:<pub-id pub-id-type="doi">10.1038/s41598-025-89202-x</pub-id>; <pub-id pub-id-type="pmid">39953105</pub-id></mixed-citation></ref>
<ref id="ref-44"><label>[44]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Zhu</surname> <given-names>X</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Cambria</surname> <given-names>E</given-names></string-name>, <string-name><surname>Rida</surname> <given-names>I</given-names></string-name>, <string-name><surname>L&#x00F3;pez</surname> <given-names>JS</given-names></string-name>, <string-name><surname>Cui</surname> <given-names>L</given-names></string-name>, <etal>et al</etal></person-group>. <article-title>RMER-DT: robust multimodal emotion recognition in conversational contexts based on diffusion and transformers</article-title>. <source>Inf Fusion</source>. <year>2025</year>;<volume>123</volume>(<issue>C</issue>):<fpage>103268</fpage>. doi:<pub-id pub-id-type="doi">10.1016/j.inffus.2025.103268</pub-id>.</mixed-citation></ref>
<ref id="ref-45"><label>[45]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Ai</surname> <given-names>W</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>F</given-names></string-name>, <string-name><surname>Shou</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Meng</surname> <given-names>T</given-names></string-name>, <string-name><surname>Chen</surname> <given-names>H</given-names></string-name>, <string-name><surname>Li</surname> <given-names>K</given-names></string-name></person-group>. <article-title>Revisiting multimodal emotion recognition in conversation from the perspective of graph spectrum</article-title>. In: <conf-name>Proceedings of the AAAI Conference on Artificial Intelligence; 2025 Feb 25&#x2013;Mar 4</conf-name>; <publisher-loc>Philadelphia, PA, USA</publisher-loc>. p. <fpage>11418</fpage>&#x2013;<lpage>26</lpage>.</mixed-citation></ref>
<ref id="ref-46"><label>[46]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Boitel</surname> <given-names>E</given-names></string-name>, <string-name><surname>Mohasseb</surname> <given-names>A</given-names></string-name>, <string-name><surname>Haig</surname> <given-names>E</given-names></string-name></person-group>. <article-title>MIST: multimodal emotion recognition using DeBERTa for text, Semi-CNN for speech, ResNet-50 for facial, and 3D-CNN for motion analysis</article-title>. <source>Expert Syst Appl</source>. <year>2025</year>;<volume>270</volume>:<fpage>126236</fpage>. doi:<pub-id pub-id-type="doi">10.1016/j.eswa.2024.126236</pub-id>.</mixed-citation></ref>
<ref id="ref-47"><label>[47]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Khalane</surname> <given-names>A</given-names></string-name>, <string-name><surname>Makwana</surname> <given-names>R</given-names></string-name>, <string-name><surname>Shaikh</surname> <given-names>T</given-names></string-name>, <string-name><surname>Ullah</surname> <given-names>A</given-names></string-name></person-group>. <article-title>Evaluating significant features in context-aware multimodal emotion recognition with XAI methods</article-title>. <source>Expert Syst</source>. <year>2025</year>;<volume>42</volume>(<issue>1</issue>):<fpage>e13403</fpage>. doi:<pub-id pub-id-type="doi">10.22541/au.167407909.97031004/v1</pub-id>.</mixed-citation></ref>
<ref id="ref-48"><label>[48]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>G&#x00FC;ler</surname> <given-names>SE</given-names></string-name>, <string-name><surname>Akbulut</surname> <given-names>FP</given-names></string-name></person-group>. <article-title>Multimodal emotion recognition: emotion classification through the integration of EEG and facial expressions</article-title>. <source>IEEE Access</source>. <year>2025</year>;<volume>13</volume>(<issue>1</issue>):<fpage>24587</fpage>&#x2013;<lpage>603</lpage>. doi:<pub-id pub-id-type="doi">10.1109/access.2025.3538642</pub-id>.</mixed-citation></ref>
<ref id="ref-49"><label>[49]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Fan</surname> <given-names>S</given-names></string-name>, <string-name><surname>Jing</surname> <given-names>J</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>C</given-names></string-name></person-group>. <article-title>Audio-visual learning for multimodal emotion recognition</article-title>. <source>Symmetry</source>. <year>2025</year>;<volume>17</volume>(<issue>3</issue>):<fpage>418</fpage>. doi:<pub-id pub-id-type="doi">10.3390/sym17030418</pub-id>.</mixed-citation></ref>
<ref id="ref-50"><label>[50]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Dutta</surname> <given-names>S</given-names></string-name>, <string-name><surname>Ganapathy</surname> <given-names>S</given-names></string-name></person-group>. <article-title>LLM supervised pre-training for multimodal emotion recognition in conversations</article-title>. In: <conf-name>Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing; 2025 Apr 6&#x2013;11</conf-name>; <publisher-loc>Hyderabad, India</publisher-loc>. p. <fpage>6</fpage>&#x2013;<lpage>11</lpage>.</mixed-citation></ref>
<ref id="ref-51"><label>[51]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Deng</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Bian</surname> <given-names>J</given-names></string-name>, <string-name><surname>Wu</surname> <given-names>S</given-names></string-name>, <string-name><surname>Lai</surname> <given-names>J</given-names></string-name>, <string-name><surname>Xie</surname> <given-names>X</given-names></string-name></person-group>. <article-title>Multiplex graph aggregation and feature refinement for unsupervised incomplete multimodal emotion recognition</article-title>. <source>Inf Fusion</source>. <year>2025</year>;<volume>114</volume>(<issue>1</issue>):<fpage>102711</fpage>. doi:<pub-id pub-id-type="doi">10.1016/j.inffus.2024.102711</pub-id>.</mixed-citation></ref>
<ref id="ref-52"><label>[52]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Chen</surname> <given-names>G</given-names></string-name>, <string-name><surname>Qian</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>D</given-names></string-name>, <string-name><surname>Qiu</surname> <given-names>S</given-names></string-name>, <string-name><surname>Zhou</surname> <given-names>R</given-names></string-name></person-group>. <article-title>Enhancing robustness against adversarial attacks in multimodal emotion recognition with spiking transformers</article-title>. <source>IEEE Access</source>. <year>2025</year>;<volume>13</volume>:<fpage>34584</fpage>&#x2013;<lpage>97</lpage>. doi:<pub-id pub-id-type="doi">10.1109/access.2025.3544086</pub-id>.</mixed-citation></ref>
<ref id="ref-53"><label>[53]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Wang</surname> <given-names>X</given-names></string-name>, <string-name><surname>Zhao</surname> <given-names>S</given-names></string-name>, <string-name><surname>Sun</surname> <given-names>H</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>H</given-names></string-name>, <string-name><surname>Zhou</surname> <given-names>J</given-names></string-name>, <string-name><surname>Qin</surname> <given-names>Y</given-names></string-name></person-group>. <article-title>Enhancing multimodal emotion recognition through multi-granularity cross-modal alignment</article-title>. In: <conf-name>Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing; 2025 Apr 6&#x2013;11</conf-name>; <publisher-loc>Hyderabad, India</publisher-loc>. p. <fpage>6</fpage>&#x2013;<lpage>11</lpage>.</mixed-citation></ref>
<ref id="ref-54"><label>[54]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Mengara Mengara</surname> <given-names>AG</given-names></string-name>, <string-name><surname>Moon</surname> <given-names>Y-K</given-names></string-name></person-group>. <article-title>CAG-MoE: multimodal emotion recognition with cross-attention gated mixture of experts</article-title>. <source>Mathematics</source>. <year>2025</year>;<volume>13</volume>(<issue>12</issue>):<fpage>1907</fpage>. doi:<pub-id pub-id-type="doi">10.3390/math13121907</pub-id>.</mixed-citation></ref>
<ref id="ref-55"><label>[55]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Zhao</surname> <given-names>S</given-names></string-name>, <string-name><surname>Ren</surname> <given-names>J</given-names></string-name>, <string-name><surname>Zhou</surname> <given-names>X</given-names></string-name></person-group>. <article-title>Cross-modal gated feature enhancement for multimodal emotion recognition in conversations</article-title>. <source>Sci Rep</source>. <year>2025</year>;<volume>15</volume>(<issue>1</issue>):<fpage>30004</fpage>. doi:<pub-id pub-id-type="doi">10.1038/s41598-025-11989-6</pub-id>; <pub-id pub-id-type="pmid">40819129</pub-id></mixed-citation></ref>
<ref id="ref-56"><label>[56]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Dixit</surname> <given-names>C</given-names></string-name>, <string-name><surname>Satapathy</surname> <given-names>SM</given-names></string-name></person-group>. <article-title>Deep CNN with late fusion for real time multimodal emotion recognition</article-title>. <source>Expert Syst Appl</source>. <year>2024</year>;<volume>240</volume>(<issue>1</issue>):<fpage>122579</fpage>. doi:<pub-id pub-id-type="doi">10.1016/j.eswa.2023.122579</pub-id>.</mixed-citation></ref>
<ref id="ref-57"><label>[57]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Franceschini</surname> <given-names>R</given-names></string-name>, <string-name><surname>Fini</surname> <given-names>E</given-names></string-name>, <string-name><surname>Beyan</surname> <given-names>C</given-names></string-name>, <string-name><surname>Conti</surname> <given-names>A</given-names></string-name>, <string-name><surname>Arrigoni</surname> <given-names>F</given-names></string-name>, <string-name><surname>Ricci</surname> <given-names>E</given-names></string-name></person-group>. <article-title>Multimodal emotion recognition with modality-pairwise unsupervised contrastive loss</article-title>. In: <conf-name>Proceedings of the 26th International Conference on Pattern Recognition; 2022 Aug 21&#x2013;25</conf-name>; <publisher-loc>Montreal, QC, Canada</publisher-loc>. p. <fpage>2589</fpage>&#x2013;<lpage>96</lpage>.</mixed-citation></ref>
<ref id="ref-58"><label>[58]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Tu</surname> <given-names>G</given-names></string-name>, <string-name><surname>Xiong</surname> <given-names>F</given-names></string-name>, <string-name><surname>Liang</surname> <given-names>B</given-names></string-name>, <string-name><surname>Xu</surname> <given-names>R</given-names></string-name></person-group>. <article-title>A persona-infused cross-task graph network for multimodal emotion recognition with emotion shift detection in conversations</article-title>. In: <conf-name>Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval; 2024 Jul 14&#x2013;18</conf-name>; <publisher-loc>Washington, DC, USA</publisher-loc>. p. <fpage>14</fpage>&#x2013;<lpage>8</lpage>.</mixed-citation></ref>
<ref id="ref-59"><label>[59]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Zhao</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Shen</surname> <given-names>G</given-names></string-name>, <string-name><surname>Xu</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>J</given-names></string-name></person-group>. <article-title>TDFNet: transformer-based deep-scale fusion network for multimodal emotion recognition</article-title>. <source>IEEE/ACM Trans Audio Speech Lang Process</source>. <year>2023</year>;<volume>31</volume>:<fpage>3771</fpage>&#x2013;<lpage>82</lpage>. doi:<pub-id pub-id-type="doi">10.1109/taslp.2023.3316458</pub-id>.</mixed-citation></ref>
<ref id="ref-60"><label>[60]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Sun</surname> <given-names>D</given-names></string-name>, <string-name><surname>He</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Han</surname> <given-names>J</given-names></string-name></person-group>. <article-title>Using auxiliary tasks in multimodal fusion of wav2vec 2.0 and bert for multimodal emotion recognition</article-title>. In: <conf-name>Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing; 2023 Jun 4&#x2013;10</conf-name>; <publisher-loc>Rhodes, Greece</publisher-loc>. p. <fpage>1</fpage>&#x2013;<lpage>5</lpage>.</mixed-citation></ref>
<ref id="ref-61"><label>[61]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Zhao</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>Y</given-names></string-name></person-group>. <article-title>Knowledge-aware bayesian co-attention for multimodal emotion recognition</article-title>. In: <conf-name>Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing; 2023 Jun 4&#x2013;10</conf-name>; <publisher-loc>Rhodes, Greece</publisher-loc>. p. <fpage>1</fpage>&#x2013;<lpage>5</lpage>.</mixed-citation></ref>
<ref id="ref-62"><label>[62]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Jia</surname> <given-names>N</given-names></string-name>, <string-name><surname>Zheng</surname> <given-names>C</given-names></string-name>, <string-name><surname>Sun</surname> <given-names>W</given-names></string-name></person-group>. <article-title>A multimodal emotion recognition model integrating speech, video and MoCAP</article-title>. <source>Multim Tools Appl</source>. <year>2022</year>;<volume>81</volume>(<issue>22</issue>):<fpage>32265</fpage>&#x2013;<lpage>86</lpage>. doi:<pub-id pub-id-type="doi">10.1007/s11042-022-13091-9</pub-id>.</mixed-citation></ref>
<ref id="ref-63"><label>[63]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Chen</surname> <given-names>B</given-names></string-name>, <string-name><surname>Cao</surname> <given-names>Q</given-names></string-name>, <string-name><surname>Hou</surname> <given-names>M</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Lu</surname> <given-names>G</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>D</given-names></string-name></person-group>. <article-title>Multimodal emotion recognition with temporal and semantic consistency</article-title>. <source>IEEE/ACM Trans Audio Speech Lang Process</source>. <year>2021</year>;<volume>29</volume>:<fpage>3592</fpage>&#x2013;<lpage>603</lpage>. doi:<pub-id pub-id-type="doi">10.1109/taslp.2021.3129331</pub-id>.</mixed-citation></ref>
<ref id="ref-64"><label>[64]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Salas-C&#x00E1;ceres</surname> <given-names>J</given-names></string-name>, <string-name><surname>Lorenzo-Navarro</surname> <given-names>J</given-names></string-name>, <string-name><surname>Freire-Obreg&#x00F3;n</surname> <given-names>D</given-names></string-name>, <string-name><surname>Castrill&#x00F3;n-Santana</surname> <given-names>M</given-names></string-name></person-group>. <article-title>Multimodal emotion recognition based on a fusion of audiovisual information with temporal dynamics</article-title>. <source>Multim Tools Appl</source>. <year>2025</year>;<volume>84</volume>(<issue>23</issue>):<fpage>27327</fpage>&#x2013;<lpage>43</lpage>. doi:<pub-id pub-id-type="doi">10.1007/s11042-024-20227-6</pub-id>.</mixed-citation></ref>
<ref id="ref-65"><label>[65]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Bilotti</surname> <given-names>U</given-names></string-name>, <string-name><surname>Bisogni</surname> <given-names>C</given-names></string-name>, <string-name><surname>De Marsico</surname> <given-names>M</given-names></string-name>, <string-name><surname>Tramonte</surname> <given-names>S</given-names></string-name></person-group>. <article-title>Multimodal emotion recognition via convolutional neural networks: comparison of different strategies on two multimodal datasets</article-title>. <source>Eng Appl Artif Intell</source>. <year>2024</year>;<volume>130</volume>(<issue>1</issue>):<fpage>107708</fpage>. doi:<pub-id pub-id-type="doi">10.1016/j.engappai.2023.107708</pub-id>.</mixed-citation></ref>
<ref id="ref-66"><label>[66]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Chandraumakantham</surname> <given-names>O</given-names></string-name>, <string-name><surname>Gowtham</surname> <given-names>N</given-names></string-name>, <string-name><surname>Zakariah</surname> <given-names>M</given-names></string-name>, <string-name><surname>Almazyad</surname> <given-names>A</given-names></string-name></person-group>. <article-title>Multimodal emotion recognition using feature fusion: an LLM-based approach</article-title>. <source>IEEE Access</source>. <year>2024</year>;<volume>12</volume>:<fpage>108052</fpage>&#x2013;<lpage>71</lpage>. doi:<pub-id pub-id-type="doi">10.1109/access.2024.3425953</pub-id>.</mixed-citation></ref>
<ref id="ref-67"><label>[67]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Radoi</surname> <given-names>A</given-names></string-name>, <string-name><surname>Cioroiu</surname> <given-names>G</given-names></string-name></person-group>. <article-title>Uncertainty-based Learning of a lightweight model for multimodal emotion recognition</article-title>. <source>IEEE Access</source>. <year>2024</year>;<volume>12</volume>:<fpage>120362</fpage>&#x2013;<lpage>74</lpage>. doi:<pub-id pub-id-type="doi">10.1109/access.2024.3450674</pub-id>.</mixed-citation></ref>
<ref id="ref-68"><label>[68]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Sharafi</surname> <given-names>M</given-names></string-name>, <string-name><surname>Yazdchi</surname> <given-names>M</given-names></string-name>, <string-name><surname>Rasti</surname> <given-names>R</given-names></string-name>, <string-name><surname>Nasimi</surname> <given-names>F</given-names></string-name></person-group>. <article-title>A novel spatio-temporal convolutional neural framework for multimodal emotion recognition</article-title>. <source>Biomed Signal Process Control</source>. <year>2022</year>;<volume>78</volume>:<fpage>103970</fpage>. doi:<pub-id pub-id-type="doi">10.1016/j.bspc.2022.103970</pub-id>.</mixed-citation></ref>
<ref id="ref-69"><label>[69]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Middya</surname> <given-names>AI</given-names></string-name>, <string-name><surname>Nag</surname> <given-names>B</given-names></string-name>, <string-name><surname>Roy</surname> <given-names>S</given-names></string-name></person-group>. <article-title>Deep learning based multimodal emotion recognition using model-level fusion of audio-visual modalities</article-title>. <source>Knowl Based Syst</source>. <year>2022</year>;<volume>244</volume>(<issue>3</issue>):<fpage>108580</fpage>. doi:<pub-id pub-id-type="doi">10.1016/j.knosys.2022.108580</pub-id>.</mixed-citation></ref>
<ref id="ref-70"><label>[70]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Luna-Jim&#x00E9;nez</surname> <given-names>C</given-names></string-name>, <string-name><surname>Kleinlein</surname> <given-names>R</given-names></string-name>, <string-name><surname>Griol</surname> <given-names>D</given-names></string-name>, <string-name><surname>Callejas</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Montero</surname> <given-names>JM</given-names></string-name>, <string-name><surname>Fern&#x00E1;ndez-Mart&#x00ED;nez</surname> <given-names>F</given-names></string-name></person-group>. <article-title>A proposal for multimodal emotion recognition using aural transformers and action units on ravdess dataset</article-title>. <source>Appl Sci</source>. <year>2021</year>;<volume>12</volume>(<issue>1</issue>):<fpage>327</fpage>. doi:<pub-id pub-id-type="doi">10.3390/app12010327</pub-id>.</mixed-citation></ref>
<ref id="ref-71"><label>[71]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Luna-Jim&#x00E9;nez</surname> <given-names>C</given-names></string-name>, <string-name><surname>Griol</surname> <given-names>D</given-names></string-name>, <string-name><surname>Callejas</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Kleinlein</surname> <given-names>R</given-names></string-name>, <string-name><surname>Montero</surname> <given-names>JM</given-names></string-name>, <string-name><surname>Fern&#x00E1;ndez-Mart&#x00ED;nez</surname> <given-names>F</given-names></string-name></person-group>. <article-title>Multimodal emotion recognition on RAVDESS dataset using transfer learning</article-title>. <source>Sensors</source>. <year>2021</year>;<volume>21</volume>(<issue>22</issue>):<fpage>7665</fpage>. doi:<pub-id pub-id-type="doi">10.3390/s21227665</pub-id>; <pub-id pub-id-type="pmid">34833739</pub-id></mixed-citation></ref>
<ref id="ref-72"><label>[72]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Tan</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Sun</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Duan</surname> <given-names>F</given-names></string-name>, <string-name><surname>Sol&#x00E9;-Casals</surname> <given-names>J</given-names></string-name>, <string-name><surname>Caiafa</surname> <given-names>CF</given-names></string-name></person-group>. <article-title>A multimodal emotion recognition method based on facial expressions and electroencephalography</article-title>. <source>Biomed Signal Process Control</source>. <year>2021</year>;<volume>70</volume>:<fpage>103029</fpage>. doi:<pub-id pub-id-type="doi">10.1016/j.bspc.2021.103029</pub-id>.</mixed-citation></ref>
<ref id="ref-73"><label>[73]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Wang</surname> <given-names>S</given-names></string-name>, <string-name><surname>Qu</surname> <given-names>J</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>Y</given-names></string-name></person-group>. <article-title>Multimodal emotion recognition from EEG signals and facial expressions</article-title>. <source>IEEE Access</source>. <year>2023</year>;<volume>11</volume>:<fpage>33061</fpage>&#x2013;<lpage>8</lpage>. doi:<pub-id pub-id-type="doi">10.1109/access.2023.3263670</pub-id>.</mixed-citation></ref>
<ref id="ref-74"><label>[74]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Cheng</surname> <given-names>C</given-names></string-name>, <string-name><surname>Liu</surname> <given-names>W</given-names></string-name>, <string-name><surname>Feng</surname> <given-names>L</given-names></string-name>, <string-name><surname>Jia</surname> <given-names>Z</given-names></string-name></person-group>. <article-title>Dense graph convolutional with joint cross-attention network for multimodal emotion recognition</article-title>. <source>IEEE Trans Computat Soc Syst</source>. <year>2024</year>;<volume>11</volume>(<issue>5</issue>):<fpage>6672</fpage>&#x2013;<lpage>83</lpage>. doi:<pub-id pub-id-type="doi">10.1109/tcss.2024.3412074</pub-id>.</mixed-citation></ref>
<ref id="ref-75"><label>[75]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Zhang</surname> <given-names>L</given-names></string-name>, <string-name><surname>Jin</surname> <given-names>L</given-names></string-name>, <string-name><surname>Xu</surname> <given-names>G</given-names></string-name>, <string-name><surname>Li</surname> <given-names>X</given-names></string-name>, <string-name><surname>Xu</surname> <given-names>C</given-names></string-name>, <string-name><surname>Wei</surname> <given-names>K</given-names></string-name>, <etal>et al.</etal></person-group> <article-title>CAMEL: capturing metaphorical alignment with context disentangling for multimodal emotion recognition</article-title>. In: <conf-name>Proceedings of the AAAI Conference on Artificial Intelligence; 2024 Feb 20&#x2013;27</conf-name>; <publisher-loc>Vancouver, BC, Canada</publisher-loc>. p. <fpage>9341</fpage>&#x2013;<lpage>9</lpage>.</mixed-citation></ref>
<ref id="ref-76"><label>[76]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Filali</surname> <given-names>H</given-names></string-name>, <string-name><surname>Boulealam</surname> <given-names>C</given-names></string-name>, <string-name><surname>El Fazazy</surname> <given-names>K</given-names></string-name>, <string-name><surname>Mahraz</surname> <given-names>AM</given-names></string-name>, <string-name><surname>Tairi</surname> <given-names>H</given-names></string-name>, <string-name><surname>Riffi</surname> <given-names>J</given-names></string-name></person-group>. <article-title>Meaningful multimodal emotion recognition based on capsule graph transformer architecture</article-title>. <source>Information</source>. <year>2025</year>;<volume>16</volume>(<issue>1</issue>):<fpage>40</fpage>. doi:<pub-id pub-id-type="doi">10.3390/info16010040</pub-id>.</mixed-citation></ref>
<ref id="ref-77"><label>[77]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Meng</surname> <given-names>T</given-names></string-name>, <string-name><surname>Shou</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Ai</surname> <given-names>W</given-names></string-name>, <string-name><surname>Yin</surname> <given-names>N</given-names></string-name>, <string-name><surname>Li</surname> <given-names>K</given-names></string-name></person-group>. <article-title>Deep imbalanced learning for multimodal emotion recognition in conversations</article-title>. <source>IEEE Trans Artif Intell</source>. <year>2024</year>;<volume>5</volume>(<issue>12</issue>):<fpage>6472</fpage>&#x2013;<lpage>87</lpage>. doi:<pub-id pub-id-type="doi">10.1109/tai.2024.3445325</pub-id>.</mixed-citation></ref>
<ref id="ref-78"><label>[78]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Sun</surname> <given-names>H</given-names></string-name>, <string-name><surname>Zhao</surname> <given-names>S</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>X</given-names></string-name>, <string-name><surname>Zeng</surname> <given-names>W</given-names></string-name>, <string-name><surname>Chen</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Qin</surname> <given-names>Y</given-names></string-name></person-group>. <article-title>Fine-grained disentangled representation learning for multimodal emotion recognition</article-title>. In: <conf-name>Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing; 2024 Apr 14&#x2013;19</conf-name>; <publisher-loc>Seoul, Republic of Korea</publisher-loc>. p. <fpage>11051</fpage>&#x2013;<lpage>5</lpage>.</mixed-citation></ref>
<ref id="ref-79"><label>[79]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Li</surname> <given-names>C</given-names></string-name>, <string-name><surname>Xie</surname> <given-names>L</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>X</given-names></string-name>, <string-name><surname>Pan</surname> <given-names>H</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>Z</given-names></string-name></person-group>. <article-title>A twin disentanglement Transformer Network with Hierarchical-Level Feature Reconstruction for robust multimodal emotion recognition</article-title>. <source>Expert Syst Appl</source>. <year>2025</year>;<volume>264</volume>(<issue>5</issue>):<fpage>125822</fpage>. doi:<pub-id pub-id-type="doi">10.1016/j.eswa.2024.125822</pub-id>.</mixed-citation></ref>
<ref id="ref-80"><label>[80]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Zheng</surname> <given-names>J</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>S</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>X</given-names></string-name>, <string-name><surname>Zeng</surname> <given-names>Z</given-names></string-name></person-group>. <article-title>Multi-channel weight-sharing autoencoder based on cascade multi-head attention for multimodal emotion recognition</article-title>. <source>IEEE Trans Multim</source>. <year>2022</year>;<volume>25</volume>:<fpage>2213</fpage>&#x2013;<lpage>25</lpage>. doi:<pub-id pub-id-type="doi">10.1109/tmm.2022.3144885</pub-id>.</mixed-citation></ref>
<ref id="ref-81"><label>[81]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Liu</surname> <given-names>R</given-names></string-name>, <string-name><surname>Zuo</surname> <given-names>H</given-names></string-name>, <string-name><surname>Lian</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Schuller</surname> <given-names>BW</given-names></string-name>, <string-name><surname>Li</surname> <given-names>H</given-names></string-name></person-group>. <article-title>Contrastive learning based modality-invariant feature acquisition for robust multimodal emotion recognition with missing modalities</article-title>. <source>IEEE Trans Affect Comput</source>. <year>2024</year>;<volume>15</volume>(<issue>4</issue>):<fpage>1856</fpage>&#x2013;<lpage>73</lpage>. doi:<pub-id pub-id-type="doi">10.1109/taffc.2024.3378570</pub-id>.</mixed-citation></ref>
<ref id="ref-82"><label>[82]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Busso</surname> <given-names>C</given-names></string-name>, <string-name><surname>Bulut</surname> <given-names>M</given-names></string-name>, <string-name><surname>Lee</surname> <given-names>CC</given-names></string-name>, <string-name><surname>Kazemzadeh</surname> <given-names>A</given-names></string-name>, <string-name><surname>Mower</surname> <given-names>E</given-names></string-name>, <string-name><surname>Kim</surname> <given-names>S</given-names></string-name>, <etal>et al</etal></person-group>. <article-title>IEMOCAP: interactive emotional dyadic motion capture database</article-title>. <source>Lang Resour Evaluat</source>. <year>2008</year>;<volume>42</volume>(<issue>4</issue>):<fpage>335</fpage>&#x2013;<lpage>59</lpage>. doi:<pub-id pub-id-type="doi">10.1007/s10579-008-9076-6</pub-id>.</mixed-citation></ref>
<ref id="ref-83"><label>[83]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Livingstone</surname> <given-names>SR</given-names></string-name>, <string-name><surname>Russo</surname> <given-names>FA</given-names></string-name></person-group>. <article-title>The ryerson audio-visual database of emotional speech and song (RAVDESS): a dynamic, multimodal set of facial and vocal expressions in North American English</article-title>. <source>PLoS One</source>. <year>2018</year>;<volume>13</volume>(<issue>5</issue>):<fpage>e0196391</fpage>. doi:<pub-id pub-id-type="doi">10.32920/25412950.v1</pub-id>.</mixed-citation></ref>
<ref id="ref-84"><label>[84]</label><mixed-citation publication-type="book"><person-group person-group-type="author"><string-name><surname>Jackson</surname> <given-names>P</given-names></string-name>, <string-name><surname>Haq</surname> <given-names>S</given-names></string-name></person-group>. <source>Surrey audio-visual expressed emotion (savee) database</source>. <publisher-loc>Guildford, UK</publisher-loc>: <publisher-name>University of Surrey</publisher-name>; <year>2014</year>.</mixed-citation></ref>
<ref id="ref-85"><label>[85]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Martin</surname> <given-names>O</given-names></string-name>, <string-name><surname>Kotsia</surname> <given-names>I</given-names></string-name>, <string-name><surname>Macq</surname> <given-names>B</given-names></string-name>, <string-name><surname>Pitas</surname> <given-names>I</given-names></string-name></person-group>. <article-title>The eNTERFACE&#x2019;05 audio-visual emotion database</article-title>. In: <conf-name>Proceedings of the 22nd International Conference on Data Engineering Workshops; 2006 Apr 3&#x2013;7</conf-name>; <publisher-loc>Atlanta, GA, USA</publisher-loc>.</mixed-citation></ref>
<ref id="ref-86"><label>[86]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Cao</surname> <given-names>H</given-names></string-name>, <string-name><surname>Cooper</surname> <given-names>DG</given-names></string-name>, <string-name><surname>Keutmann</surname> <given-names>MK</given-names></string-name>, <string-name><surname>Gur</surname> <given-names>RC</given-names></string-name>, <string-name><surname>Nenkova</surname> <given-names>A</given-names></string-name>, <string-name><surname>Verma</surname> <given-names>R</given-names></string-name></person-group>. <article-title>CREAM-D: crowd-sourced emotional multimodal actors dataset</article-title>. <source>IEEE Trans Affect Comput</source>. <year>2014</year>;<volume>5</volume>(<issue>4</issue>):<fpage>377</fpage>&#x2013;<lpage>90</lpage>. doi:<pub-id pub-id-type="doi">10.1109/taffc.2014.2336244</pub-id>; <pub-id pub-id-type="pmid">25653738</pub-id></mixed-citation></ref>
<ref id="ref-87"><label>[87]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Busso</surname> <given-names>C</given-names></string-name>, <string-name><surname>Parthasarathy</surname> <given-names>S</given-names></string-name>, <string-name><surname>Burmania</surname> <given-names>A</given-names></string-name>, <string-name><surname>AbdelWahab</surname> <given-names>M</given-names></string-name>, <string-name><surname>Sadoughi</surname> <given-names>N</given-names></string-name>, <string-name><surname>Provost</surname> <given-names>EM</given-names></string-name></person-group>. <article-title>MSP-IMPROV: an acted corpus of dyadic interactions to study emotion perception</article-title>. <source>IEEE Trans Affect Comput</source>. <year>2016</year>;<volume>8</volume>(<issue>1</issue>):<fpage>67</fpage>&#x2013;<lpage>80</lpage>. doi:<pub-id pub-id-type="doi">10.1109/taffc.2016.2515617</pub-id>.</mixed-citation></ref>
<ref id="ref-88"><label>[88]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Zhalehpour</surname> <given-names>S</given-names></string-name>, <string-name><surname>Onder</surname> <given-names>O</given-names></string-name>, <string-name><surname>Akhtar</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Erdem</surname> <given-names>CE</given-names></string-name></person-group>. <article-title>BAUM-1: a spontaneous audio-visual face database of affective and mental states</article-title>. <source>IEEE Trans Affect Comput</source>. <year>2016</year>;<volume>8</volume>(<issue>3</issue>):<fpage>300</fpage>&#x2013;<lpage>13</lpage>. doi:<pub-id pub-id-type="doi">10.1109/taffc.2016.2553038</pub-id>.</mixed-citation></ref>
<ref id="ref-89"><label>[89]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><surname>Kollias</surname> <given-names>D</given-names></string-name>, <string-name><surname>Zafeiriou</surname> <given-names>S</given-names></string-name></person-group>. <article-title>Aff-Wild2: extending the aff-wild database for affect recognition</article-title>. <comment>arXiv:1811.07770</comment>. <year>2018</year>.</mixed-citation></ref>
<ref id="ref-90"><label>[90]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><surname>Zadeh</surname> <given-names>A</given-names></string-name>, <string-name><surname>Zellers</surname> <given-names>R</given-names></string-name>, <string-name><surname>Pincus</surname> <given-names>E</given-names></string-name>, <string-name><surname>Morency</surname> <given-names>LP</given-names></string-name></person-group>. <article-title>MOSI: multimodal corpus of sentiment intensity and subjectivity analysis in online opinion videos</article-title>. <comment>arXiv:1606.06259</comment>. <year>2016</year>.</mixed-citation></ref>
<ref id="ref-91"><label>[91]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Zadeh</surname> <given-names>AB</given-names></string-name>, <string-name><surname>Liang</surname> <given-names>PP</given-names></string-name>, <string-name><surname>Poria</surname> <given-names>S</given-names></string-name>, <string-name><surname>Cambria</surname> <given-names>E</given-names></string-name>, <string-name><surname>Morency</surname> <given-names>LP</given-names></string-name></person-group>. <article-title>Multimodal language analysis in the wild: CMU-MOSEI dataset and interpretable dynamic fusion graph</article-title>. In: <conf-name>Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics; 2018 Jul 15&#x2013;20</conf-name>; <publisher-loc>Melbourne, VIC, Australia</publisher-loc>. p. <fpage>2236</fpage>&#x2013;<lpage>46</lpage>.</mixed-citation></ref>
<ref id="ref-92"><label>[92]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Poria</surname> <given-names>S</given-names></string-name>, <string-name><surname>Hazarika</surname> <given-names>D</given-names></string-name>, <string-name><surname>Majumder</surname> <given-names>N</given-names></string-name>, <string-name><surname>Naik</surname> <given-names>G</given-names></string-name>, <string-name><surname>Cambria</surname> <given-names>E</given-names></string-name>, <string-name><surname>Mihalcea</surname> <given-names>R</given-names></string-name></person-group>. <article-title>MELD: a multimodal multi-party dataset for emotion recognition in conversations</article-title>. In: <conf-name>Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics; 2019 Jul 28&#x2013;Aug 2</conf-name>; <publisher-loc>Florence, Italy</publisher-loc>. p. <fpage>527</fpage>&#x2013;<lpage>36</lpage>.</mixed-citation></ref>
<ref id="ref-93"><label>[93]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Li</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Su</surname> <given-names>H</given-names></string-name>, <string-name><surname>Shen</surname> <given-names>X</given-names></string-name>, <string-name><surname>Li</surname> <given-names>W</given-names></string-name>, <string-name><surname>Cao</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Niu</surname> <given-names>S</given-names></string-name></person-group>. <article-title>DailyDialog: a manually labelled multi-turn dialogue dataset</article-title>. In: <conf-name>Proceedings of the Eighth International Joint Conference on Natural Language Processing, 2017 Nov 27&#x2013;Dec 1</conf-name>; <publisher-loc>Taipei, Taiwan</publisher-loc>. p. <fpage>986</fpage>&#x2013;<lpage>95</lpage>.</mixed-citation></ref>
<ref id="ref-94"><label>[94]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Zahiri</surname> <given-names>SM</given-names></string-name>, <string-name><surname>Choi</surname> <given-names>JD</given-names></string-name></person-group>. <article-title>Emotion detection on TV show transcripts with sequence-based convolutional neural networks</article-title>. In: <conf-name>Proceedings of the AAAI Workshops; 2018 Feb 2&#x2013;3</conf-name>; <publisher-loc>New Orleans, LA, USA</publisher-loc>. p. <fpage>44</fpage>&#x2013;<lpage>52</lpage>.</mixed-citation></ref>
<ref id="ref-95"><label>[95]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Hasan</surname> <given-names>MK</given-names></string-name>, <string-name><surname>Rahman</surname> <given-names>W</given-names></string-name>, <string-name><surname>Zadeh</surname> <given-names>AB</given-names></string-name>, <string-name><surname>Zhong</surname> <given-names>J</given-names></string-name>, <string-name><surname>Tanveer</surname> <given-names>MI</given-names></string-name>, <string-name><surname>Morency</surname> <given-names>LP</given-names></string-name>, <etal>et al.</etal></person-group> <article-title>UR-FUNNY: a multimodal language dataset for understanding humor</article-title>. In: <conf-name>Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing; 2019 Nov 3&#x2013;7</conf-name>; <publisher-loc>Hong Kong, China</publisher-loc>. p. <fpage>2046</fpage>&#x2013;<lpage>56</lpage>.</mixed-citation></ref>
<ref id="ref-96"><label>[96]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Lian</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Sun</surname> <given-names>H</given-names></string-name>, <string-name><surname>Sun</surname> <given-names>L</given-names></string-name>, <string-name><surname>Chen</surname> <given-names>K</given-names></string-name>, <string-name><surname>Xu</surname> <given-names>M</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>K</given-names></string-name>, <etal>et al.</etal></person-group> <article-title>MER 2023: multi-label learning, modality robustness, and semi-supervised learning</article-title>. In: <conf-name>Proceedings of the 31st ACM International Conference on Multimedia; 2023 Oct 29&#x2013;Nov 3</conf-name>; <publisher-loc>Ottawa, ON, Canada</publisher-loc>. p. <fpage>9610</fpage>&#x2013;<lpage>4</lpage>.</mixed-citation></ref>
<ref id="ref-97"><label>[97]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Lian</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Sun</surname> <given-names>H</given-names></string-name>, <string-name><surname>Sun</surname> <given-names>L</given-names></string-name>, <string-name><surname>Wen</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>S</given-names></string-name>, <string-name><surname>Chen</surname> <given-names>S</given-names></string-name>, <etal>et al.</etal></person-group> <article-title>MER 2024: semi-supervised learning, noise robustness, and open-vocabulary multimodal emotion recognition</article-title>. In: <conf-name>Proceedings of the 2nd International Workshop on Multimodal and Responsible Affective Computing; 2024 Oct 28&#x2013;Nov 1</conf-name>; <publisher-loc>Melbourne, VIC, Australia</publisher-loc>. p. <fpage>41</fpage>&#x2013;<lpage>8</lpage>.</mixed-citation></ref>
<ref id="ref-98"><label>[98]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Lian</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Liu</surname> <given-names>R</given-names></string-name>, <string-name><surname>Xu</surname> <given-names>K</given-names></string-name>, <string-name><surname>Liu</surname> <given-names>B</given-names></string-name>, <string-name><surname>Liu</surname> <given-names>X</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>Y</given-names></string-name>, <etal>et al.</etal></person-group> <article-title>MER 2025: when affective computing meets large language models</article-title>. In: <conf-name>Proceedings of the 33rd ACM International Conference on Multimedia; 2025 Oct 27&#x2013;31</conf-name>; <publisher-loc>San Francisco, CA, USA</publisher-loc>. p. <fpage>13837</fpage>&#x2013;<lpage>42</lpage>.</mixed-citation></ref>
<ref id="ref-99"><label>[99]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Koelstra</surname> <given-names>S</given-names></string-name>, <string-name><surname>Muhl</surname> <given-names>C</given-names></string-name>, <string-name><surname>Soleymani</surname> <given-names>M</given-names></string-name>, <string-name><surname>Lee</surname> <given-names>JS</given-names></string-name>, <string-name><surname>Yazdani</surname> <given-names>A</given-names></string-name>, <string-name><surname>Ebrahimi</surname> <given-names>T</given-names></string-name>, <etal>et al</etal></person-group>. <article-title>Deap: a database for emotion analysis; using physiological signals</article-title>. <source>IEEE Trans Affect Comput</source>. <year>2011</year>;<volume>3</volume>(<issue>1</issue>):<fpage>18</fpage>&#x2013;<lpage>31</lpage>. doi:<pub-id pub-id-type="doi">10.1109/t-affc.2011.15</pub-id>.</mixed-citation></ref>
<ref id="ref-100"><label>[100]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Soleymani</surname> <given-names>M</given-names></string-name>, <string-name><surname>Lichtenauer</surname> <given-names>J</given-names></string-name>, <string-name><surname>Pun</surname> <given-names>T</given-names></string-name>, <string-name><surname>Pantic</surname> <given-names>M</given-names></string-name></person-group>. <article-title>A multimodal database for affect recognition and implicit tagging</article-title>. <source>IEEE Trans Affect Comput</source>. <year>2011</year>;<volume>3</volume>(<issue>1</issue>):<fpage>42</fpage>&#x2013;<lpage>55</lpage>. doi:<pub-id pub-id-type="doi">10.1109/t-affc.2011.25</pub-id>.</mixed-citation></ref>
<ref id="ref-101"><label>[101]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Zhang</surname> <given-names>L</given-names></string-name>, <string-name><surname>Walter</surname> <given-names>S</given-names></string-name>, <string-name><surname>Ma</surname> <given-names>X</given-names></string-name>, <string-name><surname>Werner</surname> <given-names>P</given-names></string-name>, <string-name><surname>Al-Hamadi</surname> <given-names>A</given-names></string-name>, <string-name><surname>Traue</surname> <given-names>HC</given-names></string-name>, <etal>et al.</etal></person-group> <article-title>&#x201C;BioVid Emo DB&#x201D;: a multimodal database for emotion analyses validated by subjective ratings</article-title>. In: <conf-name>Proceedings of the 2016 IEEE Symposium Series on Computational Intelligence; 2016 Dec 6&#x2013;9</conf-name>; <publisher-loc>Athens, Greece</publisher-loc>. p. <fpage>1</fpage>&#x2013;<lpage>6</lpage>.</mixed-citation></ref>
<ref id="ref-102"><label>[102]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Miranda-Correa</surname> <given-names>JA</given-names></string-name>, <string-name><surname>Abadi</surname> <given-names>MK</given-names></string-name>, <string-name><surname>Sebe</surname> <given-names>N</given-names></string-name>, <string-name><surname>Patras</surname> <given-names>I</given-names></string-name></person-group>. <article-title>AMIGOS: a dataset for affect, personality and mood research on individuals and groups</article-title>. <source>IEEE Trans Affect Comput</source>. <year>2018</year>;<volume>12</volume>(<issue>2</issue>):<fpage>479</fpage>&#x2013;<lpage>93</lpage>. doi:<pub-id pub-id-type="doi">10.1109/taffc.2018.2884461</pub-id>.</mixed-citation></ref>
<ref id="ref-103"><label>[103]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Katsigiannis</surname> <given-names>S</given-names></string-name>, <string-name><surname>Ramzan</surname> <given-names>N</given-names></string-name></person-group>. <article-title>DREAMER: a database for emotion recognition through EEG and ECG signals from wireless low-cost off-the-shelf devices</article-title>. <source>IEEE J Biomed Health Inform</source>. <year>2017</year>;<volume>22</volume>(<issue>1</issue>):<fpage>98</fpage>&#x2013;<lpage>107</lpage>. doi:<pub-id pub-id-type="doi">10.1109/jbhi.2017.2688239</pub-id>; <pub-id pub-id-type="pmid">28368836</pub-id></mixed-citation></ref>
<ref id="ref-104"><label>[104]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Kollias</surname> <given-names>D</given-names></string-name></person-group>. <article-title>ABAW: valence-arousal estimation, expression recognition, action unit detection &#x0026; multi-task learning challenges</article-title>. In: <conf-name>Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2022 Jun 19&#x2013;24</conf-name>; <publisher-loc>New Orleans, LA, USA</publisher-loc>. p. <fpage>2328</fpage>&#x2013;<lpage>36</lpage>.</mixed-citation></ref>
<ref id="ref-105"><label>[105]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Lin</surname> <given-names>F</given-names></string-name>, <string-name><surname>Crawford</surname> <given-names>S</given-names></string-name>, <string-name><surname>Guillot</surname> <given-names>K</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Chen</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Yuan</surname> <given-names>X</given-names></string-name>, <etal>et al.</etal></person-group> <article-title>MMST-VIT: climate change-aware crop yield prediction via multi-modal spatial-temporal vision transformer</article-title>. In: <conf-name>Proceedings of the IEEE/CVF International Conference on Computer Vision; 2023 Oct 2&#x2013;6</conf-name>; <publisher-loc>Paris, France</publisher-loc>. p. <fpage>5774</fpage>&#x2013;<lpage>84</lpage>.</mixed-citation></ref>
<ref id="ref-106"><label>[106]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Mohamed</surname> <given-names>SA</given-names></string-name>, <string-name><surname>Maksoud</surname> <given-names>OOA</given-names></string-name>, <string-name><surname>Fathy</surname> <given-names>A</given-names></string-name>, <string-name><surname>Mohamed</surname> <given-names>AS</given-names></string-name>, <string-name><surname>Hosny</surname> <given-names>K</given-names></string-name>, <string-name><surname>Keshk</surname> <given-names>HM</given-names></string-name>, <etal>et al</etal></person-group>. <article-title>A hybrid deep learning and rule-based model for smart weather forecasting and crop recommendation using satellite imagery</article-title>. <source>Sci Rep</source>. <year>2025</year>;<volume>15</volume>(<issue>1</issue>):<fpage>36102</fpage>. doi:<pub-id pub-id-type="doi">10.1038/s41598-025-21506-4</pub-id>; <pub-id pub-id-type="pmid">41094001</pub-id></mixed-citation></ref>
<ref id="ref-107"><label>[107]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Feng</surname> <given-names>K</given-names></string-name>, <string-name><surname>Chaspari</surname> <given-names>T</given-names></string-name></person-group>. <article-title>A review of generalizable transfer learning in automatic emotion recognition</article-title>. <source>Front Comput Sci</source>. <year>2020</year>;<volume>2</volume>:<fpage>9</fpage>. doi:<pub-id pub-id-type="doi">10.3389/fcomp.2020.00009</pub-id>.</mixed-citation></ref>
<ref id="ref-108"><label>[108]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Guo</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Tang</surname> <given-names>C</given-names></string-name>, <string-name><surname>Wu</surname> <given-names>H</given-names></string-name>, <string-name><surname>Chen</surname> <given-names>B</given-names></string-name></person-group>. <article-title>GNN-based multi-source domain prototype representation for cross-subject EEG emotion recognition</article-title>. <source>Neurocomputing</source>. <year>2024</year>;<volume>609</volume>(<issue>3</issue>):<fpage>128445</fpage>. doi:<pub-id pub-id-type="doi">10.1016/j.neucom.2024.128445</pub-id>.</mixed-citation></ref>
<ref id="ref-109"><label>[109]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Tang</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Wu</surname> <given-names>Y</given-names></string-name>, <string-name><surname>He</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Liu</surname> <given-names>J</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>W</given-names></string-name></person-group>. <article-title>Dual-task contrastive meta-learning for few-shot cross-domain emotion recognition</article-title>. <source>Comput Mater Contin</source>. <year>2025</year>;<volume>82</volume>(<issue>2</issue>):<fpage>2331</fpage>&#x2013;<lpage>52</lpage>. doi:<pub-id pub-id-type="doi">10.32604/cmc.2024.059115</pub-id>.</mixed-citation></ref>
<ref id="ref-110"><label>[110]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Gong</surname> <given-names>X</given-names></string-name>, <string-name><surname>Chen</surname> <given-names>CP</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>T</given-names></string-name></person-group>. <article-title>Cross-cultural emotion recognition with EEG and eye movement signals based on multiple stacked broad learning system</article-title>. <source>IEEE Trans Computat Soc Syst</source>. <year>2023</year>;<volume>11</volume>(<issue>2</issue>):<fpage>2014</fpage>&#x2013;<lpage>25</lpage>. doi:<pub-id pub-id-type="doi">10.1109/tcss.2023.3298324</pub-id>.</mixed-citation></ref>
<ref id="ref-111"><label>[111]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Shilandari</surname> <given-names>A</given-names></string-name>, <string-name><surname>Marvi</surname> <given-names>H</given-names></string-name>, <string-name><surname>Khosravi</surname> <given-names>H</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>W</given-names></string-name></person-group>. <article-title>Speech emotion recognition using data augmentation method by cycle-generative adversarial networks</article-title>. <source>Signal Image Video Process</source>. <year>2022</year>;<volume>16</volume>(<issue>7</issue>):<fpage>1955</fpage>&#x2013;<lpage>62</lpage>. doi:<pub-id pub-id-type="doi">10.1007/s11760-022-02156-9</pub-id>.</mixed-citation></ref>
<ref id="ref-112"><label>[112]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><surname>Hu</surname> <given-names>EJ</given-names></string-name>, <string-name><surname>Shen</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Wallis</surname> <given-names>P</given-names></string-name>, <string-name><surname>Allen-Zhu</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Li</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>S</given-names></string-name>, <etal>et al</etal></person-group>. <article-title>LoRA: low-rank adaptation of large language models</article-title>. <comment>arXiv:2106.09685</comment>. <year>2022</year>.</mixed-citation></ref>
<ref id="ref-113"><label>[113]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Liu</surname> <given-names>L</given-names></string-name>, <string-name><surname>Luo</surname> <given-names>Q</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>W</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>M</given-names></string-name>, <string-name><surname>Zhai</surname> <given-names>B</given-names></string-name></person-group>. <article-title>Multimodal emotion recognition method in complex dynamic scenes</article-title>. <source>J Inform Intell</source>. <year>2025</year>;<volume>3</volume>(<issue>3</issue>):<fpage>257</fpage>&#x2013;<lpage>74</lpage>. doi:<pub-id pub-id-type="doi">10.1016/j.jiixd.2025.02.004</pub-id>.</mixed-citation></ref>
<ref id="ref-114"><label>[114]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Richet</surname> <given-names>N</given-names></string-name>, <string-name><surname>Belharbi</surname> <given-names>S</given-names></string-name>, <string-name><surname>Aslam</surname> <given-names>H</given-names></string-name>, <string-name><surname>Schadt</surname> <given-names>ME</given-names></string-name>, <string-name><surname>Gonz&#x00E1;lez-Gonz&#x00E1;lez</surname> <given-names>M</given-names></string-name>, <string-name><surname>Cortal</surname> <given-names>G</given-names></string-name>, <etal>et al.</etal></person-group> <article-title>Textualized and feature-based models for compound multimodal emotion recognition in the wild</article-title>. In: <conf-name>Proceedings of the European Conference on Computer Vision; 2024 Sep 29&#x2013;Oct 4</conf-name>; <publisher-loc>Milan, Italy</publisher-loc>. p. <fpage>60</fpage>&#x2013;<lpage>78</lpage>.</mixed-citation></ref>
<ref id="ref-115"><label>[115]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Guo</surname> <given-names>L</given-names></string-name>, <string-name><surname>Song</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Ding</surname> <given-names>S</given-names></string-name></person-group>. <article-title>Speaker-aware cognitive network with cross-modal attention for multimodal emotion recognition in conversation</article-title>. <source>Knowl Based Syst</source>. <year>2024</year>;<volume>296</volume>(<issue>3</issue>):<fpage>111969</fpage>. doi:<pub-id pub-id-type="doi">10.1016/j.knosys.2024.111969</pub-id>.</mixed-citation></ref>
<ref id="ref-116"><label>[116]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Zhang</surname> <given-names>C</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Cheng</surname> <given-names>B</given-names></string-name></person-group>. <article-title>RL-EMO: a reinforcement learning framework for multimodal emotion recognition</article-title>. In: <conf-name>Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing; 2024 Apr 14&#x2013;19</conf-name>; <publisher-loc>Seoul, Republic of Korea</publisher-loc>. p. <fpage>10246</fpage>&#x2013;<lpage>50</lpage>.</mixed-citation></ref>
<ref id="ref-117"><label>[117]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Pan</surname> <given-names>J</given-names></string-name>, <string-name><surname>Fang</surname> <given-names>W</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Chen</surname> <given-names>B</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>S</given-names></string-name></person-group>. <article-title>Multimodal emotion recognition based on facial expressions, speech, and EEG</article-title>. <source>IEEE Open J Eng Med Biol</source>. <year>2023</year>;<volume>5</volume>:<fpage>396</fpage>&#x2013;<lpage>403</lpage>. doi:<pub-id pub-id-type="doi">10.1109/ojemb.2023.3240280</pub-id>; <pub-id pub-id-type="pmid">38899017</pub-id></mixed-citation></ref>
<ref id="ref-118"><label>[118]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Chen</surname> <given-names>H</given-names></string-name>, <string-name><surname>Guo</surname> <given-names>C</given-names></string-name>, <string-name><surname>Li</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>P</given-names></string-name>, <string-name><surname>Jiang</surname> <given-names>D</given-names></string-name></person-group>. <article-title>Semi-supervised multimodal emotion recognition with class-balanced pseudo-labeling</article-title>. In: <conf-name>Proceedings of the 31st ACM International Conference on Multimedia; 2023 Oct 29&#x2013;Nov 3</conf-name>; <publisher-loc>Ottawa, ON, Canada</publisher-loc>. p. <fpage>9556</fpage>&#x2013;<lpage>60</lpage>.</mixed-citation></ref>
<ref id="ref-119"><label>[119]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Deschamps-Berger</surname> <given-names>T</given-names></string-name>, <string-name><surname>Lamel</surname> <given-names>L</given-names></string-name>, <string-name><surname>Devillers</surname> <given-names>L</given-names></string-name></person-group>. <article-title>Exploring attention mechanisms for multimodal emotion recognition in an emergency call center corpus</article-title>. In: <conf-name>Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing; 2023 Jun 4&#x2013;10</conf-name>; <publisher-loc>Rhodes, Greece</publisher-loc>. p. <fpage>1</fpage>&#x2013;<lpage>5</lpage>.</mixed-citation></ref>
<ref id="ref-120"><label>[120]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Chen</surname> <given-names>L</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>K</given-names></string-name>, <string-name><surname>Li</surname> <given-names>M</given-names></string-name>, <string-name><surname>Wu</surname> <given-names>M</given-names></string-name>, <string-name><surname>Pedrycz</surname> <given-names>W</given-names></string-name>, <string-name><surname>Hirota</surname> <given-names>K</given-names></string-name></person-group>. <article-title>K-means clustering-based kernel canonical correlation analysis for multimodal emotion recognition in human-robot interaction</article-title>. <source>IEEE Trans Indust Electr</source>. <year>2022</year>;<volume>70</volume>(<issue>1</issue>):<fpage>1016</fpage>&#x2013;<lpage>24</lpage>. doi:<pub-id pub-id-type="doi">10.1109/tie.2022.3150097</pub-id>.</mixed-citation></ref>
<ref id="ref-121"><label>[121]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Tashu</surname> <given-names>TM</given-names></string-name>, <string-name><surname>Hajiyeva</surname> <given-names>S</given-names></string-name>, <string-name><surname>Horvath</surname> <given-names>T</given-names></string-name></person-group>. <article-title>Multimodal emotion recognition from art using sequential co-attention</article-title>. <source>J Imaging</source>. <year>2021</year>;<volume>7</volume>(<issue>8</issue>):<fpage>157</fpage>. doi:<pub-id pub-id-type="doi">10.3390/jimaging7080157</pub-id>; <pub-id pub-id-type="pmid">34460793</pub-id></mixed-citation></ref>
<ref id="ref-122"><label>[122]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Liu</surname> <given-names>J</given-names></string-name>, <string-name><surname>Chen</surname> <given-names>S</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>L</given-names></string-name>, <string-name><surname>Liu</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Fu</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Guo</surname> <given-names>L</given-names></string-name>, <etal>et al.</etal></person-group> <article-title>Multimodal emotion recognition with capsule graph convolutional based representation fusion</article-title>. In: <conf-name>Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing; 2021 Jun 6&#x2013;11</conf-name>; <publisher-loc>Toronto, ON, Canada</publisher-loc>. p. <fpage>6339</fpage>&#x2013;<lpage>43</lpage>.</mixed-citation></ref>
<ref id="ref-123"><label>[123]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Liu</surname> <given-names>D</given-names></string-name>, <string-name><surname>Chen</surname> <given-names>L</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Diao</surname> <given-names>G</given-names></string-name></person-group>. <article-title>Speech expression multimodal emotion recognition based on deep belief network</article-title>. <source>J Grid Comput</source>. <year>2021</year>;<volume>19</volume>(<issue>2</issue>):<fpage>22</fpage>. doi:<pub-id pub-id-type="doi">10.1007/s10723-021-09564-0</pub-id>.</mixed-citation></ref>
<ref id="ref-124"><label>[124]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Bhattacharya</surname> <given-names>P</given-names></string-name>, <string-name><surname>Gupta</surname> <given-names>RK</given-names></string-name>, <string-name><surname>Yang</surname> <given-names>Y</given-names></string-name></person-group>. <article-title>Exploring the contextual factors affecting multimodal emotion recognition in videos</article-title>. <source>IEEE Trans Affect Comput</source>. <year>2021</year>;<volume>14</volume>(<issue>2</issue>):<fpage>1547</fpage>&#x2013;<lpage>57</lpage>. doi:<pub-id pub-id-type="doi">10.1109/taffc.2021.3071503</pub-id>.</mixed-citation></ref>
</ref-list>
</back></article>

