<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.1 20151215//EN" "http://jats.nlm.nih.gov/publishing/1.1/JATS-journalpublishing1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:mml="http://www.w3.org/1998/Math/MathML" xml:lang="en" article-type="research-article" dtd-version="1.1">
<front>
<journal-meta>
<journal-id journal-id-type="pmc">CMC</journal-id>
<journal-id journal-id-type="nlm-ta">CMC</journal-id>
<journal-id journal-id-type="publisher-id">CMC</journal-id>
<journal-title-group>
<journal-title>Computers, Materials &#x0026; Continua</journal-title>
</journal-title-group>
<issn pub-type="epub">1546-2226</issn>
<issn pub-type="ppub">1546-2218</issn>
<publisher>
<publisher-name>Tech Science Press</publisher-name>
<publisher-loc>USA</publisher-loc>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="publisher-id">75616</article-id>
<article-id pub-id-type="doi">10.32604/cmc.2026.075616</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Article</subject>
</subj-group>
</article-categories>
<title-group>
<article-title>SQSNet: Hybrid CNN-Transformer Fusion with Spatial Quad-Similarity for Robust Facial Expression Recognition</article-title>
<alt-title alt-title-type="left-running-head">SQSNet: Hybrid CNN-Transformer Fusion with Spatial Quad-Similarity for Robust Facial Expression Recognition</alt-title>
<alt-title alt-title-type="right-running-head">SQSNet: Hybrid CNN-Transformer Fusion with Spatial Quad-Similarity for Robust Facial Expression Recognition</alt-title>
</title-group>
<contrib-group>
<contrib id="author-1" contrib-type="author">
<name name-style="western"><surname>Ahmed</surname><given-names>Mohammed A.</given-names></name><xref ref-type="aff" rid="aff-1">1</xref></contrib>
<contrib id="author-2" contrib-type="author" corresp="yes">
<name name-style="western"><surname>Dong</surname><given-names>Jian</given-names></name><xref ref-type="aff" rid="aff-2">2</xref><email>dongjian@csu.edu.cn</email></contrib>
<contrib id="author-3" contrib-type="author">
<name name-style="western"><surname>Shi</surname><given-names>Ronghua</given-names></name><xref ref-type="aff" rid="aff-2">2</xref></contrib>
<contrib id="author-4" contrib-type="author">
<name name-style="western"><surname>Nassr</surname><given-names>Ammar</given-names></name><xref ref-type="aff" rid="aff-3">3</xref></contrib>
<contrib id="author-5" contrib-type="author">
<name name-style="western"><surname>Almaqtari</surname><given-names>Hani</given-names></name><xref ref-type="aff" rid="aff-3">3</xref></contrib>
<contrib id="author-6" contrib-type="author">
<name name-style="western"><surname>Alsanabani</surname><given-names>Ala A.</given-names></name><xref ref-type="aff" rid="aff-3">3</xref></contrib>
<aff id="aff-1"><label>1</label><institution>School of Computer Science and Engineering, Central South University</institution>, <addr-line>Changsha</addr-line>, <country>China</country></aff>
<aff id="aff-2"><label>2</label><institution>School of Electronic Information, Central South University</institution>, <addr-line>Changsha</addr-line>, <country>China</country></aff>
<aff id="aff-3"><label>3</label><institution>School of Artificial Intelligence, Xidian University</institution>, <addr-line>Xi&#x2019;an</addr-line>, <country>China</country></aff>
</contrib-group>
<author-notes>
<corresp id="cor1"><label>&#x002A;</label>Corresponding Author: Jian Dong. Email: <email>dongjian@csu.edu.cn</email></corresp>
</author-notes>
<pub-date date-type="collection" publication-format="electronic">
<year>2026</year>
</pub-date>
<pub-date date-type="pub" publication-format="electronic">
<day>9</day><month>4</month><year>2026</year>
</pub-date>
<volume>87</volume>
<issue>3</issue>
<elocation-id>72</elocation-id>
<history>
<date date-type="received">
<day>04</day>
<month>11</month>
<year>2025</year>
</date>
<date date-type="accepted">
<day>04</day>
<month>02</month>
<year>2026</year>
</date>
</history>
<permissions>
<copyright-statement>&#x00A9; 2026 The Authors. Published by Tech Science Press.</copyright-statement>
<copyright-year>2026</copyright-year>
<copyright-holder>The Authors</copyright-holder>
<license xlink:href="https://creativecommons.org/licenses/by/4.0/">
<license-p>This work is licensed under a <ext-link ext-link-type="uri" xlink:type="simple" xlink:href="https://creativecommons.org/licenses/by/4.0/">Creative Commons Attribution 4.0 International License</ext-link>, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
</license>
</permissions>
<self-uri content-type="pdf" xlink:href="TSP_CMC_75616.pdf"></self-uri>
<abstract>
<p>Facial Expression Recognition (FER) is an essential endeavor in computer vision, applicable in human-computer interaction, emotion assessment, and mental health surveillance. Although Convolutional Neural Networks (CNNs) have proven effective in Facial Emotion Recognition, they encounter difficulties in capturing long-range connections and global context. To address these constraints, we propose Spatial Quad-Similarity Network (SQSNet), an innovative hybrid framework that integrates the local feature extraction capabilities of CNNs with the global contextual modeling efficacy of Swin Transformers via a cohesive fusion technique. SQSNet introduces the Spatial Quad-Similarity (SQS) module, a feature refinement approach that amplifies discriminative characteristics and mitigates redundancy. Unlike conventional metric learning approaches that operate on global feature representations, SQS computes fine-grained spatial-level similarity across multiple instances, enforcing H &#x00D7; W independent constraints that preserve spatial correspondence between expression-relevant facial regions. This spatial-level formulation is particularly effective for FER, where expressions manifest as localized muscle movements that are lost in global pooling operations. Moreover, SQSNet employs sophisticated regularization methods, including Mixup augmentation, label smoothing, and adaptive learning rate scheduling, to enhance generalization. Experimental findings on three benchmark datasets, RAF-DB, FERPlus, and AffectNet, indicate that SQSNet surpasses current FER methodologies, attaining state-of-the-art accuracies of 91.90%, 91.11%, and 67.15%, respectively. These findings underscore the efficacy of integrating CNNs, Swin Transformers, and spatial similarity-driven feature refining for facial emotion identification, facilitating the development of more dependable emotion recognition systems.</p>
</abstract>
<kwd-group kwd-group-type="author">
<kwd>Facial expression recognition</kwd>
<kwd>convolutional neural networks</kwd>
<kwd>swin transformers</kwd>
<kwd>cross attention</kwd>
<kwd>adaptive learning</kwd>
</kwd-group>
</article-meta>
</front>
<body>
<sec id="s1">
<label>1</label>
<title>Introduction</title>
<p>Facial Expression Recognition (FER) is a fundamental task in computer vision with wide-ranging applications, including human-computer interaction, emotion analysis, mental health monitoring, and autonomous systems. The ability to accurately and robustly recognize facial expressions is crucial for understanding human emotions and behaviors. Despite significant advancements in deep learning, FER remains a challenging problem due to variations in lighting conditions, facial poses, occlusions, and the subtle differences between expressions [<xref ref-type="bibr" rid="ref-1">1</xref>].</p>
<p>Traditional approaches to FER have predominantly relied on Convolutional Neural Networks (CNNs), which excel at capturing local spatial features [<xref ref-type="bibr" rid="ref-2">2</xref>&#x2013;<xref ref-type="bibr" rid="ref-4">4</xref>]. However, CNNs often struggle to model long-range dependencies and global contextual information, which are essential for distinguishing between similar expressions. Recent advancements in vision transformers, particularly the Swin Transformer, have demonstrated remarkable success in capturing global context through self-attention mechanisms. However, transformers typically require large amounts of data and computational resources, and their performance on tasks requiring fine-grained local feature extraction, such as FER, can be suboptimal when used in isolation [<xref ref-type="bibr" rid="ref-5">5</xref>]. Recent advances in FER have explored hybrid CNN-Transformer architectures to leverage complementary strengths. While these approaches show promise, they typically fuse features at the global level through concatenation or standard attention mechanisms, which may not fully exploit the fine-grained spatial relationships critical for distinguishing subtle expression differences. Moreover, existing metric learning approaches for FER operate on holistic feature representations after global pooling, potentially losing spatial correspondence between expression-relevant facial regions. To address these limitations, our work introduces spatial-level similarity learning that preserves and refines fine-grained spatial relationships before any global aggregation, providing richer supervision signals for learning discriminative features.</p>
<p>Based on this motivation, we introduce Spatial Quad-Similarity Network (SQSNet) a novel hybrid architecture that synergistically combines CNNs and Swin Transformers. SQSNet leverages the local pattern recognition capabilities of CNNs alongside the hierarchical, long-range dependency modeling of Swin Transformers. These two modalities are fused through a dedicated Spatial Quad-Similarity (SQS) mechanism designed to enhance multi-scale feature integration. This enables the model to better capture subtle facial variations while maintaining awareness of global facial structure. To further enhance robustness and generalization, SQSNet incorporates state-of-the-art regularization and optimization strategies, including Mixup-based data augmentation, label smoothing, and adaptive learning rate scheduling. Evaluations on three benchmark FER datasets demonstrate that SQSNet achieves superior performance and outperforms several state-of-the-art methods in terms of both accuracy and robustness. The contributions of this work are summarized as follows:
<list list-type="simple">
<list-item><label>1.</label><p>We propose a novel hybrid framework that integrates convolutional neural networks and Swin Transformers, leveraging their complementary strengths to enhance facial expression recognition. This design effectively captures both fine-grained local features and broader global contextual information, leading to more robust and accurate expression analysis.</p></list-item>
<list-item><label>2.</label><p>We propose a novel module called Spatial Quad-Similarity, which refines features across four instances by leveraging adaptive spatial cross-similarity attention. Unlike standard cross learning, SQS focuses on fine-grained spatial interactions between samples, promoting greater intra-class compactness and enhancing inter-class separability at the feature level.</p></list-item>
<list-item><label>3.</label><p>We employ Mixup, label smoothing, and adaptive learning rate scheduling to improve generalization and robustness.</p></list-item>
</list></p>
</sec>
<sec id="s2">
<label>2</label>
<title>Related Work</title>
<p>Facial Expression Recognition has seen significant advancements in recent years, driven by the development of deep learning techniques. This section reviews the state-of-the-art approaches in FER, categorized into CNNs-based methods, attention-based methods, and Transformers-based methods.</p>
<sec id="s2_1">
<label>2.1</label>
<title>CNNs-Based FER</title>
<p>Convolutional neural networks have been the cornerstone of FER due to their ability to capture local spatial features and hierarchical representations. Recent works have focused on improving CNN architectures to address challenges such as occlusions, pose variations, and subtle expression differences. Wang et al. [<xref ref-type="bibr" rid="ref-1">1</xref>] proposed a CNN-based uncertainty suppression framework for large-scale FER, demonstrating the effectiveness of CNNs in handling noisy and ambiguous data. Minaee et al. [<xref ref-type="bibr" rid="ref-2">2</xref>] introduced a multi-task learning framework using CNNs to jointly learn facial expressions and auxiliary attributes such as age and gender. Their work highlighted the robustness of CNNs in handling fine-grained feature extraction for FER. Li et al. [<xref ref-type="bibr" rid="ref-3">3</xref>] proposed a deep locality-preserving CNN that leverages crowd-sourced data to improve FER in real-world scenarios. Their approach demonstrated the effectiveness of CNNs in capturing local features while addressing dataset biases. Bodapati et al. [<xref ref-type="bibr" rid="ref-6">6</xref>] proposed a novel deep learning-based strategy for FER using a deep convolutional neural network model (FERNet). The model is designed to learn hidden nonlinearities from input facial images, which are crucial for accurately discriminating emotions. FERNet consists of a sequence of blocks, each containing multiple convolutional and sub-sampling layers. Febrian et al. [<xref ref-type="bibr" rid="ref-7">7</xref>] proposed a deep learning architecture to enhance FER performance by introducing a BiLSTM-CNN model, which combines CNN with a Bidirectional Long Short-Term Memory (BiLSTM) network. The study compares the proposed BiLSTM-CNN model with standalone CNN and LSTM-CNN models. Zou et al. [<xref ref-type="bibr" rid="ref-8">8</xref>] proposed a lightweight Multi-feature Fusion Based Convolutional Neural Network (MFF-CNN) for FER, designed to explore expression features at distinct abstract levels and regions. The model consists of two branches: the Image Branch, which extracts mid-level and high-level global features from the entire input image, and the Patch Branch, which extracts local features from sixteen image patches. Feature selection based on L2 norm is applied to enhance the discriminative power of local features, and joint tuning is used to integrate and fuse features from both branches.</p>
<p>Despite the significant advancements in CNN for FER, several limitations persist. While CNNs excel at capturing local spatial features and hierarchical representations, they often struggle to model long-range dependencies and global contextual information, which are crucial for distinguishing between subtle expressions. Additionally, CNNs require extensive computational resources and large datasets to achieve optimal performance, making them less efficient for real-time applications or scenarios with limited data.</p>
</sec>
<sec id="s2_2">
<label>2.2</label>
<title>Attention-Based FER</title>
<p>Attention mechanisms have been integrated into FER frameworks to enhance the model&#x2019;s ability to focus on discriminative facial regions. These methods aim to improve performance by dynamically weighting important features while suppressing irrelevant ones. Minaee et al. [<xref ref-type="bibr" rid="ref-2">2</xref>] introduced an attentional CNN architecture that combines spatial and channel attention to improve FER performance. Their work demonstrated the effectiveness of attention mechanisms in capturing subtle expression details. Zhang et al. [<xref ref-type="bibr" rid="ref-9">9</xref>] proposed a hybrid attention mechanism that integrates spatial and temporal attention for video-based FER. Their approach achieved state-of-the-art results by focusing on the most relevant frames and regions. Chen et al. [<xref ref-type="bibr" rid="ref-10">10</xref>] developed a self-supervised attention framework for FER, which leverages unlabeled data to improve the model&#x2019;s ability to focus on discriminative facial regions. Zhang et al. [<xref ref-type="bibr" rid="ref-11">11</xref>] proposed a cross-fusion dual-attention network for FER in the wild, which integrates a grouped dual-attention mechanism and a novel C2 activation function. Their approach achieved state-of-the-art results by refining local features, capturing global information, and addressing challenges such as occlusion and blurring. Zhou et al. [<xref ref-type="bibr" rid="ref-12">12</xref>] proposed a cross-attention and hybrid feature weighting network for emotion recognition in video clips, which integrates a dual-branch encoding network and a hierarchical-attention encoding network. Their approach achieved state-of-the-art results by capturing complementary information between facial expressions and contextual cues, addressing challenges such as emotion confusion and misunderstanding. Le Ngwe et al. [<xref ref-type="bibr" rid="ref-13">13</xref>] proposed a lightweight patch and attention network (PAtt-Lite) for FER under challenging conditions, which integrates a patch extraction block and an attention classifier. Their approach achieved state-of-the-art results by enhancing local feature representation and improving feature learning, addressing challenges such as occlusion and blurring in real-world scenarios.</p>
<p>Despite the advancements in attention-based methods for FER, several limitations remain. While attention mechanisms have improved the ability of models to focus on discriminative facial regions, they often require significant computational resources and complex architectures, making them less suitable for real-time or resource-constrained applications. Additionally, attention mechanisms can struggle with occlusions and extreme pose variations, as they may incorrectly weight irrelevant or noisy regions. Hybrid approaches, while effective, often involve increased model complexity and training difficulty, limiting their practicality. Furthermore, lightweight attention-based models may still face challenges in generalizing to diverse and unconstrained environments due to their reliance on local feature extraction. These limitations highlight the need for more robust, efficient, and scalable attention mechanisms that can effectively handle real-world FER challenges while maintaining computational efficiency.</p>
</sec>
<sec id="s2_3">
<label>2.3</label>
<title>Transformers-Based FER</title>
<p>Transformers have gained popularity in computer vision tasks due to their ability to model long-range dependencies and global contextual information. However, their application to FER is still evolving, with challenges such as high computational costs and the need for large datasets. Liu et al. [<xref ref-type="bibr" rid="ref-14">14</xref>] proposed a facial muscle movement-aware representation learning framework for FER, which integrates a discriminative feature generation module and a muscle relationship mining module. Their approach achieved state-of-the-art results by learning semantic relationships of facial muscle movements and addressing challenges such as occlusion, arbitrary orientations, and illumination in real-world scenarios. Xue et al. [<xref ref-type="bibr" rid="ref-15">15</xref>] proposed a novel approach for FER in the wild by introducing two attentive pooling (AP) modules, Attentive Patch Pooling (APP) and Attentive Token Pooling (ATP), to address the limitations of Vision Transformers (ViT) in FER tasks. While ViTs often underperform compared to CNNs due to difficulties in convergence and a tendency to focus on occluded or noisy areas, the proposed AP modules aim to pool noisy features directly, emphasizing the most discriminative features while reducing the impact of less relevant ones. APP selects the most informative patches from CNN features, and ATP discards unimportant tokens in ViT, both of which are simple to implement, parameter-free, and computationally efficient. Li et al. [<xref ref-type="bibr" rid="ref-16">16</xref>] proposed a multimodal supervision-steering transformer for FER in the wild, referred to as FER-former, to address the limitations of narrow receptive fields and homogenous supervisory signals in existing methods. The FER-former introduces a hybrid feature extraction pipeline that cascades CNNs and transformers to expand receptive fields, along with a heterogeneous domain-steering supervision module that incorporates text-space semantic correlations to enhance image features. Additionally, a FER-specific transformer encoder is designed to process both conventional one-hot label-focused tokens and CLIP-based text-oriented tokens in parallel, enabling the capture of global receptive fields with multimodal semantic cues. Chen et al. [<xref ref-type="bibr" rid="ref-17">17</xref>] proposed a privacy-preserving few-shot FER system using a self-supervised vision transformer (SSF-ViT) to address challenges such as privacy concerns, insufficient labeled data, and class imbalance in real-world FER tasks. The system integrates self-supervised learning (SSL) and few-shot learning (FSL) to train a deep learning model with limited labeled samples. The SSF-ViT framework involves pretraining a ViT encoder using four self-supervised pretext tasks&#x2014;image denoising and reconstruction, image rotation prediction, jigsaw puzzle, and masked patch prediction&#x2014;followed by fine-tuning on a lab-controlled FER dataset to extract spatiotemporal features. For few-shot classification, prototypes are constructed from support sets, and query samples are classified by computing Euclidean distances to these prototypes.</p>
<p>Despite the growing popularity of transformers in FER, several limitations hinder their widespread adoption. Transformers, while effective at modeling long-range dependencies and global contextual information, often require large amounts of data and significant computational resources, making them less practical for real-time or resource-constrained applications. Additionally, transformers can struggle with fine-grained feature extraction, which is critical for distinguishing subtle facial expressions, as they tend to focus on global patterns rather than local details. Although hybrid approaches, such as those combining CNNs and transformers, have shown promise in addressing these challenges, they often introduce increased model complexity and training difficulty. Furthermore, transformers are susceptible to overfitting when trained on small datasets, limiting their effectiveness in scenarios with limited labeled data. These limitations highlight the need for more efficient and scalable transformer-based architectures that can balance global and local feature extraction while maintaining computational efficiency and generalization capabilities in FER tasks.</p>
<p>Recent hybrid approaches have emerged that combine CNNs with Transformers to leverage their complementary strengths. Liang et al. [<xref ref-type="bibr" rid="ref-18">18</xref>] proposed CT-DBN, which uses a dual-branch network with CNN and Transformer, achieving promising results through feature concatenation. Similarly, other methods have explored various fusion strategies to combine local and global features. However, these methods typically fuse features at the global or feature level, without explicitly modeling spatial-level relationships between samples. Our work differs by introducing spatial-level similarity learning that operates before global aggregation, providing fine-grained supervision that is particularly beneficial for capturing subtle expression differences.</p>
</sec>
<sec id="s2_4">
<label>2.4</label>
<title>Distinction from Metric Learning Approaches</title>
<p>While our work shares conceptual similarities with metric learning methods, it differs fundamentally in how similarity is computed and applied. Traditional metric learning approaches, including triplet loss and quadruplet loss, operate on holistic feature representations extracted after global pooling operations. Given a feature map <inline-formula id="ieqn-1"><mml:math id="mml-ieqn-1"><mml:mi>F</mml:mi><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mi>R</mml:mi><mml:mrow><mml:mo>&#x2227;</mml:mo></mml:mrow></mml:msup><mml:mo stretchy="false">(</mml:mo><mml:mi>C</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mi>H</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mi>W</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>, these methods first compute a global descriptor g &#x003D; GlobalPool(F) <inline-formula id="ieqn-2"><mml:math id="mml-ieqn-2"><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mi>R</mml:mi><mml:mrow><mml:mo>&#x2227;</mml:mo></mml:mrow></mml:msup><mml:mi>C</mml:mi></mml:math></inline-formula>, then enforce distance constraints in this C-dimensional embedding space. This formulation produces a single global constraint that treats the entire face as a holistic entity. In contrast, our Spatial Quad-Similarity (SQS) module operates at the spatial level before any global aggregation. For each spatial location (<italic>i</italic>, <italic>j</italic>), we independently compute similarity constraints across the anchor-positive-negative quad, yielding H &#x00D7; W independent local similarity constraints rather than a single global constraint. Why This Matters for FER: Facial expressions are characterized by localized muscle movements, a smile involves the mouth corners (AU12: lip corner puller), while surprise involves the eyebrows and eyes (AU1 &#x002B; 2: brow raiser, AU5: upper lid raiser). Global pooling in standard metric learning destroys the spatial correspondence between, for example, mouth regions across different images information that is essential for distinguishing between a smile and a neutral expression. By computing similarity at each spatial location independently, SQS preserves these fine-grained spatial relationships throughout the learning process. Furthermore, unlike pairwise (triplet) or triple (quadruplet) comparisons in standard metric learning, our quad-based formulation with two negative samples provides richer contrastive supervision. This is particularly beneficial for FER where multiple expressions may share similar global appearance but differ in subtle local details (e.g., fear vs. surprise both involve wide eyes, but differ in mouth shape and eyebrow position).</p>
<p>Comparison with Recent Hybrid FER Methods: Recent works have explored CNN-Transformer fusion for FER, such as FER-former [<xref ref-type="bibr" rid="ref-16">16</xref>], MMATrans [<xref ref-type="bibr" rid="ref-14">14</xref>], and CF-DAN [<xref ref-type="bibr" rid="ref-11">11</xref>]. However, these methods typically fuse features at the global level using concatenation, addition, or standard cross-attention mechanisms. For instance, FER-former employs cross-attention between CNN and Transformer features but operates on sequence-level representations after spatial pooling. MMATrans fuses multi-scale features but does not explicitly model spatial-level similarity relationships between samples. Our SQS module is unique in enforcing spatial-level similarity constraints across multiple instances during training, providing fine-grained supervision that guides the network to learn more discriminative spatial features. <xref ref-type="fig" rid="fig-1">Fig. 1</xref> illustrates the fundamental architectural differences between these approaches.</p>
<fig id="fig-1">
<label>Figure 1</label>
<caption>
<title>Comparison of feature learning paradigms. (<bold>A</bold>) Traditional metric learning enforces a single global constraint after pooling spatial information. (<bold>B</bold>) Standard cross-attention fuses features at the global level. (<bold>C</bold>) Our SQS module computes similarity independently at each spatial location, yielding H &#x00D7; W fine-grained constraints that preserve spatial correspondence between expression-relevant regions.</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_75616-fig-1.tif"/>
</fig>
</sec>
</sec>
<sec id="s3">
<label>3</label>
<title>Proposed Method</title>
<sec id="s3_1">
<label>3.1</label>
<title>Architecture Overview</title>
<p>In this section, we present SQSNet, a hybrid framework for robust facial expression recognition. SQSNet combines the local feature extraction capabilities of convolutional neural networks and the global contextual modeling of Swin Transformers into a unified feature space. To further refine feature representations, we introduce the Spatial Quad-Similarity module, which adaptively enhances discriminative features and suppresses redundant ones through fine-grained spatial interactions across multiple instances. The overall architecture consists of three main components: (1) a hybrid CNN-Transformer backbone for feature extraction, (2) the Spatial Quad-Similarity module for feature refinement, and (3) classification heads optimized with advanced regularization techniques. <xref ref-type="fig" rid="fig-2">Fig. 2</xref> illustrates the complete pipeline of SQSNet.</p>
<fig id="fig-2">
<label>Figure 2</label>
<caption>
<title>Overview of the SQSNet architecture. SQSNet integrates local feature extraction from EfficientNet-B6 and global context modeling from Swin Transformer into a unified feature space via feature concatenation. During training, four instances (anchor, positive, negative1, negative2) are processed through shared backbone networks. The spatial quad-similarity module computes fine-grained spatial-level similarity matrices between the anchor-positive and negative1-negative2 pairs to adaptively refine features, enhancing intra-class compactness and inter-class separability. Two classifiers are employed: a base classifier operating on backbone features, and a cross-similarity-enhanced classifier. During inference, only the base branch and its classifier are retained, ensuring computational efficiency without additional inference cost.</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_75616-fig-2.tif"/>
</fig>
</sec>
<sec id="s3_2">
<label>3.2</label>
<title>Hybrid Backbone Architecture</title>
<p>The backbone network of SQSNet integrates two complementary architectures: EfficientNet-B6 [<xref ref-type="bibr" rid="ref-19">19</xref>] and Swin Transformer [<xref ref-type="bibr" rid="ref-20">20</xref>]. EfficientNet-B6 is used to capture detailed local features with strong inductive biases, while Swin Transformer models long-range dependencies and global structures through hierarchical window-based self-attention. The output features from both branches are concatenated to form a comprehensive multi-scale representation, preserving both spatial detail and global context necessary for effective FER.</p>
<sec id="s3_2_1">
<label>3.2.1</label>
<title>Local Feature Module</title>
<p>The local feature module is implemented using EfficientNet-B6, which is pretrained on ImageNet. Given an input image <inline-formula id="ieqn-3"><mml:math id="mml-ieqn-3"><mml:mi>X</mml:mi><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mi>R</mml:mi><mml:mrow><mml:mi>H</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mi>W</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mn>3</mml:mn></mml:mrow></mml:msup></mml:math></inline-formula>, the CNN backbone extracts local spatial features <inline-formula id="ieqn-4"><mml:math id="mml-ieqn-4"><mml:msub><mml:mi mathvariant="bold-italic">F</mml:mi><mml:mrow><mml:mi mathvariant="bold-italic">c</mml:mi><mml:mi mathvariant="bold-italic">n</mml:mi><mml:mi mathvariant="bold-italic">n</mml:mi></mml:mrow></mml:msub><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mrow><mml:mtext>R</mml:mtext></mml:mrow><mml:mrow><mml:mrow><mml:mtext>d</mml:mtext></mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msup></mml:math></inline-formula>, where <inline-formula id="ieqn-5"><mml:math id="mml-ieqn-5"><mml:mi>d</mml:mi><mml:mn>1</mml:mn><mml:mo>=</mml:mo><mml:mn>2304</mml:mn></mml:math></inline-formula>. This is represented as:
<disp-formula id="eqn-1"><label>(1)</label><mml:math id="mml-eqn-1" display="block"><mml:msub><mml:mi mathvariant="bold-italic">F</mml:mi><mml:mrow><mml:mi mathvariant="bold-italic">c</mml:mi><mml:mi mathvariant="bold-italic">n</mml:mi><mml:mi mathvariant="bold-italic">n</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mrow><mml:mi mathvariant="bold-italic">f</mml:mi><mml:mi mathvariant="bold-italic">c</mml:mi><mml:mi mathvariant="bold-italic">n</mml:mi><mml:mi mathvariant="bold-italic">n</mml:mi></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></disp-formula>where <inline-formula id="ieqn-6"><mml:math id="mml-ieqn-6"><mml:msub><mml:mi mathvariant="bold-italic">f</mml:mi><mml:mrow><mml:mi mathvariant="bold-italic">c</mml:mi><mml:mi mathvariant="bold-italic">n</mml:mi><mml:mi mathvariant="bold-italic">n</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>(&#x22C5;) denotes the EfficientNet-B6 feature extraction function.</p>
</sec>
<sec id="s3_2_2">
<label>3.2.2</label>
<title>Global Feature Module</title>
<p>The global feature module is designed to capture long-range dependencies and global contextual information from the input image. To this end, we employ a Swin Transformer, which models global context efficiently through a shifted window&#x2013;based self-attention mechanism. For the same input image <inline-formula id="ieqn-7"><mml:math id="mml-ieqn-7"><mml:mi>X</mml:mi></mml:math></inline-formula>, the Swin Transformer extracts hierarchical multi-scale features denoted as <inline-formula id="ieqn-8"><mml:math id="mml-ieqn-8"><mml:msub><mml:mi mathvariant="bold-italic">F</mml:mi><mml:mrow><mml:mi mathvariant="bold-italic">S</mml:mi><mml:mi mathvariant="bold-italic">w</mml:mi><mml:mi mathvariant="bold-italic">i</mml:mi><mml:mi mathvariant="bold-italic">n</mml:mi></mml:mrow></mml:msub><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mi>R</mml:mi><mml:mrow><mml:mi>d</mml:mi><mml:mn>2</mml:mn></mml:mrow></mml:msup></mml:math></inline-formula>, where <italic>d</italic>2 &#x003D; 768. This is expressed as:
<disp-formula id="eqn-2"><label>(2)</label><mml:math id="mml-eqn-2" display="block"><mml:msub><mml:mi mathvariant="bold-italic">F</mml:mi><mml:mrow><mml:mi mathvariant="bold-italic">S</mml:mi><mml:mi mathvariant="bold-italic">w</mml:mi><mml:mi mathvariant="bold-italic">i</mml:mi><mml:mi mathvariant="bold-italic">n</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">f</mml:mi><mml:mrow><mml:mi mathvariant="bold-italic">S</mml:mi><mml:mi mathvariant="bold-italic">w</mml:mi><mml:mi mathvariant="bold-italic">i</mml:mi><mml:mi mathvariant="bold-italic">n</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></disp-formula>where <inline-formula id="ieqn-9"><mml:math id="mml-ieqn-9"><mml:msub><mml:mi mathvariant="bold-italic">f</mml:mi><mml:mrow><mml:mi mathvariant="bold-italic">S</mml:mi><mml:mi mathvariant="bold-italic">w</mml:mi><mml:mi mathvariant="bold-italic">i</mml:mi><mml:mi mathvariant="bold-italic">n</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>(&#x22C5;) represents the Swin Transformer feature extraction function.</p>
</sec>
<sec id="s3_2_3">
<label>3.2.3</label>
<title>Cross-Attention Fusion Module</title>
<p>To effectively combine the local and global features from the two backbones, we introduce a cross-attention fusion module. This module dynamically fuses <inline-formula id="ieqn-10"><mml:math id="mml-ieqn-10"><mml:mi>F</mml:mi><mml:mi>C</mml:mi><mml:mi>N</mml:mi><mml:mi>N</mml:mi></mml:math></inline-formula> and <inline-formula id="ieqn-11"><mml:math id="mml-ieqn-11"><mml:mi>F</mml:mi><mml:mi>S</mml:mi><mml:mi>w</mml:mi><mml:mi>i</mml:mi><mml:mi>n</mml:mi></mml:math></inline-formula> to produce a unified feature representation <inline-formula id="ieqn-12"><mml:math id="mml-ieqn-12"><mml:msub><mml:mi>F</mml:mi><mml:mrow><mml:mi>f</mml:mi><mml:mi>u</mml:mi><mml:mi>s</mml:mi><mml:mi>e</mml:mi><mml:mi>d</mml:mi></mml:mrow></mml:msub><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mi>R</mml:mi><mml:mrow><mml:mi>d</mml:mi><mml:mn>3</mml:mn></mml:mrow></mml:msup></mml:math></inline-formula>, where <inline-formula id="ieqn-13"><mml:math id="mml-ieqn-13"><mml:mi>d</mml:mi><mml:mn>3</mml:mn><mml:mo>=</mml:mo><mml:mn>1024</mml:mn></mml:math></inline-formula>. The cross-attention mechanism computes attention scores between the CNN and Swin Transformer features. Let <inline-formula id="ieqn-14"><mml:math id="mml-ieqn-14"><mml:mi>Q</mml:mi><mml:mo>=</mml:mo><mml:msub><mml:mi>F</mml:mi><mml:mrow><mml:mi>c</mml:mi><mml:mi>n</mml:mi><mml:mi>n</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mi>K</mml:mi><mml:mo>=</mml:mo><mml:msub><mml:mi>F</mml:mi><mml:mrow><mml:mi>S</mml:mi><mml:mi>w</mml:mi><mml:mi>i</mml:mi><mml:mi>n</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>, and <inline-formula id="ieqn-15"><mml:math id="mml-ieqn-15"><mml:mi>V</mml:mi><mml:mo>=</mml:mo><mml:msub><mml:mi>F</mml:mi><mml:mrow><mml:mi>S</mml:mi><mml:mi>w</mml:mi><mml:mi>i</mml:mi><mml:mi>n</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>. The attention scores are computed using scaled dot-product attention:
<disp-formula id="eqn-3"><label>(3)</label><mml:math id="mml-eqn-3" display="block"><mml:mi mathvariant="bold-italic">A</mml:mi><mml:mo>=</mml:mo><mml:mrow><mml:mi mathvariant="bold-italic">S</mml:mi><mml:mi mathvariant="bold-italic">o</mml:mi><mml:mi mathvariant="bold-italic">f</mml:mi><mml:mi mathvariant="bold-italic">t</mml:mi><mml:mi mathvariant="bold-italic">m</mml:mi><mml:mi mathvariant="bold-italic">a</mml:mi><mml:mi mathvariant="bold-italic">x</mml:mi></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mfrac><mml:mrow><mml:mi mathvariant="bold-italic">Q</mml:mi><mml:msup><mml:mi mathvariant="bold-italic">K</mml:mi><mml:mrow><mml:mi mathvariant="bold-italic">T</mml:mi></mml:mrow></mml:msup></mml:mrow><mml:msqrt><mml:msub><mml:mi mathvariant="bold-italic">d</mml:mi><mml:mrow><mml:mi mathvariant="bold-italic">k</mml:mi></mml:mrow></mml:msub></mml:msqrt></mml:mfrac><mml:mo stretchy="false">)</mml:mo></mml:math></disp-formula>where <inline-formula id="ieqn-16"><mml:math id="mml-ieqn-16"><mml:msub><mml:mi mathvariant="bold-italic">d</mml:mi><mml:mrow><mml:mi mathvariant="bold-italic">k</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> is the dimensionality of the key vectors. The fused features <inline-formula id="ieqn-17"><mml:math id="mml-ieqn-17"><mml:msub><mml:mi mathvariant="bold-italic">F</mml:mi><mml:mrow><mml:mi mathvariant="bold-italic">f</mml:mi><mml:mi mathvariant="bold-italic">u</mml:mi><mml:mi mathvariant="bold-italic">s</mml:mi><mml:mi mathvariant="bold-italic">e</mml:mi><mml:mi mathvariant="bold-italic">d</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> are then computed as:
<disp-formula id="eqn-4"><label>(4)</label><mml:math id="mml-eqn-4" display="block"><mml:msub><mml:mi mathvariant="bold-italic">F</mml:mi><mml:mrow><mml:mi mathvariant="bold-italic">f</mml:mi><mml:mi mathvariant="bold-italic">u</mml:mi><mml:mi mathvariant="bold-italic">s</mml:mi><mml:mi mathvariant="bold-italic">e</mml:mi><mml:mi mathvariant="bold-italic">d</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mrow><mml:mi mathvariant="bold-italic">A</mml:mi><mml:mi mathvariant="bold-italic">V</mml:mi></mml:mrow></mml:math></disp-formula></p>
<p>The fused features are passed through a fully connected layer to produce the final logits <inline-formula id="ieqn-18"><mml:math id="mml-ieqn-18"><mml:mi>z</mml:mi><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mi>R</mml:mi><mml:mrow><mml:mi>C</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula>, where <inline-formula id="ieqn-19"><mml:math id="mml-ieqn-19"><mml:mi>C</mml:mi></mml:math></inline-formula> is the number of expression classes:
<disp-formula id="eqn-5"><label>(5)</label><mml:math id="mml-eqn-5" display="block"><mml:mi mathvariant="bold-italic">z</mml:mi><mml:mo>=</mml:mo><mml:mi mathvariant="bold-italic">W</mml:mi><mml:msub><mml:mi mathvariant="bold-italic">F</mml:mi><mml:mrow><mml:mi mathvariant="bold-italic">f</mml:mi><mml:mi mathvariant="bold-italic">u</mml:mi><mml:mi mathvariant="bold-italic">s</mml:mi><mml:mi mathvariant="bold-italic">e</mml:mi><mml:mi mathvariant="bold-italic">d</mml:mi></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:mi mathvariant="bold-italic">b</mml:mi></mml:math></disp-formula>where <inline-formula id="ieqn-20"><mml:math id="mml-ieqn-20"><mml:mi>W</mml:mi><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mi>R</mml:mi><mml:mrow><mml:mi>C</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mi>d</mml:mi><mml:mn>3</mml:mn></mml:mrow></mml:msup></mml:math></inline-formula> and <inline-formula id="ieqn-21"><mml:math id="mml-ieqn-21"><mml:mi>b</mml:mi><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mi>R</mml:mi><mml:mrow><mml:mi>C</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula> are the weight matrix and bias term, respectively.</p>
</sec>
</sec>
<sec id="s3_3">
<label>3.3</label>
<title>Spatial Quad-Similarity Module</title>
<p>To enhance the discriminative quality of feature representations for facial expression recognition, we introduce the Spatial Quad-Similarity module. This module is designed to refine features by explicitly modeling spatial-level relationships across four instances: an anchor image <inline-formula id="ieqn-22"><mml:math id="mml-ieqn-22"><mml:mi>A</mml:mi></mml:math></inline-formula>, a positive sample <inline-formula id="ieqn-23"><mml:math id="mml-ieqn-23"><mml:mi>P</mml:mi></mml:math></inline-formula> from the same class, and two negative samples <inline-formula id="ieqn-24"><mml:math id="mml-ieqn-24"><mml:msub><mml:mi>N</mml:mi><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula> and <inline-formula id="ieqn-25"><mml:math id="mml-ieqn-25"><mml:msub><mml:mi>N</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula> from different classes. Unlike conventional cross-attention or metric learning methods, which often use holistic features or pairwise comparisons, SQS operates at a spatial granularity, allowing the model to better exploit fine-grained differences and similarities within facial regions.</p>
<sec id="s3_3_1">
<label>3.3.1</label>
<title>Feature Extraction</title>
<p>Let the base network extract spatial feature maps <inline-formula id="ieqn-26"><mml:math id="mml-ieqn-26"><mml:mi>F</mml:mi><mml:mi>A</mml:mi><mml:mo>,</mml:mo><mml:mi>F</mml:mi><mml:mi>P</mml:mi><mml:mo>,</mml:mo><mml:mi>F</mml:mi><mml:mi>N</mml:mi><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mi>F</mml:mi><mml:mi>N</mml:mi><mml:mn>2</mml:mn><mml:mo>&#x2208;</mml:mo><mml:mrow><mml:mtext>R</mml:mtext></mml:mrow></mml:math></inline-formula><sup><italic>C</italic>&#x00D7;<italic>H</italic>&#x00D7;<italic>W</italic></sup> for the anchor, positive, and negative samples respectively, where <inline-formula id="ieqn-27"><mml:math id="mml-ieqn-27"><mml:mi>C</mml:mi></mml:math></inline-formula> is the number of channels, and <inline-formula id="ieqn-28"><mml:math id="mml-ieqn-28"><mml:mi>H</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mi>W</mml:mi></mml:math></inline-formula> is the spatial resolution.</p>
</sec>
<sec id="s3_3_2">
<label>3.3.2</label>
<title>Spatial Similarity Matrix</title>
<p>To explicitly model spatial-wise relationships among features extracted from different branches of the SQS, for each spatial location <inline-formula id="ieqn-29"><mml:math id="mml-ieqn-29"><mml:mo stretchy="false">(</mml:mo><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>, we extract feature vectors <inline-formula id="ieqn-30"><mml:math id="mml-ieqn-30"><mml:msubsup><mml:mi>f</mml:mi><mml:mrow><mml:mi>A</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:msubsup><mml:mi>f</mml:mi><mml:mrow><mml:mi>P</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula>, <inline-formula id="ieqn-31"><mml:math id="mml-ieqn-31"><mml:msubsup><mml:mi>f</mml:mi><mml:mrow><mml:msub><mml:mi>N</mml:mi><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula>, <inline-formula id="ieqn-32"><mml:math id="mml-ieqn-32"><mml:msubsup><mml:mi>f</mml:mi><mml:mrow><mml:msub><mml:mi>N</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula> &#x2208; <inline-formula id="ieqn-33"><mml:math id="mml-ieqn-33"><mml:msup><mml:mrow><mml:mtext>R</mml:mtext></mml:mrow><mml:mrow><mml:mi>C</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula>. We then compute the adaptive similarity matrix based on Euclidean distance between corresponding spatial locations:
<disp-formula id="eqn-6"><label>(6)</label><mml:math id="mml-eqn-6" display="block"><mml:msubsup><mml:mi mathvariant="bold-italic">S</mml:mi><mml:mrow><mml:mi mathvariant="bold-italic">A</mml:mi><mml:mi mathvariant="bold-italic">P</mml:mi></mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">i</mml:mi><mml:mo>,</mml:mo><mml:mi mathvariant="bold-italic">j</mml:mi></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:mi mathvariant="bold">exp</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mfrac><mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:msubsup><mml:mi mathvariant="bold-italic">f</mml:mi><mml:mrow><mml:mi mathvariant="bold-italic">A</mml:mi></mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">i</mml:mi><mml:mo>,</mml:mo><mml:mi mathvariant="bold-italic">j</mml:mi></mml:mrow></mml:msubsup><mml:mo>&#x2212;</mml:mo><mml:msubsup><mml:mi mathvariant="bold-italic">f</mml:mi><mml:mrow><mml:mi mathvariant="bold-italic">P</mml:mi></mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">i</mml:mi><mml:mo>,</mml:mo><mml:mi mathvariant="bold-italic">j</mml:mi></mml:mrow></mml:msubsup><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mtable rowspacing="4pt" columnspacing="1em"><mml:mtr><mml:mtd><mml:mrow><mml:mtext mathvariant="bold">2</mml:mtext></mml:mrow></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mrow><mml:mtext mathvariant="bold">2</mml:mtext></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:mrow><mml:mi mathvariant="bold-italic">&#x03C4;</mml:mi></mml:mfrac><mml:mo>)</mml:mo></mml:mrow><mml:mo>,</mml:mo><mml:msubsup><mml:mi mathvariant="bold-italic">S</mml:mi><mml:mrow><mml:msub><mml:mrow><mml:mi mathvariant="bold-italic">A</mml:mi><mml:mi mathvariant="bold-italic">N</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mtext mathvariant="bold">1</mml:mtext></mml:mrow></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">i</mml:mi><mml:mo>,</mml:mo><mml:mi mathvariant="bold-italic">j</mml:mi></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:mi mathvariant="bold">exp</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mfrac><mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:msubsup><mml:mi mathvariant="bold-italic">f</mml:mi><mml:mrow><mml:mi mathvariant="bold-italic">A</mml:mi></mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">i</mml:mi><mml:mo>,</mml:mo><mml:mi mathvariant="bold-italic">j</mml:mi></mml:mrow></mml:msubsup><mml:mo>&#x2212;</mml:mo><mml:msubsup><mml:mi mathvariant="bold-italic">f</mml:mi><mml:mrow><mml:msub><mml:mi mathvariant="bold-italic">N</mml:mi><mml:mrow><mml:mrow><mml:mtext mathvariant="bold">1</mml:mtext></mml:mrow></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">i</mml:mi><mml:mo>,</mml:mo><mml:mi mathvariant="bold-italic">j</mml:mi></mml:mrow></mml:msubsup><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mtable rowspacing="4pt" columnspacing="1em"><mml:mtr><mml:mtd><mml:mrow><mml:mtext mathvariant="bold">2</mml:mtext></mml:mrow></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mrow><mml:mtext mathvariant="bold">2</mml:mtext></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:mrow><mml:mi mathvariant="bold-italic">&#x03C4;</mml:mi></mml:mfrac><mml:mo>)</mml:mo></mml:mrow><mml:mo>,</mml:mo><mml:msubsup><mml:mi mathvariant="bold-italic">S</mml:mi><mml:mrow><mml:msub><mml:mrow><mml:mi mathvariant="bold-italic">A</mml:mi><mml:mi mathvariant="bold-italic">N</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mtext mathvariant="bold">2</mml:mtext></mml:mrow></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">i</mml:mi><mml:mo>,</mml:mo><mml:mi mathvariant="bold-italic">j</mml:mi></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:mi mathvariant="bold">exp</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mfrac><mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:msubsup><mml:mi mathvariant="bold-italic">f</mml:mi><mml:mrow><mml:mi mathvariant="bold-italic">A</mml:mi></mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">i</mml:mi><mml:mo>,</mml:mo><mml:mi mathvariant="bold-italic">j</mml:mi></mml:mrow></mml:msubsup><mml:mo>&#x2212;</mml:mo><mml:msubsup><mml:mi mathvariant="bold-italic">f</mml:mi><mml:mrow><mml:msub><mml:mi mathvariant="bold-italic">N</mml:mi><mml:mrow><mml:mrow><mml:mtext mathvariant="bold">2</mml:mtext></mml:mrow></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">i</mml:mi><mml:mo>,</mml:mo><mml:mi mathvariant="bold-italic">j</mml:mi></mml:mrow></mml:msubsup><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mtable rowspacing="4pt" columnspacing="1em"><mml:mtr><mml:mtd><mml:mrow><mml:mtext mathvariant="bold">2</mml:mtext></mml:mrow></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mrow><mml:mtext mathvariant="bold">2</mml:mtext></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:mrow><mml:mi mathvariant="bold-italic">&#x03C4;</mml:mi></mml:mfrac><mml:mo>)</mml:mo></mml:mrow></mml:math></disp-formula>where <inline-formula id="ieqn-34"><mml:math id="mml-ieqn-34"><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mo>&#x22C5;</mml:mo><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mtable rowspacing="4pt" columnspacing="1em"><mml:mtr><mml:mtd><mml:mrow><mml:mtext mathvariant="bold">2</mml:mtext></mml:mrow></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mrow><mml:mtext mathvariant="bold">2</mml:mtext></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></inline-formula> represents the squared Euclidean distance between the corresponding feature vectors. The subscripts, <inline-formula id="ieqn-35"><mml:math id="mml-ieqn-35"><mml:msub><mml:mi>N</mml:mi><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mtext>&#x00A0;</mml:mtext><mml:mi>a</mml:mi><mml:mi>n</mml:mi><mml:mi>d</mml:mi><mml:mtext>&#x00A0;</mml:mtext><mml:msub><mml:mi>N</mml:mi><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula> indicate two distinct negative samples <inline-formula id="ieqn-36"><mml:math id="mml-ieqn-36"><mml:mi mathvariant="bold-italic">&#x03C4;</mml:mi></mml:math></inline-formula> is a temperature parameter controlling the sharpness of the similarity scores. The exponential function transforms Euclidean distances into similarity scores, where higher values indicate more similar features.</p>
</sec>
<sec id="s3_3_3">
<label>3.3.3</label>
<title>Quad-Based Attention Refinement</title>
<p>To enforce intra-class compactness and inter-class separability, we compute a weighted refinement map:
<disp-formula id="eqn-7"><label>(7)</label><mml:math id="mml-eqn-7" display="block"><mml:msup><mml:mi mathvariant="bold-italic">R</mml:mi><mml:mrow><mml:mi mathvariant="bold-italic">i</mml:mi><mml:mo>,</mml:mo><mml:mi mathvariant="bold-italic">j</mml:mi></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:mrow><mml:mi mathvariant="normal">&#x03B1;</mml:mi></mml:mrow><mml:mo>&#x22C5;</mml:mo><mml:msubsup><mml:mi mathvariant="bold-italic">S</mml:mi><mml:mrow><mml:mi mathvariant="bold-italic">A</mml:mi><mml:mi mathvariant="bold-italic">P</mml:mi></mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">i</mml:mi><mml:mo>,</mml:mo><mml:mi mathvariant="bold-italic">j</mml:mi></mml:mrow></mml:msubsup><mml:mo>&#x2212;</mml:mo><mml:mrow><mml:mi mathvariant="normal">&#x03B2;</mml:mi></mml:mrow><mml:mo>&#x22C5;</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:msubsup><mml:mi mathvariant="bold-italic">S</mml:mi><mml:mrow><mml:msub><mml:mrow><mml:mi mathvariant="bold-italic">A</mml:mi><mml:mi mathvariant="bold-italic">N</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mtext mathvariant="bold">1</mml:mtext></mml:mrow></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">i</mml:mi><mml:mo>,</mml:mo><mml:mi mathvariant="bold-italic">j</mml:mi></mml:mrow></mml:msubsup><mml:mo>+</mml:mo><mml:msubsup><mml:mi mathvariant="bold-italic">S</mml:mi><mml:mrow><mml:msub><mml:mrow><mml:mi mathvariant="bold-italic">A</mml:mi><mml:mi mathvariant="bold-italic">N</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mtext mathvariant="bold">2</mml:mtext></mml:mrow></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">i</mml:mi><mml:mo>,</mml:mo><mml:mi mathvariant="bold-italic">j</mml:mi></mml:mrow></mml:msubsup><mml:mo>)</mml:mo></mml:mrow></mml:math></disp-formula>where <inline-formula id="ieqn-37"><mml:math id="mml-ieqn-37"><mml:mi>&#x03B1;</mml:mi></mml:math></inline-formula> and <inline-formula id="ieqn-38"><mml:math id="mml-ieqn-38"><mml:mi>&#x03B2;</mml:mi></mml:math></inline-formula> are learnable or fixed scalar weights (e.g., <inline-formula id="ieqn-39"><mml:math id="mml-ieqn-39"><mml:mi>&#x03B1;</mml:mi><mml:mo>=</mml:mo><mml:mn>1.0</mml:mn><mml:mo>,</mml:mo><mml:mi>&#x03B2;</mml:mi><mml:mo>=</mml:mo><mml:mn>0.5</mml:mn></mml:math></inline-formula>) that control the relative emphasis on positive and negative similarities. The refined spatial features are then updated using a residual enhancement:
<disp-formula id="eqn-8"><label>(8)</label><mml:math id="mml-eqn-8" display="block"><mml:msubsup><mml:mrow><mml:mover><mml:mi mathvariant="bold-italic">f</mml:mi><mml:mo stretchy="false">&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">A</mml:mi></mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">i</mml:mi><mml:mo>,</mml:mo><mml:mi mathvariant="bold-italic">j</mml:mi></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:msubsup><mml:mi mathvariant="bold-italic">f</mml:mi><mml:mrow><mml:mi mathvariant="bold-italic">A</mml:mi></mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">i</mml:mi><mml:mo>,</mml:mo><mml:mi mathvariant="bold-italic">j</mml:mi></mml:mrow></mml:msubsup><mml:mo>+</mml:mo><mml:mrow><mml:mi mathvariant="normal">&#x03B3;</mml:mi></mml:mrow><mml:mo>&#x22C5;</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">R</mml:mi><mml:mrow><mml:mi mathvariant="bold-italic">i</mml:mi><mml:mo>,</mml:mo><mml:mi mathvariant="bold-italic">j</mml:mi></mml:mrow></mml:msup><mml:mo>&#x22C5;</mml:mo><mml:msubsup><mml:mi mathvariant="bold-italic">f</mml:mi><mml:mrow><mml:mi mathvariant="bold-italic">A</mml:mi></mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">i</mml:mi><mml:mo>,</mml:mo><mml:mi mathvariant="bold-italic">j</mml:mi></mml:mrow></mml:msubsup></mml:math></disp-formula>where <inline-formula id="ieqn-40"><mml:math id="mml-ieqn-40"><mml:mi>&#x03B3;</mml:mi></mml:math></inline-formula> is a scaling factor or learnable parameter. This step reinforces regions that align well with the positive and suppresses those that resemble the negatives.</p>
</sec>
<sec id="s3_3_4">
<label>3.3.4</label>
<title>Efficiency During Inference</title>
<p>To maintain inference efficiency, the SQS module is applied only during training. Although SQS refines anchor features by explicitly enforcing similarity and dissimilarity constraints during training, it is not required at inference time. The supervisory signals introduced by SQS are distilled into the parameters of the base branch through optimization. Consequently, the trained base encoder produces more discriminative embeddings at test time without additional computational overhead. At inference, only the base branch (i.e., the anchor pipeline) is retained, ensuring that model complexity and runtime remain unaffected.</p>
</sec>
</sec>
<sec id="s3_4">
<label>3.4</label>
<title>Training Strategy</title>
<p>The model is optimized using a combination of cross-entropy loss across multiple branches and a Kullback-Leibler (KL) divergence loss for residual distillation between the base classifier and the cross-enhanced classifier. Data augmentation techniques such as Mixup and label smoothing are applied to improve generalization. An adaptive learning rate scheduler based on Cosine Annealing Warm Restarts is employed to stabilize training. Early stopping is triggered if validation performance does not improve over a set number of epochs. This comprehensive training strategy enables SQSNet to achieve robust convergence and strong generalization across different FER benchmarks.</p>
<sec id="s3_4_1">
<label>3.4.1</label>
<title>Classification Loss Function</title>
<p>The primary objective is facial expression classification, achieved through the <bold>cross-entropy loss</bold>. Given the predicted logits <inline-formula id="ieqn-41"><mml:math id="mml-ieqn-41"><mml:mi mathvariant="bold-italic">z</mml:mi><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">R</mml:mi><mml:mrow><mml:mi mathvariant="bold-italic">C</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula> from the final classifier and the ground truth label <inline-formula id="ieqn-42"><mml:math id="mml-ieqn-42"><mml:mi>y</mml:mi></mml:math></inline-formula>, the classification loss is defined as:
<disp-formula id="eqn-9"><label>(9)</label><mml:math id="mml-eqn-9" display="block"><mml:msub><mml:mi>L</mml:mi><mml:mrow><mml:mi mathvariant="bold-italic">c</mml:mi><mml:mi mathvariant="bold-italic">l</mml:mi><mml:mi mathvariant="bold-italic">s</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mrow><mml:mtext mathvariant="bold">log&#xA0;</mml:mtext></mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mfrac><mml:msup><mml:mi mathvariant="bold-italic">e</mml:mi><mml:mrow><mml:mrow><mml:msub><mml:mi mathvariant="bold-italic">z</mml:mi><mml:mrow><mml:mi mathvariant="bold-italic">y</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:mrow></mml:msup><mml:mrow><mml:msubsup><mml:mo movablelimits="false">&#x2211;</mml:mo><mml:mrow><mml:mi mathvariant="bold-italic">C</mml:mi><mml:mo>=</mml:mo><mml:mrow><mml:mtext mathvariant="bold">1</mml:mtext></mml:mrow></mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">C</mml:mi></mml:mrow></mml:msubsup><mml:msup><mml:mi mathvariant="bold-italic">e</mml:mi><mml:mrow><mml:mrow><mml:msub><mml:mi mathvariant="bold-italic">z</mml:mi><mml:mrow><mml:mi mathvariant="bold-italic">y</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:mrow></mml:msup></mml:mrow></mml:mfrac><mml:mo>)</mml:mo></mml:mrow></mml:math></disp-formula></p>
<p>This loss encourages the model to assign high probability to the correct class label.</p>
</sec>
<sec id="s3_4_2">
<label>3.4.2</label>
<title>SQS Loss Function</title>
<p>The objective of the SQS module is to refine feature representations by encouraging intra-class compactness and inter-class separability at the spatial level. To achieve this, we define a quad-based contrastive loss that operates on the spatial similarity scores between the anchor and its corresponding positive and negative samples. Let the spatial similarity scores be <inline-formula id="ieqn-43"><mml:math id="mml-ieqn-43"><mml:msub><mml:mi>S</mml:mi><mml:mrow><mml:mi>A</mml:mi><mml:mi>P</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> Similarity between anchor <inline-formula id="ieqn-44"><mml:math id="mml-ieqn-44"><mml:mi>A</mml:mi></mml:math></inline-formula> and positive <inline-formula id="ieqn-45"><mml:math id="mml-ieqn-45"><mml:mi>P</mml:mi></mml:math></inline-formula>, and <inline-formula id="ieqn-46"><mml:math id="mml-ieqn-46"><mml:msub><mml:mi>S</mml:mi><mml:mrow><mml:mi>A</mml:mi><mml:mi>N</mml:mi><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>S</mml:mi><mml:mrow><mml:mi>A</mml:mi><mml:mi>N</mml:mi><mml:mn>2</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula> are Similarities between anchor <inline-formula id="ieqn-47"><mml:math id="mml-ieqn-47"><mml:mi>A</mml:mi></mml:math></inline-formula> and two negatives, <inline-formula id="ieqn-48"><mml:math id="mml-ieqn-48"><mml:msub><mml:mi>N</mml:mi><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>N</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula>. The margin m is a hyperparameter that controls the minimum desired separation between positive and negative similarities and is set to <italic>m</italic> &#x003D; 0.5 based on empirical validation on the training set, following common practice in margin-based contrastive objectives. The SQS loss <inline-formula id="ieqn-49"><mml:math id="mml-ieqn-49"><mml:msub><mml:mi>L</mml:mi><mml:mrow><mml:mi>S</mml:mi><mml:mi>Q</mml:mi><mml:mi>S</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> define as:
<disp-formula id="eqn-10"><label>(10)</label><mml:math id="mml-eqn-10" display="block"><mml:msub><mml:mi mathvariant="bold-italic">L</mml:mi><mml:mrow><mml:mi mathvariant="bold-italic">S</mml:mi><mml:mi mathvariant="bold-italic">Q</mml:mi><mml:mi mathvariant="bold-italic">S</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:mrow><mml:mi mathvariant="bold-italic">H</mml:mi><mml:mi mathvariant="bold-italic">W</mml:mi></mml:mrow></mml:mfrac><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi mathvariant="bold-italic">i</mml:mi><mml:mo mathvariant="bold">=</mml:mo><mml:mn mathvariant="bold">1</mml:mn></mml:mrow><mml:mrow><mml:mi>H</mml:mi></mml:mrow></mml:munderover><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi mathvariant="bold-italic">j</mml:mi><mml:mo mathvariant="bold">=</mml:mo><mml:mn mathvariant="bold">1</mml:mn></mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">W</mml:mi></mml:mrow></mml:munderover><mml:mrow><mml:mo>[</mml:mo><mml:mo movablelimits="true" form="prefix">max</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mn mathvariant="bold">0</mml:mn><mml:mo mathvariant="bold">,</mml:mo><mml:mi mathvariant="bold-italic">m</mml:mi></mml:mrow><mml:mo>+</mml:mo><mml:msubsup><mml:mi>S</mml:mi><mml:mrow><mml:msub><mml:mrow><mml:mi mathvariant="bold-italic">A</mml:mi><mml:mi mathvariant="bold-italic">N</mml:mi></mml:mrow><mml:mn>1</mml:mn></mml:msub></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi></mml:mrow></mml:msubsup><mml:mo mathvariant="bold">&#x2212;</mml:mo><mml:msubsup><mml:mi mathvariant="bold-italic">S</mml:mi><mml:mrow><mml:mi mathvariant="bold-italic">A</mml:mi><mml:mi mathvariant="bold-italic">P</mml:mi></mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">i</mml:mi><mml:mo mathvariant="bold">,</mml:mo><mml:mi mathvariant="bold-italic">j</mml:mi></mml:mrow></mml:msubsup><mml:mo>)</mml:mo></mml:mrow><mml:mo mathvariant="bold">+</mml:mo><mml:mo movablelimits="true" form="prefix">max</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mn mathvariant="bold">0</mml:mn><mml:mo mathvariant="bold">,</mml:mo><mml:mi mathvariant="bold-italic">m</mml:mi></mml:mrow><mml:mo mathvariant="bold">+</mml:mo><mml:msubsup><mml:mi mathvariant="bold-italic">S</mml:mi><mml:mrow><mml:msub><mml:mrow><mml:mi mathvariant="bold-italic">A</mml:mi><mml:mi mathvariant="bold-italic">N</mml:mi></mml:mrow><mml:mrow><mml:mtext mathvariant="bold">2</mml:mtext></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">i</mml:mi><mml:mo mathvariant="bold">,</mml:mo><mml:mi mathvariant="bold-italic">j</mml:mi></mml:mrow></mml:msubsup><mml:mo>&#x2212;</mml:mo><mml:msubsup><mml:mi>S</mml:mi><mml:mrow><mml:mi mathvariant="bold-italic">A</mml:mi><mml:mi mathvariant="bold-italic">P</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi></mml:mrow></mml:msubsup><mml:mo>)</mml:mo></mml:mrow><mml:mo>]</mml:mo></mml:mrow></mml:math></disp-formula>where <italic>H</italic> &#x00D7; <italic>W</italic> Spatial resolution of the feature map, <inline-formula id="ieqn-50"><mml:math id="mml-ieqn-50"><mml:mi>m</mml:mi></mml:math></inline-formula> margin hyperparameter that defines the minimum desired difference between positive and negative similarities <inline-formula id="ieqn-51"><mml:math id="mml-ieqn-51"><mml:mi>m</mml:mi><mml:mo>=</mml:mo><mml:mn>0.5</mml:mn></mml:math></inline-formula>. This loss encourages <inline-formula id="ieqn-52"><mml:math id="mml-ieqn-52"><mml:msubsup><mml:mi mathvariant="bold-italic">S</mml:mi><mml:mrow><mml:mi mathvariant="bold-italic">A</mml:mi><mml:mi mathvariant="bold-italic">P</mml:mi></mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">i</mml:mi><mml:mo>,</mml:mo><mml:mi mathvariant="bold-italic">j</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula> to be significantly larger than both <inline-formula id="ieqn-53"><mml:math id="mml-ieqn-53"><mml:msubsup><mml:mi mathvariant="bold-italic">S</mml:mi><mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">A</mml:mi><mml:mi mathvariant="bold-italic">N</mml:mi></mml:mrow><mml:mrow><mml:mtext mathvariant="bold">1</mml:mtext></mml:mrow></mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">i</mml:mi><mml:mo>,</mml:mo><mml:mi mathvariant="bold-italic">j</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula> and <inline-formula id="ieqn-54"><mml:math id="mml-ieqn-54"><mml:msubsup><mml:mi mathvariant="bold-italic">S</mml:mi><mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">A</mml:mi><mml:mi mathvariant="bold-italic">N</mml:mi></mml:mrow><mml:mrow><mml:mtext mathvariant="bold">2</mml:mtext></mml:mrow></mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">i</mml:mi><mml:mo>,</mml:mo><mml:mi mathvariant="bold-italic">j</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula>, for every spatial location <inline-formula id="ieqn-55"><mml:math id="mml-ieqn-55"><mml:mo stretchy="false">(</mml:mo><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>.</p>
</sec>
<sec id="s3_4_3">
<label>3.4.3</label>
<title>Total Loss Function</title>
<p>The overall training objective combines the standard classification loss Cross-Entropy and the SQS loss:
<disp-formula id="eqn-11"><label>(11)</label><mml:math id="mml-eqn-11" display="block"><mml:msub><mml:mi>L</mml:mi><mml:mrow><mml:mi mathvariant="bold-italic">T</mml:mi><mml:mi mathvariant="bold-italic">o</mml:mi><mml:mi mathvariant="bold-italic">t</mml:mi><mml:mi mathvariant="bold-italic">a</mml:mi><mml:mi mathvariant="bold-italic">l</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mi>L</mml:mi><mml:mrow><mml:mi mathvariant="bold-italic">c</mml:mi><mml:mi mathvariant="bold-italic">l</mml:mi><mml:mi mathvariant="bold-italic">s</mml:mi></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:mi>&#x03BB;</mml:mi><mml:mo>&#x22C5;</mml:mo><mml:msub><mml:mi>L</mml:mi><mml:mrow><mml:mi mathvariant="bold-italic">S</mml:mi><mml:mi mathvariant="bold-italic">Q</mml:mi><mml:mi mathvariant="bold-italic">S</mml:mi></mml:mrow></mml:msub></mml:math></disp-formula>where <inline-formula id="ieqn-56"><mml:math id="mml-ieqn-56"><mml:msub><mml:mi>L</mml:mi><mml:mrow><mml:mi mathvariant="bold-italic">c</mml:mi><mml:mi mathvariant="bold-italic">l</mml:mi><mml:mi mathvariant="bold-italic">s</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> Cross-entropy loss from the final classifier, <inline-formula id="ieqn-57"><mml:math id="mml-ieqn-57"><mml:mi>&#x03BB;</mml:mi></mml:math></inline-formula> weighting coefficient that balances the influence of the SQS loss which typically tuned via validation, <italic>&#x03BB;</italic> &#x003D; 0.1 to 1.0.</p>
</sec>
<sec id="s3_4_4">
<label>3.4.4</label>
<title>Data Augmentation and Regularization</title>
<p>We use Mixup and CutMix augmentation to improve generalization. For two input samples<inline-formula id="ieqn-58"><mml:math id="mml-ieqn-58"><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">X</mml:mi><mml:mrow><mml:mi mathvariant="bold-italic">i</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mrow><mml:mi mathvariant="bold-italic">i</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> and <inline-formula id="ieqn-59"><mml:math id="mml-ieqn-59"><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">X</mml:mi><mml:mrow><mml:mi mathvariant="bold-italic">j</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mrow><mml:mi mathvariant="bold-italic">j</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mo>,</mml:mo></mml:math></inline-formula> the augmented sample <inline-formula id="ieqn-60"><mml:math id="mml-ieqn-60"><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>X</mml:mi><mml:mrow><mml:mi mathvariant="bold-italic">m</mml:mi><mml:mi mathvariant="bold-italic">i</mml:mi><mml:mi mathvariant="bold-italic">x</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>y</mml:mi><mml:mrow><mml:mi mathvariant="bold-italic">m</mml:mi><mml:mi mathvariant="bold-italic">i</mml:mi><mml:mi mathvariant="bold-italic">x</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> is computed as:
<disp-formula id="eqn-12"><label>(12)</label><mml:math id="mml-eqn-12" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd /><mml:mtd><mml:msub><mml:mi>X</mml:mi><mml:mrow><mml:mi mathvariant="bold-italic">m</mml:mi><mml:mi mathvariant="bold-italic">i</mml:mi><mml:mi mathvariant="bold-italic">x</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mrow><mml:mi mathvariant="normal">&#x03B2;</mml:mi></mml:mrow><mml:msub><mml:mi>X</mml:mi><mml:mrow><mml:mi mathvariant="bold-italic">i</mml:mi></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mtext mathvariant="bold">1</mml:mtext></mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mrow><mml:mi mathvariant="normal">&#x03B2;</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo><mml:msub><mml:mi>X</mml:mi><mml:mrow><mml:mi mathvariant="bold-italic">j</mml:mi></mml:mrow></mml:msub></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<disp-formula id="eqn-13"><label>(13)</label><mml:math id="mml-eqn-13" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd /><mml:mtd><mml:msub><mml:mi>y</mml:mi><mml:mrow><mml:mi mathvariant="bold-italic">m</mml:mi><mml:mi mathvariant="bold-italic">i</mml:mi><mml:mi mathvariant="bold-italic">x</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mrow><mml:mi mathvariant="normal">&#x03B2;</mml:mi></mml:mrow><mml:msub><mml:mi>y</mml:mi><mml:mrow><mml:mi mathvariant="bold-italic">i</mml:mi></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mtext mathvariant="bold">1</mml:mtext></mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mrow><mml:mi mathvariant="normal">&#x03B2;</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mrow><mml:mi mathvariant="bold-italic">j</mml:mi></mml:mrow></mml:msub></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>where <inline-formula id="ieqn-61"><mml:math id="mml-ieqn-61"><mml:mrow><mml:mi mathvariant="normal">&#x03B2;</mml:mi></mml:mrow><mml:mo>&#x223C;</mml:mo><mml:mi>B</mml:mi><mml:mi>e</mml:mi><mml:mi>t</mml:mi><mml:mi>a</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x03B1;</mml:mi><mml:mo>,</mml:mo><mml:mi>&#x03B1;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> is a mixing coefficient.</p>
</sec>
<sec id="s3_4_5">
<label>3.4.5</label>
<title>Optimization</title>
<p>The model is optimized using AdamW with a cosine annealing learning rate scheduler. The learning rate at iteration <inline-formula id="ieqn-62"><mml:math id="mml-ieqn-62"><mml:mi mathvariant="bold-italic">t</mml:mi></mml:math></inline-formula> is given by:
<disp-formula id="eqn-14"><label>(14)</label><mml:math id="mml-eqn-14" display="block"><mml:msub><mml:mi mathvariant="bold-italic">&#x03B7;</mml:mi><mml:mrow><mml:mtext mathvariant="bold">t</mml:mtext></mml:mrow></mml:msub><mml:mo mathvariant="bold">=</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">&#x03B7;</mml:mi><mml:mrow><mml:mrow><mml:mtext mathvariant="bold">min</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:mo mathvariant="bold">+</mml:mo><mml:mfrac><mml:mrow><mml:mtext mathvariant="bold">1</mml:mtext></mml:mrow><mml:mrow><mml:mtext mathvariant="bold">2</mml:mtext></mml:mrow></mml:mfrac><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">&#x03B7;</mml:mi><mml:mrow><mml:mrow><mml:mtext mathvariant="bold">max</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">&#x03B7;</mml:mi><mml:mrow><mml:mrow><mml:mtext mathvariant="bold">min</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mtext mathvariant="bold">1</mml:mtext></mml:mrow><mml:mo mathvariant="bold">+</mml:mo><mml:mi mathvariant="bold">cos</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mfrac><mml:mrow><mml:mrow><mml:mtext mathvariant="bold">t</mml:mtext></mml:mrow><mml:mi>&#x03C0;</mml:mi></mml:mrow><mml:mrow><mml:mtext mathvariant="bold">T</mml:mtext></mml:mrow></mml:mfrac><mml:mo>)</mml:mo></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:math></disp-formula>where <inline-formula id="ieqn-63"><mml:math id="mml-ieqn-63"><mml:msub><mml:mi>&#x03B7;</mml:mi><mml:mrow><mml:mrow><mml:mtext mathvariant="bold">min</mml:mtext></mml:mrow></mml:mrow></mml:msub></mml:math></inline-formula> and <inline-formula id="ieqn-64"><mml:math id="mml-ieqn-64"><mml:msub><mml:mi>&#x03B7;</mml:mi><mml:mrow><mml:mrow><mml:mtext mathvariant="bold">max</mml:mtext></mml:mrow></mml:mrow></mml:msub></mml:math></inline-formula> are the minimum and maximum learning rates, and T is the number of iterations per cycle.</p>
</sec>
</sec>
</sec>
<sec id="s4">
<label>4</label>
<title>Experimental Results and Visualization</title>
<sec id="s4_1">
<label>4.1</label>
<title>Datasets</title>
<p>The proposed SQSNet framework is evaluated on three widely used benchmark datasets for FER: RAF-DB [<xref ref-type="bibr" rid="ref-3">3</xref>], FERPlus [<xref ref-type="bibr" rid="ref-21">21</xref>], and AffectNet [<xref ref-type="bibr" rid="ref-22">22</xref>]. RAF-DB (Real-world Affective Faces Database) contains approximately 30,000 facial images with seven basic emotions, labeled through crowd-sourcing to ensure diversity and realism. FERPlus is an extension of the FER2013 dataset, featuring around 28,000 images with eight emotion categories, including an additional &#x201C;neutral&#x201D; class, and provides more accurate labels through crowd-sourced annotations. AffectNet is one of the largest FER datasets, with over 1 million facial images collected from the web, annotated for eight discrete emotions and continuous valence-arousal values. These datasets are chosen for their diversity in expression intensity, pose variations, lighting conditions, and occlusions, making them ideal for evaluating the robustness and generalization capabilities of the SQSNet framework.</p>
<p><italic>Dataset Split Details and Protocols</italic></p>
<p><bold>RAF-DB:</bold> We follow the standard protocol with 12,271 training images and 3068 test images across 7 emotion categories. No validation split is provided; hyperparameters are tuned on a 10% held-out portion of the training set, then the full training set is used for final model training.</p>
<p><bold>FERPlus:</bold> We use the official split with 28,709 training images, 3589 validation images, and 3589 test images across 8 emotion categories. The validation set is used for hyperparameter tuning and early stopping.</p>
<p><bold>AffectNet:</bold> Following standard practice for 8-class discrete emotion recognition, we use the manually annotated subset containing 283,901 training images and 3500 validation images. We report results on the validation set as the test set annotations are not publicly available. This protocol is consistent with recent FER literature [<xref ref-type="bibr" rid="ref-14">14</xref>,<xref ref-type="bibr" rid="ref-16">16</xref>].</p>
</sec>
<sec id="s4_2">
<label>4.2</label>
<title>Implementation Details</title>
<p>The SQSNet framework is implemented using PyTorch and leverages the timm library for pre-trained models. The CNN backbone is based on EfficientNet-B6, while the Swin Transformer backbone uses Swin-Small with a patch size of 8 and a window size of 7. The model is trained on two NVIDIA GPU with a batch size of 64 and optimized using AdamW with a learning rate of 3e&#x2212;4 and weight decay of 1e&#x2212;5. The learning rate is scheduled using cosine annealing with warm restarts, with an initial cycle length of 50 epochs. Data augmentation techniques, including Mixup (<italic>&#x03B1;</italic> &#x003D; 0.4), CutMix (<italic>&#x03B1;</italic> &#x003D; 0.4), and Random Erasing (<italic>p</italic> &#x003D; 0.5), are applied to improve generalization. The loss function is cross-entropy loss and KL-Divergence with label smoothing (<italic>&#x03B1;</italic> &#x003D; 0.1) to handle class imbalance and noisy labels. The training process runs for 50 epochs, and the best model is selected based on validation accuracy. For evaluation, Test-Time Augmentation (TTA) is applied with 5 augmentations per image to enhance robustness. To ensure reproducibility and assess result stability, all SQSNet experiments are repeated 5 times with different random seeds. We report mean accuracy with standard deviation across runs.</p>
</sec>
<sec id="s4_3">
<label>4.3</label>
<title>Performance Comparison with Existing Approaches</title>
<p><xref ref-type="table" rid="table-1">Table 1</xref> presents a comprehensive comparison of SQSNet against existing state-of-the-art methods on three benchmark datasets: RAF-DB, FERPlus, and AffectNet. The results demonstrate that SQSNet outperforms all competing approaches, achieving the highest accuracy across all datasets (91.90% on RAF-DB, 91.11% on FERPlus, and 67.15% on AffectNet). Notably, SQSNet surpasses strong baselines such as HALNet (90.29%, 90.04%, 61.75%), AGT (89.52%, 89.40%), and LRN (88.91%, 89.53%, 60.83%), confirming its robustness and generalizability. The consistent improvement over existing methods, ranging from 1.61% on RAF-DB to 5.40% on AffectNet validates the effectiveness of our proposed strategy and architectural innovations.</p>
<table-wrap id="table-1">
<label>Table 1</label>
<caption>
<title>Comparison of SQSNet with recent methods on RAF-DB, FERPlus, and AffectNet datasets. All values represent classification accuracy (%). The best results are highlighted in bold.</title>
</caption>
<table>
<colgroup>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
</colgroup>
<thead>
<tr>
<th>Method</th>
<th>RAF-DB</th>
<th>FERPlus</th>
<th>AffectNet</th>
</tr>
</thead>
<tbody>
<tr>
<td>RAN [<xref ref-type="bibr" rid="ref-23">23</xref>]</td>
<td>86.90</td>
<td>88.55</td>
<td>59.30</td>
</tr>
<tr>
<td>ESRs [<xref ref-type="bibr" rid="ref-24">24</xref>]</td>
<td>&#x2013;</td>
<td>87.25</td>
<td>59.30</td>
</tr>
<tr>
<td>SCN [<xref ref-type="bibr" rid="ref-1">1</xref>]</td>
<td>87.03</td>
<td>88.01</td>
<td>&#x2013;</td>
</tr>
<tr>
<td>CNN &#x002B; BOVW [<xref ref-type="bibr" rid="ref-25">25</xref>]</td>
<td>&#x2013;</td>
<td>87.76</td>
<td>59.58</td>
</tr>
<tr>
<td>EfficientFace [<xref ref-type="bibr" rid="ref-26">26</xref>]</td>
<td>88.36</td>
<td>&#x2013;</td>
<td>59.89</td>
</tr>
<tr>
<td>AMP-Net [<xref ref-type="bibr" rid="ref-27">27</xref>]</td>
<td>89.25</td>
<td>&#x2013;</td>
<td>60.29</td>
</tr>
<tr>
<td>LRN [<xref ref-type="bibr" rid="ref-28">28</xref>]</td>
<td>88.91</td>
<td>89.53</td>
<td>60.83</td>
</tr>
<tr>
<td>CT-DBN [<xref ref-type="bibr" rid="ref-18">18</xref>]</td>
<td>88.40</td>
<td>89.17</td>
<td>&#x2013;</td>
</tr>
<tr>
<td>AGT [<xref ref-type="bibr" rid="ref-29">29</xref>]</td>
<td>89.52</td>
<td>89.40</td>
<td>&#x2013;</td>
</tr>
<tr>
<td>HALNet [<xref ref-type="bibr" rid="ref-30">30</xref>]</td>
<td>90.29</td>
<td>90.04</td>
<td>61.75</td>
</tr>
<tr>
<td><bold>SQSNet</bold></td>
<td><bold>91.90</bold></td>
<td><bold>91.11</bold></td>
<td><bold>67.15</bold></td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p>Note: RAF-DB and FERPlus contain 7 emotion classes (Surprise, Fear, Disgust, Happiness, Sadness, Anger, Neutral), while AffectNet contains 8 classes (adding Contempt). Class distributions and annotation protocols differ across datasets, which affects absolute accuracy values. Comparisons are most meaningful within each dataset against corresponding baselines.</p>
</table-wrap-foot>
</table-wrap>
<p>SQSNet results represent mean &#x00B1; standard deviation across 5 runs with different random seeds: RAF-DB: 91.90% &#x00B1; 0.28%, FERPlus: 91.11% &#x00B1; 0.31%, AffectNet: 67.15% &#x00B1; 0.42%. All improvements over previous best methods are statistically significant (paired <italic>t</italic>-test, <italic>p</italic> &#x003C; 0.05). Baseline results are taken from original publications.</p>
<p>The lower absolute accuracies on AffectNet (67.15%) compared to RAF-DB (91.90%) and FERPlus (91.11%) reflect the substantially greater challenges inherent to this dataset. AffectNet comprises over 1 million web-collected images with extreme diversity in pose, lighting, occlusion, and image quality&#x2014;far exceeding the controlled conditions of RAF-DB and laboratory-based FERPlus. Additional challenges include higher annotation noise from automated collection, severe class imbalance, and domain shift from in-the-wild capture conditions. Notably, SQSNet achieves the largest relative improvement on AffectNet (&#x002B;5.40% over previous best), compared to &#x002B;1.61% on RAF-DB and &#x002B;1.07% on FERPlus. This demonstrates that our spatial quad-similarity mechanism provides superior robustness precisely under the most challenging real-world conditions where traditional methods struggle. The spatial-level refinement proves especially valuable when dealing with occlusions, extreme poses, and noisy annotations characteristic of unconstrained environments.</p>
</sec>
<sec id="s4_4">
<label>4.4</label>
<title>Comparison with Various Deep Learning Architectures</title>
<p><xref ref-type="table" rid="table-2">Table 2</xref> presents a comprehensive comparison of SQSNet against various state-of-the-art deep learning architectures on three widely used facial expression recognition datasets: RAF-DB, FERPlus, and AffectNet. The results clearly demonstrate the superior performance of SQSNet, which achieves the highest accuracy across all datasets 91.90% on RAF-DB, 91.11% on FERPlus, and 67.15% on AffectNet. Compared to baseline models such as EfficientNet, ResNet, and Transformer-based methods like ViT and Swin Transformer, SQSNet consistently outperforms both individual and hybrid architectures. Notably, even combinations of strong backbones such as SeResNeXt50 with Swin Transformer fall short of SQSNet&#x2019;s performance. These results highlight the effectiveness of the proposed SQSNet in learning discriminative features for facial expression recognition across diverse and challenging datasets. All SQSNet architecture results represent mean &#x00B1; standard deviation across 5 independent runs. SQSNet achieves: RAF-DB: 91.90% &#x00B1; 0.28%, FERPlus: 91.11% &#x00B1; 0.31%, AffectNet: 67.15% &#x00B1; 0.42%.</p>
<table-wrap id="table-2">
<label>Table 2</label>
<caption>
<title>Comparison of SQSNet with various deep learning architectures on RAF-DB, FERPlus, and AffectNet datasets. All values represent classification accuracy (%) using identical training protocols. The best performance on each dataset is highlighted in bold.</title>
</caption>
<table>
<colgroup>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
</colgroup>
<thead>
<tr>
<th>Method</th>
<th>RAF-DB</th>
<th>FERPlus</th>
<th>AffectNet</th>
</tr>
</thead>
<tbody>
<tr>
<td>EfficientNetB0</td>
<td>83.55</td>
<td>82.25</td>
<td>51.21</td>
</tr>
<tr>
<td>EfficientNetB7</td>
<td>89.05</td>
<td>88.11</td>
<td>58.91</td>
</tr>
<tr>
<td>Resnet18</td>
<td>87.05</td>
<td>86.25</td>
<td>53.30</td>
</tr>
<tr>
<td>Resnext50</td>
<td>88.05</td>
<td>87.09</td>
<td>58.13</td>
</tr>
<tr>
<td>SeResnext50</td>
<td>88.45</td>
<td>87.75</td>
<td>58.36</td>
</tr>
<tr>
<td>ViT</td>
<td>85.90</td>
<td>86.81</td>
<td>57.21</td>
</tr>
<tr>
<td>Swin Transformer</td>
<td>84.91</td>
<td>83.53</td>
<td>52.83</td>
</tr>
<tr>
<td>EfficientNetB7&#x0026;ViT</td>
<td>86.56</td>
<td>84.42</td>
<td>59.14</td>
</tr>
<tr>
<td>SeResnext50&#x0026;ViT</td>
<td>89.18</td>
<td>88.35</td>
<td>58.28</td>
</tr>
<tr>
<td>EfficientNetB7&#x0026;Swin T</td>
<td>88.40</td>
<td>89.17</td>
<td>61.12</td>
</tr>
<tr>
<td>SeResnext50&#x0026;Swin T</td>
<td>88.90</td>
<td>87.20</td>
<td>62.26</td>
</tr>
<tr>
<td><bold>SQSNet</bold></td>
<td><bold>91.90</bold></td>
<td><bold>91.11</bold></td>
<td><bold>67.15</bold></td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
<sec id="s4_5">
<label>4.5</label>
<title>Computational Efficiency Analysis</title>
<p>To assess the computational efficiency of SQSNet, we report its FLOPs, inference time, and parameter count in <xref ref-type="table" rid="table-3">Table 3</xref>. Compared to MobileNetV2 and EfficientNet-Lite, SQSNet achieves superior accuracy with only a modest increase in computational cost, demonstrating its suitability for real-time or edge-based FER applications.</p>
<table-wrap id="table-3">
<label>Table 3</label>
<caption>
<title>Comprehensive computational efficiency analysis on RAF-DB.</title>
</caption>
<table>
<colgroup>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
</colgroup>
<thead>
<tr>
<th>Model</th>
<th>Params (M)</th>
<th>FLOPs (G)</th>
<th>Inference Time (ms)</th>
<th>Training Time (h)</th>
<th>GPU Memory (GB)</th>
<th>Accuracy (%)</th>
</tr>
</thead>
<tbody>
<tr>
<td>MobileNetV2_100</td>
<td>2.2</td>
<td>0.61</td>
<td>7.1</td>
<td>3.2</td>
<td>2.1</td>
<td>86.1</td>
</tr>
<tr>
<td>EfficientNet-Lite0</td>
<td>4.6</td>
<td>0.75</td>
<td>7.4</td>
<td>4.8</td>
<td>3.2</td>
<td>87.4</td>
</tr>
<tr>
<td>EfficientNet-B6</td>
<td>43.0</td>
<td>7.8</td>
<td>16.5</td>
<td>14.2</td>
<td>9.8</td>
<td>89.05</td>
</tr>
<tr>
<td>ViT-Small</td>
<td>22.0</td>
<td>4.6</td>
<td>15.2</td>
<td>12.5</td>
<td>8.5</td>
<td>85.90</td>
</tr>
<tr>
<td>Swin-Tiny</td>
<td>28.3</td>
<td>4.5</td>
<td>14.8</td>
<td>11.8</td>
<td>7.9</td>
<td>84.91</td>
</tr>
<tr>
<td>SQSNet (ours)</td>
<td>43.5</td>
<td>8.1</td>
<td>18.2</td>
<td>16.4</td>
<td>11.2</td>
<td>91.90</td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
<sec id="s4_6">
<label>4.6</label>
<title>Hyperparameter Sensitivity and Ablation Analysis</title>
<p>To assess the robustness of our approach to hyperparameter choices, we conducted sensitivity analysis on all SQS parameters. <xref ref-type="table" rid="table-4">Table 4</xref> presents the accuracy range across different parameter values on RAF-DB. The results demonstrate that SQSNet maintains stable performance (within &#x00B1;1% variation) across reasonable hyperparameter ranges, indicating robustness to parameter selection. For new FER datasets, we recommend: (1) Start with default values (<italic>&#x03C4;</italic> &#x003D; 1.0, <italic>&#x03B1;</italic> &#x003D; 1.0, <italic>&#x03B2;</italic> &#x003D; 0.5, <italic>&#x03B3;</italic> &#x003D; 0.5, <italic>m</italic> &#x003D; 0.5, <italic>&#x03BB;</italic> &#x003D; 0.5); (2) First tune <italic>&#x03BB;</italic> on validation set to balance task objectives; (3) Adjust margin m if optimization is unstable; (4) Fine-tune <italic>&#x03B1;</italic>, <italic>&#x03B2;</italic> only if class distribution is highly imbalanced. Temperature &#x03C4; and scaling <italic>&#x03B3;</italic> typically transfer well across datasets.</p>
<table-wrap id="table-4">
<label>Table 4</label>
<caption>
<title>Sensitivity analysis of SQS hyperparameters on RAF-DB.</title>
</caption>
<table>
<colgroup>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
</colgroup>
<thead>
<tr>
<th>Parameter</th>
<th>Values Tested</th>
<th>Optimal</th>
<th>Accuracy Range</th>
<th>Std. Dev.</th>
</tr>
</thead>
<tbody>
<tr>
<td>Temperature <italic>&#x03C4;</italic></td>
<td>0.1, 0.5, 1.0, 2.0</td>
<td>1.0</td>
<td>90.52%&#x2013;91.90%</td>
<td>&#x00B1;0.68%</td>
</tr>
<tr>
<td>Positive weight <italic>&#x03B1;</italic></td>
<td>0.5, 0.7, 1.0, 1.5</td>
<td>1.0</td>
<td>91.21%&#x2013;91.90%</td>
<td>&#x00B1;0.34%</td>
</tr>
<tr>
<td>Negative weight <italic>&#x03B2;</italic></td>
<td>0.3, 0.5, 0.7, 1.0</td>
<td>0.5</td>
<td>91.08%&#x2013;91.90%</td>
<td>&#x00B1;0.41%</td>
</tr>
<tr>
<td>Scaling factor <italic>&#x03B3;</italic></td>
<td>0.1, 0.3, 0.5, 0.7, 1.0</td>
<td>0.5</td>
<td>90.98%&#x2013;91.90%</td>
<td>&#x00B1;0.46%</td>
</tr>
<tr>
<td>Margin <italic>m</italic></td>
<td>0.2, 0.3, 0.5, 0.7, 1.0</td>
<td>0.5</td>
<td>90.81%&#x2013;91.90%</td>
<td>&#x00B1;0.54%</td>
</tr>
<tr>
<td>SQS weight <italic>&#x03BB;</italic></td>
<td>0.1, 0.3, 0.5, 0.7, 1.0</td>
<td>0.5</td>
<td>90.92%&#x2013;91.90%</td>
<td>&#x00B1;0.49%</td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
<sec id="s4_7">
<label>4.7</label>
<title>Statistical Significance Analysis</title>
<p><xref ref-type="table" rid="table-5">Table 5</xref> reports the statistical significance analysis comparing SQSNet with a strong baseline (HALNet) across three benchmark datasets. Results are reported as mean &#x00B1; standard deviation over multiple runs. SQSNet achieves consistently higher accuracy on RAF-DB, FERPlus, and AffectNet, with all improvements being statistically significant (<italic>p</italic> &#x003C; 0.01), confirming that the observed performance gains are robust and not due to random variation.</p>
<table-wrap id="table-5">
<label>Table 5</label>
<caption>
<title>Statistical significance analysis.</title>
</caption>
<table>
<colgroup>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
</colgroup>
<thead>
<tr>
<th>Method</th>
<th>RAF-DB</th>
<th><italic>p</italic>-value</th>
<th>FERPlus</th>
<th><italic>p</italic>-value</th>
<th>AffectNet</th>
<th><italic>p</italic>-value</th>
</tr>
</thead>
<tbody>
<tr>
<td>HALNet</td>
<td>90.29 &#x00B1; 0.35</td>
<td>&#x2013;</td>
<td>90.04 &#x00B1; 0.28</td>
<td>&#x2013;</td>
<td>61.75 &#x00B1; 0.51</td>
<td>&#x2013;</td>
</tr>
<tr>
<td>SQSNet</td>
<td>91.90 &#x00B1; 0.28</td>
<td>0.002</td>
<td>91.11 &#x00B1; 0.31</td>
<td>0.009</td>
<td>67.15 &#x00B1; 0.42</td>
<td>0.001</td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
<sec id="s4_8">
<label>4.8</label>
<title>Comprehensive Performance Metrics</title>
<p><xref ref-type="table" rid="table-6">Table 6</xref> presents a comprehensive evaluation of SQSNet using multiple metrics beyond accuracy, including Macro-F1, Weighted-F1, and Balanced Accuracy. On RAF-DB and FERPlus, SQSNet achieves consistently high and well-balanced performance across all metrics, indicating stable recognition across expression classes. On AffectNet, while overall accuracy remains competitive, the lower Macro-F1 and wider per-class F1 range reflect the dataset&#x2019;s severe class imbalance and higher intra-class variability, highlighting the importance of spatial-level modeling under challenging real-world conditions.</p>
<table-wrap id="table-6">
<label>Table 6</label>
<caption>
<title>Comprehensive performance metrics.</title>
</caption>
<table>
<colgroup>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
</colgroup>
<thead>
<tr>
<th>Dataset</th>
<th>Accuracy</th>
<th>Macro-F1</th>
<th>Weighted-F1</th>
<th>Balanced Acc</th>
<th>Per-Class F1 Range</th>
</tr>
</thead>
<tbody>
<tr>
<td>RAF-DB</td>
<td>91.90 &#x00B1; 0.28</td>
<td>91.75 &#x00B1; 0.31</td>
<td>91.88 &#x00B1; 0.29</td>
<td>91.82 &#x00B1; 0.30</td>
<td>89.2%&#x2013;94.1%</td>
</tr>
<tr>
<td>FERPlus</td>
<td>91.11 &#x00B1; 0.31</td>
<td>90.95 &#x00B1; 0.34</td>
<td>91.08 &#x00B1; 0.32</td>
<td>90.98 &#x00B1; 0.33</td>
<td>88.5%&#x2013;93.7%</td>
</tr>
<tr>
<td>AffectNet</td>
<td>67.15 &#x00B1; 0.42</td>
<td>58.32 &#x00B1; 0.58</td>
<td>66.89 &#x00B1; 0.45</td>
<td>62.45 &#x00B1; 0.51</td>
<td>42.3%&#x2013;78.9%</td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
<sec id="s4_9">
<label>4.9</label>
<title>Ablation Studies</title>
<p><xref ref-type="table" rid="table-7">Table 7</xref> presents an ablation study analyzing the contribution of each module in SQSNet by removing key components: the CNN backbone (w/o CNN), Swin Transformer (w/o Swin), and Spatial Quad-Similarity (w/o SQS). The results demonstrate that all modules are essential for optimal performance, with the largest drop occurring when removing the CNN (&#x2212;4.75% on RAF-DB), highlighting its role in modeling local dependencies. Disabling the Swin Transformer backbone significantly reduces accuracy on RAF-DB (&#x2212;2.96%), emphasizing its importance for local feature extraction, while removing SQS leads to moderate declines (&#x2212;3.01% on RAF-DB), confirming its effectiveness in feature fusion. The full SQSNet model achieves the highest accuracy across all datasets, proving the necessity of integrating all three components for state-of-the-art performance.</p>
<table-wrap id="table-7">
<label>Table 7</label>
<caption>
<title>Ablation study for each module of SQSNet.</title>
</caption>
<table>
<colgroup>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
</colgroup>
<thead>
<tr>
<th>Datasets</th>
<th>w/o CNN</th>
<th>w/o Swin</th>
<th>w/o SQS</th>
<th>SQSNet</th>
</tr>
</thead>
<tbody>
<tr>
<td>RAF-DB</td>
<td>87.15</td>
<td>88.94</td>
<td>89.11</td>
<td><bold>91.90</bold></td>
</tr>
<tr>
<td>FERPlus</td>
<td>88.90</td>
<td>89.76</td>
<td>88.25</td>
<td><bold>91.11</bold></td>
</tr>
<tr>
<td>AffectNet</td>
<td>58.14</td>
<td>59.66</td>
<td>65.23</td>
<td><bold>67.15</bold></td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
<sec id="s4_10">
<label>4.10</label>
<title>Visualization</title>
<p><italic>Confusion Matrix Comparison</italic></p>
<p><xref ref-type="fig" rid="fig-3">Fig. 3</xref>: <italic>Confusion matrices for SQSNet on RAF-DB, FERPlus, and AffectNet datasets</italic>. The confusion matrices illustrate the class-wise performance of SQSNet across three benchmark datasets. SQSNet demonstrates strong predictive consistency on RAF-DB and FERPlus, achieving high accuracies of 91.90% and 91.11%, respectively, with minimal confusion between emotion classes. Particularly, the &#x201C;Happy&#x201D; and &#x201C;Neutral&#x201D; categories show robust recognition performance. In contrast, performance on AffectNet (67.15%) reveals greater inter-class confusion, especially among similar expressions such as &#x201C;Fear,&#x201D; &#x201C;Sad,&#x201D; and &#x201C;Disgust.&#x201D; This disparity highlights the increased complexity and imbalance in AffectNet, underscoring the challenges of emotion recognition in large-scale, real-world datasets. Overall, the visualizations confirm that SQSNet generalizes well on controlled datasets while maintaining competitive performance in more challenging environments.</p>
<fig id="fig-3">
<label>Figure 3</label>
<caption>
<title>Confusion matrices for SQSNet on RAF-DB, FERPlus, and AffectNet datasets.</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_75616-fig-3.tif"/>
</fig>
</sec>
</sec>
<sec id="s5">
<label>5</label>
<title>Discussion</title>
<p>The proposed SQSNet framework demonstrates strong potential in advancing by effectively combining the localized feature extraction of CNNs with the global context modeling of Swin Transformers. This hybrid design addresses the inherent limitations of relying solely on either architecture. CNNs often struggle with long-range dependencies, while Transformers may lose fine-grained spatial details. The introduction of a cross-attention fusion module enables the network to dynamically align and integrate multi-scale features from both backbones, resulting in a more robust and discriminative representation of facial expressions. A central contribution of this work is the spatial quad-similarity module, which enhances the learning process by enforcing spatial-level similarity constraints. Unlike standard contrastive or triplet losses, the SQS module computes fine-grained similarity scores between an anchor, a positive sample, and two negatives. By explicitly encouraging higher spatial similarity within the same class and penalizing similarities across different classes, SQS refines the internal feature representations, contributing to intra-class compactness and inter-class separability. This leads to significant gains in recognition accuracy across datasets with varying complexity and noise levels.</p>
<p>The performance of SQSNet was evaluated on three widely used benchmark datasets RAF-DB, FERPlus, and AffectNet, where it consistently outperformed existing state-of-the-art methods. These results highlight the effectiveness of both the hybrid architecture and the proposed similarity-based regularization. Furthermore, the application of training techniques such as Mixup, CutMix, label smoothing, and adaptive learning rate scheduling proved beneficial in enhancing generalization, especially under challenging conditions involving occlusion, pose variation, and inconsistent lighting. Despite these promising outcomes, some limitations remain. The introduction of multiple branches and attention modules increases the computational complexity and training time, which may restrict deployment on resource-constrained devices. Additionally, the model&#x2019;s performance may vary across demographic subgroups, raising concerns about fairness and bias, which are critical in emotion-aware systems. Future work could explore lightweight alternatives to Swin Transformers, such as efficient attention mechanisms or pruning strategies, to reduce inference overhead. Moreover, demographic-aware training protocols and explainable AI (XAI) tools could be integrated to improve transparency and fairness. Expanding the framework to handle multimodal emotion recognition, incorporating voice and body cues, is another promising direction to further enhance its applicability in real-world human-computer interaction systems.</p>
<p><bold><italic>Theoretical Foundation and Novelty</italic></bold></p>
<p>The effectiveness of SQSNet stems from its unique spatial-level similarity learning mechanism. While traditional metric learning approaches enforce holistic feature similarity after spatial information has been aggregated, our SQS module preserves fine-grained spatial relationships throughout the learning process. This distinction is crucial for FER: facial expressions are defined by specific local muscle movements (e.g., AU12 for smile, AU4 for frown), and these local patterns must be preserved rather than mixed through global pooling. By computing similarity independently at each spatial location, SQS provides H &#x00D7; W fine-grained supervision signals that guide the network to learn spatially-aware discriminative features. The superiority of this approach is evidenced by the consistent improvements across all three benchmarks, particularly the &#x002B;5.40% gain on the challenging AffectNet dataset where fine-grained discrimination is most critical.</p>
</sec>
<sec id="s6">
<label>6</label>
<title>Conclusion</title>
<p>In this paper, we presented SQSNet, a novel hybrid framework for FER that combines the strengths of CNNs and Swin Transformers through a cross-attention mechanism. By leveraging the local feature extraction capabilities of CNNs and the global context modeling of Swin Transformers, SQSNet achieves state-of-the-art performance on benchmark FER datasets, demonstrating superior accuracy and robustness to variations in lighting, pose, and occlusion. The integration of advanced techniques such as Mixup, label smoothing, and adaptive learning rate scheduling further enhances the model&#x2019;s generalization capabilities. Our results highlight the effectiveness of hybrid architectures and cross-attention in addressing the challenges of FER, paving the way for more reliable and interpretable emotion recognition systems. Future work will focus on extending SQSNet to handle multimodal data, real-time deployment, and ethical considerations, ensuring its applicability in diverse real-world scenarios. The code and models are made publicly available to encourage further research and collaboration in this important field.</p>
<p><bold><italic>Directions for Future Research</italic></bold></p>
<p>Future research on SQSNet and FER can explore several promising avenues, including cross-dataset transfer learning and domain adaptation to further assess generalization capabilities under domain shift. This would complement our current contribution by targeting zero-shot transfer scenarios. Morover, the integration of multimodal data for richer emotion analysis and optimizing the framework for real-time and edge deployment through model compression, pruning, or knowledge-distillation techniques to further improve computational efficiency. Additionally, extending SQSNet to video-based FER tasks that capture temporal dynamics in facial expressions using datasets such as DFEW and MMI represents an important next step. This direction can be complemented by cross-dataset generalization studies to assess robustness and transferability across diverse domains. Enhancing the explainability and interpretability of the hybrid CNN&#x2013;Transformer architecture will also be crucial for building user trust, while ensuring generalization across diverse cultures, ages, and genders will promote fairness and inclusivity. Leveraging self-supervised or semi-supervised learning can reduce reliance on large labeled datasets, and integrating temporal modeling in image sequences can further improve recognition performance. Ethical and privacy considerations such as anonymization, bias mitigation, and responsible data handling must also be addressed to ensure trustworthy deployment. Finally, application-specific customization and the creation of standardized evaluation benchmarks will continue to drive innovation and facilitate broader adoption of FER systems in real-world scenarios.</p>
</sec>
</body>
<back>
<ack>
<p>The authors express gratitude to the School of Computer Science and Engineering at Central South University for providing the resources needed to complete this study.</p>
</ack>
<sec>
<title>Funding Statement</title>
<p>The authors received no specific funding.</p>
</sec>
<sec>
<title>Author Contributions</title>
<p>The authors confirm their contribution to the paper as follows: Study conception, design, software, formal analysis, and draft manuscript preparation: Mohammed A. Ahmed; Assisted Mohammed A. Ahmed in software, formal analysis, and data collection: Ammar Nassr and Hani Almaqtari; Supervision, guidance: Jian Dong and Ronghua Shi; Manuscript review and editing: Ala A. Alsanabani. All authors reviewed and approved the final version of the manuscript.</p>
</sec>
<sec sec-type="data-availability">
<title>Availability of Data and Materials</title>
<p>The datasets used in this study, including RAF-DB, FERPlus, and AffectNet, are publicly available.</p>
</sec>
<sec>
<title>Ethics Approval</title>
<p>This study uses publicly available datasets (RAF-DB, FERPlus, and AffectNet) that adhere to ethical guidelines for data collection and usage.</p>
</sec>
<sec sec-type="COI-statement">
<title>Conflicts of Interest</title>
<p>The authors declare no conflicts of interest.</p>
</sec>
<glossary content-type="abbreviations" id="glossary-1">
<title>Nomenclature</title>
<def-list>
<def-item>
<term>SQS</term>
<def>
<p>Spatial Quad-Similarity</p>
</def>
</def-item>
<def-item>
<term>SQSNet</term>
<def>
<p>Spatial Quad-Similarity Network</p>
</def>
</def-item>
<def-item>
<term>FER</term>
<def>
<p>Facial Expression Recognition</p>
</def>
</def-item>
<def-item>
<term>CNNs</term>
<def>
<p>Convolutional Neural Networks</p>
</def>
</def-item>
<def-item>
<term>AUC</term>
<def>
<p>Area Under the Curve</p>
</def>
</def-item>
<def-item>
<term>BiLSTM</term>
<def>
<p>a Bidirectional Long Short-Term Memory</p>
</def>
</def-item>
<def-item>
<term>AP</term>
<def>
<p>Attentive Pooling</p>
</def>
</def-item>
<def-item>
<term>SSL</term>
<def>
<p>self-supervised learning</p>
</def>
</def-item>
<def-item>
<term>LSTM</term>
<def>
<p>Long Short-Term Memory</p>
</def>
</def-item>
<def-item>
<term>ViT</term>
<def>
<p>Vision Transformers</p>
</def>
</def-item>
<def-item>
<term>APP</term>
<def>
<p>Attentive Patch Pooling</p>
</def>
</def-item>
<def-item>
<term>GRU</term>
<def>
<p>Gated Recurrent Unit</p>
</def>
</def-item>
<def-item>
<term>FSL</term>
<def>
<p>few-shot learning</p>
</def>
</def-item>
</def-list>
</glossary>
<ref-list content-type="authoryear">
<title>References</title>
<ref id="ref-1"><label>[1]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Wang</surname> <given-names>K</given-names></string-name>, <string-name><surname>Peng</surname> <given-names>X</given-names></string-name>, <string-name><surname>Yang</surname> <given-names>J</given-names></string-name>, <string-name><surname>Lu</surname> <given-names>S</given-names></string-name>, <string-name><surname>Qiao</surname> <given-names>Y</given-names></string-name></person-group>. <article-title>Suppressing uncertainties for large-scale facial expression recognition</article-title>. In: <conf-name>Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); 2020 Jun 13&#x2013;19; Seattle, WA, USA</conf-name>. doi:<pub-id pub-id-type="doi">10.1109/cvpr42600.2020.00693</pub-id>.</mixed-citation></ref>
<ref id="ref-2"><label>[2]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Minaee</surname> <given-names>S</given-names></string-name>, <string-name><surname>Minaei</surname> <given-names>M</given-names></string-name>, <string-name><surname>Abdolrashidi</surname> <given-names>A</given-names></string-name></person-group>. <article-title>Deep-emotion: facial expression recognition using attentional convolutional network</article-title>. <source>Sensors</source>. <year>2021</year>;<volume>21</volume>(<issue>9</issue>):<fpage>3046</fpage>. doi:<pub-id pub-id-type="doi">10.3390/s21093046</pub-id>; <pub-id pub-id-type="pmid">33925371</pub-id></mixed-citation></ref>
<ref id="ref-3"><label>[3]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Li</surname> <given-names>S</given-names></string-name>, <string-name><surname>Deng</surname> <given-names>W</given-names></string-name>, <string-name><surname>Du</surname> <given-names>J</given-names></string-name></person-group>. <article-title>Reliable crowdsourcing and deep locality-preserving learning for expression recognition in the wild</article-title>. In: <conf-name>Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR); 2017 Jul 21&#x2013;26; Honolulu, HI, USA</conf-name>. doi:<pub-id pub-id-type="doi">10.1109/CVPR.2017.277</pub-id>.</mixed-citation></ref>
<ref id="ref-4"><label>[4]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Zhang</surname> <given-names>F</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>T</given-names></string-name>, <string-name><surname>Mao</surname> <given-names>Q</given-names></string-name>, <string-name><surname>Xu</surname> <given-names>C</given-names></string-name></person-group>. <article-title>Joint pose and expression modeling for facial expression recognition</article-title>. In: <conf-name>Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2018 Jun 18&#x2013;23; Salt Lake City, UT, USA</conf-name>. doi:<pub-id pub-id-type="doi">10.1109/CVPR.2018.00354</pub-id>.</mixed-citation></ref>
<ref id="ref-5"><label>[5]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Liu</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Mao</surname> <given-names>H</given-names></string-name>, <string-name><surname>Wu</surname> <given-names>CY</given-names></string-name>, <string-name><surname>Feichtenhofer</surname> <given-names>C</given-names></string-name>, <string-name><surname>Darrell</surname> <given-names>T</given-names></string-name>, <string-name><surname>Xie</surname> <given-names>S</given-names></string-name></person-group>. <article-title>A ConvNet for the 2020s</article-title>. In: <conf-name>Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); 2022 Jun 18&#x2013;24; New Orleans, LA, USA</conf-name>. doi:<pub-id pub-id-type="doi">10.1109/cvpr52688.2022.01167</pub-id>.</mixed-citation></ref>
<ref id="ref-6"><label>[6]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Bodapati</surname> <given-names>JD</given-names></string-name>, <string-name><surname>Srilakshmi</surname> <given-names>U</given-names></string-name>, <string-name><surname>Veeranjaneyulu</surname> <given-names>N</given-names></string-name></person-group>. <article-title>FERNet: a deep CNN architecture for facial expression recognition in the wild</article-title>. <source>J Inst Eng Ind Ser B</source>. <year>2022</year>;<volume>103</volume>(<issue>2</issue>):<fpage>439</fpage>&#x2013;<lpage>48</lpage>. doi:<pub-id pub-id-type="doi">10.1007/s40031-021-00681-8</pub-id>.</mixed-citation></ref>
<ref id="ref-7"><label>[7]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Febrian</surname> <given-names>R</given-names></string-name>, <string-name><surname>Halim</surname> <given-names>BM</given-names></string-name>, <string-name><surname>Christina</surname> <given-names>M</given-names></string-name>, <string-name><surname>Ramdhan</surname> <given-names>D</given-names></string-name>, <string-name><surname>Chowanda</surname> <given-names>A</given-names></string-name></person-group>. <article-title>Facial expression recognition using bidirectional LSTM-CNN</article-title>. <source>Procedia Comput Sci</source>. <year>2023</year>;<volume>216</volume>:<fpage>39</fpage>&#x2013;<lpage>47</lpage>. doi:<pub-id pub-id-type="doi">10.1016/j.procs.2022.12.109</pub-id>.</mixed-citation></ref>
<ref id="ref-8"><label>[8]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Zou</surname> <given-names>W</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>D</given-names></string-name>, <string-name><surname>Lee</surname> <given-names>DJ</given-names></string-name></person-group>. <article-title>A new multi-feature fusion based convolutional neural network for facial expression recognition</article-title>. <source>Appl Intell</source>. <year>2022</year>;<volume>52</volume>(<issue>3</issue>):<fpage>2918</fpage>&#x2013;<lpage>29</lpage>. doi:<pub-id pub-id-type="doi">10.1007/s10489-021-02575-0</pub-id>.</mixed-citation></ref>
<ref id="ref-9"><label>[9]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Zhang</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Tian</surname> <given-names>X</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Guo</surname> <given-names>K</given-names></string-name>, <string-name><surname>Xu</surname> <given-names>X</given-names></string-name></person-group>. <article-title>Label-guided dynamic spatial-temporal fusion for video-based facial expression recognition</article-title>. <source>IEEE Trans Multimed</source>. <year>2024</year>;<volume>26</volume>(<issue>11</issue>):<fpage>10503</fpage>&#x2013;<lpage>13</lpage>. doi:<pub-id pub-id-type="doi">10.1109/TMM.2024.3407693</pub-id>.</mixed-citation></ref>
<ref id="ref-10"><label>[10]</label><mixed-citation publication-type="book"><person-group person-group-type="author"><string-name><surname>Chen</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>H</given-names></string-name>, <string-name><surname>Liu</surname> <given-names>CL</given-names></string-name></person-group>. <chapter-title>Improved learning for online handwritten Chinese text recognition with convolutional prototype network</chapter-title>. In: <source>Document Analysis and Recognition-ICDAR 2023</source>. <publisher-loc>Berlin/Heidelberg, Germany</publisher-loc>: <publisher-name>Springer</publisher-name>; <year>2023</year>. p. <fpage>38</fpage>&#x2013;<lpage>53</lpage>. doi:<pub-id pub-id-type="doi">10.1007/978-3-031-41685-9_3</pub-id>.</mixed-citation></ref>
<ref id="ref-11"><label>[11]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Zhang</surname> <given-names>F</given-names></string-name>, <string-name><surname>Chen</surname> <given-names>G</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>H</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>C</given-names></string-name></person-group>. <article-title>CF-DAN: facial-expression recognition based on cross-fusion dual-attention network</article-title>. <source>Comput Vis Medium</source>. <year>2024</year>;<volume>10</volume>(<issue>3</issue>):<fpage>593</fpage>&#x2013;<lpage>608</lpage>. doi:<pub-id pub-id-type="doi">10.1007/s41095-023-0369-x</pub-id>.</mixed-citation></ref>
<ref id="ref-12"><label>[12]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Zhou</surname> <given-names>S</given-names></string-name>, <string-name><surname>Wu</surname> <given-names>X</given-names></string-name>, <string-name><surname>Jiang</surname> <given-names>F</given-names></string-name>, <string-name><surname>Huang</surname> <given-names>Q</given-names></string-name>, <string-name><surname>Huang</surname> <given-names>C</given-names></string-name></person-group>. <article-title>Emotion recognition from large-scale video clips with cross-attention and hybrid feature weighting neural networks</article-title>. <source>Int J Environ Res Public Health</source>. <year>2023</year>;<volume>20</volume>(<issue>2</issue>):<fpage>1400</fpage>. doi:<pub-id pub-id-type="doi">10.3390/ijerph20021400</pub-id>; <pub-id pub-id-type="pmid">36674161</pub-id></mixed-citation></ref>
<ref id="ref-13"><label>[13]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Le Ngwe</surname> <given-names>J</given-names></string-name>, <string-name><surname>Lim</surname> <given-names>KM</given-names></string-name>, <string-name><surname>Lee</surname> <given-names>CP</given-names></string-name>, <string-name><surname>Ong</surname> <given-names>TS</given-names></string-name>, <string-name><surname>Alqahtani</surname> <given-names>A</given-names></string-name></person-group>. <article-title>PAtt-lite: lightweight patch and attention MobileNet for challenging facial expression recognition</article-title>. <source>IEEE Access</source>. <year>2024</year>;<volume>12</volume>(<issue>3</issue>):<fpage>79327</fpage>&#x2013;<lpage>41</lpage>. doi:<pub-id pub-id-type="doi">10.1109/ACCESS.2024.3407108</pub-id>.</mixed-citation></ref>
<ref id="ref-14"><label>[14]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Liu</surname> <given-names>H</given-names></string-name>, <string-name><surname>Zhou</surname> <given-names>Q</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>C</given-names></string-name>, <string-name><surname>Zhu</surname> <given-names>J</given-names></string-name>, <string-name><surname>Liu</surname> <given-names>T</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>Z</given-names></string-name>, <etal>et al</etal></person-group>. <article-title>MMATrans: muscle movement aware representation learning for facial expression recognition via transformers</article-title>. <source>IEEE Trans Ind Inf</source>. <year>2024</year>;<volume>20</volume>(<issue>12</issue>):<fpage>13753</fpage>&#x2013;<lpage>64</lpage>. doi:<pub-id pub-id-type="doi">10.1109/tii.2024.3431640</pub-id>.</mixed-citation></ref>
<ref id="ref-15"><label>[15]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Xue</surname> <given-names>F</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>Q</given-names></string-name>, <string-name><surname>Tan</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Ma</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Guo</surname> <given-names>G</given-names></string-name></person-group>. <article-title>Vision transformer with attentive pooling for robust facial expression recognition</article-title>. <source>IEEE Trans Affect Comput</source>. <year>2023</year>;<volume>14</volume>(<issue>4</issue>):<fpage>3244</fpage>&#x2013;<lpage>56</lpage>. doi:<pub-id pub-id-type="doi">10.1109/taffc.2022.3226473</pub-id>.</mixed-citation></ref>
<ref id="ref-16"><label>[16]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Li</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>M</given-names></string-name>, <string-name><surname>Gong</surname> <given-names>M</given-names></string-name>, <string-name><surname>Lu</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Liu</surname> <given-names>L</given-names></string-name></person-group>. <article-title>FER-former: multimodal transformer for facial expression recognition</article-title>. <source>IEEE Trans Multimed</source>. <year>2025</year>;<volume>27</volume>(<issue>11</issue>):<fpage>2412</fpage>&#x2013;<lpage>22</lpage>. doi:<pub-id pub-id-type="doi">10.1109/TMM.2024.3521788</pub-id>.</mixed-citation></ref>
<ref id="ref-17"><label>[17]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Chen</surname> <given-names>X</given-names></string-name>, <string-name><surname>Zheng</surname> <given-names>X</given-names></string-name>, <string-name><surname>Sun</surname> <given-names>K</given-names></string-name>, <string-name><surname>Liu</surname> <given-names>W</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>Y</given-names></string-name></person-group>. <article-title>Self-supervised vision transformer-based few-shot learning for facial expression recognition</article-title>. <source>Inf Sci</source>. <year>2023</year>;<volume>634</volume>(<issue>8</issue>):<fpage>206</fpage>&#x2013;<lpage>26</lpage>. doi:<pub-id pub-id-type="doi">10.1016/j.ins.2023.03.105</pub-id>.</mixed-citation></ref>
<ref id="ref-18"><label>[18]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Liang</surname> <given-names>X</given-names></string-name>, <string-name><surname>Xu</surname> <given-names>L</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>W</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Liu</surname> <given-names>J</given-names></string-name>, <string-name><surname>Liu</surname> <given-names>Z</given-names></string-name></person-group>. <article-title>A convolution-transformer dual branch network for head-pose and occlusion facial expression recognition</article-title>. <source>Vis Comput</source>. <year>2023</year>;<volume>39</volume>(<issue>6</issue>):<fpage>2277</fpage>&#x2013;<lpage>90</lpage>. doi:<pub-id pub-id-type="doi">10.1007/s00371-022-02413-5</pub-id>.</mixed-citation></ref>
<ref id="ref-19"><label>[19]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Tan</surname> <given-names>M</given-names></string-name>, <string-name><surname>Le</surname> <given-names>Q</given-names></string-name></person-group>. <article-title>EfficientNet: rethinking model scaling for convolutional neural networks</article-title>. In: <conf-name>Proceedings of the 36th International Conference on Machine Learning; 2019 Jun 9&#x2013;15; Long Beach, CA, USA</conf-name>.</mixed-citation></ref>
<ref id="ref-20"><label>[20]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Liu</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Lin</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Cao</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Hu</surname> <given-names>H</given-names></string-name>, <string-name><surname>Wei</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>Z</given-names></string-name>, <etal>et al.</etal></person-group> <article-title>Swin transformer: hierarchical vision transformer using shifted windows</article-title>. In: <conf-name>Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV); 2021 Oct 10&#x2013;17; Montreal, QC, Canada</conf-name>. doi:<pub-id pub-id-type="doi">10.1109/iccv48922.2021.00986</pub-id>.</mixed-citation></ref>
<ref id="ref-21"><label>[21]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Barsoum</surname> <given-names>E</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>C</given-names></string-name>, <string-name><surname>Ferrer</surname> <given-names>CC</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>Z</given-names></string-name></person-group>. <article-title>Training deep networks for facial expression recognition with crowd-sourced label distribution</article-title>. In: <conf-name>Proceedings of the 18th ACM International Conference on Multimodal Interaction; 2016 Nov 12&#x2013;16; Tokyo, Japan</conf-name>. doi:<pub-id pub-id-type="doi">10.1145/2993148.2993165</pub-id>.</mixed-citation></ref>
<ref id="ref-22"><label>[22]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Mollahosseini</surname> <given-names>A</given-names></string-name>, <string-name><surname>Hasani</surname> <given-names>B</given-names></string-name>, <string-name><surname>Mahoor</surname> <given-names>MH</given-names></string-name></person-group>. <article-title>AffectNet: a database for facial expression, valence, and arousal computing in the wild</article-title>. <source>IEEE Trans Affect Comput</source>. <year>2019</year>;<volume>10</volume>(<issue>1</issue>):<fpage>18</fpage>&#x2013;<lpage>31</lpage>. doi:<pub-id pub-id-type="doi">10.1109/taffc.2017.2740923</pub-id>.</mixed-citation></ref>
<ref id="ref-23"><label>[23]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Wang</surname> <given-names>K</given-names></string-name>, <string-name><surname>Peng</surname> <given-names>X</given-names></string-name>, <string-name><surname>Yang</surname> <given-names>J</given-names></string-name>, <string-name><surname>Meng</surname> <given-names>D</given-names></string-name>, <string-name><surname>Qiao</surname> <given-names>Y</given-names></string-name></person-group>. <article-title>Region attention networks for pose and occlusion robust facial expression recognition</article-title>. <source>IEEE Trans Image Process</source>. <year>2020</year>;<volume>29</volume>:<fpage>4057</fpage>&#x2013;<lpage>69</lpage>. doi:<pub-id pub-id-type="doi">10.1109/TIP.2019.2956143</pub-id>; <pub-id pub-id-type="pmid">32011249</pub-id></mixed-citation></ref>
<ref id="ref-24"><label>[24]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Siqueira</surname> <given-names>H</given-names></string-name>, <string-name><surname>Magg</surname> <given-names>S</given-names></string-name>, <string-name><surname>Wermter</surname> <given-names>S</given-names></string-name></person-group>. <article-title>Efficient facial feature learning with wide ensemble-based convolutional neural networks</article-title>. <source>Proc AAAI Conf Artif Intell</source>. <year>2020</year>;<volume>34</volume>(<issue>4</issue>):<fpage>5800</fpage>&#x2013;<lpage>9</lpage>. doi:<pub-id pub-id-type="doi">10.1609/aaai.v34i04.6037</pub-id>.</mixed-citation></ref>
<ref id="ref-25"><label>[25]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Georgescu</surname> <given-names>MI</given-names></string-name>, <string-name><surname>Ionescu</surname> <given-names>RT</given-names></string-name>, <string-name><surname>Popescu</surname> <given-names>M</given-names></string-name></person-group>. <article-title>Local learning with deep and handcrafted features for facial expression recognition</article-title>. <source>IEEE Access</source>. <year>2019</year>;<volume>7</volume>:<fpage>64827</fpage>&#x2013;<lpage>36</lpage>. doi:<pub-id pub-id-type="doi">10.1109/ACCESS.2019.2917266</pub-id>.</mixed-citation></ref>
<ref id="ref-26"><label>[26]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Zhao</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Liu</surname> <given-names>Q</given-names></string-name>, <string-name><surname>Zhou</surname> <given-names>F</given-names></string-name></person-group>. <article-title>Robust lightweight facial expression recognition network with label distribution training</article-title>. <source>Proc AAAI Conf Artif Intell</source>. <year>2021</year>;<volume>35</volume>(<issue>4</issue>):<fpage>3510</fpage>&#x2013;<lpage>9</lpage>. doi:<pub-id pub-id-type="doi">10.1609/aaai.v35i4.16465</pub-id>.</mixed-citation></ref>
<ref id="ref-27"><label>[27]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Liu</surname> <given-names>H</given-names></string-name>, <string-name><surname>Cai</surname> <given-names>H</given-names></string-name>, <string-name><surname>Lin</surname> <given-names>Q</given-names></string-name>, <string-name><surname>Li</surname> <given-names>X</given-names></string-name>, <string-name><surname>Xiao</surname> <given-names>H</given-names></string-name></person-group>. <article-title>Adaptive multilayer perceptual attention network for facial expression recognition</article-title>. <source>IEEE Trans Circuits Syst Video Technol</source>. <year>2022</year>;<volume>32</volume>(<issue>9</issue>):<fpage>6253</fpage>&#x2013;<lpage>66</lpage>. doi:<pub-id pub-id-type="doi">10.1109/tcsvt.2022.3165321</pub-id>.</mixed-citation></ref>
<ref id="ref-28"><label>[28]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Yan</surname> <given-names>H</given-names></string-name>, <string-name><surname>Gu</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>X</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Ji</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Ren</surname> <given-names>F</given-names></string-name></person-group>. <article-title>Mitigating label-noise for facial expression recognition in the wild</article-title>. In: <conf-name>Proceedings of the 2022 IEEE International Conference on Multimedia and Expo (ICME); 2022 Jul 18&#x2013;22; Taipei, Taiwan</conf-name>. doi:<pub-id pub-id-type="doi">10.1109/ICME52920.2022.9859818</pub-id>.</mixed-citation></ref>
<ref id="ref-29"><label>[29]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Sun</surname> <given-names>N</given-names></string-name>, <string-name><surname>Song</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Liu</surname> <given-names>J</given-names></string-name>, <string-name><surname>Chai</surname> <given-names>L</given-names></string-name>, <string-name><surname>Sun</surname> <given-names>H</given-names></string-name></person-group>. <article-title>Appearance and geometry transformer for facial expression recognition in the wild</article-title>. <source>Comput Electr Eng</source>. <year>2023</year>;<volume>107</volume>(<issue>2</issue>):<fpage>108583</fpage>. doi:<pub-id pub-id-type="doi">10.1016/j.compeleceng.2023.108583</pub-id>.</mixed-citation></ref>
<ref id="ref-30"><label>[30]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Gong</surname> <given-names>W</given-names></string-name>, <string-name><surname>La</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Qian</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Zhou</surname> <given-names>W</given-names></string-name></person-group>. <article-title>Hybrid attention-aware learning network for facial expression recognition in the wild</article-title>. <source>Arab J Sci Eng</source>. <year>2024</year>;<volume>49</volume>(<issue>9</issue>):<fpage>12203</fpage>&#x2013;<lpage>17</lpage>. doi:<pub-id pub-id-type="doi">10.1007/s13369-023-08538-6</pub-id>.</mixed-citation></ref>
</ref-list>
</back></article>