<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.1 20151215//EN" "http://jats.nlm.nih.gov/publishing/1.1/JATS-journalpublishing1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:mml="http://www.w3.org/1998/Math/MathML" xml:lang="en" article-type="research-article" dtd-version="1.1">
<front>
<journal-meta>
<journal-id journal-id-type="pmc">JAI</journal-id>
<journal-id journal-id-type="nlm-ta">JAI</journal-id>
<journal-id journal-id-type="publisher-id">JAI</journal-id>
<journal-title-group>
<journal-title>Journal on Artificial Intelligence</journal-title>
</journal-title-group>
<issn pub-type="epub">2579-003X</issn>
<issn pub-type="ppub">2579-0021</issn>
<publisher>
<publisher-name>Tech Science Press</publisher-name>
<publisher-loc>USA</publisher-loc>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="publisher-id">48911</article-id>
<article-id pub-id-type="doi">10.32604/jai.2024.048911</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Article</subject>
</subj-group>
</article-categories>
<title-group>
<article-title>Detection of Student Engagement in E-Learning Environments Using EfficientnetV2-L Together with RNN-Based Models</article-title>
<alt-title alt-title-type="left-running-head">Detection of Student Engagement in E-Learning Environments Using EfficientnetV2-L Together with RNN-Based Models</alt-title>
<alt-title alt-title-type="right-running-head">Detection of Student Engagement in E-Learning Environments Using EfficientnetV2-L Together with RNN-Based Models</alt-title>
</title-group>
<contrib-group>
<contrib id="author-1" contrib-type="author" corresp="yes">
<name name-style="western"><surname>Shiri</surname><given-names>Farhad Mortezapour</given-names></name><xref ref-type="aff" rid="aff-1">1</xref><email>GS63904@student.upm.edu.my</email></contrib>
<contrib id="author-2" contrib-type="author">
<name name-style="western"><surname>Ahmadi</surname><given-names>Ehsan</given-names></name><xref ref-type="aff" rid="aff-2">2</xref></contrib>
<contrib id="author-3" contrib-type="author">
<name name-style="western"><surname>Rezaee</surname><given-names>Mohammadreza</given-names></name><xref ref-type="aff" rid="aff-1">1</xref></contrib>
<contrib id="author-4" contrib-type="author">
<name name-style="western"><surname>Perumal</surname><given-names>Thinagaran</given-names></name><xref ref-type="aff" rid="aff-1">1</xref></contrib>
<aff id="aff-1"><label>1</label><institution>Faculty of Computer Science and Information Technology, University Putra Malaysia (UPM)</institution>, <addr-line>Serdang</addr-line>, <country>Malaysia</country></aff>
<aff id="aff-2"><label>2</label><institution>Department of Electrical and Computer Engineering, University of Wisconsin</institution>, <addr-line>Madison</addr-line>, <country>USA</country></aff>
</contrib-group>
<author-notes>
<corresp id="cor1"><label>&#x002A;</label>Corresponding Author: Farhad Mortezapour Shiri. Email: <email>GS63904@student.upm.edu.my</email></corresp>
</author-notes>
<pub-date date-type="collection" publication-format="electronic"><year>2024</year></pub-date>
<pub-date date-type="pub" publication-format="electronic"><day>24</day><month>4</month><year>2024</year></pub-date>
<volume>6</volume>
<issue>0</issue>
<fpage>85</fpage>
<lpage>103</lpage>
<history>
<date date-type="received">
<day>21</day>
<month>12</month>
<year>2023</year>
</date>
<date date-type="accepted">
<day>04</day>
<month>3</month>
<year>2024</year>
</date>
</history>
<permissions>
<copyright-statement>&#x00A9; 2024 Shiri et al.</copyright-statement>
<copyright-year>2024</copyright-year>
<copyright-holder>Shiri et al.</copyright-holder>
<license xlink:href="https://creativecommons.org/licenses/by/4.0/">
<license-p>This work is licensed under a <ext-link ext-link-type="uri" xlink:type="simple" xlink:href="https://creativecommons.org/licenses/by/4.0/">Creative Commons Attribution 4.0 International License</ext-link>, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
</license>
</permissions>
<self-uri content-type="pdf" xlink:href="TSP_JAI_48911.pdf"></self-uri>
<abstract>
<p>Automatic detection of student engagement levels from videos, which is a spatio-temporal classification problem is crucial for enhancing the quality of online education. This paper addresses this challenge by proposing four novel hybrid end-to-end deep learning models designed for the automatic detection of student engagement levels in e-learning videos. The evaluation of these models utilizes the DAiSEE dataset, a public repository capturing student affective states in e-learning scenarios. The initial model integrates EfficientNetV2-L with Gated Recurrent Unit (GRU) and attains an accuracy of 61.45%. Subsequently, the second model combines EfficientNetV2-L with bidirectional GRU (Bi-GRU), yielding an accuracy of 61.56%. The third and fourth models leverage a fusion of EfficientNetV2-L with Long Short-Term Memory (LSTM) and bidirectional LSTM (Bi-LSTM), achieving accuracies of 62.11% and 61.67%, respectively. Our findings demonstrate the viability of these models in effectively discerning student engagement levels, with the EfficientNetV2-L&#x002B;LSTM model emerging as the most proficient, reaching an accuracy of 62.11%. This study underscores the potential of hybrid spatio-temporal networks in automating the detection of student engagement, thereby contributing to advancements in online education quality.</p>
</abstract>
<kwd-group kwd-group-type="author">
<kwd>Student engagement detection</kwd>
<kwd>hybrid deep learning models</kwd>
<kwd>computer vision</kwd>
<kwd>EfficientNetV2-L</kwd>
<kwd>online learning environments</kwd>
<kwd>spatio-temporal classification</kwd>
</kwd-group>
</article-meta>
</front>
<body>
<sec id="s1">
<label>1</label>
<title>Introduction</title>
<p>The rise of online education, propelled by advancements in Internet technology has garnered widespread popularity among students [<xref ref-type="bibr" rid="ref-1">1</xref>]. In contrast to traditional teaching methods, online learning streamlines and enhances the accessibility of educational resources [<xref ref-type="bibr" rid="ref-2">2</xref>]. The global accessibility and affordability of education owe much to the transformative impact of online learning. However, amidst the benefits and growing interest in distance education, a pressing concern revolves around students&#x2019; performance and active engagement in online learning environments [<xref ref-type="bibr" rid="ref-3">3</xref>].</p>
<p>Central to effective learning is the concept of student engagement [<xref ref-type="bibr" rid="ref-4">4</xref>], denoting active involvement in situations conducive to high-quality learning outcomes [<xref ref-type="bibr" rid="ref-5">5</xref>]. Actively engaged students generally exhibit better conceptual understanding and learning outcomes [<xref ref-type="bibr" rid="ref-6">6</xref>]. Student engagement encompasses behavioral, cognitive, and emotional states within the learning environment [<xref ref-type="bibr" rid="ref-7">7</xref>]. Behavioral engagement requires active participation in class activities, emphasizing effort and perseverance [<xref ref-type="bibr" rid="ref-8">8</xref>], while cognition involves learning skills such as perception, storage, processing, and retrieval [<xref ref-type="bibr" rid="ref-9">9</xref>]. Emotional engagement reflects a student&#x2019;s active participation influenced by affective states [<xref ref-type="bibr" rid="ref-10">10</xref>], where positive emotions like happiness and interest enhance focus and engagement, while negative emotions such as boredom and frustration lead to disengagement [<xref ref-type="bibr" rid="ref-11">11</xref>].</p>
<p>In the realm of online learning, challenges such as lack of motivation and focus often arise, directly impacting engagement [<xref ref-type="bibr" rid="ref-12">12</xref>]. Unlike physical classrooms where teachers gauge engagement through facial expressions and social cues such as yawning, body posture, and glued eyes, assessing engagement in online environments proves significantly more intricate. Diverse electronic devices and varied backgrounds further complicate tracking students&#x2019; engagement [<xref ref-type="bibr" rid="ref-13">13</xref>]. A pivotal aspect of enhancing the quality of online learning is the automated prediction of students&#x2019; engagement levels [<xref ref-type="bibr" rid="ref-14">14</xref>]. This holds across various learning environments, encompassing traditional classrooms, massive open online courses (MOOCs), intelligent tutoring systems (ITS), and educational games.</p>
<p>Several methods exist for automating the determination of students&#x2019; engagement in online education, broadly categorized into sensor-based and computer-vision-based approaches. Notably, computer-vision-based approaches, further divided into image-based and video-based methods, have garnered substantial interest. The image-based approaches rely solely on spatial information from a single image or frame which is a significant limitation. Since engagement detection is a spatio-temporal effective behavior because it is not stable over time, therefore, video-based methods emerge as more efficient and popular for detecting students&#x2019; engagement [<xref ref-type="bibr" rid="ref-15">15</xref>].</p>
<p>Video-based methods predominantly fall into two categories: Machine learning-based and deep learning-based approaches. Machine learning-based methods extract features and employ handcrafted patterns for engagement estimation [<xref ref-type="bibr" rid="ref-16">16</xref>], while deep learning techniques dynamically learn features from training data, enabling the algorithm to discern subtle variations [<xref ref-type="bibr" rid="ref-17">17</xref>]. Deep learning methods surpass traditional machine learning in tasks requiring affective state prediction. Moreover, deep learning-based facial expression analysis in video data is non-intrusive, automated, and easily implementable [<xref ref-type="bibr" rid="ref-18">18</xref>].</p>
<p>This study aims to propose a new spatio-temporal hybrid deep learning model for detecting and classifying students&#x2019; engagement from video data by combining the advantages of EfficientNetV2-L with four different RNN-based Models in online learning environments. The rest of the paper is organized as follows. Reviewing recent studies in student engagement detection in <xref ref-type="sec" rid="s2">Section 2</xref>. <xref ref-type="sec" rid="s3">Section 3</xref> delves into the proposed deep learning approach, followed by experimental findings in <xref ref-type="sec" rid="s4">Section 4</xref> and concluding remarks in <xref ref-type="sec" rid="s5">Section 5</xref>.</p>
</sec>
<sec id="s2">
<label>2</label>
<title>Related Works</title>
<p>In the realm of automatic student engagement detection, two primary methods have emerged: Sensor-based approaches and video-based methods. Sensor-based methods rely on physiological signals, encompassing heart rate variability, skin temperature, blood volume pulse, electrodermal activity (EDA), electrocardiogram (ECG), electromyogram (EMG), and electroencephalogram (EEG) [<xref ref-type="bibr" rid="ref-19">19</xref>,<xref ref-type="bibr" rid="ref-20">20</xref>].</p>
<p>Authors in [<xref ref-type="bibr" rid="ref-21">21</xref>] demonstrated the feasibility of distinguishing engaged and non-engaged students during lectures using wearable electrodermal activity sensors. Employing the Empatica E4 wristband [<xref ref-type="bibr" rid="ref-22">22</xref>], which integrates blood volume pulse, acceleration, peripheral skin temperature, and electrodermal activity sensors, they recorded physiological data to achieve this distinction.</p>
<p>Kerdawy et al. [<xref ref-type="bibr" rid="ref-9">9</xref>] proposed a method for predicting students&#x2019; cognitive states, engagement, and spontaneous attention by combining facial expression modalities and electroencephalography (EEG). They observed strong agreement between EEG and face-based models in engaged classes, with less agreement in non-engaged scenarios.</p>
<p>While some works explore sensor-based methods for detecting student engagement [<xref ref-type="bibr" rid="ref-23">23</xref>&#x2013;<xref ref-type="bibr" rid="ref-25">25</xref>], challenges such as cost, wearability, portability, and mental privacy constraints hinder the implementation of brain-computer interface (BCI) modules in physical or online classrooms [<xref ref-type="bibr" rid="ref-26">26</xref>]. In contrast, video-based methods have gained prominence for their ease of data collection and unobtrusive evaluation processes [<xref ref-type="bibr" rid="ref-27">27</xref>].</p>
<p>Pise et al. [<xref ref-type="bibr" rid="ref-28">28</xref>] suggested a model that combined SqueezeNet [<xref ref-type="bibr" rid="ref-29">29</xref>] for feature extraction and temporal relational network (TRN) for connecting significant transformations between extracted spatio-temporal frames. This model achieved an accuracy of 91.30% on the DISFA&#x002B; dataset [<xref ref-type="bibr" rid="ref-30">30</xref>].</p>
<p>Gupta et al. [<xref ref-type="bibr" rid="ref-31">31</xref>] introduced the DAiSEE dataset, including affective states and engagement levels. They provided baseline results for four-class classification using CNN-based video classification techniques, such as InceptionNet frame level, InceptionNet video level [<xref ref-type="bibr" rid="ref-32">32</xref>], C3D training, C3D fine-tuning [<xref ref-type="bibr" rid="ref-33">33</xref>], and long-term recurrent convolutional networks (LRCN) [<xref ref-type="bibr" rid="ref-34">34</xref>], achieving accuracies of 47.1%, 46.4%, 48.6%, 56.1%, and 57.9%, respectively.</p>
<p>In [<xref ref-type="bibr" rid="ref-35">35</xref>], an inflated 3D convolutional network (I3D) was proposed for predicting students&#x2019; engagement levels, utilizing OpenFace and AlphaPose for feature extraction, with an accuracy of 52.35% on the DAiSEE dataset.</p>
<p>Liao et al. [<xref ref-type="bibr" rid="ref-27">27</xref>] introduced the DFSTN model, combining long short-term memory (LSTM) with global attention (GALN) and pretrained SE-ResNet-50 (SENet) [<xref ref-type="bibr" rid="ref-36">36</xref>] for student engagement prediction. They tested the proposed method on the DAiSEE dataset and achieved an accuracy of 58.84%.</p>
<p>Abedi et al. [<xref ref-type="bibr" rid="ref-37">37</xref>] proposed a new end-to-end spatio-temporal hybrid method based on residual network (ResNet) [<xref ref-type="bibr" rid="ref-38">38</xref>], and temporal convolutional network (TCN) [<xref ref-type="bibr" rid="ref-39">39</xref>] for assessing student engagement in an online learning environment. While the ResNet extracts spatial features from subsequent video frames, TCN analyses the temporal changes in video frames to determine the degree of engagement. They achieved a performance increase of 63.9% on the DAiSEE dataset.</p>
<p>Bajaj et al. [<xref ref-type="bibr" rid="ref-40">40</xref>] utilized a hybrid neural network architecture based on ResNet and temporal convolutional network (TCN) for classifying student engagement, achieving a recognition accuracy of 53.6% on the DAiSEE dataset.</p>
<p>Mehta et al. [<xref ref-type="bibr" rid="ref-41">41</xref>] introduced a three-dimensional DenseNet Self-Attention neural network (3D DenseAttNet) for automatically detecting students&#x2019; engagement in online learning environments. This model is designed to selectively extract relevant high-level intra-frame and inter-frame features from video data using the 3D DenseNet block. The proposed model surpassed the previous state-of-the-art, achieving a recognition accuracy of 63.59% on the DAiSEE dataset.</p>
<p>Gupta et al. [<xref ref-type="bibr" rid="ref-11">11</xref>] presented a deep learning approach centered on analyzing facial emotions to assess the engagement levels of students in real time during online learning. This system employs the faster region-based convolutional neural network (R-CNN) [<xref ref-type="bibr" rid="ref-42">42</xref>] for identifying faces and a modified face-points extractor (MFACXTOR) for pinpointing key facial features. The system was tested using various deep learning architectures including Inception-V3 [<xref ref-type="bibr" rid="ref-32">32</xref>], VGG19 [<xref ref-type="bibr" rid="ref-43">43</xref>], and ResNet-50 [<xref ref-type="bibr" rid="ref-38">38</xref>] to determine the most effective model for accurately classifying real-time student engagement. The results from their experiments indicate that the system attained accuracies of 89.11% with Inception-V3, 90.14% with VGG19, and 92.32% with ResNet-50 on the dataset they developed.</p>
<p>Chen et al. [<xref ref-type="bibr" rid="ref-44">44</xref>] integrated gaze directions and facial expressions as separate elements in a multi-modal deep neural network (MDNN) for predicting student engagement in collaborative learning settings. This multi-faceted approach was tested in an actual collaborative learning context. The findings demonstrate that the model is effective in precisely forecasting student performance within these environments.</p>
<p>Ahmad et al. [<xref ref-type="bibr" rid="ref-45">45</xref>] employed the lightweight MobileNetv2 model for automatic assessment of student engagement. The MobileNetv2 architecture&#x2019;s layers have all been fine-tuned to enhance learning efficiency and adaptability. The model&#x2019;s final layer was modified to classify three distinct output classes, instead of the original 1000 classes used in ImageNet. Their experimental analysis utilized an open-source dataset comprising individuals watching videos in online courses. The performance of lightweight MobileNetv2 was benchmarked against two other established pre-trained networks, ResNet-50 and Inception-V4, with MobileNetv2 achieving a superior average accuracy of 74.55%.</p>
<p>The authors in [<xref ref-type="bibr" rid="ref-46">46</xref>] developed a real-time system to monitor the engagement of student groups by analyzing their facial expressions and identifying affective states such as &#x2018;boredom,&#x2019; &#x2018;confusion,&#x2019; &#x2018;focus,&#x2019; &#x2018;frustration,&#x2019; &#x2018;yawning,&#x2019; and &#x2018;sleepiness,&#x2019; which are crucial in educational settings. This approach involves pre-processing steps like face detection, utilizing a convolutional neural network (CNN) for facial expression recognition, and post-processing for estimating group engagement frame by frame. To train the model, a dataset was compiled featuring the mentioned facial expressions from classroom lectures. This dataset was augmented with samples from three other datasets: BAUM-1 [<xref ref-type="bibr" rid="ref-47">47</xref>], DAiSEE [<xref ref-type="bibr" rid="ref-31">31</xref>], and YawDD [<xref ref-type="bibr" rid="ref-48">48</xref>], to enhance the model&#x2019;s predictive accuracy across various scenarios.</p>
<p>Sharma et al. [<xref ref-type="bibr" rid="ref-49">49</xref>] devised a method that amalgamates data on eye and head movements with facial emotional cues to create an engagement index categorized into three levels: &#x201C;highly engaged,&#x201D; &#x201C;moderately engaged,&#x201D; and &#x201C;completely disengaged.&#x201D; They employed convolutional neural network (CNN) models for classification purposes and used them in the training process. Implemented in a standard e-learning context, the system demonstrated its efficacy by accurately determining the engagement level of students, and classifying them into one of the three aforementioned categories for each analyzed time segment.</p>
<p>Ikram et al. [<xref ref-type="bibr" rid="ref-50">50</xref>] developed a refined transfer learning approach using a modified VGG16 model, enhanced with an additional layer and meticulously calibrated hyperparameters. This model was designed to assess student engagement in a minimally controlled, real-world classroom setting with 45 students. In evaluating the level of student engagement, the model demonstrated impressive results, achieving 90% accuracy and a computation time of only 0.5 N seconds for distinguishing between engaged and non-engaged students.</p>
</sec>
<sec id="s3">
<label>3</label>
<title>Methodology and Proposed Model</title>
<p>The majority of available datasets for detecting student engagement are either privately held or small in scale, making it challenging to benchmark our research. Consequently, we opted to use the public DAiSEE dataset [<xref ref-type="bibr" rid="ref-31">31</xref>] for our evaluation and comparisons. One key limitation of current models for four-level classification on the DAiSEE dataset is their subpar accuracy. To address this issue, we leveraged EfficientNetV2-L [<xref ref-type="bibr" rid="ref-51">51</xref>] for extracting spatial features from video frames and employed four distinct RNN-based models to capture temporal information, thereby enhancing accuracy. Notably, among the various model families, EfficientNetV2 stands out as the top performer, surpassing EfficientNet [<xref ref-type="bibr" rid="ref-52">52</xref>], ResNet [<xref ref-type="bibr" rid="ref-38">38</xref>], DenseNet [<xref ref-type="bibr" rid="ref-53">53</xref>], and Inception [<xref ref-type="bibr" rid="ref-32">32</xref>] models, which contributes to the overall improvement in accuracy. Additionally, the adoption of EfficientNetV2 substantially accelerates the training process.</p>
<p><xref ref-type="fig" rid="fig-1">Fig. 1</xref> illustrates the block diagram of our methodology designed to predict automated student engagement in an online learning environment using video data. The proposed pipeline comprises several essential stages, including Dataset Selection: Involving the careful selection of an appropriate dataset for analysis. Pre-Processing Stage: Encompasses critical data preparation steps such as data reduction and data normalization.</p>
<fig id="fig-1">
<label>Figure 1</label>
<caption>
<title>Block diagram of the proposed methodology for student engagement detection</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="JAI_48911-fig-1.tif"/>
</fig>
<p>Feature Extraction and Classification: Utilizes our proposed hybrid deep learning model to perform feature extraction and classification of relevant engagement levels. Model Evaluation: Includes the assessment and validation of our proposed model&#x2019;s performance. Experimental Result Analysis: Analyzing the outcomes of our experiments to gain insights into student engagement patterns and behavior. This comprehensive methodology is designed to enhance our understanding of student engagement in online learning by leveraging advanced deep learning techniques and rigorous data analysis procedures.</p>
<p>The architecture of the proposed hybrid deep learning model for detecting student engagement levels is depicted in <xref ref-type="fig" rid="fig-2">Fig. 2</xref>. Raw video frames serve as input data to the model, generating output across four distinct classes reflecting students&#x2019; levels of engagement. Given the spatio-temporal nature of student engagement detection manifested in a sequence of video frames over time, a comprehensive analysis demands both spatial and temporal considerations. This analysis typically entails the monitoring and evaluation of students&#x2019; conduct within an online learning environment by examining video footage. Regarding the spatial dimension, it involves monitoring the positions of students within the virtual classroom or e-learning platform. This encompasses identifying their screen location and observing visual cues linked to their engagement, such as eye movement and facial expressions. On the other hand, the temporal dimension concentrates on how student engagement evolves throughout an e-learning session over time. This involves tracing fluctuations in engagement levels during lectures, interactive activities, or discussions. Various features are derived from the video data to define students&#x2019; behavior and involvement, encompassing aspects like facial expressions, body language, and interactions with e-learning materials. The extraction and classification of these features employ machine learning and computer vision techniques. This study employs EfficientNetV2-L to extract spatial features from video frames, while four distinct RNN-based models capture temporal information and model the sequence of frames.</p>
<fig id="fig-2">
<label>Figure 2</label>
<caption>
<title>The structure of the proposed hybrid model for determining student engagement</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="JAI_48911-fig-2.tif"/>
</fig>
<p>The proposed hybrid models include (1) EfficientNetV2-L with gated recurrent unit (GRU), (2) EfficientNetV2-L with bidirectional GRU (Bi-GRU), (3) EfficientNetV2-L with long short-term memory (LSTM), and (4) EfficientNetV2-L with bidirectional LSTM (Bi-LSTM).</p>
<sec id="s3_1">
<label>3.1</label>
<title>EfficientNetV2</title>
<p>EfficientNetV2 represents an advancement over previous models like DenseNet [<xref ref-type="bibr" rid="ref-53">53</xref>] and EfficientNet [<xref ref-type="bibr" rid="ref-52">52</xref>], demonstrating superior training speed and parameter efficiency. The architecture incorporates mobile inverted bottleneck (MBConv) [<xref ref-type="bibr" rid="ref-54">54</xref>] and fused-MBConv [<xref ref-type="bibr" rid="ref-55">55</xref>] as fundamental building blocks. Pre-training is performed on the ImageNet dataset [<xref ref-type="bibr" rid="ref-56">56</xref>]. The architecture of EfficientNetV2, illustrated in <xref ref-type="fig" rid="fig-3">Fig. 3</xref>, distinguishes itself from the EfficientNet backbone in key aspects: 1-Increased use of both MBConv and fused-MBConv in the initial layers. 2-Preference for smaller expansion coefficients for MBConv. 3-Preference for smaller kernel size (3 &#x00D7; 3) compensated by an increased number of layers. 4-Elimination of the final stride-1 step present in the original EfficientNet, likely to address memory access overhead and large parameter size.</p>
<fig id="fig-3">
<label>Figure 3</label>
<caption>
<title>Architecture of EfficientNetV2</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="JAI_48911-fig-3.tif"/>
</fig>
</sec>
<sec id="s3_2">
<label>3.2</label>
<title>Long Short-Term Memory (LSTM)</title>
<p>The long short-term memory (LSTM), introduced as a seminal work in [<xref ref-type="bibr" rid="ref-57">57</xref>], epitomizes a sophisticated iteration of recurrent neural network (RNN), meticulously crafted to tackle the pervasive issue of long-term dependency [<xref ref-type="bibr" rid="ref-58">58</xref>]. Proven to excel in retaining information over extended sequences, LSTM tackles the vanishing gradient problem effectively [<xref ref-type="bibr" rid="ref-59">59</xref>]. The LSTM network processes the output from the previous time step and the current input at a given time step, producing an output sent to the subsequent time step. The last time step&#x2019;s final hidden layer is commonly utilized for classification [<xref ref-type="bibr" rid="ref-60">60</xref>].</p>
<p>The LSTM architecture includes a memory unit denoted as <inline-formula id="ieqn-1"><mml:math id="mml-ieqn-1"><mml:mi>c</mml:mi></mml:math></inline-formula>, a hidden state represented by <inline-formula id="ieqn-2"><mml:math id="mml-ieqn-2"><mml:mi>h</mml:mi></mml:math></inline-formula>, and three distinct gates: The input gate (<inline-formula id="ieqn-3"><mml:math id="mml-ieqn-3"><mml:mi>i</mml:mi></mml:math></inline-formula>), the forget gate (<inline-formula id="ieqn-4"><mml:math id="mml-ieqn-4"><mml:mi>f</mml:mi></mml:math></inline-formula>), and the output gate (<inline-formula id="ieqn-5"><mml:math id="mml-ieqn-5"><mml:mi>o</mml:mi></mml:math></inline-formula>). These gates play a crucial role in controlling the flow of information in and out of the memory unit, effectively managing reading and writing operations within the LSTM framework. Specifically, the input gate determines the manner in which the internal state is updated based on the current input and the preceding internal state. Conversely, the forget gate governs the degree to which the previous internal state is retained. Lastly, the output gate modulates the impact of the internal state on the overall system [<xref ref-type="bibr" rid="ref-61">61</xref>]. <xref ref-type="fig" rid="fig-4">Fig. 4</xref> demonstrates how the update process functions within the internal framework of an LSTM. More concretely, at each time step <inline-formula id="ieqn-6"><mml:math id="mml-ieqn-6"><mml:mi>t</mml:mi></mml:math></inline-formula>, the LSTM initially receives an input <inline-formula id="ieqn-7"><mml:math id="mml-ieqn-7"><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> along with the previous hidden state <inline-formula id="ieqn-8"><mml:math id="mml-ieqn-8"><mml:msub><mml:mi>h</mml:mi><mml:mrow><mml:mi>t</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula>. Subsequently, it calculates activations for the gates and proceeds to update both the memory unit to <inline-formula id="ieqn-9"><mml:math id="mml-ieqn-9"><mml:msub><mml:mi>c</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> and the hidden state to <inline-formula id="ieqn-10"><mml:math id="mml-ieqn-10"><mml:msub><mml:mi>h</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>. This computational process can be outlined as follows [<xref ref-type="bibr" rid="ref-62">62</xref>]:<disp-formula id="eqn-1"><label>(1)</label><mml:math id="mml-eqn-1" display="block"><mml:msub><mml:mi>i</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mi>&#x03C3;</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>W</mml:mi><mml:mrow><mml:mi>x</mml:mi><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:msub><mml:mi>W</mml:mi><mml:mrow><mml:mi>h</mml:mi><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:msub><mml:mi>h</mml:mi><mml:mrow><mml:mi>t</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:msub><mml:mi>b</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow></mml:math></disp-formula>
<disp-formula id="eqn-2"><label>(2)</label><mml:math id="mml-eqn-2" display="block"><mml:msub><mml:mi>f</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mi>&#x03C3;</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>W</mml:mi><mml:mrow><mml:mi>x</mml:mi><mml:mi>f</mml:mi></mml:mrow></mml:msub><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:msub><mml:mi>W</mml:mi><mml:mrow><mml:mi>h</mml:mi><mml:mi>f</mml:mi></mml:mrow></mml:msub><mml:msub><mml:mi>h</mml:mi><mml:mrow><mml:mi>t</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:msub><mml:mi>b</mml:mi><mml:mrow><mml:mi>f</mml:mi></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow></mml:math></disp-formula>
<disp-formula id="eqn-3"><label>(3)</label><mml:math id="mml-eqn-3" display="block"><mml:msub><mml:mi>c</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mi>f</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>&#x2299;</mml:mo><mml:msub><mml:mi>c</mml:mi><mml:mrow><mml:mi>t</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:msub><mml:mi>i</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>&#x2299;</mml:mo><mml:mi>t</mml:mi><mml:mi>a</mml:mi><mml:mi>n</mml:mi><mml:mi>h</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>W</mml:mi><mml:mrow><mml:mi>x</mml:mi><mml:mi>c</mml:mi></mml:mrow></mml:msub><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:msub><mml:mi>W</mml:mi><mml:mrow><mml:mi>h</mml:mi><mml:mi>c</mml:mi></mml:mrow></mml:msub><mml:msub><mml:mi>h</mml:mi><mml:mrow><mml:mi>t</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:msub><mml:mi>b</mml:mi><mml:mrow><mml:mi>c</mml:mi></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow></mml:math></disp-formula>
<disp-formula id="eqn-4"><label>(4)</label><mml:math id="mml-eqn-4" display="block"><mml:msub><mml:mi>o</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mi>&#x03C3;</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>W</mml:mi><mml:mrow><mml:mi>x</mml:mi><mml:mi>o</mml:mi></mml:mrow></mml:msub><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:msub><mml:mi>W</mml:mi><mml:mrow><mml:mi>h</mml:mi><mml:mi>o</mml:mi></mml:mrow></mml:msub><mml:msub><mml:mi>h</mml:mi><mml:mrow><mml:mi>t</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:msub><mml:mi>b</mml:mi><mml:mrow><mml:mi>o</mml:mi></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow></mml:math></disp-formula>
<disp-formula id="eqn-5"><label>(5)</label><mml:math id="mml-eqn-5" display="block"><mml:msub><mml:mi>h</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mi>o</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>&#x2299;</mml:mo><mml:mi>t</mml:mi><mml:mi>a</mml:mi><mml:mi>n</mml:mi><mml:mi>h</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>c</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow></mml:math></disp-formula></p>
<fig id="fig-4">
<label>Figure 4</label>
<caption>
<title>The inner structure of a LSTM unit</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="JAI_48911-fig-4.tif"/>
</fig>
<p>Here, the symbol <inline-formula id="ieqn-11"><mml:math id="mml-ieqn-11"><mml:mi>&#x03C3;</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> represents the logistic sigmoid function defined as <inline-formula id="ieqn-12"><mml:math id="mml-ieqn-12"><mml:mi>&#x03C3;</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mn>1</mml:mn><mml:mrow><mml:mo>/</mml:mo></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mn>1</mml:mn><mml:mspace width="thinmathspace" /><mml:mo>+</mml:mo><mml:mspace width="thinmathspace" /><mml:mi>e</mml:mi><mml:mi>x</mml:mi><mml:mi>p</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>. The symbol <inline-formula id="ieqn-13"><mml:math id="mml-ieqn-13"><mml:mo>&#x2299;</mml:mo></mml:math></inline-formula> denotes the point-wise product operation. The parameters <inline-formula id="ieqn-14"><mml:math id="mml-ieqn-14"><mml:mi>W</mml:mi></mml:math></inline-formula> and <inline-formula id="ieqn-15"><mml:math id="mml-ieqn-15"><mml:mi>b</mml:mi></mml:math></inline-formula> correspond to the weights and biases associated with the three gates and the memory unit.</p>
<p>A version of LSTM known as bidirectional long short-term memory (Bi-LSTM) [<xref ref-type="bibr" rid="ref-63">63</xref>] addresses the drawbacks of traditional LSTM architectures by incorporating both preceding and succeeding contexts in tasks involving sequence modeling. Unlike LSTM models, which solely handle input data in a forward direction, Bi-LSTM operates in both forward and backward directions [<xref ref-type="bibr" rid="ref-64">64</xref>].</p>
</sec>
<sec id="s3_3">
<label>3.3</label>
<title>Gated Recurrent Unit (GRU)</title>
<p>The gated recurrent unit (GRU) serves as an alternative variant to the traditional recurrent neural network (RNN), aimed at resolving issues related to short-term memory through a design that is less complex than the long short-term memory (LSTM) [<xref ref-type="bibr" rid="ref-65">65</xref>]. By consolidating the input and forget gates found in LSTM into a singular update gate, GRU achieves an improvement in overall efficiency. Comprising update gate, reset gate, and current memory content, GRU identifies long-term dependencies in sequences. The gates allow for selective modification and utilization of data from previous time steps, aiding in the identification of long-term dependencies [<xref ref-type="bibr" rid="ref-66">66</xref>]. <xref ref-type="fig" rid="fig-5">Fig. 5</xref> provides a visual representation of the GRU unit&#x2019;s architecture.</p>
<fig id="fig-5">
<label>Figure 5</label>
<caption>
<title>The inner structure of a GRU unit [<xref ref-type="bibr" rid="ref-66">66</xref>]</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="JAI_48911-fig-5.tif"/>
</fig>
<p>At time t, the GRU cell&#x2019;s activation, denoted as <inline-formula id="ieqn-16"><mml:math id="mml-ieqn-16"><mml:msubsup><mml:mi>h</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula>, is determined through a weighted mix of its previous activation (<inline-formula id="ieqn-17"><mml:math id="mml-ieqn-17"><mml:msubsup><mml:mi>h</mml:mi><mml:mrow><mml:mi>t</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula>), and a candidate activation (<inline-formula id="ieqn-18"><mml:math id="mml-ieqn-18"><mml:msubsup><mml:mrow><mml:mover><mml:mi>h</mml:mi><mml:mo stretchy="false">&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula>), as follows [<xref ref-type="bibr" rid="ref-67">67</xref>]:
<disp-formula id="eqn-6"><label>(6)</label><mml:math id="mml-eqn-6" display="block"><mml:msubsup><mml:mi>h</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:mn>1</mml:mn><mml:mo>&#x2212;</mml:mo><mml:msubsup><mml:mi>h</mml:mi><mml:mrow><mml:msub><mml:mi>z</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msubsup><mml:mo>)</mml:mo></mml:mrow><mml:msub><mml:mi>h</mml:mi><mml:mrow><mml:mi>t</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:msubsup><mml:mi>z</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msubsup><mml:msubsup><mml:mrow><mml:mover><mml:mi>h</mml:mi><mml:mo stretchy="false">&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msubsup></mml:math></disp-formula></p>
<p>Here, an update gate, denoted as <inline-formula id="ieqn-19"><mml:math id="mml-ieqn-19"><mml:msubsup><mml:mi>z</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula>, determines the extent to which the unit updates its activation or content. The formulation for this gate is given by:
<disp-formula id="eqn-7"><label>(7)</label><mml:math id="mml-eqn-7" display="block"><mml:msubsup><mml:mi>z</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:mi>&#x03C3;</mml:mi><mml:msup><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>W</mml:mi><mml:mrow><mml:mi>z</mml:mi></mml:mrow></mml:msub><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:msub><mml:mi>U</mml:mi><mml:mrow><mml:mi>z</mml:mi></mml:mrow></mml:msub><mml:msub><mml:mi>h</mml:mi><mml:mrow><mml:mi>t</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msup></mml:math></disp-formula></p>
<p>This process involves calculating a linear combination of the current state and a newly generated state, a technique reminiscent of what is seen in LSTM units. However, unlike LSTM, GRU lacks a mechanism for regulating how much of their state is revealed, instead opting to fully disclose their entire state at each update.</p>
<p>The candidate activation, denoted as <inline-formula id="ieqn-20"><mml:math id="mml-ieqn-20"><mml:msubsup><mml:mrow><mml:mover><mml:mi>h</mml:mi><mml:mo stretchy="false">&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula>, is calculated in a manner akin to the conventional recurrent unit.
<disp-formula id="eqn-8"><label>(8)</label><mml:math id="mml-eqn-8" display="block"><mml:msubsup><mml:mrow><mml:mover><mml:mi>h</mml:mi><mml:mo stretchy="false">&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:mi>tanh</mml:mi><mml:mo>&#x2061;</mml:mo><mml:msup><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mtext>W</mml:mtext></mml:mrow><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:mi>U</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>r</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>&#x2299;</mml:mo><mml:msub><mml:mi>h</mml:mi><mml:mrow><mml:mi>t</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msup></mml:math></disp-formula>where <inline-formula id="ieqn-21"><mml:math id="mml-ieqn-21"><mml:msub><mml:mi>r</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> represents a collection of reset gates and &#x2299; indicating element-wise multiplication. When <inline-formula id="ieqn-22"><mml:math id="mml-ieqn-22"><mml:msub><mml:mi>r</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> approaches 0, indicating &#x201C;off,&#x201D; the reset gate essentially causes the unit to behave as if it is processing the initial symbol of an input sequence, allowing it to forget the previously computed state. The calculation of the reset gate, denoted as <inline-formula id="ieqn-23"><mml:math id="mml-ieqn-23"><mml:msubsup><mml:mi>r</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula>, follows a process similar to that of the update gate.
<disp-formula id="eqn-9"><label>(9)</label><mml:math id="mml-eqn-9" display="block"><mml:msubsup><mml:mi>r</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:mi>&#x03C3;</mml:mi><mml:msup><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>W</mml:mi><mml:mrow><mml:mi>r</mml:mi></mml:mrow></mml:msub><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:msub><mml:mi>U</mml:mi><mml:mrow><mml:mi>r</mml:mi></mml:mrow></mml:msub><mml:msub><mml:mi>h</mml:mi><mml:mrow><mml:mi>t</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msup></mml:math></disp-formula></p>
<p>GRU models, which require fewer tensor operations, provide a simpler option compared to LSTM, leading to quicker training times. Nonetheless, whether to use GRU or LSTM is contingent on the particular use case and the nature of the problem being addressed [<xref ref-type="bibr" rid="ref-58">58</xref>].</p>
<p>A notable improvement to the GRU architecture is the Bi-GRU [<xref ref-type="bibr" rid="ref-68">68</xref>], which successfully addresses specific limitations of the standard GRU by integrating information from both past and future contexts in sequential modeling tasks. In contrast to the GRU, which handles input sequences exclusively in a forward direction, the Bi-GRU operates in both forward and backward directions. In a Bi-GRU model, two parallel GRU layers are employed, with one processing the input data in the forward direction and the other handling it in reverse [<xref ref-type="bibr" rid="ref-69">69</xref>].</p>
</sec>
</sec>
<sec id="s4">
<label>4</label>
<title>Experimental Results</title>
<sec id="s4_1">
<label>4.1</label>
<title>Dataset</title>
<p>The prevalent datasets for student engagement detection are largely private or limited in size, posing challenges in benchmarking our research against existing work. Therefore, in this study, we conducted experiments and evaluated the proposed models using the DAiSEE dataset (Dataset for Affective States in E-Environments) [<xref ref-type="bibr" rid="ref-31">31</xref>]. The dataset comprises 112 students currently enrolled in school, aged between 18 to 30, with a predominantly Asian demographic, comprising 32 females and 80 males. A total of 9068 video clips, each lasting 10 seconds, were captured in six distinct locations such as dorm rooms, labs, and libraries, under three lighting conditions: bright, dark, and mild. Under different lighting conditions, using indoor or outdoor light sources, images or videos absorb light properties that are inextricably linked to the original image [<xref ref-type="bibr" rid="ref-70">70</xref>]. The DAiSEE dataset encompasses four affective states including confusion, boredom, engagement, and frustration, each with four levels: &#x201C;very low,&#x201D; &#x201C;low,&#x201D; &#x201C;high,&#x201D; and &#x201C;very high.&#x201D; This paper focuses predominantly on assessing student engagement levels during online learning. <xref ref-type="table" rid="table-1">Table 1</xref> presents the detailed distribution of engagement levels.</p>
<table-wrap id="table-1">
<label>Table 1</label>
<caption>
<title>Data distribution on the DAiSEE dataset</title>
</caption>
<table frame="hsides">
<colgroup>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
</colgroup>
<thead>
<tr>
<th>Level</th>
<th>Train</th>
<th>Validation</th>
<th>Test</th>
<th>Total</th>
</tr>
</thead>
<tbody>
<tr>
<td>Very high</td>
<td>2494</td>
<td>450</td>
<td>814</td>
<td>3758</td>
</tr>
<tr>
<td>High</td>
<td>2617</td>
<td>813</td>
<td>882</td>
<td>4312</td>
</tr>
<tr>
<td>Low</td>
<td>213</td>
<td>143</td>
<td>84</td>
<td>440</td>
</tr>
<tr>
<td>Very low</td>
<td>34</td>
<td>23</td>
<td>4</td>
<td>61</td>
</tr>
<tr>
<td>Total</td>
<td>5358</td>
<td>1429</td>
<td>1784</td>
<td>8571</td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
<sec id="s4_2">
<label>4.2</label>
<title>Result</title>
<p>We evaluated the four proposed deep learning models, namely EfficientNetV2-L&#x002B;GRU, EfficientNetV2-L&#x002B;Bi-GRU, EfficientNetV2-L&#x002B;LSTM, and EfficientNetV2-L&#x002B;Bi-LSTM, utilizing the DAiSEE dataset to investigate the effectiveness of each model. Before experimentation, the decision was made regarding the number of frames from each video to be fed into the model. Utilizing a vector with k-frames to represent the spatial features of a video, we aimed to balance temporal information and training time. In this study, we opted for 50 frames per video, resizing them to 224 &#x00D7; 224 to generate 50 &#x00D7; 3 &#x00D7; 224 &#x00D7; 224 (L &#x00D7; C &#x00D7; H &#x00D7; W) tensors as inputs to the model. The EfficientNetV2-L model extracts feature vectors of dimension 1280 from successive frames, subsequently feeding them to the RNN-based module. The parameter values used are provided in <xref ref-type="table" rid="table-2">Table 2</xref>.</p>
<table-wrap id="table-2">
<label>Table 2</label>
<caption>
<title>The parameter values used for experiments</title>
</caption>
<table frame="hsides">
<colgroup>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
</colgroup>
<thead>
<tr>
<th>Model</th>
<th>EfficientNetV2L</th>
<th>GRU</th>
<th>Bi-GRU</th>
<th>LSTM</th>
<th>Bi-LSTM</th>
</tr>
</thead>
<tbody>
<tr>
<td>Input</td>
<td>50 &#x00D7; 224 &#x00D7; 224 &#x00D7; 3</td>
<td>50 &#x00D7; 1280</td>
<td>50 &#x00D7; 1280</td>
<td>50 &#x00D7; 1280</td>
<td>50 &#x00D7; 1280</td>
</tr>
<tr>
<td>Layer 1</td>
<td></td>
<td>Unit &#x003D; 100, return sequences &#x003D; True</td>
<td>Unit &#x003D; 100, return_ sequences &#x003D; True</td>
<td>Unit &#x003D; 100, return_ sequences &#x003D; True</td>
<td>Unit &#x003D; 100, return_ sequences &#x003D; True</td>
</tr>
<tr>
<td>Layer 2</td>
<td></td>
<td>Dropout: 0.4</td>
<td>Dropout: 0.4</td>
<td>Dropout: 0.4</td>
<td>Dropout: 0.4</td>
</tr>
<tr>
<td>Layer 3</td>
<td></td>
<td>Unit &#x003D; 50</td>
<td>Unit &#x003D; 50</td>
<td>Unit &#x003D; 50</td>
<td>Unit &#x003D; 50</td>
</tr>
<tr>
<td>Layer 4</td>
<td></td>
<td>Dropout: 0.4</td>
<td>Dropout: 0.4</td>
<td>Dropout: 0.4</td>
<td>Dropout: 0.4</td>
</tr>
<tr>
<td>Layer 5</td>
<td></td>
<td>Dense: 16 units, Activation: Relu</td>
<td>Dense: 16 units, Activation: Relu</td>
<td>Dense: 16 units, Activation: Relu</td>
<td>Dense: 16 units, Activation: Relu</td>
</tr>
<tr>
<td>Output</td>
<td>50 &#x00D7; 1280</td>
<td>Dense: 4 units, Activation: softmax</td>
<td>Dense: 4 units, Activation: softmax</td>
<td>Dense: 4 units, Activation: softmax</td>
<td>Dense: 4 units, Activation: softmax</td>
</tr>
<tr>
<td></td>
<td>weights &#x003D; imagenet</td>
<td colspan="4">Loss function: &#x201C;binary_crossentropy&#x201D;, Optimizer function: &#x201C;Adam&#x201D;</td>
</tr>
<tr>
<td/>
<td>pooling &#x003D; &#x201C;avg&#x201D;</td>
<td colspan="4">Batch size &#x003D; 32, Epochs &#x003D; 10</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>The performance of the proposed models is summarized in <xref ref-type="table" rid="table-3">Table 3</xref>. The results highlight the EfficientNetV2-L&#x002B;LSTM model as the top performer among the proposed models, achieving an accuracy of 62.11%. Accuracy is measured as the ratio of correct to incorrect prediction [<xref ref-type="bibr" rid="ref-71">71</xref>].</p>
<table-wrap id="table-3">
<label>Table 3</label>
<caption>
<title>Accuracies of the four proposed models</title>
</caption>
<table frame="hsides">
<colgroup>
<col align="left"/>
<col align="left"/>
</colgroup>
<thead>
<tr>
<th>Model</th>
<th>Accuracy</th>
</tr>
</thead>
<tbody>
<tr>
<td>EfficientNetV2-L&#x002B;GRU</td>
<td>61.45%</td>
</tr>
<tr>
<td>EfficientNetV2-L&#x002B;Bi-GRU</td>
<td>61.56%</td>
</tr>
<tr>
<td>EfficientNetV2-L&#x002B;LSTM</td>
<td>62.11%</td>
</tr>
<tr>
<td>EfficientNetV2-L&#x002B;Bi-LSTM</td>
<td>61.67%</td>
</tr>
</tbody>
</table>
</table-wrap>
<p><xref ref-type="fig" rid="fig-6">Fig. 6</xref> illustrates accuracy and validation-accuracy diagrams, offering a visual representation of how the proposed models perform during training and testing across multiple epochs. Upon closer examination of the graphs, a noticeable trend emerges. In all the charts, the training accuracy initiates at approximately 50% in the first epoch and gradually rises to around 55% by the eighth epoch, after which it stabilizes. However, in graph (d), a decline in training accuracy is observed post the eighth epoch. As for the validation accuracy depicted in all the graphs, there are notable fluctuations. These fluctuations stem from the dataset&#x2019;s inherent imbalance in terms of engagement level distribution. Specifically, the number of samples with low engagement levels is considerably lower than those with high engagement levels. In such a skewed distribution, it is plausible that a majority of the minority-level samples are misclassified as belonging to the majority engagement levels. Nonetheless, these fluctuations are less pronounced in graphs (b) and (d). Consequently, it can be inferred that bidirectional RNN models exhibit greater stability when dealing with imbalanced datasets in comparison to unidirectional RNN models.</p>
<fig id="fig-6">
<label>Figure 6</label>
<caption>
<title>The accuracy diagram of the four proposed models on training and testing</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="JAI_48911-fig-6.tif"/>
</fig>
<p>Additionally, <xref ref-type="fig" rid="fig-7">Fig. 7</xref> presents loss and validation-loss diagrams, visually representing the fluctuation in loss values during training and evaluation processes for the different models. The loss function quantifies the dissimilarity between predicted and actual labels. By reviewing the graphs, a clear trend emerges that the training losses have consistently diminished across all graphs. Notably, graph (c) exhibits the lowest training loss, hovering around 0.51. Moreover, the validation loss in graphs (a) and (b) demonstrates greater stability compared to graphs (c) and (d). However, it is worth noting that the final validation loss values in graphs (c) and (d), both approximately at 0.51, are lower than the values in the other two graphs, which are approximately 0.52. This observation indicates that in this specific context, the LSTM models outperform the GRU model.</p>
<fig id="fig-7">
<label>Figure 7</label>
<caption>
<title>The Loss diagram of the four proposed models on training and testing</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="JAI_48911-fig-7.tif"/>
</fig>
</sec>
<sec id="s4_3">
<label>4.3</label>
<title>Comparison Performance</title>
<p><xref ref-type="table" rid="table-4">Table 4</xref> intricately compares the outcomes of our four proposed hybrid models with previous studies utilizing the DAiSEE dataset. In a benchmark study [<xref ref-type="bibr" rid="ref-31">31</xref>], diverse deep learning models, including video-level InceptionNet, C3D fine-tuning, and long-term recurrent convolutional network (LRCN), were tested. LRCN emerged as the leader with an accuracy of 57.90%. Several other models, such as inflated 3D convolutional network (I3D) [<xref ref-type="bibr" rid="ref-35">35</xref>], convolutional 3D (C3D) neural networks with focal loss [<xref ref-type="bibr" rid="ref-72">72</xref>], ResNet&#x002B;TCN with weighted loss [<xref ref-type="bibr" rid="ref-37">37</xref>], and ResNet&#x002B;TCN [<xref ref-type="bibr" rid="ref-40">40</xref>], were introduced in subsequent works. Despite these efforts, the consistently superior performance of the LRCN model remained. Comparatively, DFSTN [<xref ref-type="bibr" rid="ref-27">27</xref>] surpassed LRCN with an accuracy of 58.84%, while the deep engagement recognition network (DERN) [<xref ref-type="bibr" rid="ref-73">73</xref>] which combines temporal convolution, bidirectional LSTM, and attention mechanism, achieved 60%, a 1.16% improvement over DFSTN. The Neural Turing Machine [<xref ref-type="bibr" rid="ref-74">74</xref>] exhibited an accuracy of 61.3% which is better than DERN. Notably, the proposed EfficientNetV2-L&#x002B;LSTM model, achieving an accuracy of 62.11%, outperformed both LRCN and the majority of contemporary models. However, DenseAttNet [<xref ref-type="bibr" rid="ref-41">41</xref>] with 63.59%, and ResNet&#x002B;TCN [<xref ref-type="bibr" rid="ref-37">37</xref>] with 63.9% outperformed previous works. This comparative analysis underscores that our proposed models exhibit sufficient accuracy in detecting student engagement within the DAiSEE dataset compared to earlier models.</p>
<table-wrap id="table-4">
<label>Table 4</label>
<caption>
<title>Comparison of proposed models and previous works on DAiSEE</title>
</caption>
<table frame="hsides">
<colgroup>
<col align="left"/>
<col align="left"/>
</colgroup>
<thead>
<tr>
<th>Model</th>
<th>Accuracy</th>
</tr>
</thead>
<tbody>
<tr>
<td>InceptionNet [<xref ref-type="bibr" rid="ref-31">31</xref>]</td>
<td>46.40%</td>
</tr>
<tr>
<td>C3D fine-tuning [<xref ref-type="bibr" rid="ref-31">31</xref>]</td>
<td>56.10%</td>
</tr>
<tr>
<td>LRCN [<xref ref-type="bibr" rid="ref-31">31</xref>]</td>
<td>57.90%</td>
</tr>
<tr>
<td>I3D [<xref ref-type="bibr" rid="ref-35">35</xref>]</td>
<td>52.35%</td>
</tr>
<tr>
<td>C3D (FL) [<xref ref-type="bibr" rid="ref-72">72</xref>]</td>
<td>56.20%</td>
</tr>
<tr>
<td>ResNet&#x002B;TCN [<xref ref-type="bibr" rid="ref-40">40</xref>]</td>
<td>53.60%</td>
</tr>
<tr>
<td>DFSTN [<xref ref-type="bibr" rid="ref-27">27</xref>]</td>
<td>58.84%</td>
</tr>
<tr>
<td>DERN [<xref ref-type="bibr" rid="ref-73">73</xref>]</td>
<td>60.00%</td>
</tr>
<tr>
<td>Neural turing machine [<xref ref-type="bibr" rid="ref-74">74</xref>]</td>
<td>61.30%</td>
</tr>
<tr>
<td>ResNet&#x002B;TCN with weighted loss [<xref ref-type="bibr" rid="ref-37">37</xref>]</td>
<td>53.70%</td>
</tr>
<tr>
<td>C3D&#x002B;TCN [<xref ref-type="bibr" rid="ref-37">37</xref>]</td>
<td>59.97%</td>
</tr>
<tr>
<td>ResNet&#x002B;LSTM [<xref ref-type="bibr" rid="ref-37">37</xref>]</td>
<td>61.15%</td>
</tr>
<tr>
<td>ResNet&#x002B;TCN [<xref ref-type="bibr" rid="ref-37">37</xref>]</td>
<td>63.90%</td>
</tr>
<tr>
<td>DenseAttNet [<xref ref-type="bibr" rid="ref-41">41</xref>]</td>
<td>63.59%</td>
</tr>
<tr>
<td><bold>EfficientNetV2-L&#x002B;GRU (proposed)</bold></td>
<td><bold>61.45%</bold></td>
</tr>
<tr>
<td><bold>EfficientNetV2-L&#x002B;Bi-GRU (proposed)</bold></td>
<td><bold>61.56%</bold></td>
</tr>
<tr>
<td><bold>EfficientNetV2-L&#x002B;LSTM (proposed)</bold></td>
<td><bold>62.11%</bold></td>
</tr>
<tr>
<td><bold>EfficientNetV2-L&#x002B;Bi-LSTM (proposed)</bold></td>
<td><bold>61.67%</bold></td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
</sec>
<sec id="s5">
<label>5</label>
<title>Conclusion</title>
<p>In this paper, our primary objective was to address the challenge faced by teachers in accurately and promptly detecting their students&#x2019; engagement in online learning. To achieve this, we introduced four hybrid spatio-temporal models designed for detecting student engagement from video in online learning environments. These models encompassed a hybrid EfficientNetV2-L in conjunction with gated recurrent unit (GRU), a hybrid EfficientNetV2-L paired with Bidirectional GRU, a hybrid EfficientNetV2-L combined with long short-term memory (LSTM), and a hybrid EfficientNetV2-L together with Bidirectional LSTM.</p>
<p>The EfficientNetV2-L played a pivotal role in spatial feature extraction, while GRU, Bidirectional GRU, LSTM, and Bidirectional LSTM were employed to capture temporal information from sequential data. Our experimentation, conducted on the DAiSEE dataset featuring four levels of student engagement, demonstrated that the proposed models exhibited superior accuracy compared to the majority of previous works utilizing the same dataset. Notably, the EfficientNetV2-L&#x002B;LSTM model emerged as the top performer, achieving an accuracy of 62.11%.</p>
<p>Despite these promising results, certain limitations exist in the current study. To address these, future research will refine the automatic recognition of learning engagement by implementing a robust face detector to crop face regions from each frame during pre-processing. Additionally, the incorporation of attention mechanisms in the proposed models will be explored to further enhance accuracy. Furthermore, our commitment to advancing research in this domain involves testing the suggested models on diverse datasets, ensuring broader applicability and generalizability.</p>
<p>In essence, this study contributes valuable insights into automating the detection of student engagement in online learning environments. The demonstrated effectiveness of our hybrid models highlights their potential to provide teachers with accurate assessments of student engagement, thus contributing to the ongoing efforts to enhance the quality of online education.</p>
</sec>
</body>
<back>
<ack>
<p>The authors would like to express sincere gratitude to all the individuals who have contributed to the completion of this research paper. Their unwavering support, valuable insights, and encouragement have been instrumental in making this endeavor a success.</p>
</ack>
<sec><title>Funding Statement</title>
<p>The authors received no specific funding for this study.</p>
</sec>
<sec><title>Author Contributions</title>
<p>The authors confirm contribution to the paper as follows: study conception and design: F. M. Shiri, E. Ahmadi, M. Rezaee; data collection: F. M. Shiri; analysis and interpretation of results: F. M. Shiri, E. Ahmadi; draft manuscript preparation: F. M. Shiri. E. Ahmadi, M. Rezaee, T. Perumal. All authors reviewed the results and approved the final version of the manuscript.</p>
</sec>
<sec sec-type="data-availability"><title>Availability of Data and Materials</title>
<p>The code used and/or analyzed during this research are available from the corresponding author upon reasonable request. Data used in this study can be accessed via the following link: <ext-link ext-link-type="uri" xlink:href="https://people.iith.ac.in/vineethnb/resources/daisee/index.html">https://people.iith.ac.in/vineethnb/resources/daisee/index.html</ext-link>.</p>
</sec>
<sec sec-type="COI-statement"><title>Conflicts of Interest</title>
<p>The authors declare that they have no conflicts of interest to report regarding the present study.</p>
</sec>
<ref-list content-type="authoryear">
<title>References</title>
<ref id="ref-1"><label>[1]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>A.</given-names> <surname>Haleem</surname></string-name>, <string-name><given-names>M.</given-names> <surname>Javaid</surname></string-name>, <string-name><given-names>M. A.</given-names> <surname>Qadri</surname></string-name>, and <string-name><given-names>R.</given-names> <surname>Suman</surname></string-name></person-group>, &#x201C;<article-title>Understanding the role of digital technologies in education: A review</article-title>,&#x201D; <source>Sustain. Oper. Comput.</source>, vol. <volume>3</volume>, no. <issue>4</issue>, pp. <fpage>275</fpage>&#x2013;<lpage>285</lpage>, <year>2022</year>. doi: <pub-id pub-id-type="doi">10.1016/j.susoc.2022.05.004</pub-id>.</mixed-citation></ref>
<ref id="ref-2"><label>[2]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>I.</given-names> <surname>Blagoev</surname></string-name>, <string-name><given-names>G.</given-names> <surname>Vassileva</surname></string-name>, and <string-name><given-names>V.</given-names> <surname>Monov</surname></string-name></person-group>, &#x201C;<article-title>A model for e-learning based on the knowledge of learners</article-title>,&#x201D; <source>Cybernet. Inf. Technol.</source>, vol. <volume>21</volume>, no. <issue>2</issue>, pp. <fpage>121</fpage>&#x2013;<lpage>135</lpage>, <year>2021</year>. doi: <pub-id pub-id-type="doi">10.2478/cait-2021-0023</pub-id>.</mixed-citation></ref>
<ref id="ref-3"><label>[3]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>A.</given-names> <surname>Alhothali</surname></string-name>, <string-name><given-names>M.</given-names> <surname>Albsisi</surname></string-name>, <string-name><given-names>H.</given-names> <surname>Assalahi</surname></string-name>, and <string-name><given-names>T.</given-names> <surname>Aldosemani</surname></string-name></person-group>, &#x201C;<article-title>Predicting student outcomes in online courses using machine learning techniques: A review</article-title>,&#x201D; <source>Sustain.</source>, vol. <volume>14</volume>, no. <issue>10</issue>, pp. <fpage>6199</fpage>, <year>2022</year>. doi: <pub-id pub-id-type="doi">10.3390/su14106199</pub-id>.</mixed-citation></ref>
<ref id="ref-4"><label>[4]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>P.</given-names> <surname>Redmond</surname></string-name>, <string-name><given-names>L.</given-names> <surname>Abawi</surname></string-name>, <string-name><given-names>A.</given-names> <surname>Brown</surname></string-name>, <string-name><given-names>R.</given-names> <surname>Henderson</surname></string-name>, and <string-name><given-names>A.</given-names> <surname>Heffernan</surname></string-name></person-group>, &#x201C;<article-title>An online engagement framework for higher education</article-title>,&#x201D; <source>Online Learn. J.</source>, vol. <volume>22</volume>, no. <issue>1</issue>, pp. <fpage>183</fpage>&#x2013;<lpage>204</lpage>, <year>2018</year>. doi: <pub-id pub-id-type="doi">10.24059/olj.v22i1.1175</pub-id></mixed-citation></ref>
<ref id="ref-5"><label>[5]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>N. A.</given-names> <surname>Johar</surname></string-name>, <string-name><given-names>S. N.</given-names> <surname>Kew</surname></string-name>, <string-name><given-names>Z.</given-names> <surname>Tasir</surname></string-name>, and <string-name><given-names>E.</given-names> <surname>Koh</surname></string-name></person-group>, &#x201C;<article-title>Learning analytics on student engagement to enhance students&#x2019; learning performance: A systematic review</article-title>,&#x201D; <source>Sustain.</source>, vol. <volume>15</volume>, no. <issue>10</issue>, pp. <fpage>7849</fpage>, <year>2023</year>. doi: <pub-id pub-id-type="doi">10.3390/su15107849</pub-id>.</mixed-citation></ref>
<ref id="ref-6"><label>[6]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>S. N.</given-names> <surname>Kew</surname></string-name> and <string-name><given-names>Z.</given-names> <surname>Tasir</surname></string-name></person-group>, &#x201C;<article-title>Analysing students&#x2019; cognitive engagement in e-learning discussion forums through content analysis</article-title>,&#x201D; <source>Knowl. Manage. E-Learn.</source>, vol. <volume>13</volume>, no. <issue>1</issue>, pp. <fpage>39</fpage>&#x2013;<lpage>57</lpage>, <year>2021</year>.</mixed-citation></ref>
<ref id="ref-7"><label>[7]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>M.</given-names> <surname>Pilotti</surname></string-name>, <string-name><given-names>S.</given-names> <surname>Anderson</surname></string-name>, <string-name><given-names>P.</given-names> <surname>Hardy</surname></string-name>, <string-name><given-names>P.</given-names> <surname>Murphy</surname></string-name>, and <string-name><given-names>P.</given-names> <surname>Vincent</surname></string-name></person-group>, &#x201C;<article-title>Factors related to cognitive, emotional, and behavioral engagement in the online asynchronous classroom</article-title>,&#x201D; <source>Int. J. Teach. Learn. High. Edu.</source>, vol. <volume>29</volume>, no. <issue>1</issue>, pp. <fpage>145</fpage>&#x2013;<lpage>153</lpage>, <year>2017</year>.</mixed-citation></ref>
<ref id="ref-8"><label>[8]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>E.</given-names> <surname>Okur</surname></string-name>, <string-name><given-names>N.</given-names> <surname>Alyuz</surname></string-name>, <string-name><given-names>S.</given-names> <surname>Aslan</surname></string-name>, <string-name><given-names>U.</given-names> <surname>Genc</surname></string-name>, <string-name><given-names>C.</given-names> <surname>Tanriover</surname></string-name> and <string-name><given-names>A. A.</given-names> <surname>Esme</surname></string-name></person-group>, &#x201C;<article-title>Behavioral engagement detection of students in the wild</article-title>,&#x201D; in <conf-name>Artif. Intell. Edu.: 18th Int. Conf.</conf-name>, <publisher-loc>Wuhan, China</publisher-loc>, <year>Jun. 28&#x2013;Jul. 01, 2017</year>, pp. <fpage>250</fpage>&#x2013;<lpage>261</lpage>.</mixed-citation></ref>
<ref id="ref-9"><label>[9]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>M.</given-names> <surname>El Kerdawy</surname></string-name> <etal>et al.</etal></person-group>, &#x201C;<article-title>The automatic detection of cognition using eeg and facial expressions</article-title>,&#x201D; <source>Sensors</source>, vol. <volume>20</volume>, no. <issue>12</issue>, pp. <fpage>3516</fpage>, <year>2020</year>. doi: <pub-id pub-id-type="doi">10.3390/s20123516</pub-id>; <pub-id pub-id-type="pmid">32575909</pub-id></mixed-citation></ref>
<ref id="ref-10"><label>[10]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>M.</given-names> <surname>Mukhopadhyay</surname></string-name>, <string-name><given-names>S.</given-names> <surname>Pal</surname></string-name>, <string-name><given-names>A.</given-names> <surname>Nayyar</surname></string-name>, <string-name><given-names>P. K. D.</given-names> <surname>Pramanik</surname></string-name>, <string-name><given-names>N.</given-names> <surname>Dasgupta</surname></string-name> and <string-name><given-names>P.</given-names> <surname>Choudhury</surname></string-name></person-group>, &#x201C;<article-title>Facial emotion detection to assess learner&#x2019;s state of mind in an online learning system</article-title>,&#x201D; in <conf-name>Proc. 2020 5th Int. Conf. Intell. Inf. Technol.</conf-name>, <year>2020</year>, pp. <fpage>107</fpage>&#x2013;<lpage>115</lpage>.</mixed-citation></ref>
<ref id="ref-11"><label>[11]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>S.</given-names> <surname>Gupta</surname></string-name>, <string-name><given-names>P.</given-names> <surname>Kumar</surname></string-name>, and <string-name><given-names>R. K.</given-names> <surname>Tekchandani</surname></string-name></person-group>, &#x201C;<article-title>Facial emotion recognition based real-time learner engagement detection system in online learning context using deep learning models</article-title>,&#x201D; <source>Multimed. Tools Appl.</source>, vol. <volume>82</volume>, no. <issue>8</issue>, pp. <fpage>11365</fpage>&#x2013;<lpage>11394</lpage>, <year>2023</year>. doi: <pub-id pub-id-type="doi">10.1007/s11042-022-13558-9</pub-id>; <pub-id pub-id-type="pmid">36105662</pub-id></mixed-citation></ref>
<ref id="ref-12"><label>[12]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><given-names>M. A.</given-names> <surname>Al Mamun</surname></string-name> and <string-name><given-names>G.</given-names> <surname>Lawrie</surname></string-name></person-group>, &#x201C;<article-title>Factors affecting student behavioural engagement in an inquiry-based online learning environment</article-title>,&#x201D; <year>2021</year>. doi: <pub-id pub-id-type="doi">10.21203/rs.3.rs-249144/v1</pub-id></mixed-citation></ref>
<ref id="ref-13"><label>[13]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>B.</given-names> <surname>Zhu</surname></string-name>, <string-name><given-names>X.</given-names> <surname>Lan</surname></string-name>, <string-name><given-names>X.</given-names> <surname>Guo</surname></string-name>, <string-name><given-names>K. E.</given-names> <surname>Barner</surname></string-name>, and <string-name><given-names>C.</given-names> <surname>Boncelet</surname></string-name></person-group>, &#x201C;<article-title>Multi-rate attention based GRU model for engagement prediction</article-title>,&#x201D; in <conf-name>Proc. 2020 Int. Conf. Multi. Interact.</conf-name>, <year>2020</year>, pp. <fpage>841</fpage>&#x2013;<lpage>848</lpage>.</mixed-citation></ref>
<ref id="ref-14"><label>[14]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>J.</given-names> <surname>Whitehill</surname></string-name>, <string-name><given-names>Z.</given-names> <surname>Serpell</surname></string-name>, <string-name><given-names>Y. C.</given-names> <surname>Lin</surname></string-name>, <string-name><given-names>A.</given-names> <surname>Foster</surname></string-name>, and <string-name><given-names>J. R.</given-names> <surname>Movellan</surname></string-name></person-group>, &#x201C;<article-title>The faces of engagement: Automatic recognition of student engagementfrom facial expressions</article-title>,&#x201D; <source>IEEE Trans. Affect. Comput.</source>, vol. <volume>5</volume>, no. <issue>1</issue>, pp. <fpage>86</fpage>&#x2013;<lpage>98</lpage>, <year>2014</year>. doi: <pub-id pub-id-type="doi">10.1109/TAFFC.2014.2316163</pub-id>.</mixed-citation></ref>
<ref id="ref-15"><label>[15]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>T.</given-names> <surname>Selim</surname></string-name>, <string-name><given-names>I.</given-names> <surname>Elkabani</surname></string-name>, and <string-name><given-names>M. A.</given-names> <surname>Abdou</surname></string-name></person-group>, &#x201C;<article-title>Students engagement level detection in online e-learning using hybrid EfficientNetB7 together with TCN, LSTM, and Bi-LSTM</article-title>,&#x201D; <source>IEEE Access</source>, vol. <volume>10</volume>, pp. <fpage>99573</fpage>&#x2013;<lpage>99583</lpage>, <year>2022</year>. doi: <pub-id pub-id-type="doi">10.1109/ACCESS.2022.3206779</pub-id>.</mixed-citation></ref>
<ref id="ref-16"><label>[16]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>N.</given-names> <surname>Bosch</surname></string-name> <etal>et al.</etal></person-group>, &#x201C;<article-title>Automatic detection of learning-centered affective states in the wild</article-title>,&#x201D; in <conf-name>Proc. 20th Int. Conf. Intell. User Interfaces</conf-name>, <year>2015</year>, pp. <fpage>379</fpage>&#x2013;<lpage>388</lpage>.</mixed-citation></ref>
<ref id="ref-17"><label>[17]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>S.</given-names> <surname>Zhang</surname></string-name>, <string-name><given-names>X.</given-names> <surname>Pan</surname></string-name>, <string-name><given-names>Y.</given-names> <surname>Cui</surname></string-name>, <string-name><given-names>X.</given-names> <surname>Zhao</surname></string-name>, and <string-name><given-names>L.</given-names> <surname>Liu</surname></string-name></person-group>, &#x201C;<article-title>Learning affective video features for facial expression recognition via hybrid deep learning</article-title>,&#x201D; <source>IEEE Access</source>, vol. <volume>7</volume>, pp. <fpage>32297</fpage>&#x2013;<lpage>32304</lpage>, <year>2019</year>. doi: <pub-id pub-id-type="doi">10.1109/ACCESS.2019.2901521</pub-id>.</mixed-citation></ref>
<ref id="ref-18"><label>[18]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>M.</given-names> <surname>Dewan</surname></string-name>, <string-name><given-names>M.</given-names> <surname>Murshed</surname></string-name>, and <string-name><given-names>F.</given-names> <surname>Lin</surname></string-name></person-group>, &#x201C;<article-title>Engagement detection in online learning: A review</article-title>,&#x201D; <source>Smart Learn. Environ.</source>, vol. <volume>6</volume>, no. <issue>1</issue>, pp. <fpage>1</fpage>&#x2013;<lpage>20</lpage>, <year>2019</year>. doi: <pub-id pub-id-type="doi">10.1186/s40561-018-0080-z</pub-id>.</mixed-citation></ref>
<ref id="ref-19"><label>[19]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>H.</given-names> <surname>Monkaresi</surname></string-name>, <string-name><given-names>N.</given-names> <surname>Bosch</surname></string-name>, <string-name><given-names>R. A.</given-names> <surname>Calvo</surname></string-name>, and <string-name><given-names>S. K.</given-names> <surname>D&#x2019;Mello</surname></string-name></person-group>, &#x201C;<article-title>Automated detection of engagement using video-based estimation of facial expressions and heart rate</article-title>,&#x201D; <source>IEEE Trans. Affect. Comput.</source>, vol. <volume>8</volume>, no. <issue>1</issue>, pp. <fpage>15</fpage>&#x2013;<lpage>28</lpage>, <year>2016</year>. doi: <pub-id pub-id-type="doi">10.1109/TAFFC.2016.2515084</pub-id>.</mixed-citation></ref>
<ref id="ref-20"><label>[20]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>M.</given-names> <surname>Bustos-L&#x00F3;pez</surname></string-name>, <string-name><given-names>N.</given-names> <surname>Cruz-Ram&#x00ED;rez</surname></string-name>, <string-name><given-names>A.</given-names> <surname>Guerra-Hern&#x00E1;ndez</surname></string-name>, <string-name><given-names>L. N.</given-names> <surname>S&#x00E1;nchez-Morales</surname></string-name>, <string-name><given-names>N. A.</given-names> <surname>Cruz-Ramos</surname></string-name> and <string-name><given-names>G.</given-names> <surname>Alor-Hern&#x00E1;ndez</surname></string-name></person-group>, &#x201C;<article-title>Wearables for engagement detection in learning environments: A review</article-title>,&#x201D; <source>Biosens.</source>, vol. <volume>12</volume>, no. <issue>7</issue>, pp. <fpage>509</fpage>, <year>2022</year>. doi: <pub-id pub-id-type="doi">10.3390/bios12070509</pub-id>; <pub-id pub-id-type="pmid">35884312</pub-id></mixed-citation></ref>
<ref id="ref-21"><label>[21]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>E.</given-names> <surname>Di Lascio</surname></string-name>, <string-name><given-names>S.</given-names> <surname>Gashi</surname></string-name>, and <string-name><given-names>S.</given-names> <surname>Santini</surname></string-name></person-group>, &#x201C;<article-title>Unobtrusive assessment of students&#x2019; emotional engagement during lectures using electrodermal activity sensors</article-title>,&#x201D; in <source>Proc. ACM Interact., Mobile, Wear. Ubiquit. Technol.</source>, vol. <volume>2</volume>, no. <issue>3</issue>, pp. <fpage>1</fpage>&#x2013;<lpage>21</lpage>, <year>2018</year>. doi: <pub-id pub-id-type="doi">10.1145/3264913</pub-id>.</mixed-citation></ref>
<ref id="ref-22"><label>[22]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>M.</given-names> <surname>Garbarino</surname></string-name>, <string-name><given-names>M.</given-names> <surname>Lai</surname></string-name>, <string-name><given-names>D.</given-names> <surname>Bender</surname></string-name>, <string-name><given-names>R. W.</given-names> <surname>Picard</surname></string-name>, and <string-name><given-names>S.</given-names> <surname>Tognetti</surname></string-name></person-group>, &#x201C;<article-title>Empatica E3&#x2014;A wearable wireless multi-sensor device for real-time computerized biofeedback and data acquisition</article-title>,&#x201D; in <conf-name>4th Int. Conf. Wirel. Mobile Commun. Healthcare-Transf. Healthcare Innov. Mobile Wirel. Technol. (MOBIHEALTH), IEEE</conf-name>, <year>2014</year>, pp. <fpage>39</fpage>&#x2013;<lpage>42</lpage>.</mixed-citation></ref>
<ref id="ref-23"><label>[23]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>A.</given-names> <surname>Al-Alwani</surname></string-name></person-group>, &#x201C;<article-title>A combined approach to improve supervised e-learning using multi-sensor student engagement analysis</article-title>,&#x201D; <source>American J. Appl. Sci.</source>, vol. <volume>13</volume>, no. <issue>12</issue>, pp. <fpage>1377</fpage>&#x2013;<lpage>1384</lpage>, <year>2016</year>. doi: <pub-id pub-id-type="doi">10.3844/ajassp.2016.1377.1384</pub-id>.</mixed-citation></ref>
<ref id="ref-24"><label>[24]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>A.</given-names> <surname>Apicella</surname></string-name>, <string-name><given-names>P.</given-names> <surname>Arpaia</surname></string-name>, <string-name><given-names>M.</given-names> <surname>Frosolone</surname></string-name>, <string-name><given-names>G.</given-names> <surname>Improta</surname></string-name>, <string-name><given-names>N.</given-names> <surname>Moccaldi</surname></string-name> and <string-name><given-names>A.</given-names> <surname>Pollastro</surname></string-name></person-group>, &#x201C;<article-title>EEG-based measurement system for monitoring student engagement in learning 4.0</article-title>,&#x201D; <source>Sci. Rep.</source>, vol. <volume>12</volume>, no. <issue>1</issue>, pp. <fpage>5857</fpage>, <year>2022</year>. doi: <pub-id pub-id-type="doi">10.1038/s41598-022-09578-y</pub-id>; <pub-id pub-id-type="pmid">35393470</pub-id></mixed-citation></ref>
<ref id="ref-25"><label>[25]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>D. K.</given-names> <surname>Darnell</surname></string-name> and <string-name><given-names>P. A.</given-names> <surname>Krieg</surname></string-name></person-group>, &#x201C;<article-title>Student engagement, assessed using heart rate, shows no reset following active learning sessions in lectures</article-title>,&#x201D; <source>PLoS One</source>, vol. <volume>14</volume>, no. <issue>12</issue>, pp. <fpage>e0225709</fpage>, <year>2019</year>. doi: <pub-id pub-id-type="doi">10.1371/journal.pone.0225709</pub-id>; <pub-id pub-id-type="pmid">31790461</pub-id></mixed-citation></ref>
<ref id="ref-26"><label>[26]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>X.</given-names> <surname>Hu</surname></string-name>, <string-name><given-names>J.</given-names> <surname>Chen</surname></string-name>, <string-name><given-names>F.</given-names> <surname>Wang</surname></string-name>, and <string-name><given-names>D.</given-names> <surname>Zhang</surname></string-name></person-group>, &#x201C;<article-title>Ten challenges for EEG-based affective computing</article-title>,&#x201D; <source>Brain Sci. Adv.</source>, vol. <volume>5</volume>, no. <issue>1</issue>, pp. <fpage>1</fpage>&#x2013;<lpage>20</lpage>, <year>2019</year>. doi: <pub-id pub-id-type="doi">10.1177/2096595819896200</pub-id>.</mixed-citation></ref>
<ref id="ref-27"><label>[27]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>J.</given-names> <surname>Liao</surname></string-name>, <string-name><given-names>Y.</given-names> <surname>Liang</surname></string-name>, and <string-name><given-names>J.</given-names> <surname>Pan</surname></string-name></person-group>, &#x201C;<article-title>Deep facial spatiotemporal network for engagement prediction in online learning</article-title>,&#x201D; <source>Appl. Intell.</source>, vol. <volume>51</volume>, no. <issue>10</issue>, pp. <fpage>6609</fpage>&#x2013;<lpage>6621</lpage>, <year>2021</year>. doi: <pub-id pub-id-type="doi">10.1007/s10489-020-02139-8</pub-id>.</mixed-citation></ref>
<ref id="ref-28"><label>[28]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>A.</given-names> <surname>Pise</surname></string-name>, <string-name><given-names>H.</given-names> <surname>Vadapalli</surname></string-name>, and <string-name><given-names>I.</given-names> <surname>Sanders</surname></string-name></person-group>, &#x201C;<article-title>Facial emotion recognition using temporal relational network: An application to e-learning</article-title>,&#x201D; <source>Multimed. Tools Appl.</source>, vol. <volume>81</volume>, no. <issue>19</issue>, pp. <fpage>26633</fpage>&#x2013;<lpage>26653</lpage>, <year>2022</year>.</mixed-citation></ref>
<ref id="ref-29"><label>[29]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><given-names>F. N.</given-names> <surname>Iandola</surname></string-name>, <string-name><given-names>S.</given-names> <surname>Han</surname></string-name>, <string-name><given-names>M. W.</given-names> <surname>Moskewicz</surname></string-name>, <string-name><given-names>K.</given-names> <surname>Ashraf</surname></string-name>, <string-name><given-names>W. J.</given-names> <surname>Dally</surname></string-name>, and <string-name><given-names>K.</given-names> <surname>Keutzer</surname></string-name></person-group>, &#x201C;<article-title>SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and &#x003C;0.5 MB model size</article-title>,&#x201D; <comment>arXiv preprint arXiv:1602.07360</comment>, <year>2016</year>.</mixed-citation></ref>
<ref id="ref-30"><label>[30]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>M.</given-names> <surname>Mavadati</surname></string-name>, <string-name><given-names>P.</given-names> <surname>Sanger</surname></string-name>, and <string-name><given-names>M. H.</given-names> <surname>Mahoor</surname></string-name></person-group>, &#x201C;<article-title>Extended disfa dataset: Investigating posed and spontaneous facial expressions</article-title>,&#x201D; in <conf-name>Proc. IEEE Conf. Comput. Vis. Pattern Recogn. Workshops</conf-name>, <year>2016</year>, pp. <fpage>1</fpage>&#x2013;<lpage>8</lpage>.</mixed-citation></ref>
<ref id="ref-31"><label>[31]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><given-names>A.</given-names> <surname>Gupta</surname></string-name>, <string-name><given-names>A.</given-names> <surname>D&#x2019;Cunha</surname></string-name>, <string-name><given-names>K.</given-names> <surname>Awasthi</surname></string-name>, and <string-name><given-names>V.</given-names> <surname>Balasubramanian</surname></string-name></person-group>, &#x201C;<article-title>DAiSEE: Towards user engagement recognition in the wild</article-title>,&#x201D; <comment>arXiv preprint arXiv:1609.01885</comment>, <year>2016</year>.</mixed-citation></ref>
<ref id="ref-32"><label>[32]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>C.</given-names> <surname>Szegedy</surname></string-name>, <string-name><given-names>V.</given-names> <surname>Vanhoucke</surname></string-name>, <string-name><given-names>S.</given-names> <surname>Ioffe</surname></string-name>, <string-name><given-names>J.</given-names> <surname>Shlens</surname></string-name>, and <string-name><given-names>Z.</given-names> <surname>Wojna</surname></string-name></person-group>, &#x201C;<article-title>Rethinking the inception architecture for computer vision</article-title>,&#x201D; in <conf-name>Proc. IEEE Conf. Comput. Vis. Pattern Recogn.</conf-name>, <year>2016</year>, pp. <fpage>2818</fpage>&#x2013;<lpage>2826</lpage>.</mixed-citation></ref>
<ref id="ref-33"><label>[33]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>D.</given-names> <surname>Tran</surname></string-name>, <string-name><given-names>L.</given-names> <surname>Bourdev</surname></string-name>, <string-name><given-names>R.</given-names> <surname>Fergus</surname></string-name>, <string-name><given-names>L.</given-names> <surname>Torresani</surname></string-name>, and <string-name><given-names>M.</given-names> <surname>Paluri</surname></string-name></person-group>, &#x201C;<article-title>Learning spatiotemporal features with 3D convolutional networks</article-title>,&#x201D; in <conf-name>Proc. IEEE Int. Conf. Comput. Vis.</conf-name>, <year>2015</year>, pp. <fpage>4489</fpage>&#x2013;<lpage>4497</lpage>.</mixed-citation></ref>
<ref id="ref-34"><label>[34]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>J.</given-names> <surname>Donahue</surname></string-name> <etal>et al.</etal></person-group>, &#x201C;<article-title>Long-term recurrent convolutional networks for visual recognition and description</article-title>,&#x201D; in <conf-name>Proc. IEEE Conf. Comput. Vis. Pattern Recogn.</conf-name>, <year>2015</year>, pp. <fpage>2625</fpage>&#x2013;<lpage>2634</lpage>.</mixed-citation></ref>
<ref id="ref-35"><label>[35]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>H.</given-names> <surname>Zhang</surname></string-name>, <string-name><given-names>X.</given-names> <surname>Xiao</surname></string-name>, <string-name><given-names>T.</given-names> <surname>Huang</surname></string-name>, <string-name><given-names>S.</given-names> <surname>Liu</surname></string-name>, <string-name><given-names>Y.</given-names> <surname>Xia</surname></string-name> and <string-name><given-names>J.</given-names> <surname>Li</surname></string-name></person-group>, &#x201C;<article-title>An novel end-to-end network for automatic student engagement recognition</article-title>,&#x201D; in <conf-name>IEEE 9th Int. Conf. Electron. Inf. Emerg. Commun. (ICEIEC)</conf-name>, <publisher-name>IEEE</publisher-name>, <year>2019</year>, pp. <fpage>342</fpage>&#x2013;<lpage>345</lpage>.</mixed-citation></ref>
<ref id="ref-36"><label>[36]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>J.</given-names> <surname>Hu</surname></string-name>, <string-name><given-names>L.</given-names> <surname>Shen</surname></string-name>, and <string-name><given-names>G.</given-names> <surname>Sun</surname></string-name></person-group>, &#x201C;<article-title>Squeeze-and-excitation networks</article-title>,&#x201D; in <conf-name>Proc. IEEE Conf. Comput. Vis. Pattern Recogn.</conf-name>, <year>2018</year>, pp. <fpage>7132</fpage>&#x2013;<lpage>7141</lpage>.</mixed-citation></ref>
<ref id="ref-37"><label>[37]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>A.</given-names> <surname>Abedi</surname></string-name> and <string-name><given-names>S. S.</given-names> <surname>Khan</surname></string-name></person-group>, &#x201C;<article-title>Improving state-of-the-art in detecting student engagement with ResNet and TCN hybrid network</article-title>,&#x201D; in <conf-name>2021 18th Conf. Robots Vis. (CRV)</conf-name>, <publisher-name>IEEE</publisher-name>, <year>2021</year>, pp. <fpage>151</fpage>&#x2013;<lpage>157</lpage>.</mixed-citation></ref>
<ref id="ref-38"><label>[38]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>K.</given-names> <surname>He</surname></string-name>, <string-name><given-names>X.</given-names> <surname>Zhang</surname></string-name>, <string-name><given-names>S.</given-names> <surname>Ren</surname></string-name>, and <string-name><given-names>J.</given-names> <surname>Sun</surname></string-name></person-group>, &#x201C;<article-title>Deep residual learning for image recognition</article-title>,&#x201D; in <conf-name>Proc. IEEE Conf. Comput. Vis. Pattern Recogn.</conf-name>, <year>2016</year>, pp. <fpage>770</fpage>&#x2013;<lpage>778</lpage>.</mixed-citation></ref>
<ref id="ref-39"><label>[39]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><given-names>S.</given-names> <surname>Bai</surname></string-name>, <string-name><given-names>J. Z.</given-names> <surname>Kolter</surname></string-name>, and <string-name><given-names>V.</given-names> <surname>Koltun</surname></string-name></person-group>, &#x201C;<article-title>An empirical evaluation of generic convolutional and recurrent networks for sequence modeling</article-title>,&#x201D; <comment>arXiv preprint arXiv:1803.01271</comment>, <year>2018</year>.</mixed-citation></ref>
<ref id="ref-40"><label>[40]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>K. K.</given-names> <surname>Bajaj</surname></string-name>, <string-name><given-names>I.</given-names> <surname>Ghergulescu</surname></string-name>, and <string-name><given-names>A. N.</given-names> <surname>Moldovan</surname></string-name></person-group>, &#x201C;<article-title>Classification of student affective states in online learning using neural networks</article-title>,&#x201D; in <conf-name>2022 17th Int. Workshop Semantic Soc. Media Adapt. Personal. (SMAP)</conf-name>, <publisher-name>IEEE</publisher-name>, <year>2022</year>, pp. <fpage>1</fpage>&#x2013;<lpage>6</lpage>.</mixed-citation></ref>
<ref id="ref-41"><label>[41]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>N. K.</given-names> <surname>Mehta</surname></string-name>, <string-name><given-names>S. S.</given-names> <surname>Prasad</surname></string-name>, <string-name><given-names>S.</given-names> <surname>Saurav</surname></string-name>, <string-name><given-names>R.</given-names> <surname>Saini</surname></string-name>, and <string-name><given-names>S.</given-names> <surname>Singh</surname></string-name></person-group>, &#x201C;<article-title>Three-dimensional DenseNet self-attention neural network for automatic detection of student&#x2019;s engagement</article-title>,&#x201D; <source>Appl. Intell.</source>, vol. <volume>52</volume>, no. <issue>12</issue>, pp. <fpage>13803</fpage>&#x2013;<lpage>13823</lpage>, <year>2022</year>. doi: <pub-id pub-id-type="doi">10.1007/s10489-022-03200-4</pub-id>; <pub-id pub-id-type="pmid">35340984</pub-id></mixed-citation></ref>
<ref id="ref-42"><label>[42]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>L.</given-names> <surname>Hai</surname></string-name> and <string-name><given-names>H.</given-names> <surname>Guo</surname></string-name></person-group>, &#x201C;<article-title>Face detection with improved face R-CNN training method</article-title>,&#x201D; in <conf-name>Proc. 3rd Int. Conf. Control Comput. Vis.</conf-name>, <year>2020</year>, pp. <fpage>22</fpage>&#x2013;<lpage>25</lpage>.</mixed-citation></ref>
<ref id="ref-43"><label>[43]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><given-names>K.</given-names> <surname>Simonyan</surname></string-name> and <string-name><given-names>A.</given-names> <surname>Zisserman</surname></string-name></person-group>, &#x201C;<article-title>Very deep convolutional networks for large-scale image recognition</article-title>,&#x201D; <comment>arXiv preprint arXiv:1409.1556</comment>, <year>2014</year>.</mixed-citation></ref>
<ref id="ref-44"><label>[44]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>Y.</given-names> <surname>Chen</surname></string-name>, <string-name><given-names>J.</given-names> <surname>Zhou</surname></string-name>, <string-name><given-names>Q.</given-names> <surname>Gao</surname></string-name>, <string-name><given-names>J.</given-names> <surname>Gao</surname></string-name>, and <string-name><given-names>W.</given-names> <surname>Zhang</surname></string-name></person-group>, &#x201C;<article-title>MDNN: Predicting student engagement via gaze direction and facial expression in collaborative learning</article-title>,&#x201D; <source>Comput. Model. Eng. Sci.</source>, vol. <volume>136</volume>, no. <issue>1</issue>, pp. <fpage>381</fpage>&#x2013;<lpage>401</lpage>, <year>2023</year>. doi: <pub-id pub-id-type="doi">10.32604/cmes.2023.023234</pub-id></mixed-citation></ref>
<ref id="ref-45"><label>[45]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>N.</given-names> <surname>Ahmad</surname></string-name>, <string-name><given-names>Z.</given-names> <surname>Khan</surname></string-name>, and <string-name><given-names>D.</given-names> <surname>Singh</surname></string-name></person-group>, &#x201C;<article-title>Student engagement prediction in MOOCs using deep learning</article-title>,&#x201D; in <conf-name>2023 Int. Conf. Emerg. Smart Comput. Inf. (ESCI)</conf-name>, <publisher-name>IEEE</publisher-name>, <year>2023</year>, pp. <fpage>1</fpage>&#x2013;<lpage>6</lpage>.</mixed-citation></ref>
<ref id="ref-46"><label>[46]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>C.</given-names> <surname>Pabba</surname></string-name> and <string-name><given-names>P.</given-names> <surname>Kumar</surname></string-name></person-group>, &#x201C;<article-title>An intelligent system for monitoring students&#x2019; engagement in large classroom teaching through facial expression recognition</article-title>,&#x201D; <source>Expert. Syst.</source>, vol. <volume>39</volume>, no. <issue>1</issue>, pp. <fpage>e12839</fpage>, <year>2022</year>. doi: <pub-id pub-id-type="doi">10.1111/exsy.12839</pub-id>.</mixed-citation></ref>
<ref id="ref-47"><label>[47]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>S.</given-names> <surname>Zhalehpour</surname></string-name>, <string-name><given-names>O.</given-names> <surname>Onder</surname></string-name>, <string-name><given-names>Z.</given-names> <surname>Akhtar</surname></string-name>, and <string-name><given-names>C. E.</given-names> <surname>Erdem</surname></string-name></person-group>, &#x201C;<article-title>BAUM-1: A spontaneous audio-visual face database of affective and mental states</article-title>,&#x201D; <source>IEEE Trans. Affect. Comput.</source>, vol. <volume>8</volume>, no. <issue>3</issue>, pp. <fpage>300</fpage>&#x2013;<lpage>313</lpage>, <year>2016</year>. doi: <pub-id pub-id-type="doi">10.1109/TAFFC.2016.2553038</pub-id>.</mixed-citation></ref>
<ref id="ref-48"><label>[48]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>S.</given-names> <surname>Abtahi</surname></string-name>, <string-name><given-names>M.</given-names> <surname>Omidyeganeh</surname></string-name>, <string-name><given-names>S.</given-names> <surname>Shirmohammadi</surname></string-name>, and <string-name><given-names>B.</given-names> <surname>Hariri</surname></string-name></person-group>, &#x201C;<article-title>YawDD: A yawning detection dataset</article-title>,&#x201D; in <conf-name>Proc. 5th ACM Multimed. Syst. Conf.</conf-name>, <year>2014</year>, pp. <fpage>24</fpage>&#x2013;<lpage>28</lpage>.</mixed-citation></ref>
<ref id="ref-49"><label>[49]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>P.</given-names> <surname>Sharma</surname></string-name> <etal>et al.</etal></person-group>, &#x201C;<article-title>Student engagement detection using emotion analysis, eye tracking and head movement with machine learning</article-title>,&#x201D; in <conf-name>Int. Conf. Technol. Innov. Learn., Teach. Edu.</conf-name>, <publisher-name>Springer</publisher-name>, <year>2022</year>, pp. <fpage>52</fpage>&#x2013;<lpage>68</lpage>.</mixed-citation></ref>
<ref id="ref-50"><label>[50]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>S.</given-names> <surname>Ikram</surname></string-name> <etal>et al.</etal></person-group>, &#x201C;<article-title>Recognition of student engagement state in a classroom environment using deep and efficient transfer learning algorithm</article-title>,&#x201D; <source>Appl. Sci.</source>, vol. <volume>13</volume>, no. <issue>15</issue>, pp. <fpage>8637</fpage>, <year>2023</year>. doi: <pub-id pub-id-type="doi">10.3390/app13158637</pub-id>.</mixed-citation></ref>
<ref id="ref-51"><label>[51]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>M.</given-names> <surname>Tan</surname></string-name> and <string-name><given-names>Q.</given-names> <surname>Le</surname></string-name></person-group>, &#x201C;<article-title>EfficientNetV2: Smaller models and faster training</article-title>,&#x201D; in <conf-name>Int. Conf. Mach. Learn.</conf-name>, <publisher-name>PMLR</publisher-name>, <year>2021</year>, pp. <fpage>10096</fpage>&#x2013;<lpage>10106</lpage>.</mixed-citation></ref>
<ref id="ref-52"><label>[52]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>M.</given-names> <surname>Tan</surname></string-name> and <string-name><given-names>Q.</given-names> <surname>Le</surname></string-name></person-group>, &#x201C;<article-title>EfficientNet: Rethinking model scaling for convolutional neural networks</article-title>,&#x201D; in <conf-name>Int. Conf. Mach. Learn.</conf-name>, <publisher-name>PMLR</publisher-name>, <year>2019</year>, pp. <fpage>6105</fpage>&#x2013;<lpage>6114</lpage>.</mixed-citation></ref>
<ref id="ref-53"><label>[53]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>G.</given-names> <surname>Huang</surname></string-name>, <string-name><given-names>Z.</given-names> <surname>Liu</surname></string-name>, <string-name><given-names>L.</given-names> <surname>van der Maaten</surname></string-name>, and <string-name><given-names>K. Q.</given-names> <surname>Weinberger</surname></string-name></person-group>, &#x201C;<article-title>Densely connected convolutional networks</article-title>,&#x201D; in <conf-name>Proc. IEEE Conf. Comput. Vis. Pattern Recogn.</conf-name>, <year>2017</year>, pp. <fpage>4700</fpage>&#x2013;<lpage>4708</lpage>.</mixed-citation></ref>
<ref id="ref-54"><label>[54]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>M.</given-names> <surname>Sandler</surname></string-name>, <string-name><given-names>A.</given-names> <surname>Howard</surname></string-name>, <string-name><given-names>M.</given-names> <surname>Zhu</surname></string-name>, <string-name><given-names>A.</given-names> <surname>Zhmoginov</surname></string-name>, and <string-name><given-names>L. C.</given-names> <surname>Chen</surname></string-name></person-group>, &#x201C;<article-title>MobileNetV2: Inverted residuals and linear bottlenecks</article-title>,&#x201D; in <conf-name>Proc. IEEE Conf. Comput. Vis. Pattern Recogn.</conf-name>, <year>2018</year>, pp. <fpage>4510</fpage>&#x2013;<lpage>4520</lpage>.</mixed-citation></ref>
<ref id="ref-55"><label>[55]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>S.</given-names> <surname>Gupta</surname></string-name> and <string-name><given-names>B.</given-names> <surname>Akin</surname></string-name></person-group>, &#x201C;<article-title>Accelerator-aware neural network design using automl</article-title>,&#x201D; <comment>arXiv preprint arXiv:2003.02838</comment>, <year>2020</year>.</mixed-citation></ref>
<ref id="ref-56"><label>[56]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>J.</given-names> <surname>Deng</surname></string-name>, <string-name><given-names>W.</given-names> <surname>Dong</surname></string-name>, <string-name><given-names>R.</given-names> <surname>Socher</surname></string-name>, <string-name><given-names>L. J.</given-names> <surname>Li</surname></string-name>, <string-name><given-names>K.</given-names> <surname>Li</surname></string-name> and <string-name><given-names>L.</given-names> <surname>Fei-Fei</surname></string-name></person-group>, &#x201C;<article-title>ImageNet: A large-scale hierarchical image database</article-title>,&#x201D; in <conf-name>IEEE Conf. Comput. Vis. Pattern Recogn.</conf-name>, <publisher-name>IEEE</publisher-name>, <year>2009</year>, pp. <fpage>248</fpage>&#x2013;<lpage>255</lpage>.</mixed-citation></ref>
<ref id="ref-57"><label>[57]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>S.</given-names> <surname>Hochreiter</surname></string-name> and <string-name><given-names>J.</given-names> <surname>Schmidhuber</surname></string-name></person-group>, &#x201C;<article-title>Long short-term memory</article-title>,&#x201D; <source>Neural Comput.</source>, vol. <volume>9</volume>, no. <issue>8</issue>, pp. <fpage>1735</fpage>&#x2013;<lpage>1780</lpage>, <year>1997</year>. doi: <pub-id pub-id-type="doi">10.1162/neco.1997.9.8.1735</pub-id>; <pub-id pub-id-type="pmid">9377276</pub-id></mixed-citation></ref>
<ref id="ref-58"><label>[58]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><given-names>F. M.</given-names> <surname>Shiri</surname></string-name>, <string-name><given-names>T.</given-names> <surname>Perumal</surname></string-name>, <string-name><given-names>N.</given-names> <surname>Mustapha</surname></string-name>, and <string-name><given-names>R.</given-names> <surname>Mohamed</surname></string-name></person-group>, &#x201C;<article-title>A comprehensive overview and comparative analysis on deep learning models: CNN, RNN, LSTM, GRU</article-title>,&#x201D; <comment>arXiv preprint arXiv:2305.17473</comment>, <year>2023</year>.</mixed-citation></ref>
<ref id="ref-59"><label>[59]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>V.</given-names> <surname>Barot</surname></string-name> and <string-name><given-names>V.</given-names> <surname>Kapadia</surname></string-name></person-group>, &#x201C;<article-title>Long short term memory neural network-based model construction and Fne-tuning for air quality parameters prediction</article-title>,&#x201D; <source>Cybernet. Inf. Technol.</source>, vol. <volume>22</volume>, no. <issue>1</issue>, pp. <fpage>171</fpage>&#x2013;<lpage>189</lpage>, <year>2022</year>. doi: <pub-id pub-id-type="doi">10.2478/cait-2022-0011</pub-id>.</mixed-citation></ref>
<ref id="ref-60"><label>[60]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><given-names>S.</given-names> <surname>Minaee</surname></string-name>, <string-name><given-names>E.</given-names> <surname>Azimi</surname></string-name>, and <string-name><given-names>A.</given-names> <surname>Abdolrashidi</surname></string-name></person-group>, &#x201C;<article-title>Deep-sentiment: Sentiment analysis using ensemble of CNN and Bi-LSTM models</article-title>,&#x201D; <comment>arXiv preprint arXiv:1904.04206</comment>, <year>2019</year>.</mixed-citation></ref>
<ref id="ref-61"><label>[61]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>W.</given-names> <surname>Fang</surname></string-name>, <string-name><given-names>Y.</given-names> <surname>Chen</surname></string-name>, and <string-name><given-names>Q.</given-names> <surname>Xue</surname></string-name></person-group>, &#x201C;<article-title>Survey on research of RNN-based spatio-temporal sequence prediction algorithms</article-title>,&#x201D; <source>J. Big Data</source>, vol. <volume>3</volume>, no. <issue>3</issue>, pp. <fpage>97</fpage>&#x2013;<lpage>110</lpage>, <year>2021</year>. doi: <pub-id pub-id-type="doi">10.32604/jbd.2021.016993</pub-id></mixed-citation></ref>
<ref id="ref-62"><label>[62]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>F.</given-names> <surname>Zhao</surname></string-name>, <string-name><given-names>J.</given-names> <surname>Feng</surname></string-name>, <string-name><given-names>J.</given-names> <surname>Zhao</surname></string-name>, <string-name><given-names>W.</given-names> <surname>Yang</surname></string-name>, and <string-name><given-names>S.</given-names> <surname>Yan</surname></string-name></person-group>, &#x201C;<article-title>Robust LSTM-autoencoders for face de-occlusion in the wild</article-title>,&#x201D; <source>IEEE Trans. Image Process.</source>, vol. <volume>27</volume>, no. <issue>2</issue>, pp. <fpage>778</fpage>&#x2013;<lpage>790</lpage>, <year>2017</year>. doi: <pub-id pub-id-type="doi">10.1109/TIP.2017.2771408</pub-id>; <pub-id pub-id-type="pmid">29757731</pub-id></mixed-citation></ref>
<ref id="ref-63"><label>[63]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>A.</given-names> <surname>Graves</surname></string-name>, <string-name><given-names>M.</given-names> <surname>Liwicki</surname></string-name>, <string-name><given-names>S.</given-names> <surname>Fern&#x00E1;ndez</surname></string-name>, <string-name><given-names>R.</given-names> <surname>Bertolami</surname></string-name>, <string-name><given-names>H.</given-names> <surname>Bunke</surname></string-name> and <string-name><given-names>J.</given-names> <surname>Schmidhuber</surname></string-name></person-group>, &#x201C;<article-title>A novel connectionist system for unconstrained handwriting recognition</article-title>,&#x201D; <source>IEEE Trans. Pattern Anal. Mach. Intell.</source>, vol. <volume>31</volume>, no. <issue>5</issue>, pp. <fpage>855</fpage>&#x2013;<lpage>868</lpage>, <year>2008</year>. doi: <pub-id pub-id-type="doi">10.1109/TPAMI.2008.137</pub-id>; <pub-id pub-id-type="pmid">19299860</pub-id></mixed-citation></ref>
<ref id="ref-64"><label>[64]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>T. H.</given-names> <surname>Aldhyani</surname></string-name> and <string-name><given-names>H.</given-names> <surname>Alkahtani</surname></string-name></person-group>, &#x201C;<article-title>A bidirectional long short-term memory model algorithm for predicting COVID-19 in gulf countries</article-title>,&#x201D; <source>Life</source>, vol. <volume>11</volume>, no. <issue>11</issue>, pp. <fpage>1118</fpage>, <year>2021</year>. doi: <pub-id pub-id-type="doi">10.3390/life11111118</pub-id>; <pub-id pub-id-type="pmid">34832994</pub-id></mixed-citation></ref>
<ref id="ref-65"><label>[65]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><given-names>F. M.</given-names> <surname>Shiri</surname></string-name>, <string-name><given-names>T.</given-names> <surname>Perumal</surname></string-name>, <string-name><given-names>N.</given-names> <surname>Mustapha</surname></string-name>, <string-name><given-names>R.</given-names> <surname>Mohamed</surname></string-name>, <string-name><given-names>M. A. B.</given-names> <surname>Ahmadon</surname></string-name> and <string-name><given-names>S.</given-names> <surname>Yamaguchi</surname></string-name></person-group>, &#x201C;<article-title>A survey on multi-resident activity recognition in smart environments</article-title>,&#x201D; <comment>arXiv preprint arXiv:2304.12304</comment>, <year>2023</year>.</mixed-citation></ref>
<ref id="ref-66"><label>[66]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>A.</given-names> <surname>Dutta</surname></string-name>, <string-name><given-names>S.</given-names> <surname>Kumar</surname></string-name>, and <string-name><given-names>M.</given-names> <surname>Basu</surname></string-name></person-group>, &#x201C;<article-title>A gated recurrent unit approach to bitcoin price prediction</article-title>,&#x201D; <source>J. Risk Finan. Manag.</source>, vol. <volume>13</volume>, no. <issue>2</issue>, pp. <fpage>23</fpage>, <year>2020</year>. doi: <pub-id pub-id-type="doi">10.3390/jrfm13020023</pub-id>.</mixed-citation></ref>
<ref id="ref-67"><label>[67]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><given-names>J.</given-names> <surname>Chung</surname></string-name>, <string-name><given-names>C.</given-names> <surname>Gulcehre</surname></string-name>, <string-name><given-names>K.</given-names> <surname>Cho</surname></string-name>, and <string-name><given-names>Y.</given-names> <surname>Bengio</surname></string-name></person-group>, &#x201C;<article-title>Empirical evaluation of gated recurrent neural networks on sequence modeling</article-title>,&#x201D; <comment>arXiv preprint arXiv:1412.3555</comment>, <year>2014</year>.</mixed-citation></ref>
<ref id="ref-68"><label>[68]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><given-names>D.</given-names> <surname>Bahdanau</surname></string-name>, <string-name><given-names>K.</given-names> <surname>Cho</surname></string-name>, and <string-name><given-names>Y.</given-names> <surname>Bengio</surname></string-name></person-group>, &#x201C;<article-title>Neural machine translation by jointly learning to align and translate</article-title>,&#x201D; <comment>arXiv preprint arXiv:1409.0473</comment>, <year>2014</year>.</mixed-citation></ref>
<ref id="ref-69"><label>[69]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><given-names>Y.</given-names> <surname>Zhang</surname></string-name>, <string-name><given-names>J.</given-names> <surname>Chen</surname></string-name>, <string-name><given-names>D.</given-names> <surname>Wang</surname></string-name>, <string-name><given-names>M.</given-names> <surname>Hu</surname></string-name>, and <string-name><given-names>L.</given-names> <surname>Chen</surname></string-name></person-group>, &#x201C;<article-title>The bidirectional gate recurrent unit based attention mechanism network for state of charge estimation</article-title>,&#x201D; <source>Journal of the Electrochemical Society</source>, vol. <volume>169</volume>, no. <issue>11</issue>, p. <fpage>110503</fpage>, <year>2022</year>.</mixed-citation></ref>
<ref id="ref-70"><label>[70]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>M.</given-names> <surname>Kumar</surname></string-name> and <string-name><given-names>S.</given-names> <surname>Srivastava</surname></string-name></person-group>, &#x201C;<article-title>Image authentication by assessing manipulations using illumination</article-title>,&#x201D; <source>Multimed. Tools Appl.</source>, vol. <volume>78</volume>, no. <issue>9</issue>, pp. <fpage>12451</fpage>&#x2013;<lpage>12463</lpage>, <year>2019</year>. doi: <pub-id pub-id-type="doi">10.1007/s11042-018-6775-x</pub-id>.</mixed-citation></ref>
<ref id="ref-71"><label>[71]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>G.</given-names> <surname>Madhu</surname></string-name>, <string-name><given-names>S.</given-names> <surname>Kautish</surname></string-name>, <string-name><given-names>Y.</given-names> <surname>Gupta</surname></string-name>, <string-name><given-names>G.</given-names> <surname>Nagachandrika</surname></string-name>, <string-name><given-names>S. M.</given-names> <surname>Biju</surname></string-name> and <string-name><given-names>M.</given-names> <surname>Kumar</surname></string-name></person-group>, &#x201C;<article-title>XCovNet: An optimized xception convolutional neural network for classification of COVID-19 from point-of-care lung ultrasound images</article-title>,&#x201D; <source>Multimed. Tools Appl.</source>, vol. <volume>83</volume>, no. <issue>11</issue>, pp. <fpage>1</fpage>&#x2013;<lpage>22</lpage>, <year>2023</year>. doi: <pub-id pub-id-type="doi">10.1007/s11042-023-16944-z</pub-id>.</mixed-citation></ref>
<ref id="ref-72"><label>[72]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>L.</given-names> <surname>Geng</surname></string-name>, <string-name><given-names>M.</given-names> <surname>Xu</surname></string-name>, <string-name><given-names>Z.</given-names> <surname>Wei</surname></string-name>, and <string-name><given-names>X.</given-names> <surname>Zhou</surname></string-name></person-group>, &#x201C;<article-title>Learning deep spatiotemporal feature for engagement recognition of online courses</article-title>,&#x201D; in <conf-name>2019 IEEE Symp. Series Comput. Intell. (SSCI)</conf-name>, <publisher-name>IEEE</publisher-name>, <year>2019</year>, pp. <fpage>442</fpage>&#x2013;<lpage>447</lpage>.</mixed-citation></ref>
<ref id="ref-73"><label>[73]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>T.</given-names> <surname>Huang</surname></string-name>, <string-name><given-names>Y.</given-names> <surname>Mei</surname></string-name>, <string-name><given-names>H.</given-names> <surname>Zhang</surname></string-name>, <string-name><given-names>S.</given-names> <surname>Liu</surname></string-name>, and <string-name><given-names>H.</given-names> <surname>Yang</surname></string-name></person-group>, &#x201C;<article-title>Fine-grained engagement recognition in online learning environment</article-title>,&#x201D; in <conf-name>2019 IEEE 9th Int. Conf. Electr. Inf. Emerg. Commun. (ICEIEC)</conf-name>, <publisher-name>IEEE</publisher-name>, <year>2019</year>, pp. <fpage>338</fpage>&#x2013;<lpage>341</lpage>.</mixed-citation></ref>
<ref id="ref-74"><label>[74]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>X.</given-names> <surname>Ma</surname></string-name>, <string-name><given-names>M.</given-names> <surname>Xu</surname></string-name>, <string-name><given-names>Y.</given-names> <surname>Dong</surname></string-name>, and <string-name><given-names>Z.</given-names> <surname>Sun</surname></string-name></person-group>, &#x201C;<article-title>Automatic student engagement in online learning environment based on neural turing machine</article-title>,&#x201D; <source>Int. J. Inf. Edu. Technol.</source>, vol. <volume>11</volume>, no. <issue>3</issue>, pp. <fpage>107</fpage>&#x2013;<lpage>111</lpage>, <year>2021</year>. doi: <pub-id pub-id-type="doi">10.18178/ijiet.2021.11.3.1497</pub-id>.</mixed-citation></ref>
</ref-list>
</back></article>