<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.1 20151215//EN" "http://jats.nlm.nih.gov/publishing/1.1/JATS-journalpublishing1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" article-type="research-article" dtd-version="1.1">
<front>
<journal-meta>
<journal-id journal-id-type="pmc">IASC</journal-id>
<journal-id journal-id-type="nlm-ta">IASC</journal-id>
<journal-id journal-id-type="publisher-id">IASC</journal-id>
<journal-title-group>
<journal-title>Intelligent Automation &#x0026; Soft Computing</journal-title>
</journal-title-group>
<issn pub-type="epub">2326-005X</issn>
<issn pub-type="ppub">1079-8587</issn>
<publisher>
<publisher-name>Tech Science Press</publisher-name>
<publisher-loc>USA</publisher-loc>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="publisher-id">33082</article-id>
<article-id pub-id-type="doi">10.32604/iasc.2023.033082</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Article</subject>
</subj-group>
</article-categories>
<title-group>
<article-title>Performance Analysis of a Chunk-Based Speech Emotion Recognition Model Using RNN</article-title><alt-title alt-title-type="left-running-head">Performance Analysis of a Chunk-Based Speech Emotion Recognition Model Using RNN</alt-title><alt-title alt-title-type="right-running-head">Performance Analysis of a Chunk-Based Speech Emotion Recognition Model Using RNN</alt-title>
</title-group>
<contrib-group content-type="authors">
<contrib id="author-1" contrib-type="author">
<name name-style="western"><surname>Shin</surname><given-names>Hyun-Sam</given-names></name>
<xref ref-type="aff" rid="aff-1">1</xref>
</contrib>
<contrib id="author-2" contrib-type="author" corresp="yes">
<name name-style="western"><surname>Hong</surname><given-names>Jun-Ki</given-names></name>
<xref ref-type="aff" rid="aff-2">2</xref><email>jkhong@pcu.ac.kr</email>
</contrib>
<aff id="aff-1"><label>1</label><institution>Division of Software Convergence, Hanshin University</institution>, <addr-line>Osan-si, 18101</addr-line>, <country>Korea</country></aff>
<aff id="aff-2"><label>2</label><institution>Division of AI Software Engineering, Pai Chai University</institution>, <addr-line>Daejeon, 35345</addr-line>, <country>Korea</country></aff>
</contrib-group><author-notes><corresp id="cor1"><label>&#x002A;</label>Corresponding Author: Jun-Ki Hong. Email: <email>jkhong@pcu.ac.kr</email></corresp></author-notes>
<pub-date pub-type="epub" date-type="pub" iso-8601-date="2022-08-24"><day>24</day>
<month>08</month>
<year>2022</year></pub-date>
<volume>36</volume>
<issue>1</issue>
<fpage>235</fpage>
<lpage>248</lpage>
<history>
<date date-type="received"><day>07</day><month>6</month><year>2022</year></date>
<date date-type="accepted"><day>12</day><month>7</month><year>2022</year></date>
</history>
<permissions>
<copyright-statement>&#x00A9; 2023 Shin and Hong</copyright-statement>
<copyright-year>2023</copyright-year>
<copyright-holder>Shin and Hong</copyright-holder>
<license xlink:href="https://creativecommons.org/licenses/by/4.0/">
<license-p>This work is licensed under a <ext-link ext-link-type="uri" xlink:type="simple" xlink:href="https://creativecommons.org/licenses/by/4.0/">Creative Commons Attribution 4.0 International License</ext-link>, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
</license>
</permissions>
<self-uri content-type="pdf" xlink:href="TSP_IASC_33082.pdf"></self-uri>
<abstract>
<p>Recently, artificial-intelligence-based automatic customer response system has been widely used instead of customer service representatives. Therefore, it is important for automatic customer service to promptly recognize emotions in a customer&#x2019;s voice to provide the appropriate service accordingly. Therefore, we analyzed the performance of the emotion recognition (ER) accuracy as a function of the simulation time using the proposed chunk-based speech ER (CSER) model. The proposed CSER model divides voice signals into 3-s long chunks to efficiently recognize characteristically inherent emotions in the customer&#x2019;s voice. We evaluated the performance of the ER of voice signal chunks by applying four RNN techniques&#x2014;long short-term memory (LSTM), bidirectional-LSTM, gated recurrent units (GRU), and bidirectional-GRU&#x2014;to the proposed CSER model individually to assess its ER accuracy and time efficiency. The results reveal that GRU shows the best time efficiency in recognizing emotions from speech signals in terms of accuracy as a function of simulation time.</p>
</abstract>
<kwd-group kwd-group-type="author">
<kwd>RNN</kwd>
<kwd>speech emotion recognition</kwd>
<kwd>attention mechanism</kwd>
<kwd>time efficiency</kwd>
</kwd-group>
</article-meta>
</front>
<body>
<sec id="s1">
<label>1</label>
<title>Introduction</title>
<p>Artificial intelligence speakers based on voice recognition are widely used for customer reception services. Robots and other devices assist humans in most industries and comprise simple voice recognition and display interfaces. Moreover, various studies are being conducted to improve the emotion recognition (ER) rate from voice signals to understand the exact intention of customers. ER technology analyzes the emotional states of humans by collecting and analyzing information from their voices or gestures. However, emotional states determined from voice signals are more accurate than those determined from gestures because expressing emotions with gestures may vary with culture. Recently, research on the analysis of emotions using various deep learning techniques, such as artificial neural network (ANN), convolutional neural network (CNN), and recurrent neural network (RNN), and studies on recognizing human emotions by extracting and analyzing the characteristics of voice signals in various manners have been conducted. ANN, CNN, and RNN have been employed in various fields, such as imaging [<xref ref-type="bibr" rid="ref-1">1</xref>&#x2013;<xref ref-type="bibr" rid="ref-5">5</xref>], relation extraction [<xref ref-type="bibr" rid="ref-6">6</xref>], natural language processing [<xref ref-type="bibr" rid="ref-7">7</xref>], and speech emotion recognition (SER) [<xref ref-type="bibr" rid="ref-8">8</xref>].</p>
<p>SER using RNN-based long short-term memory (LSTM) and gated recurrent unit (GRU) has demonstrated improved performance in various studies because both techniques train on the audio signal considering the properties of the time-series sequence. However, the performance analysis of ER accuracy with respect to the simulation time has not yet been conducted. Therefore, in this study, we proposed a chunk-based SER (CSER) model that divides voice signals into 3-s long units to recognize the emotions from the individual chunks using LSTM, Bi-LSTM, GRU, and Bi-GRU techniques to analyze the performance of the SER accuracy with respect to the simulation time of the four RNN techniques.</p>
<p>The remainder of the study is organized as follows. Section 2 presents the related literature, and Section 3 describes the proposed CSER model. Section 4 presents the performance analysis of the SER accuracy and time efficiency of LSTM, Bi-LSTM, GRU, and Bi-GRU with the proposed CSER model. Finally, Section 5 presents the conclusions and further research directions.</p>
</sec>
<sec id="s2">
<label>2</label>
<title>Related Works</title>
<p>Many studies have been previously conducted to improve SER performance by combining the advantages of RNN and attention mechanisms (AM). AM is a technique based on the way humans focus on characteristic parts rather than using all information including the background when recognizing an object [<xref ref-type="bibr" rid="ref-9">9</xref>]. The initial AM was used to effectively analyze images by assigning weight to specific parts containing relatively important information in the field of neural network (NN)-based image processing. However, studies have been conducted to improve the performance of ER by applying the AM to the SER research fields to improve SER [<xref ref-type="bibr" rid="ref-10">10</xref>&#x2013;<xref ref-type="bibr" rid="ref-20">20</xref>]. <xref ref-type="table" rid="table-1">Tab. 1</xref> lists the previous studies of SER using RNN with AM in terms of methods.</p>
<table-wrap id="table-1"><label>Table 1</label>
<caption>
<title>SER literature using RNN with AM</title></caption>
<table><colgroup><col align="left"/><col align="left"/>
</colgroup>
<thead>
<tr>
<th align="left">Studies</th>
<th align="left">Methods</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">Trigeorgis et al., 2016 [<xref ref-type="bibr" rid="ref-10">10</xref>]</td>
<td align="left">CNN and bidrectional LSTM (Bi-LSTM)</td>
</tr>
<tr>
<td align="left">Huang et al., 2016 [<xref ref-type="bibr" rid="ref-11">11</xref>]</td>
<td align="left">Bi-LSTM</td>
</tr>
<tr>
<td align="left">Mirsamadi et al., 2017 [<xref ref-type="bibr" rid="ref-12">12</xref>]</td>
<td align="left">Bi-LSTM</td>
</tr>
<tr>
<td align="left">Tao et al., 2018 [<xref ref-type="bibr" rid="ref-13">13</xref>]</td>
<td align="left">LSTM</td>
</tr>
<tr>
<td align="left">Sarma et al., 2018 [<xref ref-type="bibr" rid="ref-14">14</xref>]</td>
<td align="left">LSTM</td>
</tr>
<tr>
<td align="left">Chen et al., 2018 [<xref ref-type="bibr" rid="ref-15">15</xref>]</td>
<td align="left">Convolutional RNN (CRNN)</td>
</tr>
<tr>
<td align="left">Zhao et al., 2018 [<xref ref-type="bibr" rid="ref-16">16</xref>]</td>
<td align="left">Bi-LSTM</td>
</tr>
<tr>
<td align="left">Xie et al., 2019 [<xref ref-type="bibr" rid="ref-17">17</xref>]</td>
<td align="left">LSTM</td>
</tr>
<tr>
<td align="left">Xie et al, 2019 [<xref ref-type="bibr" rid="ref-18">18</xref>]</td>
<td align="left">LSTM</td>
</tr>
<tr>
<td align="left">Li et al., 2019 [<xref ref-type="bibr" rid="ref-19">19</xref>]</td>
<td align="left">CNN and Bi-LSTM</td>
</tr>
<tr>
<td align="left">Zheng et al., 2020 [<xref ref-type="bibr" rid="ref-20">20</xref>]</td>
<td align="left">CNN, GRU and Bi-LSTM</td>
</tr>
<tr>
<td align="left">Present study</td>
<td align="left">LSTM, Bi-LSTM, GRU, and bidirectional-GRU (Bi-GRU)</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>Studies listed in <xref ref-type="table" rid="table-1">Tab. 1</xref> analyzed emotions using the interactive emotional dyadic motion capture (IEMOCAP) dataset [<xref ref-type="bibr" rid="ref-21">21</xref>]. As shown in <xref ref-type="table" rid="table-1">Tab. 1</xref>, SER studies using LSTM, Bi-LSTM, and GRU have been conducted [<xref ref-type="bibr" rid="ref-11">11</xref>&#x2013;<xref ref-type="bibr" rid="ref-18">18</xref>]. Furthermore, the convergence studies of RNN and CNN were conducted to improve the SER accuracy [<xref ref-type="bibr" rid="ref-10">10</xref>,<xref ref-type="bibr" rid="ref-19">19</xref>,<xref ref-type="bibr" rid="ref-20">20</xref>]. However, studies on the performance analysis of the SER accuracy as a function of the simulation time of four RNN techniques, such as LSTM, Bi-LSTM, GRU, and Bi-GRU, have not yet been conducted.</p>

</sec>
<sec id="s3">
<label>3</label>
<title>Proposed CSER Model</title>
<p>In this section, we describe the overall structure of the proposed CSER model and LSTM, Bi-LSTM, GRU, and Bi-GRU techniques used to evaluate the performance, feature extraction, and AM of the proposed CSER model. The proposed CSER model divides the received voice signal into 3-s-long chunks and votes for recognized emotions using RNN techniques with hard and soft voting methods to eventually recognize voice emotion. <xref ref-type="fig" rid="fig-1">Fig. 1</xref> shows the flow chart of the proposed CSER model.</p>
<fig id="fig-1">
<label>Figure 1</label>
<caption>
<title>Flowchart of the proposed chunk-based SER (CSER) model</title></caption>
<graphic mimetype="image" mime-subtype="png" xlink:href="IASC_33082-fig-1.png"/>
</fig>
<p>As shown in <xref ref-type="fig" rid="fig-1">Fig. 1</xref>, feature extraction is performed in each chunk to evaluate the ER accuracy and simulation speed of the proposed CSER model. Four RNN techniques (LSTM, Bi-LSTM, GRU, and Bi-GRU) are used to recognize emotions from each of the chunked audio file. Subsequently, AM is used to calculate and reflect the importance of the voice signal according to the context of speech signals. Finally, the emotion recognized in each chunk is subjected to hard and soft voting to recognize the final emotion. The hard voting classifier in the proposed CSER model predicts the emotional state with the largest sum of votes from the chunks, whereas the soft voting classifier predicts the emotion with the largest summed probability of emotions from the chunks.</p>

<p>The detailed descriptions of the chunking process of audio signals, AM, LSTM, Bi-LSTM, GRU, and Bi-GRU techniques used in the proposed CSER model are presented in the following subsections.</p>
<sec id="s3_1">
<label>3.1</label>
<title>Chunks of Audio Signals</title>
<p>In the proposed CSER model, an entire voice signal is divided into 3-s long intervals to accurately recognize emotions from all input audio signals. The minimum number of chunks of an entire voice signal can be expressed as<disp-formula id="eqn-1"><label>(1)</label>
<mml:math id="mml-eqn-1" display="block"><mml:mi>n</mml:mi><mml:mo>=</mml:mo><mml:mstyle displaystyle="true" scriptlevel="0"><mml:mrow><mml:mfrac><mml:mi>c</mml:mi><mml:mi>l</mml:mi></mml:mfrac></mml:mrow><mml:mo>+</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mspace width="thickmathspace" /></mml:mstyle></mml:math>
</disp-formula>where <italic>n</italic> is the number of chunks, <italic>c</italic> is the chunk size, and <italic>l</italic> is the audio length. After obtaining the number of chunks, the size <italic>h</italic> of the overlap can be calculated using <xref ref-type="disp-formula" rid="eqn-2">Eq. (2)</xref> to ensure that the chunks overlap at regular intervals.<disp-formula id="eqn-2"><label>(2)</label>
<mml:math id="mml-eqn-2" display="block"><mml:mi>h</mml:mi><mml:mo>=</mml:mo><mml:mstyle displaystyle="true" scriptlevel="0"><mml:mrow><mml:mfrac><mml:mrow><mml:mi>n</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mi>c</mml:mi><mml:mspace width="thickmathspace" /><mml:mo>&#x2212;</mml:mo><mml:mspace width="thickmathspace" /><mml:mi>l</mml:mi></mml:mrow><mml:mrow><mml:mi>n</mml:mi><mml:mspace width="thickmathspace" /><mml:mo>&#x2212;</mml:mo><mml:mspace width="thickmathspace" /><mml:mn>1</mml:mn></mml:mrow></mml:mfrac></mml:mrow></mml:mstyle></mml:math>
</disp-formula></p>
<p>After dividing the audio with different lengths into chunks of the same length, acoustic feature values are extracted from each chunk using the OpenSmile toolkit [<xref ref-type="bibr" rid="ref-22">22</xref>]. The feature extraction is described in the following section. Low-level descriptors (LLDs) are extracted from a 20&#x2013;50-ms short frame to extract emotional features through RNN techniques from each chunk, and statistical functions are applied to the extracted LLDs to calculate high-level statistical functions, which are features of pronunciation units.</p>
</sec>
<sec id="s3_2">
<label>3.2</label>
<title>Feature Extraction</title>
<p>In the proposed CSER model, feature extraction is performed to extract and recognize voice features from each chunk. Zero-crossing rate (ZCR), root mean square (RMS), Mel vector, chroma, Mel-frequency cepstral coefficient (MFCC), and spectral features are extracted from each chunk for its ER.</p>
<p>ZCR calculates the sign change rate of the amplitude of each chunk. It refers to the rate at which the signal value passes through zero as the sign change rate, which is used as the most primitive pitch detection algorithm.<disp-formula id="eqn-3"><label>(3)</label>
<mml:math id="mml-eqn-3" display="block"><mml:msub><mml:mi>Z</mml:mi><mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mi>i</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mstyle displaystyle="true" scriptlevel="0"><mml:mrow><mml:mfrac><mml:mn>1</mml:mn><mml:mrow><mml:mn>2</mml:mn><mml:mi>N</mml:mi></mml:mrow></mml:mfrac></mml:mrow><mml:msubsup><mml:mo movablelimits="false">&#x2211;</mml:mo><mml:mrow><mml:mi>n</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mi>N</mml:mi></mml:msubsup><mml:mrow><mml:mrow><mml:mo>|</mml:mo><mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:msub><mml:mi>x</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:mi>n</mml:mi><mml:mo>)</mml:mo></mml:mrow><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mi>n</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mrow><mml:mo>|</mml:mo></mml:mrow></mml:mrow></mml:mstyle></mml:math>
</disp-formula></p>
<p><inline-formula id="ieqn-1">
<mml:math id="mml-ieqn-1"><mml:mi>s</mml:mi><mml:mi>g</mml:mi><mml:mi>n</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mo>&#x22C5;</mml:mo><mml:mo>)</mml:mo></mml:mrow></mml:math>
</inline-formula> is the sign function given by<disp-formula id="eqn-4"><label>(4)</label>
<mml:math id="mml-eqn-4" display="block"><mml:mi>s</mml:mi><mml:mi>g</mml:mi><mml:mi>n</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:msub><mml:mi>x</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:mi>n</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mrow><mml:mo>{</mml:mo><mml:mrow><mml:mtable rowspacing="4pt" columnspacing="1em"><mml:mtr><mml:mtd><mml:mrow><mml:mn>1</mml:mn><mml:mspace width="thickmathspace" /><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mi>x</mml:mi><mml:mspace width="thickmathspace" /><mml:mo>&#x2265;</mml:mo><mml:mn>0</mml:mn></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mspace width="thickmathspace" /><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mi>x</mml:mi><mml:mspace width="thickmathspace" /><mml:mo>&#x003C;</mml:mo><mml:mn>0</mml:mn></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:mrow><mml:mo fence="true" stretchy="true" symmetric="true"></mml:mo></mml:mrow><mml:mo>,</mml:mo><mml:mspace width="thickmathspace" /></mml:math>
</disp-formula>where <italic>x</italic>, <italic>N,</italic> and <italic>i</italic> are the amplitude, length, and index of the frame, respectively.</p>
<p>RMS is a value obtained by calculating the energy of each chunk frame (i.e., the sound intensity of each frame). It is the most basic measure of emotion, which is given by<disp-formula id="eqn-5"><label>(5)</label>
<mml:math id="mml-eqn-5" display="block"><mml:mi>R</mml:mi><mml:mi>M</mml:mi><mml:mi>S</mml:mi><mml:mo>=</mml:mo><mml:msqrt><mml:mstyle displaystyle="true" scriptlevel="0"><mml:mrow><mml:mfrac><mml:mrow><mml:msubsup><mml:mrow><mml:mo movablelimits="false">&#x2211;</mml:mo></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mi>N</mml:mi></mml:msubsup><mml:mo>&#x2061;</mml:mo><mml:mi>y</mml:mi><mml:msup><mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mi>i</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:mrow><mml:mn>2</mml:mn></mml:msup></mml:mrow><mml:mi>N</mml:mi></mml:mfrac></mml:mrow></mml:mstyle></mml:msqrt><mml:mo>,</mml:mo><mml:mspace width="thickmathspace" /></mml:math>
</disp-formula>where <inline-formula id="ieqn-2">
<mml:math id="mml-ieqn-2"><mml:mi>y</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mi>i</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:math>
</inline-formula> is the signal amplitude of the <italic>i</italic>-th chunk.</p>
<p>The Mel-scale spectrogram is a characteristic vector of the energy (dB) of each chunk according to time and frequency and is often used as a basic characteristic in ER. Chroma short-time Fourier transform is a feature vector representing the change of 12 distinctive pitch classes and can be characterized as a scale by extracting the energy of each scale from chunks according to time. MFCC is a feature that represents a unique characteristic of sound and can be extracted from an audio signal. The spectral feature is a statistical feature value in the frequency domain, and the frequency band spectrum is used as a statistical value for ER in addition to other features.</p>
</sec>
<sec id="s3_3">
<label>3.3</label>
<title>Recurrent Neural Network (RNN) Techniques</title>
<p>In this section, the LSTM, Bi-LSTM, GRU, and Bi-GRU techniques and self-AM for feature extraction used in the proposed CSER model are described.</p>
<sec id="s3_3_1">
<label>3.3.1</label>
<title>Long Short-Term Memory (LSTM)</title>
<p>LSTM is a technique that supplements the shortcoming that, if learning continues for a long time in existing RNNs, initial learning is forgotten [<xref ref-type="bibr" rid="ref-23">23</xref>]. The values are adjusted by attaching cells called gates to the input, forget, and output layers of an RNN. The input gate decides whether to store new information, the forget gate decides whether to store previous state information, and the output gate controls the output value of an updated cell. <xref ref-type="fig" rid="fig-2">Fig. 2</xref> shows the structure of LSTM.</p>
<fig id="fig-2">
<label>Figure 2</label>
<caption>
<title>Structure of long short-term memory (LSTM)</title></caption>
<graphic mimetype="image" mime-subtype="png" xlink:href="IASC_33082-fig-2.png"/>
</fig>
<p>In <xref ref-type="fig" rid="fig-2">Fig. 2</xref>, <inline-formula id="ieqn-3">
<mml:math id="mml-ieqn-3"><mml:msub><mml:mi>x</mml:mi><mml:mi>t</mml:mi></mml:msub></mml:math>
</inline-formula> and <inline-formula id="ieqn-4">
<mml:math id="mml-ieqn-4"><mml:msub><mml:mi>h</mml:mi><mml:mi>t</mml:mi></mml:msub></mml:math>
</inline-formula> represent the input and hidden states at time <italic>t</italic>, respectively. Moreover, <italic>i</italic>, <italic>f,</italic> and <italic>o</italic> denote the input, forget, and output gates, respectively. First, LSTM uses a sigmoid function to determine the information to be eliminated. Then, it uses another sigmoid function and a <italic>tanh</italic> function to determine if new information should be stored in the cell state. The cell state is updated in the third step, and the output value is determined using the final sigmoid and <italic>tanh</italic> functions through which the output from the cell state is passed.</p>

<p>The LSTM used in this study consisted of continuously connected units in the left and right directions. At each step, the LSTM receives the hidden and cell states of the previous time step, receives the input value of the current step, performs computation through gates, updates the hidden and cell states, and transmits them to the next time step. Forget gate <inline-formula id="ieqn-5">
<mml:math id="mml-ieqn-5"><mml:msub><mml:mi>f</mml:mi><mml:mi>t</mml:mi></mml:msub></mml:math>
</inline-formula> decides what information needs to be removed from the LSTM memory. The forget gate can be expressed as<disp-formula id="eqn-6"><label>(6)</label>
<mml:math id="mml-eqn-6" display="block"><mml:msub><mml:mi>f</mml:mi><mml:mi>t</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mi>&#x03C3;</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:msub><mml:mi>W</mml:mi><mml:mi>f</mml:mi></mml:msub><mml:mo>.</mml:mo><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:msub><mml:mi>h</mml:mi><mml:mrow><mml:mi>t</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mspace width="thickmathspace" /><mml:msub><mml:mi>x</mml:mi><mml:mi>t</mml:mi></mml:msub></mml:mrow><mml:mo>]</mml:mo></mml:mrow><mml:mo>+</mml:mo><mml:msub><mml:mi>b</mml:mi><mml:mi>f</mml:mi></mml:msub></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mo>,</mml:mo><mml:mspace width="thickmathspace" /></mml:math>
</disp-formula>where <italic>W</italic> is the weight matrix, and <italic>b</italic> is the bias vector, which are used to connect the input layer, memory block, and output layer. The forget gate applies a sigmoid function on the previous hidden state <inline-formula id="ieqn-6">
<mml:math id="mml-ieqn-6"><mml:msub><mml:mi>h</mml:mi><mml:mrow><mml:mi>t</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:math>
</inline-formula> and current input value <inline-formula id="ieqn-7">
<mml:math id="mml-ieqn-7"><mml:msub><mml:mi>x</mml:mi><mml:mi>t</mml:mi></mml:msub></mml:math>
</inline-formula>. An output value of zero indicates completely discarding the value, while a value of one represents the complete retainment of the value.</p>
<p>Input gate <inline-formula id="ieqn-8">
<mml:math id="mml-ieqn-8"><mml:msub><mml:mi>i</mml:mi><mml:mi>t</mml:mi></mml:msub></mml:math>
</inline-formula> decides whether or not the new information is to be added to the LSTM memory. This gate consists of sigmoid and tangent layers. The sigmoid layer determines which values need to be updated and the tanh layer creates a vector of new candidate values to be added to the cell state.</p>
<p><disp-formula id="eqn-7"><label>(7)</label>
<mml:math id="mml-eqn-7" display="block"><mml:msub><mml:mi>i</mml:mi><mml:mi>t</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mi>&#x03C3;</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:msub><mml:mi>W</mml:mi><mml:mi>f</mml:mi></mml:msub><mml:mo>.</mml:mo><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:msub><mml:mi>h</mml:mi><mml:mrow><mml:mi>t</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mspace width="thickmathspace" /><mml:msub><mml:mi>x</mml:mi><mml:mi>t</mml:mi></mml:msub></mml:mrow><mml:mo>]</mml:mo></mml:mrow><mml:mo>+</mml:mo><mml:msub><mml:mi>b</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mo>,</mml:mo><mml:mspace width="thickmathspace" /></mml:math>
</disp-formula></p>
<p><disp-formula id="eqn-8"><label>(8)</label>
<mml:math id="mml-eqn-8" display="block"><mml:msub><mml:mrow><mml:mover><mml:mi>c</mml:mi><mml:mo stretchy="false">&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mi>t</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mi>tanh</mml:mi><mml:mo>&#x2061;</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:msub><mml:mi>W</mml:mi><mml:mi>C</mml:mi></mml:msub><mml:mo>.</mml:mo><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:msub><mml:mi>h</mml:mi><mml:mrow><mml:mi>t</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mspace width="thickmathspace" /><mml:msub><mml:mi>x</mml:mi><mml:mi>t</mml:mi></mml:msub></mml:mrow><mml:mo>]</mml:mo></mml:mrow><mml:mo>+</mml:mo><mml:msub><mml:mi>b</mml:mi><mml:mi>c</mml:mi></mml:msub></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mo>,</mml:mo><mml:mspace width="thickmathspace" /></mml:math>
</disp-formula></p>
<p>where <inline-formula id="ieqn-9">
<mml:math id="mml-ieqn-9"><mml:msub><mml:mi>i</mml:mi><mml:mi>t</mml:mi></mml:msub></mml:math>
</inline-formula> decides whether the value should be updated, and <inline-formula id="ieqn-10">
<mml:math id="mml-ieqn-10"><mml:msub><mml:mrow><mml:mover><mml:mi>c</mml:mi><mml:mo stretchy="false">&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mi>t</mml:mi></mml:msub></mml:math>
</inline-formula> is the vector of new candidate values to be added to the cell state.</p>
<p>Subsequently, the previous cell state <inline-formula id="ieqn-11">
<mml:math id="mml-ieqn-11"><mml:msub><mml:mi>c</mml:mi><mml:mrow><mml:mi>t</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:math>
</inline-formula> is updated to the current cell state <inline-formula id="ieqn-12">
<mml:math id="mml-ieqn-12"><mml:msub><mml:mi>c</mml:mi><mml:mi>t</mml:mi></mml:msub></mml:math>
</inline-formula>, which can be expressed as<disp-formula id="eqn-9"><label>(9)</label>
<mml:math id="mml-eqn-9" display="block"><mml:msub><mml:mi>c</mml:mi><mml:mi>t</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mi>f</mml:mi><mml:mi>t</mml:mi></mml:msub><mml:mo>&#x2217;</mml:mo><mml:msub><mml:mi>c</mml:mi><mml:mrow><mml:mi>t</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:msub><mml:mi>i</mml:mi><mml:mi>t</mml:mi></mml:msub><mml:mrow><mml:mtext>&#xA0;</mml:mtext><mml:mo>&#x2217;</mml:mo><mml:mtext>&#xA0;</mml:mtext></mml:mrow><mml:msub><mml:mrow><mml:mover><mml:mi>c</mml:mi><mml:mo stretchy="false">&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mi>t</mml:mi></mml:msub><mml:mo>,</mml:mo><mml:mspace width="thickmathspace" /></mml:math>
</disp-formula>where <inline-formula id="ieqn-13">
<mml:math id="mml-ieqn-13"><mml:msub><mml:mi>f</mml:mi><mml:mi>t</mml:mi></mml:msub></mml:math>
</inline-formula> is the result of the forget gate, a value between 0 and 1. Finally, the result is multiplied by the output of a sigmoid layer. Then, output gate <inline-formula id="ieqn-14">
<mml:math id="mml-ieqn-14"><mml:msub><mml:mi>o</mml:mi><mml:mi>t</mml:mi></mml:msub></mml:math>
</inline-formula> is calculated using the output of sigmoid layer of <inline-formula id="ieqn-15">
<mml:math id="mml-ieqn-15"><mml:msub><mml:mi>h</mml:mi><mml:mrow><mml:mi>t</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:math>
</inline-formula> and <inline-formula id="ieqn-16">
<mml:math id="mml-ieqn-16"><mml:msub><mml:mi>x</mml:mi><mml:mi>t</mml:mi></mml:msub></mml:math>
</inline-formula>.</p>
<p><disp-formula id="eqn-10"><label>(10)</label>
<mml:math id="mml-eqn-10" display="block"><mml:msub><mml:mi>o</mml:mi><mml:mi>t</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mi>&#x03C3;</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:msub><mml:mi>W</mml:mi><mml:mi>o</mml:mi></mml:msub><mml:mo>&#x2217;</mml:mo><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:msub><mml:mi>h</mml:mi><mml:mrow><mml:mi>t</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mspace width="thickmathspace" /><mml:msub><mml:mi>x</mml:mi><mml:mi>t</mml:mi></mml:msub></mml:mrow><mml:mo>]</mml:mo></mml:mrow><mml:mo>+</mml:mo><mml:msub><mml:mi>b</mml:mi><mml:mi>o</mml:mi></mml:msub></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mo>,</mml:mo><mml:mspace width="thickmathspace" /></mml:math>
</disp-formula></p>
<p><disp-formula id="eqn-11"><label>(11)</label>
<mml:math id="mml-eqn-11" display="block"><mml:msub><mml:mi>h</mml:mi><mml:mi>t</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mi>o</mml:mi><mml:mi>t</mml:mi></mml:msub><mml:mo>&#x2217;</mml:mo><mml:mi>tanh</mml:mi><mml:mo>&#x2061;</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:msub><mml:mi>c</mml:mi><mml:mi>t</mml:mi></mml:msub></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mo>,</mml:mo><mml:mspace width="thickmathspace" /></mml:math>
</disp-formula>where <inline-formula id="ieqn-17">
<mml:math id="mml-ieqn-17"><mml:msub><mml:mi>h</mml:mi><mml:mi>t</mml:mi></mml:msub></mml:math>
</inline-formula> is a value between &#x2212;1 and 1. In this study, <inline-formula id="ieqn-18">
<mml:math id="mml-ieqn-18"><mml:msub><mml:mi>x</mml:mi><mml:mi>t</mml:mi></mml:msub></mml:math>
</inline-formula> is the input chunked speech signal data. The input time-series speech signal data are expressed as <inline-formula id="ieqn-19">
<mml:math id="mml-ieqn-19"><mml:mi>X</mml:mi><mml:mo>=</mml:mo><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:msub><mml:mi>x</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:mspace width="thickmathspace" /><mml:mspace width="thickmathspace" /><mml:msub><mml:mi>x</mml:mi><mml:mn>2</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:mspace width="thickmathspace" /><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:mspace width="thickmathspace" /><mml:msub><mml:mi>x</mml:mi><mml:mi>N</mml:mi></mml:msub></mml:mrow><mml:mo>]</mml:mo></mml:mrow></mml:math>
</inline-formula>, and the hidden state of memory cells is denoted by <inline-formula id="ieqn-20">
<mml:math id="mml-ieqn-20"><mml:mi>H</mml:mi><mml:mo>=</mml:mo><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:msub><mml:mi>h</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:mspace width="thickmathspace" /><mml:mspace width="thickmathspace" /><mml:msub><mml:mi>h</mml:mi><mml:mn>2</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:mspace width="thickmathspace" /><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:mspace width="thickmathspace" /><mml:msub><mml:mi>h</mml:mi><mml:mi>N</mml:mi></mml:msub></mml:mrow><mml:mo>]</mml:mo></mml:mrow><mml:mo>,</mml:mo><mml:mspace width="thickmathspace" /></mml:math>
</inline-formula> where <italic>N</italic> is the number of chunked speech data values.</p>
</sec>
<sec id="s3_3_2">
<label>3.3.2</label>
<title>Bidirectional-LSTM (Bi-LSTM)</title>
<p>Bi-LSTM is a modified LSTM [<xref ref-type="bibr" rid="ref-24">24</xref>] and comprises two independent hidden layers. It calculates the forward hidden sequence first followed by the reverse hidden sequence. Furthermore, it combines the two layers to obtain the output. Compared with LSTM, Bi-LSTM can improve the context available in the algorithm by effectively increasing the amount of information available in the network.</p>
</sec>
<sec id="s3_3_3">
<label>3.3.3</label>
<title>Gated Recurrent Unit (GRU)</title>
<p>GRU is an extended approach of LSTM [<xref ref-type="bibr" rid="ref-25">25</xref>] and is similar to LSTM with fewer parameters. The parameters are learned through the gating mechanism. Moreover, its internal structure is simpler, making it easier to train, because an update to its hidden state requires fewer computations. Moreover, it solves the vanishing gradient problem of LSTM. <xref ref-type="fig" rid="fig-3">Fig. 3</xref> illustrates the structure of GRU.</p>
<fig id="fig-3">
<label>Figure 3</label>
<caption>
<title>Structure of gated recurrent unit (GRU)</title></caption>
<graphic mimetype="image" mime-subtype="png" xlink:href="IASC_33082-fig-3.png"/>
</fig>
<p>In LSTM, there are three gates&#x2014;output, input, and forget&#x2014;whereas there are only two gates&#x2014;update and reset gates&#x2014;in GRU. As shown in <xref ref-type="fig" rid="fig-3">Fig. 3</xref>, the update gate, <inline-formula id="ieqn-21">
<mml:math id="mml-ieqn-21"><mml:msub><mml:mi>z</mml:mi><mml:mi>t</mml:mi></mml:msub></mml:math>
</inline-formula>, determines which information can be retained to the next state and the reset gate, <inline-formula id="ieqn-22">
<mml:math id="mml-ieqn-22"><mml:msub><mml:mi>r</mml:mi><mml:mi>t</mml:mi></mml:msub></mml:math>
</inline-formula>, determines how previous state information is combined with the new input information in GRU. The formula of GRU can be expressed as follows:</p>

<p><disp-formula id="eqn-12"><label>(12)</label>
<mml:math id="mml-eqn-12" display="block"><mml:msub><mml:mi>z</mml:mi><mml:mi>t</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mi>&#x03C3;</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:msub><mml:mi>W</mml:mi><mml:mi>h</mml:mi></mml:msub><mml:mo>.</mml:mo><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:msub><mml:mi>h</mml:mi><mml:mrow><mml:mi>t</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mspace width="thickmathspace" /><mml:msub><mml:mi>x</mml:mi><mml:mi>t</mml:mi></mml:msub></mml:mrow><mml:mo>]</mml:mo></mml:mrow></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mo>,</mml:mo><mml:mspace width="thickmathspace" /></mml:math>
</disp-formula></p>
<p><disp-formula id="eqn-13"><label>(13)</label>
<mml:math id="mml-eqn-13" display="block"><mml:msub><mml:mi>r</mml:mi><mml:mi>t</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mi>&#x03C3;</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:msub><mml:mi>W</mml:mi><mml:mi>r</mml:mi></mml:msub><mml:mo>.</mml:mo><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:msub><mml:mi>h</mml:mi><mml:mrow><mml:mi>t</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mspace width="thickmathspace" /><mml:msub><mml:mi>x</mml:mi><mml:mi>t</mml:mi></mml:msub></mml:mrow><mml:mo>]</mml:mo></mml:mrow></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mo>,</mml:mo><mml:mspace width="thickmathspace" /></mml:math>
</disp-formula></p>
<p>Candidate hidden state, <inline-formula id="ieqn-23">
<mml:math id="mml-ieqn-23"><mml:msub><mml:mi>h</mml:mi><mml:mi>t</mml:mi></mml:msub></mml:math>
</inline-formula>, and current hidden state, <inline-formula id="ieqn-24">
<mml:math id="mml-ieqn-24"><mml:msub><mml:mi>h</mml:mi><mml:mi>t</mml:mi></mml:msub></mml:math>
</inline-formula>, can be written as follows:</p>
<p><disp-formula id="eqn-14"><label>(14)</label>
<mml:math id="mml-eqn-14" display="block"><mml:msub><mml:mrow><mml:mover><mml:mi>h</mml:mi><mml:mo stretchy="false">&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mi>t</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mrow><mml:mi mathvariant="normal">t</mml:mi><mml:mi mathvariant="normal">a</mml:mi><mml:mi mathvariant="normal">n</mml:mi><mml:mi mathvariant="normal">h</mml:mi></mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mi>W</mml:mi><mml:mo>.</mml:mo><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:msub><mml:mi>r</mml:mi><mml:mi>t</mml:mi></mml:msub><mml:mo>&#x2217;</mml:mo><mml:msub><mml:mi>h</mml:mi><mml:mrow><mml:mi>t</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mspace width="thickmathspace" /><mml:msub><mml:mi>x</mml:mi><mml:mi>t</mml:mi></mml:msub></mml:mrow><mml:mo>]</mml:mo></mml:mrow></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mo>,</mml:mo><mml:mspace width="thickmathspace" /></mml:math>
</disp-formula></p>
<p><disp-formula id="eqn-15"><label>(15)</label>
<mml:math id="mml-eqn-15" display="block"><mml:msub><mml:mi>h</mml:mi><mml:mi>t</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mn>1</mml:mn><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mi>z</mml:mi><mml:mi>t</mml:mi></mml:msub></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mo>&#x2217;</mml:mo><mml:msub><mml:mi>h</mml:mi><mml:mrow><mml:mi>t</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:msub><mml:mi>z</mml:mi><mml:mi>t</mml:mi></mml:msub><mml:mo>&#x2217;</mml:mo><mml:msub><mml:mrow><mml:mover><mml:mi>h</mml:mi><mml:mo stretchy="false">&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mi>t</mml:mi></mml:msub></mml:math>
</disp-formula>where the definitions of <inline-formula id="ieqn-25">
<mml:math id="mml-ieqn-25"><mml:mi>&#x03C3;</mml:mi></mml:math>
</inline-formula>, <inline-formula id="ieqn-26">
<mml:math id="mml-ieqn-26"><mml:mi>x</mml:mi><mml:mo>,</mml:mo><mml:mspace width="thickmathspace" /></mml:math>
</inline-formula> and <italic>h</italic> are the same as those in LSTM. <inline-formula id="ieqn-27">
<mml:math id="mml-ieqn-27"><mml:msub><mml:mi>h</mml:mi><mml:mi>t</mml:mi></mml:msub></mml:math>
</inline-formula> and <inline-formula id="ieqn-28">
<mml:math id="mml-ieqn-28"><mml:msub><mml:mi>h</mml:mi><mml:mrow><mml:mi>t</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:math>
</inline-formula> denote the output of the current and previous states, respectively. <inline-formula id="ieqn-29">
<mml:math id="mml-ieqn-29"><mml:msub><mml:mi>x</mml:mi><mml:mi>t</mml:mi></mml:msub></mml:math>
</inline-formula> denotes the speech data signal of the chunked voice signal. The input time-series chunked speech data are expressed as <inline-formula id="ieqn-30">
<mml:math id="mml-ieqn-30"><mml:mi>X</mml:mi><mml:mo>=</mml:mo><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:msub><mml:mi>x</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:mspace width="thickmathspace" /><mml:mspace width="thickmathspace" /><mml:msub><mml:mi>x</mml:mi><mml:mn>2</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:mspace width="thickmathspace" /><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:mspace width="thickmathspace" /><mml:msub><mml:mi>x</mml:mi><mml:mi>N</mml:mi></mml:msub></mml:mrow><mml:mo>]</mml:mo></mml:mrow></mml:math>
</inline-formula>, and the hidden state of memory cells is denoted by <inline-formula id="ieqn-31">
<mml:math id="mml-ieqn-31"><mml:mi>H</mml:mi><mml:mo>=</mml:mo><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:msub><mml:mi>h</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:mspace width="thickmathspace" /><mml:mspace width="thickmathspace" /><mml:msub><mml:mi>h</mml:mi><mml:mn>2</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:mspace width="thickmathspace" /><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:mspace width="thickmathspace" /><mml:msub><mml:mi>h</mml:mi><mml:mi>N</mml:mi></mml:msub></mml:mrow><mml:mo>]</mml:mo></mml:mrow><mml:mo>,</mml:mo><mml:mspace width="thickmathspace" /></mml:math>
</inline-formula> where <italic>N</italic> is the number of chunked speech data values.</p>
</sec>
<sec id="s3_3_4">
<label>3.3.4</label>
<title>Bidirectional Gated Recurrent Unit (Bi-GRU)</title>
<p>Bi-GRU is an improved model that combines bidirectional RNN and GRU [<xref ref-type="bibr" rid="ref-24">24</xref>]. The structure of Bi-GRU is similar to that of Bi-LSTM, except for the cyclic unit. These two bidirectional networks can simultaneously use forward and reverse information. As GRU is simpler than LSTM, Bi-GRU is simpler than Bi-LSTM.</p>
</sec>
<sec id="s3_3_5">
<label>3.3.5</label>
<title>Attention Mechanisms (AM)</title>
<p>In this subsection, we describe the self-AM used in the proposed CSER model. AM was first proposed in the image processing field, and it aids a model to learn by focusing on specific function information. AM uses the state of the last hidden layer of LSTM or the implicit state of the LSTM&#x2019;s output to fit the hidden state of the current moment input. However, in the field of SER, self-AM, which adaptively weights the current input voice signal, is more appropriate.</p>
<p>Sequential speech signals contain more emotional information than others such as signal interferences and noise. To focus on the emotional parts of a sequence, we learn the internal structure of the sequence with a focus on enhancing specific feature information in sentences using self-AM. Therefore, self-AM is an improved version of AM as it reduces dependence on external information and is better at capturing internal correlations of data or functions.</p>
<p>In the case of audio sequences that generally represent human emotions, adjacent frames exhibit similar acoustic characteristics. The query (<italic>Q</italic>), key (<italic>K</italic>), and value (<italic>V</italic>) of the <italic>i</italic>-th element for the voice data sequence <italic>X</italic> can be expressed as follows:</p>
<p><disp-formula id="eqn-16"><label>(16)</label>
<mml:math id="mml-eqn-16" display="block"><mml:msub><mml:mi>Q</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mrow><mml:mspace width="thickmathspace" /></mml:mrow></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mrow><mml:mspace width="thickmathspace" /></mml:mrow><mml:msubsup><mml:mi>w</mml:mi><mml:mi>q</mml:mi><mml:mi>T</mml:mi></mml:msubsup><mml:msub><mml:mi>x</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo>,</mml:mo><mml:mspace width="thickmathspace" /></mml:math>
</disp-formula></p>
<p><disp-formula id="eqn-17"><label>(17)</label>
<mml:math id="mml-eqn-17" display="block"><mml:msub><mml:mi>V</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mrow><mml:mspace width="thickmathspace" /></mml:mrow></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mrow><mml:mspace width="thickmathspace" /></mml:mrow><mml:msubsup><mml:mi>w</mml:mi><mml:mi>v</mml:mi><mml:mi>T</mml:mi></mml:msubsup><mml:msub><mml:mi>x</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo>,</mml:mo><mml:mspace width="thickmathspace" /></mml:math>
</disp-formula></p>
<p><disp-formula id="eqn-18"><label>(18)</label>
<mml:math id="mml-eqn-18" display="block"><mml:msub><mml:mi>K</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mrow><mml:mspace width="thickmathspace" /></mml:mrow></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mrow><mml:mspace width="thickmathspace" /></mml:mrow><mml:msubsup><mml:mi>w</mml:mi><mml:mi>k</mml:mi><mml:mi>T</mml:mi></mml:msubsup><mml:msub><mml:mi>x</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo>,</mml:mo><mml:mspace width="thickmathspace" /></mml:math>
</disp-formula></p>
<p>where <inline-formula id="ieqn-32">
<mml:math id="mml-ieqn-32"><mml:msub><mml:mi>x</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:math>
</inline-formula> is the <inline-formula id="ieqn-33">
<mml:math id="mml-ieqn-33"><mml:mi>i</mml:mi></mml:math>
</inline-formula>-th element of <italic>X</italic>. <inline-formula id="ieqn-34">
<mml:math id="mml-ieqn-34"><mml:msub><mml:mi>w</mml:mi><mml:mi>q</mml:mi></mml:msub></mml:math>
</inline-formula>, <inline-formula id="ieqn-35">
<mml:math id="mml-ieqn-35"><mml:msub><mml:mi>w</mml:mi><mml:mi>v</mml:mi></mml:msub></mml:math>
</inline-formula>, and <inline-formula id="ieqn-36">
<mml:math id="mml-ieqn-36"><mml:msub><mml:mi>w</mml:mi><mml:mi>k</mml:mi></mml:msub></mml:math>
</inline-formula> are the linear projections that map the <inline-formula id="ieqn-37">
<mml:math id="mml-ieqn-37"><mml:mi>i</mml:mi></mml:math>
</inline-formula>-th element to the query, value, and key, respectively. The dimensions of <italic>Q</italic>, <italic>V</italic>, and <italic>K</italic> are <inline-formula id="ieqn-38">
<mml:math id="mml-ieqn-38"><mml:mn>1</mml:mn><mml:mrow><mml:mspace width="thickmathspace" /></mml:mrow><mml:mo>&#x00D7;</mml:mo><mml:mrow><mml:mspace width="thickmathspace" /></mml:mrow><mml:mi>h</mml:mi><mml:mi>p</mml:mi></mml:math>
</inline-formula>, where <inline-formula id="ieqn-39">
<mml:math id="mml-ieqn-39"><mml:mi>h</mml:mi><mml:mi>p</mml:mi></mml:math>
</inline-formula> is a hyperparameter. Finally, the AM is computed using matrix multiplication on a set of sequences.<disp-formula id="eqn-19"><label>(19)</label>
<mml:math id="mml-eqn-19" display="block"><mml:mi>z</mml:mi><mml:mo>=</mml:mo><mml:mi>s</mml:mi><mml:mi>o</mml:mi><mml:mi>f</mml:mi><mml:mi>t</mml:mi><mml:mi>m</mml:mi><mml:mi>a</mml:mi><mml:mi>x</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mstyle displaystyle="true" scriptlevel="0"><mml:mrow><mml:mfrac><mml:mrow><mml:msup><mml:mi>K</mml:mi><mml:mi>T</mml:mi></mml:msup><mml:mi>Q</mml:mi></mml:mrow><mml:mrow><mml:msqrt><mml:msub><mml:mi>d</mml:mi><mml:mi>k</mml:mi></mml:msub></mml:msqrt></mml:mrow></mml:mfrac></mml:mrow></mml:mstyle></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mi>V</mml:mi><mml:mo>,</mml:mo></mml:math>
</disp-formula>where <italic>Q</italic>, <italic>K</italic>, and <italic>V</italic> denote sets of queries, keys, and values, respectively, and <inline-formula id="ieqn-40">
<mml:math id="mml-ieqn-40"><mml:msub><mml:mi>d</mml:mi><mml:mi>k</mml:mi></mml:msub></mml:math>
</inline-formula> is a scaling factor.</p>
</sec>
</sec>
</sec>
<sec id="s4">
<label>4</label>
<title>Simulation Results</title>
<sec id="s4_1">
<label>4.1</label>
<title>Dataset</title>
<p>The dataset used in this study is the IEMOCAP database [<xref ref-type="bibr" rid="ref-21">21</xref>]. This database was comprised of five sessions, prepared using 10 actors who improvised or performed emotions based on a script in a mixed-gender pair, and analyzed by evaluators. <xref ref-type="table" rid="table-2">Tab. 2</xref> lists the IEMOCAP database.</p>
<table-wrap id="table-2"><label>Table 2</label>
<caption>
<title>IEMOCAP database</title></caption>
<table><colgroup><col align="left"/><col align="left"/><col align="left"/>
</colgroup>
<thead>
<tr>
<th align="left">Label</th>
<th align="left">Emotion</th>
<th align="left">No. of data</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">ang</td>
<td align="left">anger</td>
<td align="left">1,103</td>
</tr>
<tr>
<td align="left">hap</td>
<td align="left">happiness</td>
<td align="left">595</td>
</tr>
<tr>
<td align="left">exc</td>
<td align="left">excite</td>
<td align="left">1,041</td>
</tr>
<tr>
<td align="left">sad</td>
<td align="left">sadness</td>
<td align="left">1,084</td>
</tr>
<tr>
<td align="left">fru</td>
<td align="left">frustration</td>
<td align="left">1,849</td>
</tr>
<tr>
<td align="left">fea</td>
<td align="left">fear</td>
<td align="left">40</td>
</tr>
<tr>
<td align="left">sur</td>
<td align="left">surprise</td>
<td align="left">107</td>
</tr>
<tr>
<td align="left">neu</td>
<td align="left">neutral</td>
<td align="left">1,708</td>
</tr>
<tr>
<td align="left">dis</td>
<td align="left">disgust</td>
<td align="left">2</td>
</tr>
<tr>
<td align="left">xxx</td>
<td align="left">unknown</td>
<td align="left">2,507</td>
</tr>
<tr>
<td align="left">oth</td>
<td align="left">others</td>
<td align="left">3</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>As shown in <xref ref-type="table" rid="table-2">Tab. 2</xref>, the emotion data in this database comprise 10,039 voice files related to 11 types of emotions, such as anger, happiness, sadness, tranquility, excitement, fear, surprise, and disgust. Herein, the performance of the CSER model is analyzed using the emotion data of voice files related to anger, happiness, neutral, and sadness.</p>

</sec>
<sec id="s4_2">
<label>4.2</label>
<title>Performance Analysis of Hard Voting</title>
<p>In this section, we apply the LSTM, Bi-LSTM, GRU, and Bi-GRU techniques to the proposed CSER model and compare and analyze the accuracy and simulation time of the ER results through hard voting. <xref ref-type="fig" rid="fig-4">Fig. 4</xref> represents the confusion matrices of hard voting applied to the CSER model.</p>
<fig id="fig-4">
<label>Figure 4</label>
<caption>
<title>Confusion matrices of the proposed CSER model with hard voting: (a) LSTM, (b) Bi-LSTM, (c) GRU, and (d) Bi-GRU</title></caption>
<graphic mimetype="image" mime-subtype="png" xlink:href="IASC_33082-fig-4.png"/>
</fig>
<p>Based on <xref ref-type="fig" rid="fig-4">Fig. 4</xref>, four emotions are recognized from the voice signals with high probability when LSTM, Bi-LSTM, GRU, and Bi-GRU are applied to the proposed CSER. Furthermore, the prediction accuracy for distinct emotions such as anger is high. However, all four RNN techniques have the highest probability of diagonal matrices in confusion matrices for all emotions.</p>

<p><xref ref-type="fig" rid="fig-5">Figs. 5a</xref> and <xref ref-type="fig" rid="fig-5">5b</xref> show the accuracy and simulation time, respectively, for the four RNN techniques applied to the proposed CSER model. The final ER is evaluated using hard voting. Based on <xref ref-type="fig" rid="fig-6">Fig. 6a</xref>, Bi-LSTM has the highest accuracy (63.97&#x0025;), whereas LSTM has the lowest (60.05&#x0025;). <xref ref-type="table" rid="table-3">Tab. 3</xref> summarizes the comparative performances of accuracy and simulation time of LSTM, Bi-LSTM, GRU, and Bi-GRU of the proposed CSER model when using hard voting.</p>
<fig id="fig-5">
<label>Figure 5</label>
<caption>
<title>Performance comparison of LSTM, Bi-LSTM, GRU, and Bi-GRU of the proposed CSER model with hard voting: (a) accuracy and (b) simulation time</title></caption>
<graphic mimetype="image" mime-subtype="png" xlink:href="IASC_33082-fig-5.png"/>
</fig><fig id="fig-6">
<label>Figure 6</label>
<caption>
<title>Confusion matrices of soft voting applied to the CSER model: (a) LSTM, (b) Bi-LSTM, (c) GRU, and (d) Bi-GRU</title></caption>
<graphic mimetype="image" mime-subtype="png" xlink:href="IASC_33082-fig-6.png"/>
</fig><table-wrap id="table-3"><label>Table 3</label>
<caption>
<title>Comparison of accuracy and simulation time of the proposed CSER when using hard voting</title></caption>
<table><colgroup><col align="left"/><col align="left"/><col align="left"/>
</colgroup>
<thead>
<tr>
<th align="left"/>
<th align="left">Accuracy (&#x0025;)</th>
<th align="left">Simulation time (s)</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">LSTM</td>
<td align="left">60.05</td>
<td align="left">60.23</td>
</tr>
<tr>
<td align="left">Bi-LSTM</td>
<td align="left">63.97</td>
<td align="left">152.87</td>
</tr>
<tr>
<td align="left">GRU</td>
<td align="left">62.50</td>
<td align="left">51.02</td>
</tr>
<tr>
<td align="left">Bi-GRU</td>
<td align="left">62.99</td>
<td align="left">97.64</td>
</tr>
</tbody>
</table>
</table-wrap>
<p><xref ref-type="table" rid="table-3">Tab. 3</xref> shows the accuracy and simulation time when the final emotion is recognized using hard voting with the four RNN techniques applied to the proposed CSER model. Based on the <xref ref-type="table" rid="table-3">Tab. 3</xref>, when applying the bidirectional technique for both the LSTM and GRU methods, the simulation time increased by approximately 2.5 and 1.8 times, respectively, and the accuracy increased by approximately 3.92&#x0025; and 0.49&#x0025;, respectively. In general, when evaluating emotions using hard voting, the time efficiency with respect to the simulation time is optimal for GRU since the structure of GRU is simpler than LSTM.</p>

<p>Furthermore, <xref ref-type="table" rid="table-3">Tab. 3</xref> shows that the simulation time of the GRU technique is the shortest when recognizing voice emotions using hard voting in the proposed CSER model. <xref ref-type="table" rid="table-4">Tab. 4</xref> lists the accuracy and time efficiency of GRU compared with the other three RNN techniques. The relative performance of accuracy and time efficiency compared to GRU when using hard voting is summarized in <xref ref-type="table" rid="table-4">Tab. 4</xref>.</p>
<table-wrap id="table-4"><label>Table 4</label>
<caption>
<title>Comparison of accuracy and time efficiency compared to GRU of hard voting</title></caption>
<table><colgroup><col align="left"/><col align="left"/><col align="left"/>
</colgroup>
<thead>
<tr>
<th align="left"/>
<th align="left">Accuracy diff. (&#x0025;)</th>
<th align="left">Time efficiency (&#x0025;)</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">LSTM</td>
<td align="left">&#x002B;2.45</td>
<td align="left">15.29</td>
</tr>
<tr>
<td align="left">Bi-LSTM</td>
<td align="left">&#x2212;1.47</td>
<td align="left">66.63</td>
</tr>
<tr>
<td align="left">Bi-GRU</td>
<td align="left">&#x2212;0.49</td>
<td align="left">47.75</td>
</tr>
</tbody>
</table>
</table-wrap>
<p><xref ref-type="table" rid="table-4">Tab. 4</xref> lists the accuracy difference and time efficiency of three RNN techniques compared to GRU. The accuracy is increased by 2.45&#x0025; compared to LSTM when speech emotion is recognized by GRU. However, compared to the Bi-LSTM and Bi-GRU, the speech recognition accuracy is decreased by &#x2212;1.45&#x0025; and &#x2212;0.49&#x0025;, respectively.</p>

<p>Nevertheless, as shown in <xref ref-type="table" rid="table-3">Tab. 3</xref>, the simulation time is 51.02&#x2005;s, which is the fastest among the four RNN techniques, when GRU is applied to the proposed CSER model. However, the time efficiency is increased by 15.29&#x0025;, 66.63&#x0025;, and 47.75&#x0025; compared to the LSTM, Bi-LSTM, and Bi-GRU techniques, respectively, as shown in <xref ref-type="table" rid="table-4">Tab. 4</xref>. Therefore, it can be confirmed that applying GRU to the proposed CSER model has the highest time efficiency relative to accuracy.</p>

<p>Therefore, we confirmed that applying GRU to the proposed CSER model resulted in the highest time efficiency with respect to the accuracy among the four RNN techniques.</p>
</sec>
<sec id="s4_3">
<label>4.3</label>
<title>Performance Analysis of Soft Voting</title>
<p>In this subsection, we apply the LSTM, Bi-LSTM, GRU, and Bi-GRU techniques to the proposed CSER model and compare and analyze the accuracy and simulation time of the ER results through soft voting. <xref ref-type="fig" rid="fig-6">Fig. 6</xref> shows the confusion matrices of emotion recognition results obtained by applying the four RNN techniques to the proposed CSER and using soft voting.</p>
<p>As shown in <xref ref-type="fig" rid="fig-6">Fig. 6</xref>, the soft voting results recognize emotions from the voice signals with high probabilities, which are similar to the results of the hard voting in Section 4.2.</p>
<p><xref ref-type="fig" rid="fig-7">Figs. 7a</xref> and <xref ref-type="fig" rid="fig-7">7b</xref> show the accuracy and simulation time, respectively, for the four RNN techniques applied to the proposed CSER model, and the final ER is evaluated using soft voting. All four RNN techniques showed similar simulation accuracy. Both Bi-LSTM and Bi-GRU showed an increased simulation time of approximately 1.8 times compared with LSTM and GRU, respectively.</p>
<fig id="fig-7">
<label>Figure 7</label>
<caption>
<title>Performance comparison of LSTM, Bi-LSTM, GRU, and Bi-GRU of the proposed CSER model with soft voting: (a) Accuracy and (b) Simulation time</title></caption>
<graphic mimetype="image" mime-subtype="png" xlink:href="IASC_33082-fig-7.png"/>
</fig>
<p>As shown in <xref ref-type="table" rid="table-5">Tab. 5</xref>, when the bidirectional technique was applied to LSTM and GRU, the ER accuracy of LSTM increased by approximately 3.4&#x0025;, but the accuracy of GRU did not increase. Therefore, GRU is the most efficient in terms of simulation time for the CSER model when using soft and hard voting. The relative performance of the accuracy and time efficiency compared to GRU when using soft voting is listed in <xref ref-type="table" rid="table-6">Tab. 6</xref>.</p>
<table-wrap id="table-5"><label>Table 5</label>
<caption>
<title>Comparison of the accuracy and simulation time of the proposed CSER when using soft voting</title></caption>
<table><colgroup><col align="left"/><col align="left"/><col align="left"/>
</colgroup>
<thead>
<tr>
<th align="left"/>
<th align="left">Accuracy (&#x0025;)</th>
<th align="left">Simulation time (s)</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">LSTM</td>
<td align="left">61.52</td>
<td align="left">75.57</td>
</tr>
<tr>
<td align="left">Bi-LSTM</td>
<td align="left">64.95</td>
<td align="left">135.64</td>
</tr>
<tr>
<td align="left">GRU</td>
<td align="left">63.73</td>
<td align="left">53.25</td>
</tr>
<tr>
<td align="left">Bi-GRU</td>
<td align="left">63.73</td>
<td align="left">97.48</td>
</tr>
</tbody>
</table>
</table-wrap><table-wrap id="table-6"><label>Table 6</label>
<caption>
<title>Comparison of accuracy and time efficiency compared to GRU of soft voting</title></caption>
<table><colgroup><col align="left"/><col align="left"/><col align="left"/>
</colgroup>
<thead>
<tr>
<th align="left"/>
<th align="left">Accuracy diff. (&#x0025;)</th>
<th align="left">Time efficiency (&#x0025;)</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">LSTM</td>
<td align="left">&#x002B;2.21</td>
<td align="left">29.54</td>
</tr>
<tr>
<td align="left">Bi-LSTM</td>
<td align="left">&#x2212;1.22</td>
<td align="left">60.74</td>
</tr>
<tr>
<td align="left">Bi-GRU</td>
<td align="left">0.00</td>
<td align="left">45.37</td>
</tr>
</tbody>
</table>
</table-wrap>
<p><xref ref-type="table" rid="table-6">Tab. 6</xref> represents the results of the comparison of the accuracy and time efficiency between the GRU and the other three RNN techniques when recognizing emotions using soft voting in the proposed CSER model. When GRU is applied, the accuracy is increased by 2.21&#x0025; compared to LSTM and decreased by 1.22&#x0025; compared to Bi-LSTM, showing the same performance as that of Bi-GRU. However, when GRU is applied to the proposed CSER, the time efficiency of ER increased by 29.54&#x0025;, 60.74&#x0025;, and 45.37&#x0025; compared to LSTM, Bi-LSTM, and Bi-GRU, respectively. Therefore, the GRU exhibits the highest performance in terms of simulation time for both hard and soft voting.</p>

</sec>
</sec>
<sec id="s5">
<label>5</label>
<title>Conclusion</title>
<p>In this study, we proposed a CSER model in which a voice signal was divided into 3-s long chunks of voice signals to predict emotions using RNN techniques, recognize the emotions predicted in each chunk via hard and soft voting, and evaluate the performance. To evaluate the performance, the ER accuracy and simulation time of four RNN techniques (LSTM, Bi-LSTM, GRU, and Bi-GRU) were compared using hard and soft voting.</p>
<p>According to simulation results, GRU showed the best accuracy and time efficiency as a function of simulation time when emotion was recognized using LSTM, Bi-LSTM, GRU, and Bi-GRU techniques in the proposed CSER. It was confirmed that the time efficiency of GRU increased from a minimum of 15.29&#x0025; to a maximum of 66.63&#x0025; for hard voting and a minimum of 29.54&#x0025; to a maximum of 60.74&#x0025; for soft voting. Consequently, simulation results indicated that the GRU technique is the most efficient in terms of ER over simulation time using hard and soft voting. For further study, we plan to develop an ensemble CSER model that uses a combination of LSTM, Bi-LSTM, GRU, and Bi-GRU when recognizing emotions in each chunk of voice signals to increase the accuracy.</p>
</sec>
</body>
<back><fn-group>
<fn fn-type="other">
<p><bold>Funding Statement:</bold> This result was supported by the &#x201C;Regional Innovation Strategy (RIS)&#x201D; through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (MOE) (2021RIS-004).</p>
</fn>
<fn fn-type="conflict">
<p><bold>Conflicts of Interest:</bold> The authors declare that they have no conflicts of interest to report regarding the present study.</p>
</fn>
</fn-group>
<ref-list content-type="authoryear">
<title>References</title>
<ref id="ref-1"><label>[1]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>X.</given-names> <surname>Zhao</surname></string-name>, <string-name><given-names>W.</given-names> <surname>Liu</surname></string-name>, <string-name><given-names>W.</given-names> <surname>Xing</surname></string-name> and <string-name><given-names>X.</given-names> <surname>Wei</surname></string-name></person-group>, &#x201C;<article-title>DA-Res2net: A novel densely connected residual attention network for image semantic segmentation</article-title>,&#x201D; <source>KSII Transactions on Internet and Information Systems</source>, vol. <volume>14</volume>, no. <issue>11</issue>, pp. <fpage>4426</fpage>&#x2013;<lpage>4442</lpage>, <year>2020</year>.</mixed-citation></ref>
<ref id="ref-2"><label>[2]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>Z.</given-names> <surname>Huang</surname></string-name>, <string-name><given-names>J.</given-names> <surname>Li</surname></string-name> and <string-name><given-names>Z.</given-names> <surname>Hua</surname></string-name></person-group>, &#x201C;<article-title>Attention-based for multiscale fusion underwater image enhancement</article-title>,&#x201D; <source>KSII Transactions on Internet and Information Systems</source>, vol. <volume>16</volume>, no. <issue>2</issue>, pp. <fpage>544</fpage>&#x2013;<lpage>564</lpage>, <year>2022</year>.</mixed-citation></ref>
<ref id="ref-3"><label>[3]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>C.</given-names> <surname>Dong</surname></string-name>, <string-name><given-names>C. C.</given-names> <surname>Loy</surname></string-name>, <string-name><given-names>K.</given-names> <surname>He</surname></string-name> and <string-name><given-names>X.</given-names> <surname>Tang</surname></string-name></person-group>, &#x201C;<article-title>Image super-resolution using deep convolutional networks</article-title>,&#x201D; <source>IEEE Transactions on Pattern Analysis and Machine Intelligence</source>, vol. <volume>38</volume>, no. <issue>2</issue>, pp. <fpage>295</fpage>&#x2013;<lpage>307</lpage>, <year>2015</year>.</mixed-citation></ref>
<ref id="ref-4"><label>[4]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>I.</given-names> <surname>Hussain</surname></string-name>, <string-name><given-names>J.</given-names> <surname>Zeng Xinhong</surname></string-name> and <string-name><given-names>S.</given-names> <surname>Tan</surname></string-name></person-group>, &#x201C;<article-title>A survey on deep convolutional neural networks for image steganography and steganalysis</article-title>,&#x201D; <source>KSII Transactions on Internet and Information Systems</source>, vol. <volume>14</volume>, no. <issue>3</issue>, pp. <fpage>1228</fpage>&#x2013;<lpage>1248</lpage>, <year>2020</year>.</mixed-citation></ref>
<ref id="ref-5"><label>[5]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>B.</given-names> <surname>Chen</surname></string-name>, <string-name><given-names>J.</given-names> <surname>Wang</surname></string-name>, <string-name><given-names>Y.</given-names> <surname>Chen</surname></string-name>, <string-name><given-names>Z.</given-names> <surname>Jin</surname></string-name>, <string-name><given-names>H. J.</given-names> <surname>Shim</surname></string-name> <etal>et al.,</etal></person-group> &#x201C;<article-title>High-capacity robust image steganography via adversarial network</article-title>,&#x201D; <source>KSII Transactions on Internet and Information Systems</source>, vol. <volume>14</volume>, no. <issue>1</issue>, pp. <fpage>366</fpage>&#x2013;<lpage>381</lpage>, <year>2020</year>.</mixed-citation></ref>
<ref id="ref-6"><label>[6]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>H.</given-names> <surname>Sun</surname></string-name> and <string-name><given-names>R.</given-names> <surname>Grishman</surname></string-name></person-group>, &#x201C;<article-title>Lexicalized dependency paths based supervised learning for relation extraction</article-title>,&#x201D; <source>Computer Systems Science and Engineering</source>, vol. <volume>43</volume>, no. <issue>3</issue>, pp. <fpage>861</fpage>&#x2013;<lpage>870</lpage>, <year>2022</year>.</mixed-citation></ref>
<ref id="ref-7"><label>[7]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>R.</given-names> <surname>Mu</surname></string-name> and <string-name><given-names>X.</given-names> <surname>Zeng</surname></string-name></person-group>, &#x201C;<article-title>A review of deep learning research</article-title>,&#x201D; <source>KSII Transactions on Internet and Information Systems</source>, vol. <volume>13</volume>, no. <issue>4</issue>, pp. <fpage>1738</fpage>&#x2013;<lpage>1764</lpage>, <year>2019</year>.</mixed-citation></ref>
<ref id="ref-8"><label>[8]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>J.</given-names> <surname>Lee</surname></string-name> and <string-name><given-names>I.</given-names> <surname>Tashev</surname></string-name></person-group>, &#x201C;<article-title>High-level feature representation using recurrent neural network for speech emotion recognition</article-title>,&#x201D; in <conf-name>Proc. Interspeech 2015</conf-name>, <conf-loc>Dresden, Germany</conf-loc>, pp. <fpage>1537</fpage>&#x2013;<lpage>1540</lpage>, <year>2015</year>.</mixed-citation></ref>
<ref id="ref-9"><label>[9]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>M.</given-names> <surname>Luong</surname></string-name>, <string-name><given-names>H.</given-names> <surname>Pham</surname></string-name> and <string-name><given-names>C.</given-names> <surname>Manning</surname></string-name></person-group>, &#x201C;<article-title>Effective approaches to attention-based neural machine translation</article-title>,&#x201D; in <conf-name>Proc. 2015 EMNLP</conf-name>, <conf-loc>Lisbon, Portugal</conf-loc>, pp. <fpage>1412</fpage>&#x2013;<lpage>1421</lpage>, <year>2015</year>.</mixed-citation></ref>
<ref id="ref-10"><label>[10]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>G.</given-names> <surname>Trigeorgis</surname></string-name>, <string-name><given-names>F.</given-names> <surname>Ringeval</surname></string-name>, <string-name><given-names>R.</given-names> <surname>Brueckner</surname></string-name>, <string-name><given-names>E.</given-names> <surname>Marchi</surname></string-name>, <string-name><given-names>M. A.</given-names> <surname>Nicolaou</surname></string-name> <etal>et al.,</etal></person-group> &#x201C;<article-title>Adieu features? end-to-end speech emotion recognition using a deep convolutional recurrent network</article-title>,&#x201D; in <conf-name>Proc. 2016 IEEE ICASSP</conf-name>, <conf-loc>Shanghai, China</conf-loc>, pp. <fpage>5200</fpage>&#x2013;<lpage>5204</lpage>, <year>2016</year>.</mixed-citation></ref>
<ref id="ref-11"><label>[11]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>C. -W.</given-names> <surname>Huang</surname></string-name> and <string-name><given-names>S. S.</given-names> <surname>Narayanan</surname></string-name></person-group>, &#x201C;<article-title>Attention assisted discovery of sub-utterance structure in speech emotion recognition</article-title>,&#x201D; in <conf-name>Proc. INTERSPEECH 2016</conf-name>, <conf-loc>San Francisco, CA, USA</conf-loc>, pp. <fpage>8</fpage>&#x2013;<lpage>12</lpage>, <year>2016</year>.</mixed-citation></ref>
<ref id="ref-12"><label>[12]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>S.</given-names> <surname>Mirsamadi</surname></string-name>, <string-name><given-names>E.</given-names> <surname>Barsoum</surname></string-name> and <string-name><given-names>C.</given-names> <surname>Zhang</surname></string-name></person-group>, &#x201C;<article-title>Automatic speech emotion recognition using recurrent neural networks with local attention</article-title>,&#x201D; in <conf-name>Proc. 2017 IEEE ICASSP</conf-name>, <conf-loc>New Orleans, LA, USA</conf-loc>, pp. <fpage>2227</fpage>&#x2013;<lpage>2231</lpage>, <year>2017</year>.</mixed-citation></ref>
<ref id="ref-13"><label>[13]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>F.</given-names> <surname>Tao</surname></string-name> and <string-name><given-names>G.</given-names> <surname>Liu</surname></string-name></person-group>, &#x201C;<article-title>Advanced LSTM: A study about better time dependency modeling in emotion recognition</article-title>,&#x201D; in <conf-name>Proc. 2018 IEEE ICASSP</conf-name>, <conf-loc>Calgary, AB, Canada</conf-loc>, pp. <fpage>2906</fpage>&#x2013;<lpage>2910</lpage>, <year>2018</year>.</mixed-citation></ref>
<ref id="ref-14"><label>[14]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>M.</given-names> <surname>Sarma</surname></string-name>, <string-name><given-names>P.</given-names> <surname>Ghahremani</surname></string-name>, <string-name><given-names>D.</given-names> <surname>Povey</surname></string-name>, <string-name><given-names>N. K.</given-names> <surname>Goel</surname></string-name>, <string-name><given-names>K. K.</given-names> <surname>Sarma</surname></string-name> <etal>et al.,</etal></person-group> &#x201C;<article-title>Emotion identification from raw speech signals using DNNs</article-title>,&#x201D; in <conf-name>Proc. Interspeech 2018</conf-name>, <conf-loc>Hyderabad, India</conf-loc>, pp. <fpage>3097</fpage>&#x2013;<lpage>3101</lpage>, <year>2018</year>.</mixed-citation></ref>
<ref id="ref-15"><label>[15]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>M.</given-names> <surname>Chen</surname></string-name>, <string-name><given-names>X.</given-names> <surname>He</surname></string-name>, <string-name><given-names>J.</given-names> <surname>Yang</surname></string-name> and <string-name><given-names>H.</given-names> <surname>Zhang</surname></string-name></person-group>, &#x201C;<article-title>3-D convolutional recurrent neural networks with attention model for speech emotion recognition</article-title>,&#x201D; <source>IEEE Signal Processing Letters</source>, vol. <volume>25</volume>, no. <issue>10</issue>, pp. <fpage>1440</fpage>&#x2013;<lpage>1444</lpage>, <year>2018</year>.</mixed-citation></ref>
<ref id="ref-16"><label>[16]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>Z.</given-names> <surname>Zhao</surname></string-name>, <string-name><given-names>Y.</given-names> <surname>Zheng</surname></string-name>, <string-name><given-names>Z.</given-names> <surname>Zhang</surname></string-name>, <string-name><given-names>H.</given-names> <surname>Wang</surname></string-name>, <string-name><given-names>Y.</given-names> <surname>Zhao</surname></string-name> <etal>et al.,</etal></person-group> &#x201C;<article-title>Exploring spatio-temporal representations by integrating attention based bidirectional-LSTM-RNNs and FCNs for speech emotion recognition</article-title>,&#x201D; in <conf-name>Proc. Interspeech 2018</conf-name>, <conf-loc>Hyderabad, India</conf-loc>, pp. <fpage>272</fpage>&#x2013;<lpage>276</lpage>, <year>2018</year>.</mixed-citation></ref>
<ref id="ref-17"><label>[17]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>Y.</given-names> <surname>Xie</surname></string-name>, <string-name><given-names>R.</given-names> <surname>Liang</surname></string-name>, <string-name><given-names>Z.</given-names> <surname>Liang</surname></string-name>, <string-name><given-names>C.</given-names> <surname>Huang</surname></string-name>, <string-name><given-names>C.</given-names> <surname>Zou</surname></string-name> <etal>et al.,</etal></person-group> &#x201C;<article-title>Speech emotion classification using attention-based LSTM</article-title>,&#x201D; <source>IEEE/ACM Transactions on Audio, Speech, and Language Processing</source>, vol. <volume>27</volume>, no. <issue>11</issue>, pp. <fpage>1675</fpage>&#x2013;<lpage>1685</lpage>, <year>2019</year>.</mixed-citation></ref>
<ref id="ref-18"><label>[18]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>Y.</given-names> <surname>Xie</surname></string-name>, <string-name><given-names>R.</given-names> <surname>Liang</surname></string-name>, <string-name><given-names>Z.</given-names> <surname>Liang</surname></string-name> and <string-name><given-names>L.</given-names> <surname>Zhao</surname></string-name></person-group>, &#x201C;<article-title>Attention-based dense LSTM for speech emotion recognition</article-title>,&#x201D; <source>IEICE Transactions on Information and Systems</source>, vol. <volume>102</volume>, no. <issue>7</issue>, pp. <fpage>1426</fpage>&#x2013;<lpage>1429</lpage>, <year>2019</year>.</mixed-citation></ref>
<ref id="ref-19"><label>[19]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>Y.</given-names> <surname>Li</surname></string-name>, <string-name><given-names>T.</given-names> <surname>Zhao</surname></string-name> and <string-name><given-names>T.</given-names> <surname>Kawahara</surname></string-name></person-group>, &#x201C;<article-title>Improved end-to-end speech emotion recognition using self attention mechanism and multitask learning</article-title>,&#x201D; in <conf-name>Proc. INTERSPEECH 2019</conf-name>, <conf-loc>Graz, Austria</conf-loc>, pp. <fpage>2803</fpage>&#x2013;<lpage>2807</lpage>, <year>2019</year>.</mixed-citation></ref>
<ref id="ref-20"><label>[20]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>C.</given-names> <surname>Zheng</surname></string-name>, <string-name><given-names>C.</given-names> <surname>Wang</surname></string-name> and <string-name><given-names>N.</given-names> <surname>Jia</surname></string-name></person-group>, &#x201C;<article-title>An ensemble model for multi-level speech emotion recognition</article-title>,&#x201D; <source>Applied Sciences</source>, vol. <volume>10</volume>, no. <issue>1</issue>, pp. <fpage>1</fpage>&#x2013;<lpage>20</lpage>, <year>2020</year>.</mixed-citation></ref>
<ref id="ref-21"><label>[21]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>C.</given-names> <surname>Busso</surname></string-name>, <string-name><given-names>M.</given-names> <surname>Bulut</surname></string-name>, <string-name><given-names>C. C.</given-names> <surname>Lee</surname></string-name>, <string-name><given-names>A.</given-names> <surname>Kazemzadeh</surname></string-name>, <string-name><given-names>E.</given-names> <surname>Mower</surname></string-name> <etal>et al.,</etal></person-group> &#x201C;<article-title>IEMOCAP: Interactive emotional dyadic motion capture database</article-title>,&#x201D; <source>Language Resources and Evaluation</source>, vol. <volume>42</volume>, no. <issue>4</issue>, pp. <fpage>335</fpage>&#x2013;<lpage>359</lpage>, <year>2008</year>.</mixed-citation></ref>
<ref id="ref-22"><label>[22]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>F.</given-names> <surname>Eyben</surname></string-name>, <string-name><given-names>M.</given-names> <surname>W&#x00F6;llmer</surname></string-name> and <string-name><given-names>B.</given-names> <surname>Schuller</surname></string-name></person-group>, &#x201C;<article-title>Opensmile: The Munich versatile and fast open-source audio feature extractor</article-title>,&#x201D; in <conf-name>Proc. 18th ACM Int. Conf. on Multimedia</conf-name>, <conf-loc>New York, NY, USA</conf-loc>, pp. <fpage>1459</fpage>&#x2013;<lpage>1462</lpage>, <year>2010</year>.</mixed-citation></ref>
<ref id="ref-23"><label>[23]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>S.</given-names> <surname>Hochreiter</surname></string-name> and <string-name><given-names>J.</given-names> <surname>Schmidhuber</surname></string-name></person-group>, &#x201C;<article-title>Long short-term memory</article-title>,&#x201D; <source>Neural Computation</source>, vol. <volume>9</volume>, no. <issue>8</issue>, pp. <fpage>1735</fpage>&#x2013;<lpage>1780</lpage>, <year>1997</year>.</mixed-citation></ref>
<ref id="ref-24"><label>[24]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>M.</given-names> <surname>Schuster</surname></string-name> and <string-name><given-names>K. K.</given-names> <surname>Paliwal</surname></string-name></person-group>, &#x201C;<article-title>Bidirectional recurrent neural networks</article-title>,&#x201D; <source>IEEE Transaction on Signal Processing</source>, vol. <volume>45</volume>, no. <issue>11</issue>, pp. <fpage>2673</fpage>&#x2013;<lpage>2681</lpage>, <year>1997</year>.</mixed-citation></ref>
<ref id="ref-25"><label>[25]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>J.</given-names> <surname>Chung</surname></string-name>, <string-name><given-names>&#x00C7;.</given-names> <surname>G&#x00FC;l&#x00E7;ehre</surname></string-name>, <string-name><given-names>K.</given-names> <surname>Cho</surname></string-name> and <string-name><given-names>Y.</given-names> <surname>Bengio</surname></string-name></person-group>, &#x201C;<article-title>Empirical evaluation of gated recurrent neural networks on sequence modeling</article-title>,&#x201D; in <conf-name>Proc. NIPS 2014 Workshop on Deep Learning</conf-name>, <conf-loc>Montreal, QC, Canada</conf-loc>, pp. <fpage>1</fpage>&#x2013;<lpage>9</lpage>, <year>2014</year>.</mixed-citation></ref>
</ref-list>
</back>
</article>