<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.1 20151215//EN" "http://jats.nlm.nih.gov/publishing/1.1/JATS-journalpublishing1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" article-type="research-article" dtd-version="1.1">
<front>
<journal-meta>
<journal-id journal-id-type="pmc">CMC</journal-id>
<journal-id journal-id-type="nlm-ta">CMC</journal-id>
<journal-id journal-id-type="publisher-id">CMC</journal-id>
<journal-title-group>
<journal-title>Computers, Materials &#x0026; Continua</journal-title>
</journal-title-group>
<issn pub-type="epub">1546-2226</issn>
<issn pub-type="ppub">1546-2218</issn>
<publisher>
<publisher-name>Tech Science Press</publisher-name>
<publisher-loc>USA</publisher-loc>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="publisher-id">15070</article-id>
<article-id pub-id-type="doi">10.32604/cmc.2021.015070</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Article</subject>
</subj-group>
</article-categories>
<title-group>
<article-title>1D-CNN: Speech Emotion Recognition System Using a Stacked Network with Dilated CNN Features</article-title>
<alt-title alt-title-type="left-running-head">1D-CNN: Speech Emotion Recognition System Using a Stacked Network with Dilated CNN Features</alt-title>
<alt-title alt-title-type="right-running-head">1D-CNN: Speech Emotion Recognition System Using a Stacked Network with Dilated CNN Features</alt-title>
</title-group>
<contrib-group content-type="authors">
<contrib id="author-1" contrib-type="author">
<name name-style="western"><surname>Mustaqeem</surname></name>
</contrib>
<contrib id="author-2" contrib-type="author" corresp="yes">
<name name-style="western">
<surname>Kwon</surname>
<given-names>Soonil</given-names>
</name>
<email>skwon@sejong.edu</email></contrib>
<aff><institution>Interaction Technology Laboratory, Department of Software, Sejong University</institution>, <addr-line>Seoul, 05006</addr-line>, <country>Korea</country></aff>
</contrib-group>
<author-notes><corresp id="cor1">&#x002A;Corresponding Author: Soonil Kwon. Email: <email>skwon@sejong.edu</email></corresp></author-notes>
<pub-date pub-type="epub" date-type="pub" iso-8601-date="2021-01-29">
<day>29</day>
<month>01</month>
<year>2021</year>
</pub-date>
<volume>67</volume>
<issue>3</issue>
<fpage>4039</fpage>
<lpage>4059</lpage>
<history>
<date date-type="received">
<day>05</day>
<month>11</month>
<year>2020</year>
</date>
<date date-type="accepted">
<day>12</day>
<month>01</month>
<year>2021</year>
</date>
</history>
<permissions>
<copyright-statement>&#x00A9; 2021 Mustaqeem and Kwon</copyright-statement>
<copyright-year>2021</copyright-year>
<copyright-holder>Mustaqeem and Kwon</copyright-holder>
<license xlink:href="https://creativecommons.org/licenses/by/4.0/">
<license-p>This work is licensed under a <ext-link ext-link-type="uri" xlink:type="simple" xlink:href="https://creativecommons.org/licenses/by/4.0/">Creative Commons Attribution 4.0 International License</ext-link>, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
</license>
</permissions>
<self-uri content-type="pdf" xlink:href="TSP_CMC_15070.pdf"></self-uri>
<abstract>
<p>Emotion recognition from speech data is an active and emerging area of research that plays an important role in numerous applications, such as robotics, virtual reality, behavior assessments, and emergency call centers. Recently, researchers have developed many techniques in this field in order to ensure an improvement in the accuracy by utilizing several deep learning approaches, but the recognition rate is still not convincing. Our main aim is to develop a new technique that increases the recognition rate with reasonable cost computations. In this paper, we suggested a new technique, which is a one-dimensional dilated convolutional neural network (1D-DCNN) for speech emotion recognition (SER) that utilizes the hierarchical features learning blocks (HFLBs) with a bi-directional gated recurrent unit (BiGRU). We designed a one-dimensional CNN network to enhance the speech signals, which uses a spectral analysis, and to extract the hidden patterns from the speech signals that are fed into a stacked one-dimensional dilated network that are called HFLBs. Each HFLB contains one dilated convolution layer (DCL), one batch normalization (BN), and one leaky_relu (Relu) layer in order to extract the emotional features using a hieratical correlation strategy. Furthermore, the learned emotional features are feed into a BiGRU in order to adjust the global weights and to recognize the temporal cues. The final state of the deep BiGRU is passed from a softmax classifier in order to produce the probabilities of the emotions. The proposed model was evaluated over three benchmarked datasets that included the IEMOCAP, EMO-DB, and RAVDESS, which achieved 72.75%, 91.14%, and 78.01% accuracy, respectively.</p>
</abstract>
<kwd-group kwd-group-type="author">
<kwd>Affective computing</kwd>
<kwd>one-dimensional dilated convolutional neural network</kwd>
<kwd>emotion recognition</kwd>
<kwd>gated recurrent unit</kwd>
<kwd>raw audio clips</kwd>
</kwd-group>
</article-meta>
</front>
<body>
<sec id="s1">
<label>1</label>
<title>Introduction</title>
<p>Speech signals are the most dominant source of human communication, and an efficient method of human-computer interaction (HCI) using 5G technology. Emotions express human behavior, which is recognized from various body expressions, such as speech patterns, facial expressions, gestures, and brain signals [<xref ref-type="bibr" rid="ref-1">1</xref>,<xref ref-type="bibr" rid="ref-2">2</xref>]. In the field of speech signal processing, speech emotion recognition (SER) is the most attractive area of research in this era. Speech signals play an important role to recognize the emotional state and the human behavior during his<inline-formula id="ieqn-1"><alternatives><inline-graphic xlink:href="ieqn-1.png"/><tex-math id="tex-ieqn-1"><![CDATA[$\backslash$]]></tex-math><mml:math id="mml-ieqn-1"><mml:mo>\</mml:mo></mml:math></alternatives></inline-formula>her speech. Many researchers have introduced various techniques for efficient SER systems in order to recognize individual speech patterns and to identify the state of the speaker in terms of emotions. Hence, a sufficient feature selection and extraction is an extremely challenging task in this area [<xref ref-type="bibr" rid="ref-3">3</xref>]. Artificial intelligence (AI) plays a crucial role with the development of the skills, and technologies, in the field of HCI, which includes robotics, and on-board systems, in order to detect human activity and recognize emotions. Similarly, call centers recognize the customer&#x2019;s expressions, health care centers recognize the emotional state of the patients, and virtual reality applications recognize the actions and the activities using sensors. In the field of SER, the emotions of the speaker depend on the paralinguistic features not on the lexical content or a speaker. Speech signals have 2 major types of signs that include paralinguistic signs, which have hidden messages about the emotions that are contained in the speech signals, and the linguistic signs, which are constantly referred to as a meaning of the speech signals or the context [<xref ref-type="bibr" rid="ref-4">4</xref>]. Recently, researchers introduced some techniques to improved the SER rate using high-level features by utilizing the deep learning approaches in order to extract the hidden cues [<xref ref-type="bibr" rid="ref-5">5</xref>]. In the last decade, the researchers have used many acoustic features, such as qualitative, spectral, and continuous features to investigate the best features of the speech signals [<xref ref-type="bibr" rid="ref-6">6</xref>].</p>
<p>In the current technological era, the researchers have utilized neural networks and deep learning tools in order to search for an efficient way to extract the deep features that ensure the emotional state of a speaker in the speech data [<xref ref-type="bibr" rid="ref-7">7</xref>,<xref ref-type="bibr" rid="ref-8">8</xref>]. Some researchers introduced hybrid techniques to evaluate the handcrafted features with CNN models in order to improve the recognition accuracy of the speech signals. The handcrafted features particularly ensure the accuracy, but this process is difficult due to the features engineering, the amount of time that is required, and this method is exclusive to manual selection, which is particularly depend on expert knowledge [<xref ref-type="bibr" rid="ref-9">9</xref>]. The deep learning approaches, which include the 2D-CNN models, special for the visual data, such as images and videos in computer vision [<xref ref-type="bibr" rid="ref-10">10</xref>], but the researchers adopted these models in speech processing and achieved better results than the classical models [<xref ref-type="bibr" rid="ref-7">7</xref>]. In addition, the researchers achieved good performances with the emotion recognition from the speech signals that utilize the deep learning approaches, such as the deep belief networks (DBNs), 2D-CNNs, 1D-CNNs, and the long short-term memory (LSTM) network [<xref ref-type="bibr" rid="ref-5">5</xref>,<xref ref-type="bibr" rid="ref-11">11</xref>&#x2013;<xref ref-type="bibr" rid="ref-13">13</xref>]. The performance of the deep learning approaches is better than the traditional methods. Hence, Fiore et al. [<xref ref-type="bibr" rid="ref-14">14</xref>] developed an SER for on-board system to detect and analyze the emotional conditions of the driver, which involved taking the appropriate actions in order to ensure the passenger&#x2019;s safety. Badshah et al. [<xref ref-type="bibr" rid="ref-15">15</xref>] introduced an SER system for the smart health care centers in order to analyze the customer emotions using a fine-tuned Alex Net model with rectangular kernels. Kwon et al. [<xref ref-type="bibr" rid="ref-5">5</xref>] proposed a novel deep stride CNN network for speech emotion recognition that used the IEMOCAP [<xref ref-type="bibr" rid="ref-16">16</xref>] and the RAVDESS [<xref ref-type="bibr" rid="ref-17">17</xref>] datasets in order to improve the prediction accuracy and decrease the overall model complexity [<xref ref-type="bibr" rid="ref-8">8</xref>]. Kang et al. [<xref ref-type="bibr" rid="ref-18">18</xref>] developed an SER technique in order to analyze the emotion type and the intensity from the arousal features and the violence features using the content analysis in movies. Dias et al. [<xref ref-type="bibr" rid="ref-19">19</xref>] developed an SER method in order to recognize a speaker&#x2019;s privacy using a privacy-preserving-based hashing technique that used paralinguistic features.</p>
<p>The SER has recently become an emerging area of research in digital audio signal processing. Many researchers have developed a variety of techniques that utilize deep learning approaches, such as 2D-CNNs and 1D-CNNs models in order to increase the level of accuracy. Typically, the researchers utilized the pre-trained CNNs weights in order to extract the discriminative high-level features, which were fed into the traditional RNNs afterwards for sequence learning [<xref ref-type="bibr" rid="ref-12">12</xref>,<xref ref-type="bibr" rid="ref-20">20</xref>]. The recognition performance was slightly increased when these pre-trained models were utilized, but the computational complexity was also increased with the use of huge pre-trained network weights. The current deep learning approaches, which include CNNs, DBNs, and CNN-LSTM architectures, have not shown enormous enhancements with respect to emotion recognition. In this study, we present a novel end-to-end one-dimensional CNN network with a BiGRU in order to identify the state of a speaker in term of the emotions. We assemble a DCNN-GRU network for the SER that utilizes one-dimensional dilated layers in stacked hierarchical features learning blocks (HFLBs) in order to extract the emotional features from the speech signals. The GRU network is connect to this network in order to learn and extract the long-term dependency from the time-varying speech signals. The design one-dimensional dilated CNN model accepts raw data, such as raw audio clips in order to remove noises and learn the high-level features while using the suggested model, which decrease of training time due to less parameters being used during the training. The proposed model is evaluated using three speech datasets, which included the IEMOCAP, EMO-DB, and RAVDESS, in order to ensure the effectiveness and the significance of the model. The key contributions of our work are illustrated below.</p>
<list list-type="bullet">
<list-item><p>We investigated and studied the current literature of speech emotion recognition (SER), and we analyzed the performance of the classical learning approaches vs the deep learning approaches. As a result, we were inspired from the performance and the recent successes of the 2D-CNN models, so we planned a one-dimensional dilated convolutional neural network DCNN for an SER that can learn both the spatial features and the sequential features from the raw audio files by leveraging the DCNN with a deep BiGRU. Our proposed system accomplished automatically modeling the temporal dependencies using the learned features.</p></list-item>
<list-item><p>The refining of the input data always plays a crucial role with an accurate prediction, which ensures improvement with the final accuracy. The existing methods for the SER lack the prior step of refining of the input data, which effectively assists boosting the final accuracy. In this study, we proposed a new preprocessing scheme in order to enhance and remove noise from the speech signals, which utilize the FFT and a spectral analysis, so our preprocessing step plays an important role with the SER system, which successfully dominated the state-of-the-art systems.</p></list-item>
<list-item><p>We intensely studied the speech signals, the linguistic information, and the paralinguistic information for the SER, and proposed a method to extract the hierarchal emotional correlation. The existing literature for the SER lacks a focus on the hierarchal correlations, which easily recognize the emotional state and boost the final accuracy. In this paper, we proposed four stacked one-dimensional DCNN hierarchical features learning blocks (HFLBs) in order to raise the learned features and easily recognize the emotional signs. Each block consists of one dilated convolutional layer (DCN), one batch normalization (BN) layer, and one leaky_relu (Relu) layer.</p></list-item>
<list-item><p>Our system is suitable to monitor the real-time processing in order to directly accept the raw audio speech files without reshaping them through high computations devices, which proved experimentally that our system can learn a lot of emotional features from the raw audio files. To the best of our knowledge, the proposed system is new, and this is the first contribution in this domain.</p></list-item>
<list-item><p>We tested the robustness and the significance of our proposed system over three benchmarked datasets, which included the IEMOCAP, EMO-DB, and RAVDESS, and they achieved 72.75%, 91.14%, and 78.01% accuracy, respectively. We also compared them with the baseline SER method. Our system outperformed from the other systems.</p></list-item>
</list>
<p>The rest of the paper is divided as follows. Section 2 highlights the literature about the SER using the low-level descriptors and the high-level descriptors. Section 3 represents the proposed system methods and the materials. The experimental results, the comparisons with the baseline, and the discussion is presented in Section 4. Section 5 concludes the paper and offers a future direction for the proposed system.</p>
</sec>
<sec id="s2">
<label>2</label>
<title>Literature Review</title>
<p>Speech emotion recognition (SER) is an active research area of digital signal processing that has been actively occurring throughout the last decade. The research frequently presents innovative techniques in this era in order to increase the performance of the SER and reduce the complexity of the overall system. Usually, an SER system has two core parts, which have challenges that need to be solved for an efficient emotion recognition system that include (i) the selection of the robust, discriminative, and salient features of the speech signals [<xref ref-type="bibr" rid="ref-21">21</xref>] and (ii) the classification methods in order to accurately classify them accordingly. Hence, the researchers currently use a dominant source, deep learning approaches for the robust features extraction and selection [<xref ref-type="bibr" rid="ref-22">22</xref>], which is quickly growing in this field. The state-of-the-art SER [<xref ref-type="bibr" rid="ref-23">23</xref>] developed methods for the SER in order to improve the performance of the existing systems, which utilize the CNN architectures that extract features from the speech spectrograms. Similarly, [<xref ref-type="bibr" rid="ref-24">24</xref>] used the end-to-end deep learning approach for the features extraction and classification [<xref ref-type="bibr" rid="ref-5">5</xref>] in order to recognize the emotional state of the distinct speakers.</p>
<p>In this technological era, the deep learning approaches have become popular in all fields and specifically in the field of SER that utilize the 2D-CNN models to extract the deep hidden cues from the spectrograms of the speech signals [<xref ref-type="bibr" rid="ref-24">24</xref>]. Actually, a spectrogram is a 2D plotting of the speech signal frequency with respect to time, and it is a more suitable representation for the 2D-CNN models [<xref ref-type="bibr" rid="ref-25">25</xref>]. Similarly, some researchers used the transfer learning techniques for the SER that utilizes spectrograms to train the pre-trained Alex Net [<xref ref-type="bibr" rid="ref-26">26</xref>] and VGG [<xref ref-type="bibr" rid="ref-27">27</xref>] models in order to identify the state of the speakers in term of the emotions [<xref ref-type="bibr" rid="ref-28">28</xref>]. Furthermore, the researchers used the 2D-CNNs to extract the special information from the spectrograms and the LSTM, or the RNNs were utilized to extract the hidden sequential and temporal information from the speech signals. Currently, the CNNs have increased the research interest of the SER. In this regard, [<xref ref-type="bibr" rid="ref-29">29</xref>] developed a new end-to-end method for an SER that utilizes a deep neural network (DNN) with the LSTM, which directly accepts the raw audio data and extracts the salient discriminative features rather than obtaining the handcrafted features. Most researchers used the joint CNNs with the LSTM and the RNNs for the SER [<xref ref-type="bibr" rid="ref-30">30</xref>] to capture the special cues and the temporal cues from the speech data in order to recognize the emotional information. The authors developed a technique in [<xref ref-type="bibr" rid="ref-31">31</xref>] for the SER that used a fixed variable-length, which utilizes the CNN-RNNs, where the CNN is used to extract the salient features from the spectrograms, and the RNNs controlled the length of the speech segment. Similarly, [<xref ref-type="bibr" rid="ref-32">32</xref>] used a hybrid approach for the SER, which utilized a CNN pre-trained Alex-Net model for the features extraction, and a traditional Support Vector Machine (SVM) was used for the classification. In [<xref ref-type="bibr" rid="ref-33">33</xref>], the authors suggested a deep learning model for a spontaneous SER that utilized the RECOLA usual emotions database for the model&#x2019;s evaluation.</p>
<p>The SER has many CNN methods it can use to take various types of inputs, which include spectrograms, log Mel spectrograms, speech signals, and Mel frequency cepstral coefficient MFCCs [<xref ref-type="bibr" rid="ref-34">34</xref>] in order to recognized the emotional state of the speakers [<xref ref-type="bibr" rid="ref-35">35</xref>]. In this field, some researchers combined the traditional approaches with the advancements to utilize the pre-trained CNNs systems in order to extract the salient information from the audio data. Furthermore, they used the traditional machines to classify the emotions from the learned features [<xref ref-type="bibr" rid="ref-36">36</xref>] by using huge network parameters that boosted the complexity of the overall system. In [<xref ref-type="bibr" rid="ref-8">8</xref>], the authors developed techniques and introduced a new deep learning model for the SER that used the RAVDESS [<xref ref-type="bibr" rid="ref-17">17</xref>] and the IEMOCAP [<xref ref-type="bibr" rid="ref-16">16</xref>] datasets, which use less parameters to recognized the different emotions with high accuracy and less computational complexity. In this article, we proposed a new strategy for the SER that uses a one-dimensional DCNN model with four stacked hierarchical blocks. Each block consisted of one dilated convolutional layer with a rectified linear unit (relu) and one batch normalization layer (BN) with a proper dropout setting in order to reduce the model overfitting. Furthermore, we used a BiGRU to adjust the global weights and to extract the most salient high-level sequential information form the learned features. The final learned sequential information was passed from the last fully connected (FC) layer with a softmax activation in order to produce the probabilities of the different emotions. A detail description of the proposed technique is explained in the upcoming sections.</p>
</sec>
<sec id="s3">
<label>3</label>
<title>Proposed SER System</title>
<p>In this section, we properly demonstrate the proposed speech emotion recognition (SER) framework and its main components as well as provide a detailed description. In the field of SER, the selection of the appropriate features is a challenging problem for researchers to clearly define and distinguish among the emotional features and the non-emotional features. The features extraction methods are classified by the low-level (handcrafted) and the high-level (CNN) techniques. The handcrafted features extraction methods are designed by the feature engineering strategies, which are explained in more detailed in [<xref ref-type="bibr" rid="ref-37">37</xref>]. The high-level features are extracted using various deep learning approaches, such as the DBN, DNN, and CNN&#x2019;s in order to learn features from the input data by adjusting the weights automatically according to the input data. The prediction performance of the learned features is better than the handcrafted features [<xref ref-type="bibr" rid="ref-38">38</xref>]. In the field of SER, most researchers used the 2D-CNNs models for the feature extraction and the emotion classification [<xref ref-type="bibr" rid="ref-35">35</xref>], which requires more attention to the data preparation. The speech signal lays in one dimension, and the 2D-CNNs require two-dimensional input because of this. As a result, in order to preprocess the data, converting the speech signals into spectrograms is compulsory [<xref ref-type="bibr" rid="ref-39">39</xref>]. With the transformation of the 1D speech signals into the 2D speech spectrograms, some useful information may be lost. Similarly, the original speech signals have rich emotions cues instead of spectrograms. Due to this information, we proposed a novel 1D-DCNN-BiGRU system for the SER in order to directly accept the raw speech signals without transforming them, which is explained in the upcoming section.</p>
<sec id="s3_1">
<label>3.1</label>
<title>One Dimensional DCNN-GRU System</title>
<p>The proposed one-dimensional CNN-GRU architecture is constructed in order to endure the raw speech signals in their original form. The system consists of three main chunks that includes a preprocessing step in order to enhance the speech signals that use the FFT and a spectral analysis. First, our model utilizes the 1D-DCNN network, which consists of four one-dimensional convolution layers with a rectified linear unit (Relu) and two batch normalization (BN) layers in order to learn the local features and prepare the initial tensor for a stacked network. We designed the network in order to place the first couple of convolution layers, which have 32 filters that are size 7 with stride setting one. Subsequently, we used the BN layer to normalize or to re-scale the input in order to improve the speed and the performance. Similarly, the third and the fourth convolution layers were placed consecutively after the first BN layer. These convolutions had 64 filters that are size 5 with a stride setting two. The dropout and the L1 kernel regularization technique were used in the network to reduce the system overfitting [<xref ref-type="bibr" rid="ref-40">40</xref>]. Second, the stacked network had four (HFLBs), and each block consisted of one dilated convolution layer, one BN layer, and one leaky_relu layer. The stacked network extracted the salient features from the input tensor using a hierarchical correlation. Finally, a fully connected scheme that consisted of a BiGRU network and a fully connected (FC) layer with a softmax classifier was used to produce the probabilities of the classes. The learned spatiotemporal features were then directly passed from the FC layer, which can be stated as:</p>
<p><disp-formula id="eqn-1">
<label>(1)</label>
<alternatives>
<graphic mimetype="image" mime-subtype="png" xlink:href="eqn-1.png"/>
<tex-math id="tex-eqn-1"><![CDATA[$$\begin{equation}
\mathrm{z}^{\mathrm{L}}=\mathrm{b}^{\mathrm{L}}+\mathrm{z}^{\mathrm{L}-1}\cdot \mathrm{w}^{\mathrm{L}}
 \label{eqn-1}
\end{equation}$$]]></tex-math>
<mml:math id="mml-eqn-1" display="block"><mml:msup><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>z</mml:mi></mml:mstyle></mml:mrow><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>L</mml:mi></mml:mstyle></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:msup><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>b</mml:mi></mml:mstyle></mml:mrow><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>L</mml:mi></mml:mstyle></mml:mrow></mml:msup><mml:mo>+</mml:mo><mml:msup><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>z</mml:mi></mml:mstyle></mml:mrow><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>L</mml:mi></mml:mstyle><mml:mo>-</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup><mml:mo>&#x22C5;</mml:mo><mml:msup><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>w</mml:mi></mml:mstyle></mml:mrow><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>L</mml:mi></mml:mstyle></mml:mrow></mml:msup></mml:math></alternatives></disp-formula></p>
<p>Essentially, softmax was used as a classifier in this model, which calculated the prediction based on the maximum probability. We utilized the softmax classifier and generalized it for the multi-class classification in order to have more than two values with a label y, which can be expressed as:</p>
<p><disp-formula id="eqn-2">
<label>(2)</label>
<alternatives>
<graphic mimetype="image" mime-subtype="png" xlink:href="eqn-2.png"/>
<tex-math id="tex-eqn-2"><![CDATA[$$\begin{equation}\mathrm{x}_{\mathrm{i}}=\sum\limits_{\mathrm{j}}\mathrm{h}_{\mathrm{j}}\mathrm{w}_{\mathrm{ji}}
 \label{eqn-2} \end{equation}$$]]></tex-math>
<mml:math id="mml-eqn-2" display="block"><mml:mrow></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>x</mml:mi></mml:mstyle></mml:mrow><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>i</mml:mi></mml:mstyle></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mstyle displaystyle='true'><mml:munder><mml:mrow><mml:mo>&#x2211;</mml:mo></mml:mrow><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>j</mml:mi></mml:mstyle></mml:mrow></mml:munder></mml:mstyle><mml:msub><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>h</mml:mi></mml:mstyle></mml:mrow><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>j</mml:mi></mml:mstyle></mml:mrow></mml:msub><mml:msub><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>w</mml:mi></mml:mstyle></mml:mrow><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>j</mml:mi><mml:mi>i</mml:mi></mml:mstyle></mml:mrow></mml:msub></mml:mrow><mml:mrow></mml:mrow></mml:math>
</alternatives></disp-formula></p>
<p><disp-formula id="eqn-3">
<label>(3)</label>
<alternatives>
<graphic mimetype="image" mime-subtype="png" xlink:href="eqn-3.png"/>
<tex-math id="tex-eqn-3"><![CDATA[$$\begin{equation}\text{s}\text{o}\text{f}\text{t}\text{m}\text{ax}(\mathrm{z})_{\mathrm{i}}=\mathrm{p}_{\mathrm{i}}=\frac{\mathrm{e}^{{\mathrm{z}_{\mathrm{i}}}}}{\sum_{\mathrm{j}=1}^{\mathrm{n}}\mathrm{e}^{{\mathrm{z}_{\mathrm{i}}}}}
 \label{eqn-3}\end{equation}$$]]></tex-math>
<mml:math id="mml-eqn-3" display="block"><mml:mrow></mml:mrow><mml:mrow><mml:mstyle><mml:mtext>s</mml:mtext></mml:mstyle><mml:mstyle><mml:mtext>o</mml:mtext></mml:mstyle><mml:mstyle><mml:mtext>f</mml:mtext></mml:mstyle><mml:mstyle><mml:mtext>t</mml:mtext></mml:mstyle><mml:mstyle><mml:mtext>m</mml:mtext></mml:mstyle><mml:mstyle><mml:mtext>ax</mml:mtext></mml:mstyle><mml:msub><mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>z</mml:mi></mml:mstyle></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>i</mml:mi></mml:mstyle></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>p</mml:mi></mml:mstyle></mml:mrow><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>i</mml:mi></mml:mstyle></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:msup><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>e</mml:mi></mml:mstyle></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>z</mml:mi></mml:mstyle></mml:mrow><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>i</mml:mi></mml:mstyle></mml:mrow></mml:msub></mml:mrow></mml:msup></mml:mrow><mml:mrow><mml:mstyle displaystyle='true'><mml:mstyle displaystyle='true'><mml:msubsup><mml:mrow><mml:mo>&#x2211;</mml:mo> </mml:mrow><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>j</mml:mi></mml:mstyle><mml:mo lspace='0pt' rspace='0pt'>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>n</mml:mi></mml:mstyle></mml:mrow></mml:msubsup></mml:mstyle></mml:mstyle><mml:msup><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>e</mml:mi></mml:mstyle></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>z</mml:mi></mml:mstyle></mml:mrow><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>i</mml:mi></mml:mstyle></mml:mrow></mml:msub></mml:mrow></mml:msup></mml:mrow></mml:mfrac></mml:mrow><mml:mrow></mml:mrow></mml:math>
</alternatives></disp-formula></p>
<p>In <xref ref-type="disp-formula" rid="eqn-2">Eqs. (2)</xref> and <xref ref-type="disp-formula" rid="eqn-3">(3)</xref>, <italic>z<sub>i</sub></italic> represents the input to the softmax, <italic>h<sub>j</sub></italic> shows the activation of the layer and the weight among the penultimate, and the softmax layer is illustrated by <italic>w<sub>ij</sub></italic>, which is connected to each other. Hence, the predicted emotion or label y would be:</p>
<p><disp-formula id="eqn-4">
<label>(4)</label>
<alternatives>
<graphic mimetype="image" mime-subtype="png" xlink:href="eqn-4.png"/>
<tex-math id="tex-eqn-4"><![CDATA[$$\begin{equation}
\mathrm{y}=\text{a}\text{r}\text{g}\text{m}\text{a}\text{x}(\mathrm{p})_{\mathrm{i}}
 \label{eqn-4}
\end{equation}$$]]></tex-math>
<mml:math id="mml-eqn-4" display="block"><mml:mstyle mathvariant="normal"><mml:mi>y</mml:mi></mml:mstyle><mml:mo>=</mml:mo><mml:mstyle><mml:mtext>a</mml:mtext></mml:mstyle><mml:mstyle><mml:mtext>r</mml:mtext></mml:mstyle><mml:mstyle><mml:mtext>g</mml:mtext></mml:mstyle><mml:mstyle><mml:mtext>m</mml:mtext></mml:mstyle><mml:mstyle><mml:mtext>a</mml:mtext></mml:mstyle><mml:mstyle><mml:mtext>x</mml:mtext></mml:mstyle><mml:msub><mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>p</mml:mi></mml:mstyle></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>i</mml:mi></mml:mstyle></mml:mrow></mml:msub></mml:math></alternatives></disp-formula></p>
<p>The overall structure of the suggested SER system is demonstrated in <xref ref-type="fig" rid="fig-1">Fig. 1</xref>, and a detailed description of the sub-components is explained in the subsequent sections.</p>
<fig id="fig-1">
<label>Figure 1</label>
<caption>
<title>A detailed overview of the proposed one-dimensional architecture of the SER. In the 1D-CNN networks, we extracted the local features from the speech signals, and then passed from the stacked blocks. The blocks have connections with the last fully connected network where we adjusted the global weight and extracted the temporal or the sequential cues that utilized the gated recurrent unit (GRU) and the fully connected (FC) layers with softmax to calculate the probability of each class or emotion</title>
</caption>
<graphic mimetype="image" mime-subtype="png" xlink:href="fig-1.png"/>
</fig>
</sec>
<sec id="s3_2">
<label>3.2</label>
<title>Speech Signals Enhancement</title>
<p>Speech signals need to be refined using an efficient technique that enhance the training performance and ensure the final prediction of the system. In our proposed system, we contributed a speech signal enhancement module for the SER, which is summarized in <xref ref-type="fig" rid="fig-2">Fig. 2</xref>. We designed a module to enhance the high pass speech signals and the low pass speech signals using efficient algorithms. In the low pass speech signals, we estimated the spectrum by utilizing the fast Fourier transformation (FFT) in order to find the low-frequency band of the voice. We performed the spectral subtraction by utilizing the algorithm that is described in [<xref ref-type="bibr" rid="ref-41">41</xref>] in order to denoised and enhance the low pass speech signal, which is expressed as:</p>
<p><disp-formula id="eqn-5">
<label>(5)</label>
<alternatives>
<graphic mimetype="image" mime-subtype="png" xlink:href="eqn-5.png"/>
<tex-math id="tex-eqn-5"><![CDATA[$$\begin{equation}
\mathfrak{r}\mmmathcal{d}\mmmathcal{t} \left[\mmmathcal{n}\,\,\right]=\text{I}\text{F}\text{F}\text{T}
 \left\{\sqrt{\mathcal{S}_{\mathrm{l}}} \left[\mmmathcal{m}\,\,\right]. \mathrm{e}^{\,\mmmathcal{j}{\theta_{\mathrm{l}[\mathrm{m}]}}}\right\}\label{eqn-5}
\end{equation}$$]]></tex-math>
<mml:math id="mml-eqn-5" display="block"><mml:mi>&#x1D4c7;</mml:mi><mml:mi>&#x1d49f;</mml:mi><mml:mi>&#x1d4af;</mml:mi><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:mi mathvariant="script">n</mml:mi><mml:mspace width="0.3em"/><mml:mspace width="0.3em"/></mml:mrow><mml:mo>]</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mstyle><mml:mtext>I</mml:mtext></mml:mstyle><mml:mstyle><mml:mtext>F</mml:mtext></mml:mstyle><mml:mstyle><mml:mtext>F</mml:mtext></mml:mstyle><mml:mstyle><mml:mtext>T</mml:mtext></mml:mstyle><mml:mrow><mml:mo>{</mml:mo><mml:mrow><mml:msqrt><mml:mrow><mml:msub><mml:mrow><mml:mi mathvariant="script">S</mml:mi></mml:mrow><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>l</mml:mi></mml:mstyle></mml:mrow></mml:msub></mml:mrow></mml:msqrt><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:mi>&#x1D4c2;</mml:mi><mml:mspace width="0.3em"/><mml:mspace width="0.3em"/></mml:mrow><mml:mo>]</mml:mo></mml:mrow><mml:mo>.</mml:mo><mml:msup><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>e</mml:mi></mml:mstyle></mml:mrow><mml:mrow><mml:mspace width="0.3em"/><mml:mi>&#x1D4a5;</mml:mi><mml:msub><mml:mrow><mml:mi>&#x03B8;</mml:mi></mml:mrow><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>l</mml:mi></mml:mstyle><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>m</mml:mi></mml:mstyle></mml:mrow><mml:mo>]</mml:mo></mml:mrow></mml:mrow></mml:msub></mml:mrow></mml:msup></mml:mrow><mml:mo>}</mml:mo></mml:mrow></mml:math></alternatives></disp-formula></p>
<fig id="fig-2">
<label>Figure 2</label>
<caption>
<title>A detailed overview of the proposed speech signals enhancement technique that utilizes the fast Fourier transformational (FFT) and a spectral analysis (SA) that is based on the low pass and the high pass signals in order to remove noises and make it clear</title>
</caption>
<graphic mimetype="image" mime-subtype="png" xlink:href="fig-2.png"/>
</fig>
<p>The inverse fast Fourier transformation is illustrated by the IFFT, <inline-formula id="ieqn-2"><alternatives><inline-graphic xlink:href="ieqn-2.png"/><tex-math id="tex-ieqn-2"><![CDATA[$\Theta_{\mathrm{l}}$]]></tex-math><mml:math id="mml-ieqn-2"><mml:msub><mml:mrow><mml:mi>&#x0398;</mml:mi></mml:mrow><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>l</mml:mi></mml:mstyle></mml:mrow></mml:msub></mml:math></alternatives></inline-formula> [m] represents the low pass noise speech signal phase of the FFT, and</p>
<p><disp-formula id="eqn-6">
<label>(6)</label>
<alternatives>
<graphic mimetype="image" mime-subtype="png" xlink:href="eqn-6.png"/>
<tex-math id="tex-eqn-6"><![CDATA[$$\begin{equation}
\mathcal{S}_{\mathrm{l}}[\mathrm{m}]= \left\{\begin{array}{l@{\quad}l}
\mathrm{R}_{\mathrm{l}} \left[\mathrm{m}\right]-\alpha \mathrm{N}_{\mathrm{l} \left[\mathrm{m}\right],}& \text{if }\mathrm{R}_{\mathrm{l} \left[\mathrm{m}\right]}> \left(\alpha +\beta \right) {\mathrm{N}_{\mathrm{l}[\mathrm{m}]}} \\[6pt]
\beta \mathrm{N}_{\mathrm{l}[\mathrm{m}]}, & \text{o}\text{t}\text{h}\text{e}\text{r}\text{wise}.\end{array}\right.
 \label{eqn-6}
\end{equation}$$]]></tex-math>
<mml:math id="mml-eqn-6" display="block"><mml:msub><mml:mrow><mml:mi mathvariant="script">S</mml:mi></mml:mrow><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>l</mml:mi></mml:mstyle></mml:mrow></mml:msub><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>m</mml:mi></mml:mstyle></mml:mrow><mml:mo>]</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mrow><mml:mo>{</mml:mo><mml:mrow><mml:mtable equalrows="false" columnlines="none" equalcolumns="false"><mml:mtr><mml:mtd columnalign="left"><mml:msub><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>R</mml:mi></mml:mstyle></mml:mrow><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>l</mml:mi></mml:mstyle></mml:mrow></mml:msub><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>m</mml:mi></mml:mstyle></mml:mrow><mml:mo>]</mml:mo></mml:mrow><mml:mo>-</mml:mo><mml:mi>&#x03B1;</mml:mi><mml:msub><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>N</mml:mi></mml:mstyle></mml:mrow><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>l</mml:mi></mml:mstyle><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>m</mml:mi></mml:mstyle></mml:mrow><mml:mo>]</mml:mo></mml:mrow><mml:mo>,</mml:mo></mml:mrow></mml:msub><mml:mspace width="1em"/></mml:mtd><mml:mtd columnalign="left"><mml:mstyle><mml:mtext>if&#x00A0;</mml:mtext></mml:mstyle><mml:msub><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>R</mml:mi></mml:mstyle></mml:mrow><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>l</mml:mi></mml:mstyle><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>m</mml:mi></mml:mstyle></mml:mrow><mml:mo>]</mml:mo></mml:mrow></mml:mrow></mml:msub><mml:mo>&#x003E;</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mi>&#x03B1;</mml:mi><mml:mo>+</mml:mo><mml:mi>&#x03B2;</mml:mi></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:msub><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>N</mml:mi></mml:mstyle></mml:mrow><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>l</mml:mi></mml:mstyle><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>m</mml:mi></mml:mstyle></mml:mrow><mml:mo>]</mml:mo></mml:mrow></mml:mrow></mml:msub></mml:mtd></mml:mtr><mml:mtr><mml:mtd columnalign="left"><mml:mi>&#x03B2;</mml:mi><mml:msub><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>N</mml:mi></mml:mstyle></mml:mrow><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>l</mml:mi></mml:mstyle><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>m</mml:mi></mml:mstyle></mml:mrow><mml:mo>]</mml:mo></mml:mrow></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mspace width="1em"/></mml:mtd><mml:mtd columnalign="left"><mml:mstyle><mml:mtext>o</mml:mtext></mml:mstyle><mml:mstyle><mml:mtext>t</mml:mtext></mml:mstyle><mml:mstyle><mml:mtext>h</mml:mtext></mml:mstyle><mml:mstyle><mml:mtext>e</mml:mtext></mml:mstyle><mml:mstyle><mml:mtext>r</mml:mtext></mml:mstyle><mml:mstyle><mml:mtext>wise</mml:mtext></mml:mstyle><mml:mo>.</mml:mo></mml:mtd></mml:mtr> </mml:mtable></mml:mrow><mml:mo></mml:mo></mml:mrow></mml:math></alternatives></disp-formula></p>
<p>In the above equations, m and n indicated the frequency and the time indices. The squared magnitude of the FFT is represented by <inline-formula id="ieqn-3"><alternatives><inline-graphic xlink:href="ieqn-3.png"/><tex-math id="tex-ieqn-3"><![CDATA[$\mathrm{R}_{\mathrm{l}[\mathrm{m}]}$]]></tex-math><mml:math id="mml-ieqn-3"><mml:msub><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>R</mml:mi></mml:mstyle></mml:mrow><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>l</mml:mi></mml:mstyle><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>m</mml:mi></mml:mstyle></mml:mrow><mml:mo>]</mml:mo></mml:mrow></mml:mrow></mml:msub></mml:math></alternatives></inline-formula>, and <inline-formula id="ieqn-4"><alternatives><inline-graphic xlink:href="ieqn-4.png"/><tex-math id="tex-ieqn-4"><![CDATA[$\mathrm{N}_{\mathrm{l}[\mathrm{m}]}$]]></tex-math><mml:math id="mml-ieqn-4"><mml:msub><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>N</mml:mi></mml:mstyle></mml:mrow><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>l</mml:mi></mml:mstyle><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>m</mml:mi></mml:mstyle></mml:mrow><mml:mo>]</mml:mo></mml:mrow></mml:mrow></mml:msub></mml:math></alternatives></inline-formula> shows the spectral estimate of a low pass speech signal. <inline-formula id="ieqn-5"><alternatives><inline-graphic xlink:href="ieqn-5.png"/><tex-math id="tex-ieqn-5"><![CDATA[$\alpha$]]></tex-math><mml:math id="mml-ieqn-5"><mml:mi>&#x03B1;</mml:mi></mml:math></alternatives></inline-formula> indicates the positive subtraction factor, and <inline-formula id="ieqn-6"><alternatives><inline-graphic xlink:href="ieqn-6.png"/><tex-math id="tex-ieqn-6"><![CDATA[$\beta$]]></tex-math><mml:math id="mml-ieqn-6"><mml:mi>&#x03B2;</mml:mi></mml:math></alternatives></inline-formula> is the positive spectral floor parameters, which is described in [<xref ref-type="bibr" rid="ref-41">41</xref>] for the low pass speech signals. Similarly, in the high pass speech signal enhancement, we replaced <inline-formula id="ieqn-7"><alternatives><inline-graphic xlink:href="ieqn-7.png"/><tex-math id="tex-ieqn-7"><![CDATA[$\mathrm{R}_{\mathrm{l}[\mathrm{m}]}$]]></tex-math><mml:math id="mml-ieqn-7"><mml:msub><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>R</mml:mi></mml:mstyle></mml:mrow><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>l</mml:mi></mml:mstyle><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>m</mml:mi></mml:mstyle></mml:mrow><mml:mo>]</mml:mo></mml:mrow></mml:mrow></mml:msub></mml:math></alternatives></inline-formula> in <xref ref-type="disp-formula" rid="eqn-6">Eq. 6</xref> with:</p>
<p><disp-formula id="eqn-7">
<label>(7)</label>
<alternatives>
<graphic mimetype="image" mime-subtype="png" xlink:href="eqn-7.png"/>
<tex-math id="tex-eqn-7"><![CDATA[$$\begin{equation}
\mathrm{R}_{\mathrm{h}}[\mathrm{m}]=\sum_{\mathrm{k}=1}^{\mathrm{k}}\mmmathcal{r}_{\mathrm{h}}^{\,\,\mathrm{k}} \left[\mathrm{m}\right]\mathrm{R}_{\mathrm{h}}^{\mathrm{k}}[\mathrm{m}]
 \label{eqn-7}
\end{equation}$$]]></tex-math>
<mml:math id="mml-eqn-7" display="block"><mml:msub><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>R</mml:mi></mml:mstyle></mml:mrow><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>h</mml:mi></mml:mstyle></mml:mrow></mml:msub><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>m</mml:mi></mml:mstyle></mml:mrow><mml:mo>]</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mstyle displaystyle='true'><mml:mstyle displaystyle='true'><mml:munderover><mml:mrow><mml:mo>&#x2211;</mml:mo> </mml:mrow><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>k</mml:mi></mml:mstyle><mml:mo lspace='0pt' rspace='0pt'>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>k</mml:mi></mml:mstyle></mml:mrow></mml:munderover></mml:mstyle></mml:mstyle><mml:msubsup><mml:mrow><mml:mi>&#x1d4c7;</mml:mi></mml:mrow><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>h</mml:mi></mml:mstyle></mml:mrow><mml:mrow><mml:mspace width="0.3em"/><mml:mspace width="0.3em"/><mml:mstyle mathvariant="normal"><mml:mi>k</mml:mi></mml:mstyle></mml:mrow></mml:msubsup><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>m</mml:mi></mml:mstyle></mml:mrow><mml:mo>]</mml:mo></mml:mrow><mml:msubsup><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>R</mml:mi></mml:mstyle></mml:mrow><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>h</mml:mi></mml:mstyle></mml:mrow><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>k</mml:mi></mml:mstyle></mml:mrow></mml:msubsup><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>m</mml:mi></mml:mstyle></mml:mrow><mml:mo>]</mml:mo></mml:mrow></mml:math></alternatives></disp-formula></p>
<p>In additional, we used the following equation for the enhancement of the high pass speech signals.</p>
<p><disp-formula id="eqn-8">
<label>(8)</label>
<alternatives>
<graphic mimetype="image" mime-subtype="png" xlink:href="eqn-8.png"/>
<tex-math id="tex-eqn-8"><![CDATA[$$\begin{equation}
\mathrm{R}[\mathrm{m}]=\sum_{\mathrm{k}=1}^{\mathrm{k}}\mmmathcal{r}^{\,\,\mathrm{k}} \left[\mathrm{m}\right]\mathrm{R}^{\mathrm{k}}[\mathrm{m}]
 \label{eqn-8}
\end{equation}$$]]></tex-math>
<mml:math id="mml-eqn-8" display="block"><mml:mstyle mathvariant="normal"><mml:mi>R</mml:mi></mml:mstyle><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>m</mml:mi></mml:mstyle></mml:mrow><mml:mo>]</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mstyle displaystyle='true'><mml:mstyle displaystyle='true'><mml:munderover><mml:mrow><mml:mo>&#x2211;</mml:mo> </mml:mrow><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>k</mml:mi></mml:mstyle><mml:mo lspace='0pt' rspace='0pt'>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>k</mml:mi></mml:mstyle></mml:mrow></mml:munderover></mml:mstyle></mml:mstyle><mml:msup><mml:mrow><mml:mi>&#x1D4c7;</mml:mi></mml:mrow><mml:mrow><mml:mspace width="0.3em"/><mml:mspace width="0.3em"/><mml:mstyle mathvariant="normal"><mml:mi>k</mml:mi></mml:mstyle></mml:mrow></mml:msup><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>m</mml:mi></mml:mstyle></mml:mrow><mml:mo>]</mml:mo></mml:mrow><mml:msup><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>R</mml:mi></mml:mstyle></mml:mrow><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>k</mml:mi></mml:mstyle></mml:mrow></mml:msup><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>m</mml:mi></mml:mstyle></mml:mrow><mml:mo>]</mml:mo></mml:mrow></mml:math></alternatives></disp-formula></p>
<p>In the equations, <inline-formula id="ieqn-8"><alternatives><inline-graphic xlink:href="ieqn-8.png"/><tex-math id="tex-ieqn-8"><![CDATA[$\mathrm{R}_{\mathrm{h}}^{\mathrm{k}}[\mathrm{m}]$]]></tex-math><mml:math id="mml-ieqn-8"><mml:msubsup><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>R</mml:mi></mml:mstyle></mml:mrow><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>h</mml:mi></mml:mstyle></mml:mrow><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>k</mml:mi></mml:mstyle></mml:mrow></mml:msubsup><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>m</mml:mi></mml:mstyle></mml:mrow><mml:mo>]</mml:mo></mml:mrow></mml:math></alternatives></inline-formula> and <inline-formula id="ieqn-9"><alternatives><inline-graphic xlink:href="ieqn-9.png"/><tex-math id="tex-ieqn-9"><![CDATA[$\mathrm{R}^{\mathrm{k}}[\mathrm{m}]$]]></tex-math><mml:math id="mml-ieqn-9"><mml:msup><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>R</mml:mi></mml:mstyle></mml:mrow><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>k</mml:mi></mml:mstyle></mml:mrow></mml:msup><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>m</mml:mi></mml:mstyle></mml:mrow><mml:mo>]</mml:mo></mml:mrow></mml:math></alternatives></inline-formula> illustrated the squared magnitude of the FFT of the high pass speech signals with the k-th discrete spherical sequences, such as <inline-formula id="ieqn-10"><alternatives><inline-graphic xlink:href="ieqn-10.png"/><tex-math id="tex-ieqn-10"><![CDATA[$\mmmathcal{r}_{\mathrm{h}}^{  \mathrm{k}}[\mathrm{m}]$]]></tex-math><mml:math id="mml-ieqn-10"><mml:msubsup><mml:mrow><mml:mi>&#x1D4c7;</mml:mi></mml:mrow><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>h</mml:mi></mml:mstyle></mml:mrow><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>k</mml:mi></mml:mstyle></mml:mrow></mml:msubsup><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>m</mml:mi></mml:mstyle></mml:mrow><mml:mo>]</mml:mo></mml:mrow></mml:math></alternatives></inline-formula> and <inline-formula id="ieqn-11"><alternatives><inline-graphic xlink:href="ieqn-11.png"/><tex-math id="tex-ieqn-11"><![CDATA[$\mmmathcal{r}^{  \mathrm{k}}[\mathrm{m}]$]]></tex-math><mml:math id="mml-ieqn-11"><mml:msup><mml:mrow><mml:mi>&#x1D4c7;</mml:mi></mml:mrow><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>k</mml:mi></mml:mstyle></mml:mrow></mml:msup><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>m</mml:mi></mml:mstyle></mml:mrow><mml:mo>]</mml:mo></mml:mrow></mml:math></alternatives></inline-formula> in order to adoptively calculate the frequency-dependent weights. Similarly, phase <inline-formula id="ieqn-12"><alternatives><inline-graphic xlink:href="ieqn-12.png"/><tex-math id="tex-ieqn-12"><![CDATA[$\Theta_{\mathrm{l}}$]]></tex-math><mml:math id="mml-ieqn-12"><mml:msub><mml:mrow><mml:mi>&#x0398;</mml:mi></mml:mrow><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>l</mml:mi></mml:mstyle></mml:mrow></mml:msub></mml:math></alternatives></inline-formula> [m] is replaced by <inline-formula id="ieqn-13"><alternatives><inline-graphic xlink:href="ieqn-13.png"/><tex-math id="tex-ieqn-13"><![CDATA[$\mmmathcal{r}_{\mathrm{h}}^{  \mathrm{l}}[\mathrm{m}]$]]></tex-math><mml:math id="mml-ieqn-13"><mml:msubsup><mml:mrow><mml:mi>&#x1D4c7;</mml:mi></mml:mrow><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>h</mml:mi></mml:mstyle></mml:mrow><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>l</mml:mi></mml:mstyle></mml:mrow></mml:msubsup><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>m</mml:mi></mml:mstyle></mml:mrow><mml:mo>]</mml:mo></mml:mrow></mml:math></alternatives></inline-formula>, and the noise spectral estimated by <inline-formula id="ieqn-14"><alternatives><inline-graphic xlink:href="ieqn-14.png"/><tex-math id="tex-ieqn-14"><![CDATA[$\mathrm{N}_{\mathrm{l}} \left[\mathrm{m}\right]$]]></tex-math><mml:math id="mml-ieqn-14"><mml:msub><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>N</mml:mi></mml:mstyle></mml:mrow><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>l</mml:mi></mml:mstyle></mml:mrow></mml:msub><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>m</mml:mi></mml:mstyle></mml:mrow><mml:mo>]</mml:mo></mml:mrow></mml:math></alternatives></inline-formula> is replaced by <inline-formula id="ieqn-15"><alternatives><inline-graphic xlink:href="ieqn-15.png"/><tex-math id="tex-ieqn-15"><![CDATA[$\mathrm{R}_{\mathrm{h}} \left[\mathrm{m}\right]$]]></tex-math><mml:math id="mml-ieqn-15"><mml:msub><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>R</mml:mi></mml:mstyle></mml:mrow><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>h</mml:mi></mml:mstyle></mml:mrow></mml:msub><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>m</mml:mi></mml:mstyle></mml:mrow><mml:mo>]</mml:mo></mml:mrow></mml:math></alternatives></inline-formula> in order to enhanced the high pass speech signals. A detailed overview of the proposed speech signal enhancement module is shown in <xref ref-type="fig" rid="fig-2">Fig. 2</xref> below.</p>
</sec>
<sec id="s3_3">
<label>3.3</label>
<title>Hierarchical Features Learning Blocks (HFLBs)</title>
<p>The proposed stacked one-dimensional network consists of four HFLBs, which were specially designed to extract the emotional information and the hierarchical correlation, and they boost the learned local features. Hence, all the blocks were stacked in a hierarchy, and each HFLB had one dilated convolution layer, one BN layer, and one leaky rectified linear unit. In the dilated convolution layer, we used a 1D filter of size 3 with a stride setting one to extract the high-level salient cues from the audio signals with a leaky_relu activation function. The batch normalization (BN) layer was used in all the HFLBs after the dilated convolution layer to normalize and re-scale the features map in order to increase the recognition performance and speed up the training process. In the pooling scheme, we proposed the average pooling strategy of filter size 4 with the same stride setting in the 1D-CNN architecture. We selected the average value in order to down-sample the input tensor by removing the redundancy and the distortion with this scheme. The dilated convolution layer played an important role in the HFLBs to extract the most salient cues and the emotional cues from the learned features, which produced a features map by computing the dot (.) product among the input value and the filters, which can be represented as:</p>
<p><disp-formula id="eqn-9">
<label>(9)</label>
<alternatives>
<graphic mimetype="image" mime-subtype="png" xlink:href="eqn-9.png"/>
<tex-math id="tex-eqn-9"><![CDATA[$$\begin{equation}
\mathrm{z} \left(\mathrm{n}\right)=\mathrm{x} \left(\mathrm{n}\right) * \mathrm{w} \left(\mathrm{n}\right)=\sum_{\mathrm{m}=-\mathrm{L}}^{\mathrm{L}}\mathrm{x} \left(\mathrm{m}\right).\mathrm{w} \left(\mathrm{n}-\mathrm{m}\right)
 \label{eqn-9}
\end{equation}$$]]></tex-math>
<mml:math id="mml-eqn-9" display="block"><mml:mstyle mathvariant="normal"><mml:mi>z</mml:mi></mml:mstyle><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>n</mml:mi></mml:mstyle></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mstyle mathvariant="normal"><mml:mi>x</mml:mi></mml:mstyle><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>n</mml:mi></mml:mstyle></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mo>*</mml:mo><mml:mstyle mathvariant="normal"><mml:mi>w</mml:mi></mml:mstyle><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>n</mml:mi></mml:mstyle></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mstyle displaystyle='true'><mml:mstyle displaystyle='true'><mml:munderover><mml:mrow><mml:mo>&#x2211;</mml:mo> </mml:mrow><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>m</mml:mi></mml:mstyle><mml:mo lspace='0pt' rspace='0pt'>=</mml:mo><mml:mo lspace='0pt' rspace='0pt'>-</mml:mo><mml:mstyle mathvariant="normal"><mml:mi>L</mml:mi></mml:mstyle></mml:mrow><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>L</mml:mi></mml:mstyle></mml:mrow></mml:munderover></mml:mstyle></mml:mstyle><mml:mstyle mathvariant="normal"><mml:mi>x</mml:mi></mml:mstyle><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>m</mml:mi></mml:mstyle></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mo>.</mml:mo><mml:mstyle mathvariant="normal"><mml:mi>w</mml:mi></mml:mstyle><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>n</mml:mi></mml:mstyle><mml:mo>-</mml:mo><mml:mstyle mathvariant="normal"><mml:mi>m</mml:mi></mml:mstyle></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:math></alternatives></disp-formula></p>
<p>The proposed 1D convolution layer yields a signal x(n) as the input and produces the result z(n) by utilizing the convolution filters w(n) with respect to the L size. In the suggested model, the filter w(n) is randomly initialized in the dilated convolution layer in our experimentations.</p>
<p>The leaky rectified linear unit (leaky_relu) activation function <inline-formula id="ieqn-16"><alternatives><inline-graphic xlink:href="ieqn-16.png"/><tex-math id="tex-ieqn-16"><![CDATA[$\sigma$]]></tex-math><mml:math id="mml-ieqn-16"><mml:mi>&#x03C3;</mml:mi></mml:math></alternatives></inline-formula> is utilized in order to remove and replace the negative value with zero, which can be represented as:</p>
<p><disp-formula id="eqn-10">
<label>(10)</label>
<alternatives>
<graphic mimetype="image" mime-subtype="png" xlink:href="eqn-10.png"/>
<tex-math id="tex-eqn-10"><![CDATA[$$\begin{equation}
\sigma (\mathrm{x}_{\mathrm{i}})= \left\{\begin{array}{l@{\quad}l}
\mathrm{x}_{\mathrm{i}}, & \text{if }\mathrm{x}_{\mathrm{i}},\ \geq 0 \\
\sigma \left(\mathrm{e}^{\mathrm{x}}-1\right)& \text{i}\text{f}\ \mathrm{x}_{\mathrm{i}},< 0 \end{array}\right.
 \label{eqn-10}
\end{equation}$$]]></tex-math>
<mml:math id="mml-eqn-10" display="block"><mml:mi>&#x03C3;</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>x</mml:mi></mml:mstyle></mml:mrow><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>i</mml:mi></mml:mstyle></mml:mrow></mml:msub></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mrow><mml:mo>{</mml:mo><mml:mrow><mml:mtable equalrows="false" columnlines="none" equalcolumns="false"><mml:mtr><mml:mtd columnalign="left"><mml:msub><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>x</mml:mi></mml:mstyle></mml:mrow><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>i</mml:mi></mml:mstyle></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mspace width="1em"/></mml:mtd><mml:mtd columnalign="left"><mml:mstyle><mml:mtext>if&#x00A0;</mml:mtext></mml:mstyle><mml:msub><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>x</mml:mi></mml:mstyle></mml:mrow><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>i</mml:mi></mml:mstyle></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mspace width=".3em" /><mml:mo>&#x2265;</mml:mo><mml:mn>0</mml:mn></mml:mtd></mml:mtr><mml:mtr><mml:mtd columnalign="left"><mml:mi>&#x03C3;</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:msup><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>e</mml:mi></mml:mstyle></mml:mrow><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>x</mml:mi></mml:mstyle></mml:mrow></mml:msup><mml:mo>-</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mspace width="1em"/></mml:mtd><mml:mtd columnalign="left"><mml:mstyle><mml:mtext>i</mml:mtext></mml:mstyle><mml:mstyle><mml:mtext>f</mml:mtext></mml:mstyle><mml:mspace width=".3em" /><mml:msub><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>x</mml:mi></mml:mstyle></mml:mrow><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>i</mml:mi></mml:mstyle></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mo>&#x003C;</mml:mo><mml:mn>0</mml:mn></mml:mtd></mml:mtr> </mml:mtable></mml:mrow><mml:mo></mml:mo></mml:mrow></mml:math></alternatives></disp-formula></p>
<p>After the leaky_relu function, we used the BN layer to normalize the input features map of the previous layer for each batch. The transformation is applied in the BN layer by utilizing the mean and the variance of the convolved features, which is represented as:</p>
<p><disp-formula id="eqn-11">
<label>(11)</label>
<alternatives>
<graphic mimetype="image" mime-subtype="png" xlink:href="eqn-11.png"/>
<tex-math id="tex-eqn-11"><![CDATA[$$\begin{equation}
\mathrm{z}_{\mathrm{i}}^{\mathrm{L}}=\sigma (\mathrm{BN}(\mathrm{b}_{\mathrm{i}}^{\mathrm{L}}+\sum_{\mathrm{j}}\mathrm{z}_{\mathrm{j}}^{\mathrm{L}-1}. \mathrm{w}_{\mathrm{ij}}^{\mathrm{L}}))
 \label{eqn-11}
\end{equation}$$]]></tex-math>
<mml:math id="mml-eqn-11" display="block"><mml:msubsup><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>z</mml:mi></mml:mstyle></mml:mrow><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>i</mml:mi></mml:mstyle></mml:mrow><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>L</mml:mi></mml:mstyle></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:mi>&#x03C3;</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>B</mml:mi><mml:mi>N</mml:mi></mml:mstyle><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:msubsup><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>b</mml:mi></mml:mstyle></mml:mrow><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>i</mml:mi></mml:mstyle></mml:mrow><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>L</mml:mi></mml:mstyle></mml:mrow></mml:msubsup><mml:mo>+</mml:mo><mml:munder><mml:mrow><mml:mo>&#x2211;</mml:mo> </mml:mrow><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>j</mml:mi></mml:mstyle></mml:mrow></mml:munder><mml:msubsup><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>z</mml:mi></mml:mstyle></mml:mrow><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>j</mml:mi></mml:mstyle></mml:mrow><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>L</mml:mi></mml:mstyle><mml:mo lspace='0pt' rspace='0pt'>-</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msubsup><mml:mo>.</mml:mo><mml:msubsup><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>w</mml:mi></mml:mstyle></mml:mrow><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mstyle></mml:mrow><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>L</mml:mi></mml:mstyle></mml:mrow></mml:msubsup></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:math></alternatives></disp-formula></p>
<p>In the above equation, <inline-formula id="ieqn-17"><alternatives><inline-graphic xlink:href="ieqn-17.png"/><tex-math id="tex-ieqn-17"><![CDATA[$\mathrm{z}_{\mathrm{i}}^{\mathrm{L}}$]]></tex-math><mml:math id="mml-ieqn-17"><mml:msubsup><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>z</mml:mi></mml:mstyle></mml:mrow><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>i</mml:mi></mml:mstyle></mml:mrow><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>L</mml:mi></mml:mstyle></mml:mrow></mml:msubsup></mml:math></alternatives></inline-formula> and <inline-formula id="ieqn-18"><alternatives><inline-graphic xlink:href="ieqn-18.png"/><tex-math id="tex-ieqn-18"><![CDATA[$\mathrm{z}_{\mathrm{j}}^{\mathrm{L}-1}$]]></tex-math><mml:math id="mml-ieqn-18"><mml:msubsup><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>z</mml:mi></mml:mstyle></mml:mrow><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>j</mml:mi></mml:mstyle></mml:mrow><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>L</mml:mi></mml:mstyle><mml:mo lspace='0pt' rspace='0pt'>-</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msubsup></mml:math></alternatives></inline-formula> denotes the output features of the i-th input feature at the L-th layer, and the j-th input features is represented at the (<italic>L</italic> &#x2212;1)th layer. The convolution filters are represented by <inline-formula id="ieqn-19"><alternatives><inline-graphic xlink:href="ieqn-19.png"/><tex-math id="tex-ieqn-19"><![CDATA[$\mathrm{w}_{\mathrm{ij}}^{\mathrm{L}}$]]></tex-math><mml:math id="mml-ieqn-19"><mml:msubsup><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>w</mml:mi></mml:mstyle></mml:mrow><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mstyle></mml:mrow><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>L</mml:mi></mml:mstyle></mml:mrow></mml:msubsup></mml:math></alternatives></inline-formula>, between the i-th and the j-th features. In the 1D-CNN architecture, we passed the normalized features from the average pooling layer in order to reduce the dimensionality of the features map, which is a non-linear down-sampling technique that is represented as:</p>
<p><disp-formula id="eqn-12">
<label>(12)</label>
<alternatives>
<graphic mimetype="image" mime-subtype="png" xlink:href="eqn-12.png"/>
<tex-math id="tex-eqn-12"><![CDATA[$$\begin{equation}
\mathrm{z}_{\mathrm{k}}^{\mathrm{L}}=\text{avg }\mathrm{z}_{\mathrm{p}}^{\mathrm{L}}\parallel \forall \rho \in \Upomega _{\mathrm{k}}
 \label{eqn-12}
\end{equation}$$]]></tex-math>
<mml:math id="mml-eqn-12" display="block"><mml:msubsup><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>z</mml:mi></mml:mstyle></mml:mrow><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>k</mml:mi></mml:mstyle></mml:mrow><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>L</mml:mi></mml:mstyle></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:mstyle><mml:mtext>avg&#x00A0;</mml:mtext></mml:mstyle><mml:msubsup><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>z</mml:mi></mml:mstyle></mml:mrow><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>p</mml:mi></mml:mstyle></mml:mrow><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>L</mml:mi></mml:mstyle></mml:mrow></mml:msubsup><mml:mo>&#x2225;</mml:mo><mml:mo>&#x2200;</mml:mo><mml:mi>&#x03C1;</mml:mi><mml:mo>&#x2208;</mml:mo><mml:msub><mml:mrow><mml:mi>&#x03A9;</mml:mi></mml:mrow><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>k</mml:mi></mml:mstyle></mml:mrow></mml:msub></mml:math></alternatives></disp-formula></p>
<p>In the above Equation, <inline-formula id="ieqn-20"><alternatives><inline-graphic xlink:href="ieqn-20.png"/><tex-math id="tex-ieqn-20"><![CDATA[$\Upomega _{\mathrm{k}}$]]></tex-math><mml:math id="mml-ieqn-20"><mml:msub><mml:mrow><mml:mi>&#x03A9;</mml:mi></mml:mrow><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>k</mml:mi></mml:mstyle></mml:mrow></mml:msub></mml:math></alternatives></inline-formula> shows the pooling area with index k, and <inline-formula id="ieqn-21"><alternatives><inline-graphic xlink:href="ieqn-21.png"/><tex-math id="tex-ieqn-21"><![CDATA[$\mathrm{z}_{\mathrm{p}}^{\mathrm{L}}$]]></tex-math><mml:math id="mml-ieqn-21"><mml:msubsup><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>z</mml:mi></mml:mstyle></mml:mrow><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>p</mml:mi></mml:mstyle></mml:mrow><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>L</mml:mi></mml:mstyle></mml:mrow></mml:msubsup></mml:math></alternatives></inline-formula> represents the input features map of the Lth average pooling layer with index k and P. The final output results of the mechanism are represented by <inline-formula id="ieqn-22"><alternatives><inline-graphic xlink:href="ieqn-22.png"/><tex-math id="tex-ieqn-22"><![CDATA[$\mathrm{z}_{\mathrm{k}}^{\mathrm{L}}$]]></tex-math><mml:math id="mml-ieqn-22"><mml:msubsup><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>z</mml:mi></mml:mstyle></mml:mrow><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>k</mml:mi></mml:mstyle></mml:mrow><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>L</mml:mi></mml:mstyle></mml:mrow></mml:msubsup></mml:math></alternatives></inline-formula>, which is shown in the equation. The pooling scheme is a core operation of the 1D-CNN, which is represented in <xref ref-type="disp-formula" rid="eqn-12">Eq. (12)</xref>.</p>
</sec>
<sec id="s3_4">
<label>3.4</label>
<title>Bi-Directional Gated Recurrent Units (BiGRU)</title>
<p>The gated recurrent units (GRUs) are a special and more simplified network for the time series data in order to recognize the sequential information or the temporal information [<xref ref-type="bibr" rid="ref-42">42</xref>]. The GRU is a simplified version of the long short-term memory (LSTM), and very popular for sequential learning. The GRU is the combination of two gates, which include the update gate and the reset gate. The internal mechanism of the gates is different than LSTM. For example, the GRU update gate is the combination of the forget gate and the input gate of the LSTM, and the reset gate remained the same. The model is becoming gradually popular due to its simplicity over the standard LSTM network. The GRU network modifies the information inside the units, which is similar to the LSTM network, but it doesn&#x2019;t consume the distinct memory cells. The activation of the GRUs <inline-formula id="ieqn-23"><alternatives><inline-graphic xlink:href="ieqn-23.png"/><tex-math id="tex-ieqn-23"><![CDATA[$\mathrm{h}_{\mathrm{t}}^{\mathrm{j}}$]]></tex-math><mml:math id="mml-ieqn-23"><mml:msubsup><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>h</mml:mi></mml:mstyle></mml:mrow><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>t</mml:mi></mml:mstyle></mml:mrow><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>j</mml:mi></mml:mstyle></mml:mrow></mml:msubsup></mml:math></alternatives></inline-formula> at time t represent the linear interpolation among candidate <inline-formula id="ieqn-24"><alternatives><inline-graphic xlink:href="ieqn-24.png"/><tex-math id="tex-ieqn-24"><![CDATA[$\hat{\mathrm{h}}_{\mathrm{t}}^{\mathrm{j}}$]]></tex-math><mml:math id="mml-ieqn-24"><mml:msubsup><mml:mrow><mml:mover accent="true"><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>h</mml:mi></mml:mstyle></mml:mrow><mml:mo>^</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>t</mml:mi></mml:mstyle></mml:mrow><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>j</mml:mi></mml:mstyle></mml:mrow></mml:msubsup></mml:math></alternatives></inline-formula> and the previous activation <inline-formula id="ieqn-25"><alternatives><inline-graphic xlink:href="ieqn-25.png"/><tex-math id="tex-ieqn-25"><![CDATA[$\mathrm{h}_{\mathrm{t}-1}^{\mathrm{j}}$]]></tex-math><mml:math id="mml-ieqn-25"><mml:msubsup><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>h</mml:mi></mml:mstyle></mml:mrow><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>t</mml:mi></mml:mstyle><mml:mo lspace='0pt' rspace='0pt'>-</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>j</mml:mi></mml:mstyle></mml:mrow></mml:msubsup></mml:math></alternatives></inline-formula>, which can be represented as:</p>
<p><disp-formula id="eqn-13">
<label>(13)</label>
<alternatives>
<graphic mimetype="image" mime-subtype="png" xlink:href="eqn-13.png"/>
<tex-math id="tex-eqn-13"><![CDATA[$$\begin{equation}
\mathrm{h}_{\mathrm{t}}^{\mathrm{j}}= \left(1-\mathrm{z}_{\mathrm{t}}
^{\mathrm{j}}\right)\mathrm{h}_{\mathrm{t}-1}^{\mathrm{j}}+\mathrm{z}_
{\mathrm{t}}^{\mathrm{j}}\hat{\mathrm{h}}_{\mathrm{t}}^{\mathrm{j}}
 \label{eqn-13}
\end{equation}$$]]></tex-math>
<mml:math id="mml-eqn-13" display="block"><mml:msubsup><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>h</mml:mi></mml:mstyle></mml:mrow><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>t</mml:mi></mml:mstyle></mml:mrow><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>j</mml:mi></mml:mstyle></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mn>1</mml:mn><mml:mo>-</mml:mo><mml:msubsup><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>z</mml:mi></mml:mstyle></mml:mrow><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>t</mml:mi></mml:mstyle></mml:mrow><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>j</mml:mi></mml:mstyle></mml:mrow></mml:msubsup></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:msubsup><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>h</mml:mi></mml:mstyle></mml:mrow><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>t</mml:mi></mml:mstyle><mml:mo lspace='0pt' rspace='0pt'>-</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>j</mml:mi></mml:mstyle></mml:mrow></mml:msubsup><mml:mo>+</mml:mo><mml:msubsup><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>z</mml:mi></mml:mstyle></mml:mrow><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>t</mml:mi></mml:mstyle></mml:mrow><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>j</mml:mi></mml:mstyle></mml:mrow></mml:msubsup><mml:msubsup><mml:mrow><mml:mover accent="true"><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>h</mml:mi></mml:mstyle></mml:mrow><mml:mo>^</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>t</mml:mi></mml:mstyle></mml:mrow><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>j</mml:mi></mml:mstyle></mml:mrow></mml:msubsup></mml:math></alternatives></disp-formula></p>
<p><inline-formula id="ieqn-26"><alternatives><inline-graphic xlink:href="ieqn-26.png"/><tex-math id="tex-ieqn-26"><![CDATA[$\mathrm{z}_{\mathrm{t}}^{\mathrm{j}}$]]></tex-math><mml:math id="mml-ieqn-26"><mml:msubsup><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>z</mml:mi></mml:mstyle></mml:mrow><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>t</mml:mi></mml:mstyle></mml:mrow><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>j</mml:mi></mml:mstyle></mml:mrow></mml:msubsup></mml:math></alternatives></inline-formula> represent the update gate, and it decides how much the units update and its activation, which is represented by:</p>
<p><disp-formula id="eqn-14">
<label>(14)</label>
<alternatives>
<graphic mimetype="image" mime-subtype="png" xlink:href="eqn-14.png"/>
<tex-math id="tex-eqn-14"><![CDATA[$$\begin{equation}
\mathrm{z}_{\mathrm{t}}^{\mathrm{j}}=\sigma \left(\mathrm{W}_{\mathrm{x}}\mathrm{x}_{\mathrm{t}}+\mathrm{U}_{\mathrm{z}}\mathrm{h}_{\mathrm{t}-1}\right)^{\mathrm{j}}
 \label{eqn-14}
\end{equation}$$]]></tex-math>
<mml:math id="mml-eqn-14" display="block"><mml:msubsup><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>z</mml:mi></mml:mstyle></mml:mrow><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>t</mml:mi></mml:mstyle></mml:mrow><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>j</mml:mi></mml:mstyle></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:mi>&#x03C3;</mml:mi><mml:msup><mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>W</mml:mi></mml:mstyle></mml:mrow><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>x</mml:mi></mml:mstyle></mml:mrow></mml:msub><mml:msub><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>x</mml:mi></mml:mstyle></mml:mrow><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>t</mml:mi></mml:mstyle></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:msub><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>U</mml:mi></mml:mstyle></mml:mrow><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>z</mml:mi></mml:mstyle></mml:mrow></mml:msub><mml:msub><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>h</mml:mi></mml:mstyle></mml:mrow><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>t</mml:mi></mml:mstyle><mml:mo lspace='0pt' rspace='0pt'>-</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>j</mml:mi></mml:mstyle></mml:mrow></mml:msup></mml:math></alternatives></disp-formula></p>
<p>Similarly, the update gate in the GRUs also computes the activation, and <inline-formula id="ieqn-27"><alternatives><inline-graphic xlink:href="ieqn-27.png"/><tex-math id="tex-ieqn-27"><![CDATA[$\hat{\mathrm{h}}_{\mathrm{t}}^{\mathrm{j}}$]]></tex-math><mml:math id="mml-ieqn-27"><mml:msubsup><mml:mrow><mml:mover accent="true"><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>h</mml:mi></mml:mstyle></mml:mrow><mml:mo>^</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>t</mml:mi></mml:mstyle></mml:mrow><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>j</mml:mi></mml:mstyle></mml:mrow></mml:msubsup></mml:math></alternatives></inline-formula> utilizes the following Equation:</p>
<p><disp-formula id="eqn-15">
<label>(15)</label>
<alternatives>
<graphic mimetype="image" mime-subtype="png" xlink:href="eqn-15.png"/>
<tex-math id="tex-eqn-15"><![CDATA[$$\begin{equation}
\hat{\mathrm{h}}_{\mathrm{t}}^{\mathrm{j}}=\tanh \left(\mathrm{W}\mathrm{x}_{\mathrm{t}}+\mathrm{U}(\mathrm{r}_{\mathrm{t}}.\mathrm{*h}_{\mathrm{t}-1})\right)^{\mathrm{j}}
 \label{eqn-15}
\end{equation}$$]]></tex-math>
<mml:math id="mml-eqn-15" display="block"><mml:msubsup><mml:mrow><mml:mover accent="true"><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>h</mml:mi></mml:mstyle></mml:mrow><mml:mo>^</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>t</mml:mi></mml:mstyle></mml:mrow><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>j</mml:mi></mml:mstyle></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:mo> tanh</mml:mo><mml:msup><mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>W</mml:mi></mml:mstyle><mml:msub><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>x</mml:mi></mml:mstyle></mml:mrow><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>t</mml:mi></mml:mstyle></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:mstyle mathvariant="normal"><mml:mi>U</mml:mi></mml:mstyle><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>r</mml:mi></mml:mstyle></mml:mrow><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>t</mml:mi></mml:mstyle></mml:mrow></mml:msub><mml:mo>.</mml:mo><mml:msub><mml:mrow><mml:mstyle mathvariant="normal"><mml:mo>*</mml:mo><mml:mi>h</mml:mi></mml:mstyle></mml:mrow><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>t</mml:mi></mml:mstyle><mml:mo lspace='0pt' rspace='0pt'>-</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>j</mml:mi></mml:mstyle></mml:mrow></mml:msup></mml:math></alternatives></disp-formula></p>
<p>In <xref ref-type="disp-formula" rid="eqn-11">Eq. (15)</xref>, <inline-formula id="ieqn-28"><alternatives><inline-graphic xlink:href="ieqn-28.png"/><tex-math id="tex-ieqn-28"><![CDATA[$\mathrm{r}_{\mathrm{t}}^{\mathrm{j}}$]]></tex-math><mml:math id="mml-ieqn-28"><mml:msubsup><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>r</mml:mi></mml:mstyle></mml:mrow><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>t</mml:mi></mml:mstyle></mml:mrow><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>j</mml:mi></mml:mstyle></mml:mrow></mml:msubsup></mml:math></alternatives></inline-formula> represents the reset gate, and an element-wise multiplication is denoted by (.*). The reset gate agrees the unit to forget in the previous cues when (<inline-formula id="ieqn-29"><alternatives><inline-graphic xlink:href="ieqn-29.png"/><tex-math id="tex-ieqn-29"><![CDATA[$\mathrm{z}_{\mathrm{t}}^{\mathrm{j}}= = 0$]]></tex-math><mml:math id="mml-ieqn-29"><mml:msubsup><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>z</mml:mi></mml:mstyle></mml:mrow><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>t</mml:mi></mml:mstyle></mml:mrow><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>j</mml:mi></mml:mstyle></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:mo>=</mml:mo><mml:mn>0</mml:mn></mml:math></alternatives></inline-formula>), which means the gate is off. This is the mechanism that informs the unit in order to search the head sign of an input sequence. We can calculate the reset gate by using the following equation, which is easily expressed as:</p>
<p><disp-formula id="eqn-16">
<label>(16)</label>
<alternatives>
<graphic mimetype="image" mime-subtype="png" xlink:href="eqn-16.png"/>
<tex-math id="tex-eqn-16"><![CDATA[$$\begin{equation}
\mathrm{r}_{\mathrm{t}}^{\mathrm{j}}=\sigma \left(\mathrm{W}_{\mathrm{r}}\mathrm{x}_{\mathrm{t}}+\mathrm{r}^{{\mathrm{h}_{\mathrm{t}-1}}}\right)^{\mathrm{j}}
 \label{eqn-16}
\end{equation}$$]]></tex-math>
<mml:math id="mml-eqn-16" display="block"><mml:msubsup><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>r</mml:mi></mml:mstyle></mml:mrow><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>t</mml:mi></mml:mstyle></mml:mrow><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>j</mml:mi></mml:mstyle></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:mi>&#x03C3;</mml:mi><mml:msup><mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>W</mml:mi></mml:mstyle></mml:mrow><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>r</mml:mi></mml:mstyle></mml:mrow></mml:msub><mml:msub><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>x</mml:mi></mml:mstyle></mml:mrow><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>t</mml:mi></mml:mstyle></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:msup><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>r</mml:mi></mml:mstyle></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>h</mml:mi></mml:mstyle></mml:mrow><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>t</mml:mi></mml:mstyle><mml:mo lspace='0pt' rspace='0pt'>-</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:mrow></mml:msup></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>j</mml:mi></mml:mstyle></mml:mrow></mml:msup></mml:math></alternatives></disp-formula></p>
<p>where z represents the update gate, which controls the previous state, and the reset gate r is used to activate the short-term dependencies in units. The long-term dependencies are activated by updating gate z in units. In this paper, we utilized the BiGRU network for the SER in order to recognize the temporal information that used the learned features. The structure of the BiGRU network is illustrated in <xref ref-type="fig" rid="fig-3">Fig. 3</xref>.</p>
<fig id="fig-3">
<label>Figure 3</label>
<caption>
<title>The overall internal and external architecture of the gated recurrent unit (GRUs). The internal structure is illustrated in (a), and the external structure is illustrated in (b)</title>
</caption>
<graphic mimetype="image" mime-subtype="png" xlink:href="fig-3.png"/>
</fig>
</sec>
<sec id="s3_5">
<label>3.5</label>
<title>Model Optimization and Computational Setup</title>
<p>In this section, we explained the detailed implementation of the suggested framework that was implemented in a python environment, which utilized a scikit-learn library and other libraries for machine learning. We tuned the model with different hyperparameters in order to make it sufficient for the speech emotion recognition (SER). We evaluated our model with different batch sizes, learning rates, optimizers, number of epochs, and regularization factors, such as L1 and L2 with different values. In the dataset, we divided the data into an 80:20 ratio, the 80% of the data was used for the model training, and the remaining 20% of the data was used for the testing. In these experiments, we performed the emotion prediction directly from the raw audio data or the speech rather than conducting any pre-processing. We used a single GeForce GTX 1070 NVIDIA GPU with an 8 GB memory for the model training and the evaluation. We trained our model using an early stopping method in order to save the best model and set the learning rate to 0.0001 with one decay after 10 epochs. We set the batch size to 32 for all the datasets, and 32 was selected for the number of hidden units for the GRUs with an Adam optimizer, which was used in the overall process for the training, and it achieved the best precision, which produced only a 0.3351 training loss and a 0.5642 validation loss.</p>
</sec>
</sec>
<sec id="s4">
<label>4</label>
<title>Experimental Evaluation and Results</title>
<p>The proposed one-dimensional DCNN-GRU architecture was experimentally estimated using three benchmark speech emotions datasets, which included the IEMOCAP [<xref ref-type="bibr" rid="ref-16">16</xref>], EMO-DB [<xref ref-type="bibr" rid="ref-43">43</xref>], and RAVDESS [<xref ref-type="bibr" rid="ref-17">17</xref>]. All of these datasets are acted, and the actors expressed and read from scripts using different emotions. We documented the detailed experimental evaluations and the results in order to prove the effectiveness and the robustness of the system in the SER domain. A detailed explanation of the datasets, the model evaluations, and the performances are included in the subsequent sections.</p>
<sec id="s4_1">
<label>4.1</label>
<title>Datasets</title>
<p>The IEMOCAP dataset, which is an interactive emotional dyadic motion capture [<xref ref-type="bibr" rid="ref-16">16</xref>], is a challenging and well-known emotional speech dataset. The dataset consists of audios, videos, different facial motions, and text transcriptions, which were recorded by 10 different actors. The dataset has 12 hours of total audio-visual data and five different sessions in the dataset, and each session consists of two actors, which include one male and one female. The dataset was annotated by three field experts in order to assign the labels individually, and we selected the files that at least two experts agreed upon them. In contrast, we selected four emotions categories that included anger, happy, sad, and neutral, with 1103, 1084, 1636, and 1708 number of utterances, respectively, for the comparative analysis, which is frequently used in the literature.</p>
<p>The EMO-DB dataset, which is the Berlin emotion database [<xref ref-type="bibr" rid="ref-43">43</xref>], was recorded by ten (10) experienced actors, and it includes 535 utterances with different emotions. The dataset contains five (5) male and five (5) female actors who read pre-determined sentences in order to express the different emotions. The approximate time of the utterances in the EMO-DB is three to five seconds with a sixteen kHz sampling rate. The EMO-DB dataset is very popular, and it is frequently used for the SER in machine learning and deep learning approaches.</p>
<p>The RAVDESS dataset, which is the Ryerson audiovisual database of emotional speech and songs [<xref ref-type="bibr" rid="ref-17">17</xref>], is a British language dataset. It is a simulated dataset that is broadly used in recognition systems to identify the emotional state of the speaker during his/her speech and songs. The dataset is recorded by twenty-four experienced actors, which include twelve male actors and twelve female actors who use eight different emotions. Emotions, such as anger, being calm, sadness, happiness, neutral, disgust, fearful, and surprised contain 192, 192, 192, 192, 96, 192,192, and 192 number of audio files, respectively.</p>
</sec>
<sec id="s4_2">
<label>4.2</label>
<title>Experimental Evaluation</title>
<p>In this section, we practically evaluated our suggested system using three benchmark databases in order to test the model&#x2019;s robustness and effectiveness regarding the emotion recognition. All the datasets consist of a different number of speakers, so we split the data into an 80:20 ratio in each fold. 80% of the data was utilized for the model training, and the remaining 20% of the data was utilized for the model testing. We used a new strategy for the model training in this framework. First, we removed the un-important cues, such as noises from the raw audio files that utilize the FFT and a spectral analysis. Furthermore, we used a new 1D-CNN architecture to extract the hidden patterns from the speech segments and then feed them into a stacked dilated CNN network in order to extract the high-level discriminative emotional features with a hierarchical correlation. Similarly, we used the GRU network to extract the temporal cues that utilize the learned emotional features, which are explained in detail in Section 3.4. The suggested architecture of the SER uses various evaluation matrixes, such as precision, recall, F1_score, weighted accuracy, un-weighted accuracy, and confusion matrix for each dataset in order to checked the model prediction performance. In the weighted accuracy, we computed the model performance, which correctly predicted the labels divided by the whole labels in the corresponding class. Similarly, the un-weighted accuracy represents the model performance in order to compute the prediction among the total correct predicted labels divided by the whole labels in the dataset. The F1_score actually represents the weighted average of the precision and the recall value, which always show the balance among the precision and the recall values. The confusion matrix illustrated the actual predicted values and the confusion with the other emotions in the corresponding class in a certain row and column. We evaluated all the datasets in order to generate the model prediction performance, the confusion matrix, and the class-wise accuracy. In addition, we conducted an ablation study to select the best architecture for the SER. A detailed evaluation and the model configuration is given in <xref ref-type="table" rid="table-1">Tab. 1</xref> for all the suggested datasets.</p>
<table-wrap id="table-1">
<label>Table 1</label>
<caption>
<title>An ablation study of our proposed model configuration for three standard speech emotion datasets using an audio clip or a waveform</title>
</caption>
<table><colgroup>
<col/>
<col/>
<col/>
<col/>
<col/>
</colgroup>
<thead>
<tr>
<th>Input</th>
<th>Architecture</th>
<th>IEMOCAP (%)</th>
<th>EMO-DB (%)</th>
<th>RAVDESS (%)</th>
</tr>
</thead>
<tbody>
<tr>
<td>Audio clip or raw waveform</td>
<td><inline-formula id="ieqn-30"><alternatives><inline-graphic xlink:href="ieqn-30.png"/><tex-math id="tex-ieqn-30"><![CDATA[$\mathrm{CNN}+ \mathrm{Softmax}$]]></tex-math><mml:math id="mml-ieqn-30"><mml:mstyle mathvariant="normal"><mml:mi>C</mml:mi><mml:mi>N</mml:mi><mml:mi>N</mml:mi></mml:mstyle><mml:mo>+</mml:mo><mml:mstyle mathvariant="normal"><mml:mi>S</mml:mi><mml:mi>o</mml:mi><mml:mi>f</mml:mi><mml:mi>t</mml:mi><mml:mi>m</mml:mi><mml:mi>a</mml:mi><mml:mi>x</mml:mi></mml:mstyle></mml:math></alternatives></inline-formula></td>
<td>64.20</td>
<td>83.83</td>
<td>68.56</td>
</tr>
<tr>
<td/>
</tr>
<tr>
<td/>
</tr>
<tr>
<td/>
<td><inline-formula id="ieqn-31"><alternatives><inline-graphic xlink:href="ieqn-31.png"/><tex-math id="tex-ieqn-31"><![CDATA[$\mathrm{CNN}+ \mathrm{Stacked\ CNN}+ \mathrm{Softmax}$]]></tex-math><mml:math id="mml-ieqn-31"><mml:mstyle mathvariant="normal"><mml:mi>C</mml:mi><mml:mi>N</mml:mi><mml:mi>N</mml:mi></mml:mstyle><mml:mo>+</mml:mo><mml:mstyle mathvariant="normal"><mml:mi>S</mml:mi><mml:mi>t</mml:mi><mml:mi>a</mml:mi><mml:mi>c</mml:mi><mml:mi>k</mml:mi><mml:mi>e</mml:mi><mml:mi>d</mml:mi><mml:mspace width=".3em" /><mml:mi>C</mml:mi><mml:mi>N</mml:mi><mml:mi>N</mml:mi></mml:mstyle><mml:mo>+</mml:mo><mml:mstyle mathvariant="normal"><mml:mi>S</mml:mi><mml:mi>o</mml:mi><mml:mi>f</mml:mi><mml:mi>t</mml:mi><mml:mi>m</mml:mi><mml:mi>a</mml:mi><mml:mi>x</mml:mi></mml:mstyle></mml:math></alternatives></inline-formula></td>
<td>65.18</td>
<td>83.90</td>
<td>69.21</td>
</tr>
<tr>
<td/>
<td><inline-formula id="ieqn-32"><alternatives><inline-graphic xlink:href="ieqn-32.png"/><tex-math id="tex-ieqn-32"><![CDATA[$\mathrm{CNN}+ \mathrm{Stacked\ dilated\ CNN}+ \mathrm{Softmax}$]]></tex-math><mml:math id="mml-ieqn-32"><mml:mstyle mathvariant="normal"><mml:mi>C</mml:mi><mml:mi>N</mml:mi><mml:mi>N</mml:mi></mml:mstyle><mml:mo>+</mml:mo><mml:mstyle mathvariant="normal"><mml:mi>S</mml:mi><mml:mi>t</mml:mi><mml:mi>a</mml:mi><mml:mi>c</mml:mi><mml:mi>k</mml:mi><mml:mi>e</mml:mi><mml:mi>d</mml:mi><mml:mspace width=".3em" /><mml:mi>d</mml:mi><mml:mi>i</mml:mi><mml:mi>l</mml:mi><mml:mi>a</mml:mi><mml:mi>t</mml:mi><mml:mi>e</mml:mi><mml:mi>d</mml:mi><mml:mspace width=".3em" /><mml:mi>C</mml:mi><mml:mi>N</mml:mi><mml:mi>N</mml:mi></mml:mstyle><mml:mo>+</mml:mo><mml:mstyle mathvariant="normal"><mml:mi>S</mml:mi><mml:mi>o</mml:mi><mml:mi>f</mml:mi><mml:mi>t</mml:mi><mml:mi>m</mml:mi><mml:mi>a</mml:mi><mml:mi>x</mml:mi></mml:mstyle></mml:math></alternatives></inline-formula></td>
<td>66.33</td>
<td>85.39</td>
<td>69.67</td>
</tr>
<tr>
<td/>
<td><inline-formula id="ieqn-33"><alternatives><inline-graphic xlink:href="ieqn-33.png"/><tex-math id="tex-ieqn-33"><![CDATA[$\mathrm{CNN}+ \mathrm{BiLSTM}+ \mathrm{Softmax}$]]></tex-math><mml:math id="mml-ieqn-33"><mml:mstyle mathvariant="normal"><mml:mi>C</mml:mi><mml:mi>N</mml:mi><mml:mi>N</mml:mi></mml:mstyle><mml:mo>+</mml:mo><mml:mstyle mathvariant="normal"><mml:mi>B</mml:mi><mml:mi>i</mml:mi><mml:mi>L</mml:mi><mml:mi>S</mml:mi><mml:mi>T</mml:mi><mml:mi>M</mml:mi></mml:mstyle><mml:mo>+</mml:mo><mml:mstyle mathvariant="normal"><mml:mi>S</mml:mi><mml:mi>o</mml:mi><mml:mi>f</mml:mi><mml:mi>t</mml:mi><mml:mi>m</mml:mi><mml:mi>a</mml:mi><mml:mi>x</mml:mi></mml:mstyle></mml:math></alternatives></inline-formula></td>
<td>65.44</td>
<td>86.83</td>
<td>69.37</td>
</tr>
<tr>
<td/>
<td><inline-formula id="ieqn-34"><alternatives><inline-graphic xlink:href="ieqn-34.png"/><tex-math id="tex-ieqn-34"><![CDATA[$\mathrm{CNN}+ \mathrm{Stacked\ CNN}+ \mathrm{BiLSTM}+ \mathrm{Softmax}$]]></tex-math><mml:math id="mml-ieqn-34"><mml:mstyle mathvariant="normal"><mml:mi>C</mml:mi><mml:mi>N</mml:mi><mml:mi>N</mml:mi></mml:mstyle><mml:mo>+</mml:mo><mml:mstyle mathvariant="normal"><mml:mi>S</mml:mi><mml:mi>t</mml:mi><mml:mi>a</mml:mi><mml:mi>c</mml:mi><mml:mi>k</mml:mi><mml:mi>e</mml:mi><mml:mi>d</mml:mi><mml:mspace width=".3em" /><mml:mi>C</mml:mi><mml:mi>N</mml:mi><mml:mi>N</mml:mi></mml:mstyle><mml:mo>+</mml:mo><mml:mstyle mathvariant="normal"><mml:mi>B</mml:mi><mml:mi>i</mml:mi><mml:mi>L</mml:mi><mml:mi>S</mml:mi><mml:mi>T</mml:mi><mml:mi>M</mml:mi></mml:mstyle><mml:mo>+</mml:mo><mml:mstyle mathvariant="normal"><mml:mi>S</mml:mi><mml:mi>o</mml:mi><mml:mi>f</mml:mi><mml:mi>t</mml:mi><mml:mi>m</mml:mi><mml:mi>a</mml:mi><mml:mi>x</mml:mi></mml:mstyle></mml:math></alternatives></inline-formula></td>
<td>66.01</td>
<td>87.21</td>
<td>71.29</td>
</tr>
<tr>
<td/>
<td><inline-formula id="ieqn-35"><alternatives><inline-graphic xlink:href="ieqn-35.png"/><tex-math id="tex-ieqn-35"><![CDATA[$\mathrm{CNN}+ \mathrm{Stacked\ dilated\ CNN}+ \mathrm{BiLSTM}+\mathrm{Softmax}$]]></tex-math><mml:math id="mml-ieqn-35"><mml:mstyle mathvariant="normal"><mml:mi>C</mml:mi><mml:mi>N</mml:mi><mml:mi>N</mml:mi></mml:mstyle><mml:mo>+</mml:mo><mml:mstyle mathvariant="normal"><mml:mi>S</mml:mi><mml:mi>t</mml:mi><mml:mi>a</mml:mi><mml:mi>c</mml:mi><mml:mi>k</mml:mi><mml:mi>e</mml:mi><mml:mi>d</mml:mi><mml:mspace width=".3em" /><mml:mi>d</mml:mi><mml:mi>i</mml:mi><mml:mi>l</mml:mi><mml:mi>a</mml:mi><mml:mi>t</mml:mi><mml:mi>e</mml:mi><mml:mi>d</mml:mi><mml:mspace width=".3em" /><mml:mi>C</mml:mi><mml:mi>N</mml:mi><mml:mi>N</mml:mi></mml:mstyle><mml:mo>+</mml:mo><mml:mstyle mathvariant="normal"><mml:mi>B</mml:mi><mml:mi>i</mml:mi><mml:mi>L</mml:mi><mml:mi>S</mml:mi><mml:mi>T</mml:mi><mml:mi>M</mml:mi></mml:mstyle><mml:mo>+</mml:mo><mml:mstyle mathvariant="normal"><mml:mi>S</mml:mi><mml:mi>o</mml:mi><mml:mi>f</mml:mi><mml:mi>t</mml:mi><mml:mi>m</mml:mi><mml:mi>a</mml:mi><mml:mi>x</mml:mi></mml:mstyle></mml:math></alternatives></inline-formula></td>
<td>67.26</td>
<td>88.02</td>
<td>74.67</td>
</tr>
<tr>
<td/>
<td><inline-formula id="ieqn-36"><alternatives><inline-graphic xlink:href="ieqn-36.png"/><tex-math id="tex-ieqn-36"><![CDATA[$\mathrm{CNN}+ \mathrm{BiGRU}+ \mathrm{Softmax}$]]></tex-math><mml:math id="mml-ieqn-36"><mml:mstyle mathvariant="normal"><mml:mi>C</mml:mi><mml:mi>N</mml:mi><mml:mi>N</mml:mi></mml:mstyle><mml:mo>+</mml:mo><mml:mstyle mathvariant="normal"><mml:mi>B</mml:mi><mml:mi>i</mml:mi><mml:mi>G</mml:mi><mml:mi>R</mml:mi><mml:mi>U</mml:mi></mml:mstyle><mml:mo>+</mml:mo><mml:mstyle mathvariant="normal"><mml:mi>S</mml:mi><mml:mi>o</mml:mi><mml:mi>f</mml:mi><mml:mi>t</mml:mi><mml:mi>m</mml:mi><mml:mi>a</mml:mi><mml:mi>x</mml:mi></mml:mstyle></mml:math></alternatives></inline-formula></td>
<td>66.64</td>
<td>87.72</td>
<td>73.96</td>
</tr>
<tr>
<td/>
<td><inline-formula id="ieqn-37"><alternatives><inline-graphic xlink:href="ieqn-37.png"/><tex-math id="tex-ieqn-37"><![CDATA[$\mathrm{CNN}+ \mathrm{Stacked\ CNN}+ \mathrm{BiGRU}+ \mathrm{Softmax}$]]></tex-math><mml:math id="mml-ieqn-37"><mml:mstyle mathvariant="normal"><mml:mi>C</mml:mi><mml:mi>N</mml:mi><mml:mi>N</mml:mi></mml:mstyle><mml:mo>+</mml:mo><mml:mstyle mathvariant="normal"><mml:mi>S</mml:mi><mml:mi>t</mml:mi><mml:mi>a</mml:mi><mml:mi>c</mml:mi><mml:mi>k</mml:mi><mml:mi>e</mml:mi><mml:mi>d</mml:mi><mml:mspace width=".3em" /><mml:mi>C</mml:mi><mml:mi>N</mml:mi><mml:mi>N</mml:mi></mml:mstyle><mml:mo>+</mml:mo><mml:mstyle mathvariant="normal"><mml:mi>B</mml:mi><mml:mi>i</mml:mi><mml:mi>G</mml:mi><mml:mi>R</mml:mi><mml:mi>U</mml:mi></mml:mstyle><mml:mo>+</mml:mo><mml:mstyle mathvariant="normal"><mml:mi>S</mml:mi><mml:mi>o</mml:mi><mml:mi>f</mml:mi><mml:mi>t</mml:mi><mml:mi>m</mml:mi><mml:mi>a</mml:mi><mml:mi>x</mml:mi></mml:mstyle></mml:math></alternatives></inline-formula></td>
<td>68.86</td>
<td>89.32</td>
<td>75.66</td>
</tr>
<tr>
<td/>
<td><inline-formula id="ieqn-38"><alternatives><inline-graphic xlink:href="ieqn-38.png"/><tex-math id="tex-ieqn-38"><![CDATA[$\mathrm{CNN}+ \mathrm{Stacked\ dilated\ CNN}+ \mathrm{BiGRU}+ \mathrm{Softmax}$]]></tex-math><mml:math id="mml-ieqn-38"><mml:mstyle mathvariant="normal"><mml:mi>C</mml:mi><mml:mi>N</mml:mi><mml:mi>N</mml:mi></mml:mstyle><mml:mo>+</mml:mo><mml:mstyle mathvariant="normal"><mml:mi>S</mml:mi><mml:mi>t</mml:mi><mml:mi>a</mml:mi><mml:mi>c</mml:mi><mml:mi>k</mml:mi><mml:mi>e</mml:mi><mml:mi>d</mml:mi><mml:mspace width=".3em" /><mml:mi>d</mml:mi><mml:mi>i</mml:mi><mml:mi>l</mml:mi><mml:mi>a</mml:mi><mml:mi>t</mml:mi><mml:mi>e</mml:mi><mml:mi>d</mml:mi><mml:mspace width=".3em" /><mml:mi>C</mml:mi><mml:mi>N</mml:mi><mml:mi>N</mml:mi></mml:mstyle><mml:mo>+</mml:mo><mml:mstyle mathvariant="normal"><mml:mi>B</mml:mi><mml:mi>i</mml:mi><mml:mi>G</mml:mi><mml:mi>R</mml:mi><mml:mi>U</mml:mi></mml:mstyle><mml:mo>+</mml:mo><mml:mstyle mathvariant="normal"><mml:mi>S</mml:mi><mml:mi>o</mml:mi><mml:mi>f</mml:mi><mml:mi>t</mml:mi><mml:mi>m</mml:mi><mml:mi>a</mml:mi><mml:mi>x</mml:mi></mml:mstyle></mml:math></alternatives></inline-formula></td>
<td>72.75</td>
<td>91.14</td>
<td>78.01</td>
</tr>
</tbody>
</table></table-wrap>
<p>The ablation study indicates the different architectures and the recognition results, which statistically represent the best model for the suggested task. We evaluated different CNN architectures in the ablation study, selected the best one, and started to further investigation. A detailed experimental evaluation of the IEMOCAP dataset is given below.</p>
<p><xref ref-type="table" rid="table-2">Tab. 2</xref> represents the prediction performance of the system, which clearly shows the classification summary of each class in order to compute the precision, recall, and the F1 score of the IEMOCAP dataset. The overall prediction performance of the model is computed using the weighted accuracy and the un-weighted accuracy. Our model predicts the anger emotion and the sad emotion with high priority, and it predicts the happy emotion with low priority. The happy emotion has less linguistic information compared to the other emotions. Also, the neutral emotion is the most related to the other emotions. Due to this characteristic, our model is confused and missed recognizing these emotions. The confusion matrix shows the actual labels and the predicted labels of each emotion, and the class-wise prediction performance of the system for the IEMOCAP database is shown in <xref ref-type="fig" rid="fig-4">Fig. 4</xref>.</p>
<table-wrap id="table-2">
<label>Table 2</label>
<caption>
<title>Prediction performance of the proposed model in terms of precision, recall, F1_score, the weighted score, and the un-weighted score of the IEMOCAP dataset</title>
</caption>
<table><colgroup>
<col/>
<col/>
<col/>
<col/>
</colgroup>
<thead>
<tr>
<th>Classes</th>
<th>Recall</th>
<th>Precision</th>
<th>F1_Score</th>
</tr>
</thead>
<tbody>
<tr>
<td>Anger</td>
<td>0.83</td>
<td>0.85</td>
<td>0.84</td>
</tr>
<tr>
<td>Happy</td>
<td>0.59</td>
<td>0.54</td>
<td>0.57</td>
</tr>
<tr>
<td>Neutral</td>
<td>0.70</td>
<td>0.88</td>
<td>0.78</td>
</tr>
<tr>
<td>Sad</td>
<td>0.79</td>
<td>0.66</td>
<td>0.72</td>
</tr>
<tr>
<td>Weighted Acc</td>
<td>0.74</td>
<td>0.74</td>
<td>0.74</td>
</tr>
<tr>
<td>Un-weighted Acc</td>
<td>0.73</td>
<td>0.73</td>
<td>0.73</td>
</tr>
</tbody>
</table></table-wrap>
<fig id="fig-4">
<label>Figure 4</label>
<caption>
<title>The class-wise accuracy of the proposed technique is illustrated in (a), and the confusion matrix between the actual labels and the predicted labels is shown in (b)</title>
</caption>
<graphic mimetype="image" mime-subtype="png" xlink:href="fig-4.png"/>
</fig>
<p>The graph in <xref ref-type="fig" rid="fig-4">Fig. 4a</xref> shows the class-wise recognition results, and (b) shows the confusion matrix of the IEMOCAP dataset. The model recognizes anger, sad, neutral, and happy emotions with 83%, 79%, 70%, and 59% accuracy, respectively. Our model is mostly confused with the happy emotion, which 12% of the anger class, 9% of the neutral class, and 10% of the sad emotion are recognized as happy. The model mixed the sad emotions and the happy emotions with each other and recognized 26% of happy as sad and 10% of sad as happy, so the overall performance of the model for the IEMOCAP dataset is better than the state-of-the-art SER models. The overall average recall rate of the suggested model is 72.75%. Similarly, the classification performance of the proposed system for the EMO-DB dataset is illustrated in <xref ref-type="table" rid="table-3">Tab. 3</xref>.</p>
<table-wrap id="table-3">
<label>Table 3</label>
<caption>
<title>Prediction performance of the proposed model in terms of precision, recall, F1_score, weighted, and un-weighted score of the EMO-DB dataset</title>
</caption>
<table><colgroup>
<col/>
<col/>
<col/>
<col/>
</colgroup>
<thead>
<tr>
<th>Classes</th>
<th>Recall</th>
<th>Precision</th>
<th>F1_Score</th>
</tr>
</thead>
<tbody>
<tr>
<td>Anger</td>
<td>0.98</td>
<td>0.96</td>
<td>0.97</td>
</tr>
<tr>
<td>Boredom</td>
<td>0.93</td>
<td>0.99</td>
<td>0.96</td>
</tr>
<tr>
<td>Disgust</td>
<td>0.91</td>
<td>1.00</td>
<td>0.95</td>
</tr>
<tr>
<td>Fear</td>
<td>0.97</td>
<td>0.94</td>
<td>0.95</td>
</tr>
<tr>
<td>Happy</td>
<td>0.69</td>
<td>0.95</td>
<td>0.80</td>
</tr>
<tr>
<td>Neutral</td>
<td>0.94</td>
<td>0.81</td>
<td>0.87</td>
</tr>
<tr>
<td>Sad</td>
<td>0.96</td>
<td>0.61</td>
<td>0.75</td>
</tr>
<tr>
<td>Weighted Acc</td>
<td>0.90</td>
<td>0.92</td>
<td>0.92</td>
</tr>
<tr>
<td>Un-weighted Acc</td>
<td>0.89</td>
<td>0.91</td>
<td>0.90</td>
</tr>
</tbody>
</table></table-wrap>
<p>The classification summary shows that the overall prediction performance of the system over the EMO-DB <italic>corpus</italic> is good, but the model also shows a lower performance for the happy emotion due to less linguistic information. Furthermore, the model denotes a high performance for other emotions, such as anger, sad, fear, disgust, neutral, and boredom, which produced more than a 90% recognition rate. Similarly, again the model confused the happy emotion, which mixed it with the other emotions. In order to further investigate this, we generated the class-wise accuracy and the confusion matrix of the EMO-DB dataset to check the confusion ratio of each class with the other emotions. The confusion matrix and the class level accuracy are shown in <xref ref-type="fig" rid="fig-5">Fig. 5</xref>.</p>
<fig id="fig-5">
<label>Figure 5</label>
<caption>
<title>Class-wise accuracy of the proposed technique is illustrated in (a), and the confusion matrix between the actual labels and the predicted labels is shown in (b)</title>
</caption>
<graphic mimetype="image" mime-subtype="png" xlink:href="fig-5.png"/>
</fig>
<p>The class-wise accuracy is represented in <xref ref-type="fig" rid="fig-5">Fig. 5a</xref>, which shows each class recognition ratio. The x-axes illustrate the directions toward the classes, and the y-axes show the recognition accuracy of the corresponding class. <xref ref-type="fig" rid="fig-5">Fig. 5b</xref> illustrates the confusion matrix among the actual labels and the predicted labels of the Emo-DB dataset. The diagonal values of the confusion matrix represents the actual recall value of the proposed model for each emotion. The recognition rates for anger, fear, sadness, boredom, and disgust are more than 90% each, respectively, which is much greater than the recognition rate from the happy emotion. The proposed model increases the precision rate of the happy emotion more than the baseline model, which is relatively lower than the other emotions. The average recall rate of the system is 91.14% for the Emo-DB dataset. The prediction performance of the suggested technique for the RAVDESS dataset is presented in <xref ref-type="table" rid="table-5">Tab. 5</xref>.</p>
<p><xref ref-type="table" rid="table-4">Tab. 4</xref> represents the model prediction summary of the RAVDESS dataset. It was recently launched for the emotion recognition in the natural speeches and songs. The model&#x2019;s secure weighted accuracy and the un-weighted accuracy produced 80% and 78% scores for the overall classes. The individual recognition ratio of all the classes is much better than the state-of-the-art SER methods. Our model improves the overall performance, which includes the happy emotion. Similarly, the happy emotion recognition rate is relatively lower than the others, which is due to less linguistic information. The model mixed the happy emotion with the others, and it was confused during the prediction stage. We further investigated the model performance in order to find a class-wise accuracy and confusion matrix for the efficiency evaluation, which is shown in <xref ref-type="fig" rid="fig-6">Fig. 6</xref>.</p>
<table-wrap id="table-4">
<label>Table 4</label>
<caption>
<title>Prediction performance of the proposed model in terms of precision, recall, F1_score, weighted score, and un-weighted score of the RAVDESS dataset</title>
</caption>
<table><colgroup>
<col/>
<col/>
<col/>
<col/>
</colgroup>
<thead>
<tr>
<th>Classes</th>
<th>Recall</th>
<th>Precision</th>
<th>F1_Score</th>
</tr>
</thead>
<tbody>
<tr>
<td>Anger</td>
<td>0.91</td>
<td>0.87</td>
<td>0.89</td>
</tr>
<tr>
<td>Calm</td>
<td>0.90</td>
<td>0.52</td>
<td>0.66</td>
</tr>
<tr>
<td>Disgust</td>
<td>0.80</td>
<td>0.98</td>
<td>0.88</td>
</tr>
<tr>
<td>Fearful</td>
<td>0.79</td>
<td>0.82</td>
<td>0.80</td>
</tr>
<tr>
<td>Happy</td>
<td>0.50</td>
<td>0.92</td>
<td>0.65</td>
</tr>
<tr>
<td>Neutral</td>
<td>0.67</td>
<td>0.63</td>
<td>0.65</td>
</tr>
<tr>
<td>Sad</td>
<td>0.80</td>
<td>0.94</td>
<td>0.86</td>
</tr>
<tr>
<td>Surprise</td>
<td>0.91</td>
<td>0.81</td>
<td>0.86</td>
</tr>
<tr>
<td>Weighted Acc</td>
<td>0.83</td>
<td>0.80</td>
<td>0.80</td>
</tr>
<tr>
<td>Un-weighted Acc</td>
<td>0.78</td>
<td>0.81</td>
<td>0.78</td>
</tr>
</tbody>
</table></table-wrap>
<fig id="fig-6">
<label>Figure 6</label> 
<caption>
<title>Class-wise accuracy of the proposed technique is illustrated in (a), and the confusion matrix between the actual and the predicted labels is shown in (b)</title>
</caption>
<graphic mimetype="image" mime-subtype="png" xlink:href="fig-6.png"/>
</fig>
<p>The class-wise accuracy represents the actual performance of the corresponding class in percentages, which is shown in <xref ref-type="fig" rid="fig-6">Fig. 6a</xref>. The x-axes represent the number of emotions, and the y-axes represent the accuracy in percentages. <xref ref-type="fig" rid="fig-6">Fig. 6b</xref> shows the RAVDESS dataset confusion matrix, which illustrates the model representations among the actual emotions and the predicted emotions. The model secured 91%, 90%, 80%, 79%, 50%, 67%, 80%, and 91% recognition scores for anger, calm, disgust, fear, happy, neutral, sad, and surprised, respectively. Similarly, the system recognizes the happy emotion with a low accuracy, but the recognition rate for happiness is better than the baseline methods. Hence, the overall prediction performance of the proposed system for the SER is better than the state-of-the-art methods.</p>
</sec>
<sec id="s4_3">
<label>4.3</label>
<title>Comparative Analysis</title>
<p>We compared the proposed model with the baseline SER methods in order to show the improvement, the robustness, and the effectiveness of the system. We applied the un-weighted evaluation matrix, which is mostly used in the literature to find the recognition rate of the system. We compared the un-weighted accuracy of the proposed system with the other baseline methods in this regard. The literature for the SER lacks the 1D-CNN architectures except for some limited articles, which still don&#x2019;t show good improvements with the performance. In contrast, we suggested a new one-dimensional DCNN-GRU system for the SER that uses HFLBs with a dilated convolution layer that efficiently recognized the emotional features. We evaluated the suggested system using three standard SER datasets and compared the results with the baseline techniques in order to show the model efficiency. Essentially, our model utilizes the new types of architectures, such as the dilated DCNN blocks, which are used to extract the emotional features and reduce the processing time for the model training. The comparison of the proposed system is illustrated in <xref ref-type="table" rid="table-5">Tabs. 5</xref> and <xref ref-type="table" rid="table-6">6</xref> with the state-of-the-art systems in terms of the accuracy and the processing time.</p>
<table-wrap id="table-5">
<label>Table 5</label>
<caption>
<title>Comparative analysis of the proposed SER system with other baseline methods over three benchmark speech datasets. The output of our model outperformed the baseline methods.</title>
</caption>
<table><colgroup>
<col/>
<col/>
<col/>
<col/>
<col/>
<col/>
<col/>
<col/>
<col/>
</colgroup>
<thead>
<tr>
<th colspan="3">IEMOCAP</th>
<th colspan="3">EMO-DB</th>
<th colspan="3">RAVDESS</th>
</tr>
<tr>
<th>Year</th>
<th>Reference</th>
<th>Accuracy (%)</th>
<th>Year</th>
<th>Reference</th>
<th>Accuracy (%)</th>
<th>Year</th>
<th>Reference</th>
<th>Accuracy (%)</th>
</tr>
</thead>
<tbody>
<tr>
<td>2019</td>
<td>[<xref ref-type="bibr" rid="ref-44">44</xref>]</td>
<td>52.14</td>
<td>2019</td>
<td>[<xref ref-type="bibr" rid="ref-45">45</xref>]</td>
<td>84.49</td>
<td>2019</td>
<td>[<xref ref-type="bibr" rid="ref-39">39</xref>]</td>
<td>64.48</td>
</tr>
<tr>
<td>2017</td>
<td>[<xref ref-type="bibr" rid="ref-20">20</xref>]</td>
<td>64.78</td>
<td>2019</td>
<td>[<xref ref-type="bibr" rid="ref-47">47</xref>]</td>
<td>88.99</td>
<td>2019</td>
<td>[<xref ref-type="bibr" rid="ref-52">52</xref>]</td>
<td>69.40</td>
</tr>
<tr>
<td>2019</td>
<td>[<xref ref-type="bibr" rid="ref-45">45</xref>]</td>
<td>57.10</td>
<td>2018</td>
<td>[<xref ref-type="bibr" rid="ref-49">49</xref>]</td>
<td>82.82</td>
<td>2019</td>
<td>[<xref ref-type="bibr" rid="ref-53">53</xref>]</td>
<td>75.79</td>
</tr>
<tr>
<td>2015</td>
<td>[<xref ref-type="bibr" rid="ref-46">46</xref>]</td>
<td>40.02</td>
<td>2019</td>
<td>[<xref ref-type="bibr" rid="ref-15">15</xref>]</td>
<td>80.79</td>
<td>2019</td>
<td>[<xref ref-type="bibr" rid="ref-54">54</xref>]</td>
<td>67.14</td>
</tr>
<tr>
<td>2014</td>
<td>[<xref ref-type="bibr" rid="ref-24">24</xref>]</td>
<td>51.24</td>
<td>2019</td>
<td>[<xref ref-type="bibr" rid="ref-51">51</xref>]</td>
<td>84.53</td>
<td>2020</td>
<td>[<xref ref-type="bibr" rid="ref-50">50</xref>]</td>
<td>71.61</td>
</tr>
<tr>
<td>2019</td>
<td>[<xref ref-type="bibr" rid="ref-47">47</xref>]</td>
<td>69.32</td>
<td>2020</td>
<td>[<xref ref-type="bibr" rid="ref-50">50</xref>]</td>
<td>86.10</td>
<td>2020</td>
<td>[<xref ref-type="bibr" rid="ref-7">7</xref>]</td>
<td>77.01</td>
</tr>
<tr>
<td>2019</td>
<td>[<xref ref-type="bibr" rid="ref-48">48</xref>]</td>
<td>66.50</td>
<td>2020</td>
<td>[<xref ref-type="bibr" rid="ref-7">7</xref>]</td>
<td>85.57</td>
<td/>
</tr>
<tr>
<td>2018</td>
<td>[<xref ref-type="bibr" rid="ref-38">38</xref>]</td>
<td>63.98</td>
<td/>
</tr>
<tr>
<td>2019</td>
<td>[<xref ref-type="bibr" rid="ref-21">21</xref>]</td>
<td>61.60</td>
<td/>
</tr>
<tr>
<td>2018</td>
<td>[<xref ref-type="bibr" rid="ref-49">49</xref>]</td>
<td>64.74</td>
<td/>
</tr>
<tr>
<td>2020</td>
<td>[<xref ref-type="bibr" rid="ref-50">50</xref>]</td>
<td>64.03</td>
<td/>
</tr>
<tr>
<td>2020</td>
<td>[<xref ref-type="bibr" rid="ref-7">7</xref>]</td>
<td>71.25</td>
<td/>
</tr>
<tr>
<td><bold>Proposed</bold></td>
<td/>
<td><bold>72.75</bold></td>
<td/>
<td/>
<td><bold>91.14</bold></td>
<td/>
<td/>
<td><bold>78.01</bold></td>
</tr>
</tbody>
</table></table-wrap>
<table-wrap id="table-6">
<label>Table 6</label>
<caption>
<title>A comparison of the processing time of the proposed 1D-DCNN system that is analyzed with other SER models. Our system has less processing time due to the simple architecture</title>
</caption>
<table><colgroup>
<col/>
<col/>
<col/>
<col/>
</colgroup>
<thead>
<tr>
<th>Scheme</th>
<th>IEMOCAP-DB (s)</th>
<th>RAVDESS-DB (s)</th>
<th>EMO-DB (s)</th>
</tr>
</thead>
<tbody>
<tr>
<td>ACRNN [<xref ref-type="bibr" rid="ref-49">49</xref>]</td>
<td>13487</td>
<td>&#x2013;</td>
<td>6811</td>
</tr>
<tr>
<td>ADRNN [<xref ref-type="bibr" rid="ref-47">47</xref>]</td>
<td>13887</td>
<td>&#x2013;</td>
<td>7187</td>
</tr>
<tr>
<td>CL-SER [<xref ref-type="bibr" rid="ref-7">7</xref>]</td>
<td>10452</td>
<td>6250</td>
<td>5396</td>
</tr>
<tr>
<td><bold>Prop 1D-CNN</bold></td>
<td><bold>8200</bold></td>
<td><bold>3970</bold></td>
<td><bold>3150</bold></td>
</tr>
</tbody>
</table></table-wrap>
<p>We compared our model with different one-dimensional and two-dimensional CNN models, which are shown in the table above. In [<xref ref-type="bibr" rid="ref-44">44</xref>], the authors used 1D-CNN architecture, and they used local features learning blocks to extract the hand-crafted features from the speech signals, which pass to the sequential network to extract the temporal cues in order to recognize the emotions. The authors tried to improve the recognition rate but due to the simple and local features, they are not suitable for efficient SER models. The authors in [<xref ref-type="bibr" rid="ref-44">44</xref>] achieved an accuracy rate of 52% for the IEMOCAP dataset, which utilized the 1D-CNN model. The rest of the other studies that were compared [<xref ref-type="bibr" rid="ref-45">45</xref>,<xref ref-type="bibr" rid="ref-46">46</xref>] used 2D [<xref ref-type="bibr" rid="ref-48">48</xref>,<xref ref-type="bibr" rid="ref-50">50</xref>] and 3D-CNN [<xref ref-type="bibr" rid="ref-47">47</xref>,<xref ref-type="bibr" rid="ref-49">49</xref>] models for the SER [<xref ref-type="bibr" rid="ref-51">51</xref>], which had some significant changes by using a bagged support vector machine [<xref ref-type="bibr" rid="ref-52">52</xref>,<xref ref-type="bibr" rid="ref-53">53</xref>] and a capsule network [<xref ref-type="bibr" rid="ref-54">54</xref>]. These models secured up to 70% accuracy for the IEMOCAP, 85% accuracy for the EMO-DB, and they had approximately a 76% accuracy for the RAVDESS dataset. The prediction performance of the suggested system was reported as 72.75%, 91.14%, and 78.01% for the IEMOCAP, the EMO-DB, and the RAVDESS databases, respectively.</p>
</sec>
</sec>
<sec id="s5">
<label>5</label>
<title>Discussion</title>
<p>We used a novel strategy for the SER in the proposed framework contribute a new 1D-DCNN model with hierarchical features learning blocks (HFLBs) in order to recognize the emotional features. <xref ref-type="table" rid="table-5">Tab. 5</xref> illustrates a comparative study of the model with the other state-of-the-art systems under the same conditions, such as datasets and an accuracy matrix. In <xref ref-type="table" rid="table-5">Tab. 5</xref>, the researchers solved the SER problem using different techniques in this era, but they mostly used the 2D-CNN architectures to address the stated problem. Overall, the 2D-CNN strategies are built for visual data recognition and classification in the field of computer vision [<xref ref-type="bibr" rid="ref-55">55</xref>]. With the use of this strategy, we lost some paralinguistic cues with the speech signals, and we didn&#x2019;t achieve better accuracy for the emotion recognition. In order to address this limitation, we proposed a 1D-DCNN model that can accept the direct speech data in order to extract the features and recognize the paralinguistic information, such as emotions. Our model is able to predict emotions with a high accuracy rate compared to the other prior models that involve emotion recognition, which is shown in <xref ref-type="table" rid="table-5">Tab. 5</xref>. Our system computes the probabilities for each segment to predict the class label for an emotion, and a class has a maximum average probability that is selected as a label. We utilized the equation given below in order to compute the class label in this paper.</p>
<p><disp-formula id="eqn-17">
<label>(17)</label>
<alternatives>
<graphic mimetype="image" mime-subtype="png" xlink:href="eqn-17.png"/>
<tex-math id="tex-eqn-17"><![CDATA[$$\begin{equation}
\mathrm{L}_{\mathrm{u}}={}_{\mathrm{k}=1,\ldots \mathrm{k}}^{\text{a}\text{r}\text{g}\text{m}\text{a}\text{x}}\frac{\sum_{\mathrm{t}=1}^{\mathrm{T}}\mathrm{p}(\mathrm{y}_{\mathrm{t}}|\mathrm{x})}{\mathrm{T}}
 \label{eqn-17}
\end{equation}$$]]></tex-math>
<mml:math id="mml-eqn-17" display="block"><mml:msub><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>L</mml:mi></mml:mstyle></mml:mrow><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>u</mml:mi></mml:mstyle></mml:mrow></mml:msub><mml:msubsup><mml:mrow><mml:mo>=</mml:mo></mml:mrow><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>k</mml:mi></mml:mstyle><mml:mo lspace='0pt' rspace='0pt'>=</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mo lspace='0pt' rspace='0pt'>&#x2026;</mml:mo><mml:mstyle mathvariant="normal"><mml:mi>k</mml:mi></mml:mstyle></mml:mrow><mml:mrow><mml:mstyle><mml:mtext>a</mml:mtext></mml:mstyle><mml:mstyle><mml:mtext>r</mml:mtext></mml:mstyle><mml:mstyle><mml:mtext>g</mml:mtext></mml:mstyle><mml:mstyle><mml:mtext>m</mml:mtext></mml:mstyle><mml:mstyle><mml:mtext>a</mml:mtext></mml:mstyle><mml:mstyle><mml:mtext>x</mml:mtext></mml:mstyle></mml:mrow></mml:msubsup><mml:mfrac><mml:mrow><mml:mstyle displaystyle='true'><mml:mstyle displaystyle='true'><mml:msubsup><mml:mrow><mml:mo>&#x2211;</mml:mo> </mml:mrow><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>t</mml:mi></mml:mstyle><mml:mo lspace='0pt' rspace='0pt'>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>T</mml:mi></mml:mstyle></mml:mrow></mml:msubsup></mml:mstyle></mml:mstyle><mml:mstyle mathvariant="normal"><mml:mi>p</mml:mi></mml:mstyle><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>y</mml:mi></mml:mstyle></mml:mrow><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>t</mml:mi></mml:mstyle></mml:mrow></mml:msub><mml:mo>|</mml:mo><mml:mstyle mathvariant="normal"><mml:mi>x</mml:mi></mml:mstyle></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>T</mml:mi></mml:mstyle></mml:mrow></mml:mfrac></mml:math></alternatives></disp-formula></p>
<p>In the Equation, k represents the number of classes except for the silent class, and Lu is the predicted label of the corresponding class. We evaluated our proposed method using three different SER datasets in order to prove the robustness and the effectiveness of the system and also to show the generalization of the system. Furthermore, our suggested model deals with additional items, which include providing real-time output to process the data sufficiently, because it is not contingent on the prospective situation. In addition, our system is capable of dealing with speech that has an arbitrary size without a reduction in the performance, and it has the ability to handle speech that contains more than one emotion class.</p>
</sec>
<sec id="s6">
<label>6</label>
<title>Conclusion and Future Direction</title>
<p>Speech emotion recognition (SER) has many challenges and limitations in the literature that need to be solved by using an efficient approach. We explored various deep learning approaches for the SER tasks, conducted different experimentations, and proposed a new architecture for the SER, which utilizes a one-dimensional plain DCNN strategy with stacked HFLBs and GRUs network. Our model investigates the emotional cues in the speech signals using the local and global hierarchical correlations that utilize the raw audio signals. We used the 1D-DCNN architecture with deep BiGRU networks in order to find sequential or temporal dependencies, so the proposed model is capable of learning local information and global contextual cues from the speech signals and recognize the state of the speakers. We tested our suggested system on three benchmark databases and achieved recognition accuracies of 72.75%, 91.14%, and 78.01% for the IEMOCAP, EMO-DB, and RAVDESS datasets, respectively, which proved the robustness and the significance of the system.</p>
<p>This work has many future directions, which include being used with an automatic speech recognition (ASR) system. We can easily integrate our work in this domain to employ a mutual understanding of the paralinguistic and linguistic elements of speech in order to develop a superior model for speech processing. Similarly, the proposed architecture can be further explored for the SER, which utilized deep belief networks, graphs, and spike networks, and it is also useful for the speaker recognition and identification in order to achieve better results with satisfactory computational costs.</p>
</sec>
</body>
<back>
<fn-group><fn fn-type="other"><p><bold>Funding Statement:</bold> This work was supported by the National Research Foundation of Korea funded by the Korean Government through the Ministry of Science and ICT under Grant NRF-2020R1F1A1060659 and in part by the 2020 Faculty Research Fund of Sejong University.</p></fn>
<fn fn-type="conflict"><p><bold>Conflicts of Interest:</bold> The authors declare that they have no conflicts of interest to report regarding the present study.</p></fn></fn-group>
<ref-list content-type="authoryear">
<title>References</title>
<ref id="ref-1"><label>[1]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>R. A.</given-names> <surname>Naqvi</surname></string-name>, <string-name><given-names>M.</given-names> <surname>Arsalan</surname></string-name>, <string-name><given-names>A.</given-names> <surname>Rehman</surname></string-name>, <string-name><given-names>A. U.</given-names> <surname>Rehman</surname></string-name>, <string-name><given-names>W. K.</given-names> <surname>Loh</surname></string-name> <etal>et al.</etal></person-group><italic>,</italic> &#x201C;<article-title>Deep learning-based drivers emotion classification system in time series data for remote applications</article-title>,&#x201D; <source>Remote Sensing</source>, vol. <volume>12</volume>, no. <issue>3</issue>, pp. <fpage>587</fpage>, <year>2020</year>.</mixed-citation></ref>
<ref id="ref-2"><label>[2]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>S. Z.</given-names> <surname>Bong</surname></string-name>, <string-name><given-names>K.</given-names> <surname>Wan</surname></string-name>, <string-name><given-names>M.</given-names> <surname>Murugappan</surname></string-name>, <string-name><given-names>N. M.</given-names> <surname>Ibrahim</surname></string-name>, <string-name><given-names>Y.</given-names> <surname>Rajamanickam</surname></string-name> <etal>et al.</etal></person-group><italic>,</italic> &#x201C;<article-title>Implementation of wavelet packet transform and non linear analysis for emotion classification in stroke patient using brain signals</article-title>,&#x201D; <source>Biomedical Signal Processing and Control</source>, vol. <volume>36</volume>, no. <issue>12</issue>, pp. <fpage>102</fpage>&#x2013;<lpage>112</lpage>, <year>2017</year>.</mixed-citation></ref>
<ref id="ref-3"><label>[3]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>B.</given-names> <surname>Wei</surname></string-name>, <string-name><given-names>W.</given-names> <surname>Hu</surname></string-name>, <string-name><given-names>M.</given-names> <surname>Yang</surname></string-name> and <string-name><given-names>C. T.</given-names> <surname>Chou</surname></string-name></person-group>, &#x201C;<article-title>From real to complex: Enhancing radio-based activity recognition using complex-valued CSI</article-title>,&#x201D; <source>ACM Transactions on Sensor Networks</source>, vol. <volume>15</volume>, no. <issue>3</issue>, pp. <fpage>35</fpage>, <year>2019</year>.</mixed-citation></ref>
<ref id="ref-4"><label>[4]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>M.</given-names> <surname>Swain</surname></string-name>, <string-name><given-names>A.</given-names> <surname>Routray</surname></string-name> and <string-name><given-names>P.</given-names> <surname>Kabisatpathy</surname></string-name></person-group>, &#x201C;<article-title>Databases, features and classifiers for speech emotion recognition: A review</article-title>,&#x201D; <source>International Journal of Speech Technology</source>, vol. <volume>21</volume>, no. <issue>1</issue>, pp. <fpage>93</fpage>&#x2013;<lpage>120</lpage>, <year>2018</year>.</mixed-citation></ref>
<ref id="ref-5"><label>[5]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Mustaqeem</surname></string-name> and <string-name><given-names>S.</given-names> <surname>Kwon</surname></string-name></person-group>, &#x201C;<article-title>A CNN-assisted enhanced audio signal processing for speech emotion recognition</article-title>,&#x201D; <source>Sensors</source>, vol. <volume>20</volume>, pp. <fpage>183</fpage>, <year>2020</year>.</mixed-citation></ref>
<ref id="ref-6"><label>[6]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>S.</given-names> <surname>Demircan</surname></string-name> and <string-name><given-names>H.</given-names> <surname>Kahramanli</surname></string-name></person-group>, &#x201C;<article-title>Application of fuzzy c-means clustering algorithm to spectral features for emotion classification from speech</article-title>,&#x201D; <source>Neural Computing and Applications</source>, vol. <volume>29</volume>, no. <issue>8</issue>, pp. <fpage>59</fpage>&#x2013;<lpage>66</lpage>, <year>2018</year>.</mixed-citation></ref>
<ref id="ref-7"><label>[7]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Mustaqeem</surname></string-name>, <string-name><given-names>M.</given-names> <surname>Sajjad</surname></string-name> and <string-name><given-names>S.</given-names> <surname>Kwon</surname></string-name></person-group>, &#x201C;<article-title>Clustering-based speech emotion recognition by incorporating learned features and deep bilstm</article-title>,&#x201D; <source>IEEE Access</source>, vol. <volume>8</volume>, pp. <fpage>79861</fpage>&#x2013;<lpage>79875</lpage>, <year>2020</year>.</mixed-citation></ref>
<ref id="ref-8"><label>[8]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Mustaqeem</surname></string-name> and <string-name><given-names>S.</given-names> <surname>Kwon</surname></string-name></person-group>, &#x201C;<article-title>MLT-DNet: Speech emotion recognition using 1D dilated CNN based on multi-learning trick approach</article-title>,&#x201D; <source>Expert Systems with Applications</source>, pp. <fpage>114177</fpage>, <year>2020</year>.</mixed-citation></ref>
<ref id="ref-9"><label>[9]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>L.</given-names> <surname>Chen</surname></string-name>, <string-name><given-names>X.</given-names> <surname>Mao</surname></string-name> and <string-name><given-names>H.</given-names> <surname>Yan</surname></string-name></person-group>, &#x201C;<article-title>Text-independent phoneme segmentation combining EGG and speech data</article-title>,&#x201D; <source>IEEE/ACM Transactions on Audio, Speech, and Language Processing</source>, vol. <volume>24</volume>, no. <issue>6</issue>, pp. <fpage>1029</fpage>&#x2013;<lpage>1037</lpage>, <year>2016</year>.</mixed-citation></ref>
<ref id="ref-10"><label>[10]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>S. U.</given-names> <surname>Khan</surname></string-name> and <string-name><given-names>R.</given-names> <surname>Baik</surname></string-name></person-group>, &#x201C;<article-title>MPPIF-Net: Identification of plasmodium falciparum parasite mitochondrial proteins using deep features with multilayer bi-directional lstm</article-title>,&#x201D; <source>Processes</source>, vol. <volume>8</volume>, pp. <fpage>725</fpage>, <year>2020</year>.</mixed-citation></ref>
<ref id="ref-11"><label>[11]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><given-names>S.</given-names> <surname>Tripathi</surname></string-name>, <string-name><given-names>A.</given-names> <surname>Kumar</surname></string-name>, <string-name><given-names>A.</given-names> <surname>Ramesh</surname></string-name>, <string-name><given-names>C.</given-names> <surname>Singh</surname></string-name> and <string-name><given-names>P.</given-names> <surname>Yenigalla</surname></string-name></person-group>, &#x201C;<article-title>Deep learning based emotion recognition system using speech features and transcriptions</article-title>. <comment>Arxiv Preprint Arxiv:1906.05681</comment>, <year>2019</year>.</mixed-citation></ref>
<ref id="ref-12"><label>[12]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>F.</given-names> <surname>Karim</surname></string-name>, <string-name><given-names>S.</given-names> <surname>Majumdar</surname></string-name> and <string-name><given-names>H.</given-names> <surname>Darabi</surname></string-name></person-group>, &#x201C;<article-title>Insights into lstm fully convolutional networks for time series classification</article-title>,&#x201D; <source>IEEE Access</source>, vol. <volume>7</volume>, pp. <fpage>67718</fpage>&#x2013;<lpage>67725</lpage>, <year>2019</year>.</mixed-citation></ref>
<ref id="ref-13"><label>[13]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>H.</given-names> <surname>Zhiyan</surname></string-name> and <string-name><given-names>W.</given-names> <surname>Jian</surname></string-name></person-group>, &#x201C;<article-title>Speech emotion recognition based on deep learning and kernel nonlinear PSVM</article-title>,&#x201D; in <conf-name>2019 Chinese Control And Decision Conf.</conf-name>, Nanchang, China, pp. <fpage>1426</fpage>&#x2013;<lpage>1430</lpage>, <year>2019</year>.</mixed-citation></ref>
<ref id="ref-14"><label>[14]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>U.</given-names> <surname>Fiore</surname></string-name>, <string-name><given-names>A.</given-names> <surname>Florea</surname></string-name> and <string-name><given-names>G.</given-names> <surname>P&#x00E9;rez Lechuga</surname></string-name></person-group>, &#x201C;<article-title>An interdisciplinary review of smart vehicular traffic and its applications and challenges</article-title>,&#x201D; <source>Journal of Sensor and Actuator Networks</source>, vol. <volume>8</volume>, no. <issue>1</issue>, pp. <fpage>13</fpage>, <year>2019</year>.</mixed-citation></ref>
<ref id="ref-15"><label>[15]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>A. M.</given-names> <surname>Badshah</surname></string-name>, <string-name><given-names>N.</given-names> <surname>Rahim</surname></string-name>, <string-name><given-names>N.</given-names> <surname>Ullah</surname></string-name>, <string-name><given-names>J.</given-names> <surname>Ahmad</surname></string-name>, <string-name><given-names>K.</given-names> <surname>Muhammad</surname></string-name> <etal>et al.</etal></person-group><italic>,</italic> &#x201C;<article-title>Deep features-based speech emotion recognition for smart affective services</article-title>,&#x201D; <source>Multimedia Tools and Applications</source>, vol. <volume>78</volume>, no. <issue>5</issue>, pp. <fpage>5571</fpage>&#x2013;<lpage>5589</lpage>, <year>2019</year>.</mixed-citation></ref>
<ref id="ref-16"><label>[16]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>C.</given-names> <surname>Busso</surname></string-name>, <string-name><given-names>M.</given-names> <surname>Bulut</surname></string-name>, <string-name><given-names>C.-C.</given-names> <surname>Lee</surname></string-name>, <string-name><given-names>A.</given-names> <surname>Kazemzadeh</surname></string-name>, <string-name><given-names>E.</given-names> <surname>Mower</surname></string-name> <etal>et al.</etal></person-group><italic>,</italic> &#x201C;<article-title>IEMOCAP: Interactive emotional dyadic motion capture database</article-title>,&#x201D; <source>Language resources and evaluation</source>, vol. <volume>42</volume>, no. <issue>4</issue>, pp. <fpage>335</fpage>, <year>2008</year>.</mixed-citation></ref>
<ref id="ref-17"><label>[17]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>S. R.</given-names> <surname>Livingstone</surname></string-name> and <string-name><given-names>F. A.</given-names> <surname>Russo</surname></string-name></person-group>, &#x201C;<article-title>The ryerson audio-visual database of emotional speech and song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in north american english</article-title>,&#x201D; <source>PLoS One</source>, vol. <volume>13</volume>, no. <issue>5</issue>, pp. <fpage>e0196391</fpage>, <year>2018</year>.</mixed-citation></ref>
<ref id="ref-18"><label>[18]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>S.</given-names> <surname>Kang</surname></string-name>, <string-name><given-names>D.</given-names> <surname>Kim</surname></string-name> and <string-name><given-names>Y.</given-names> <surname>Kim</surname></string-name></person-group>, &#x201C;<article-title>A visual-physiology multimodal system for detecting outlier behavior of participants in a reality TV show</article-title>,&#x201D; <source>International Journal of Distributed Sensor Networks</source>, vol. <volume>15</volume>, pp. <fpage>1550147719864886</fpage>, <year>2019</year>.</mixed-citation></ref>
<ref id="ref-19"><label>[19]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>M.</given-names> <surname>Dias</surname></string-name>, <string-name><given-names>A.</given-names> <surname>Abad</surname></string-name> and <string-name><given-names>I.</given-names> <surname>Trancoso</surname></string-name></person-group>, &#x201C;<article-title>Exploring hashing and cryptonet based approaches for privacy-preserving speech emotion recognition</article-title>,&#x201D; in <conf-name>IEEE Int. Conf. on Acoustics, Speech and Signal Processing</conf-name>, Calgary, Alberta, Canada, pp. <fpage>2057</fpage>&#x2013;<lpage>2061</lpage>, <year>2018</year>.</mixed-citation></ref>
<ref id="ref-20"><label>[20]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>H. M.</given-names> <surname>Fayek</surname></string-name>, <string-name><given-names>M.</given-names> <surname>Lech</surname></string-name> and <string-name><given-names>L.</given-names> <surname>Cavedon</surname></string-name></person-group>, &#x201C;<article-title>Evaluating deep learning architectures for speech emotion recognition</article-title>,&#x201D; <source>Neural Networks</source>, vol. <volume>92</volume>, pp. <fpage>60</fpage>&#x2013;<lpage>68</lpage>, <year>2017</year>.</mixed-citation></ref>
<ref id="ref-21"><label>[21]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>S.</given-names> <surname>Jiang</surname></string-name>, <string-name><given-names>Z.</given-names> <surname>Li</surname></string-name>, <string-name><given-names>P.</given-names> <surname>Zhou</surname></string-name> and <string-name><given-names>M.</given-names> <surname>Li</surname></string-name></person-group>, &#x201C;<article-title>Memento: An emotion-driven lifelogging system with wearables</article-title>,&#x201D; <source>ACM Transactions on Sensor Networks</source>, vol. <volume>15</volume>, no. <issue>1</issue>, pp. <fpage>8</fpage>, <year>2019</year>.</mixed-citation></ref>
<ref id="ref-22"><label>[22]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>R. A.</given-names> <surname>Khalil</surname></string-name>, <string-name><given-names>E.</given-names> <surname>Jones</surname></string-name>, <string-name><given-names>M. I.</given-names> <surname>Babar</surname></string-name>, <string-name><given-names>T.</given-names> <surname>Jan</surname></string-name>, <string-name><given-names>M. H.</given-names> <surname>Zafar</surname></string-name> <etal>et al.</etal></person-group><italic>,</italic> &#x201C;<article-title>Speech emotion recognition using deep learning techniques: A review</article-title>,&#x201D; <source>IEEE Access</source>, vol. <volume>7</volume>, pp. <fpage>117327</fpage>&#x2013;<lpage>117345</lpage>, <year>2019</year>.</mixed-citation></ref>
<ref id="ref-23"><label>[23]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>A.</given-names> <surname>Khamparia</surname></string-name>, <string-name><given-names>D.</given-names> <surname>Gupta</surname></string-name>, <string-name><given-names>N. G.</given-names> <surname>Nguyen</surname></string-name>, <string-name><given-names>A.</given-names> <surname>Khanna</surname></string-name>, <string-name><given-names>B.</given-names> <surname>Pandey</surname></string-name> <etal>et al.</etal></person-group><italic>,</italic> &#x201C;<article-title>Sound classification using convolutional neural network and tensor deep stacking network</article-title>,&#x201D; <source>IEEE Access</source>, vol. <volume>7</volume>, pp. <fpage>7717</fpage>&#x2013;<lpage>7727</lpage>, <year>2019</year>.</mixed-citation></ref>
<ref id="ref-24"><label>[24]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>K.</given-names> <surname>Han</surname></string-name>, <string-name><given-names>D.</given-names> <surname>Yu</surname></string-name> and <string-name><given-names>I.</given-names> <surname>Tashev</surname></string-name></person-group>, &#x201C;<article-title>Speech emotion recognition using deep neural network and extreme learning machine</article-title>,&#x201D; <source>Fifteenth Annual Conf. of the Int. Speech Communication Association</source>, vol. <volume>1</volume>, pp. <fpage>1</fpage>, <year>2014</year>.</mixed-citation></ref>
<ref id="ref-25"><label>[25]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>P.</given-names> <surname>Cao</surname></string-name>, <string-name><given-names>W.</given-names> <surname>Xia</surname></string-name> and <string-name><given-names>Y.</given-names> <surname>Li</surname></string-name></person-group>, &#x201C;<article-title>Heart ID: Human identification based on radar micro-Doppler signatures of the heart using deep learning</article-title>,&#x201D; <source>Remote Sensing</source>, vol. <volume>11</volume>, pp. <fpage>1220</fpage>, <year>2019</year>.</mixed-citation></ref>
<ref id="ref-26"><label>[26]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>A.</given-names> <surname>Krizhevsky</surname></string-name>, <string-name><given-names>I.</given-names> <surname>Sutskever</surname></string-name> and <string-name><given-names>G. E.</given-names> <surname>Hinton</surname></string-name></person-group>, &#x201C;<article-title>Imagenet classification with deep convolutional neural networks</article-title>,&#x201D; <source>Advances in Neural Information Processing Systems</source>, vol. <volume>12</volume>, pp. <fpage>1097</fpage>&#x2013;<lpage>1105</lpage>, <year>2012</year>.</mixed-citation></ref>
<ref id="ref-27"><label>[27]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><given-names>K.</given-names> <surname>Simonyan</surname></string-name> and <string-name><given-names>A.</given-names> <surname>Zisserman</surname></string-name></person-group>, &#x201C;<article-title>Very deep convolutional networks for large-scale image recognition</article-title>,&#x201D; <comment>Arxiv Preprint Arxiv: 1409. 1556</comment>, <year>2014</year>.</mixed-citation></ref>
<ref id="ref-28"><label>[28]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>E. N. N.</given-names> <surname>Ocquaye</surname></string-name>, <string-name><given-names>Q.</given-names> <surname>Mao</surname></string-name>, <string-name><given-names>H.</given-names> <surname>Song</surname></string-name>, <string-name><given-names>G.</given-names> <surname>Xu</surname></string-name> and <string-name><given-names>Y.</given-names> <surname>Xue</surname></string-name></person-group>, &#x201C;<article-title>Dual exclusive attentive transfer for unsupervised deep convolutional domain adaptation in speech emotion recognition</article-title>,&#x201D; <source>IEEE Access</source>, vol. <volume>7</volume>, pp. <fpage>93847</fpage>&#x2013;<lpage>93857</lpage>, <year>2019</year>.</mixed-citation></ref>
<ref id="ref-29"><label>[29]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>T. N.</given-names> <surname>Sainath</surname></string-name>, <string-name><given-names>O.</given-names> <surname>Vinyals</surname></string-name>, <string-name><given-names>A.</given-names> <surname>Senior</surname></string-name> and <string-name><given-names>H.</given-names> <surname>Sak</surname></string-name></person-group>, &#x201C;<article-title>Convolutional, long short-term memory, fully connected deep neural networks</article-title>,&#x201D; <source>IEEE Int. Conf. on Acoustics, Speech and Signal Processing</source>, vol. <volume>1</volume>, pp. <fpage>4580</fpage>&#x2013;<lpage>4584</lpage>, <year>2015</year>.</mixed-citation></ref>
<ref id="ref-30"><label>[30]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Mustaqeem</surname></string-name> and <string-name><given-names>S.</given-names> <surname>Kwon</surname></string-name></person-group>, &#x201C;<article-title>CLSTM: Deep feature-based speech emotion recognition using the hierarchical convlstm network</article-title>,&#x201D; <source>Mathematics</source>, vol. <volume>8</volume>, pp. <fpage>2133</fpage>, <year>2020</year>.</mixed-citation></ref>
<ref id="ref-31"><label>[31]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>X.</given-names> <surname>Ma</surname></string-name>, <string-name><given-names>Z.</given-names> <surname>Wu</surname></string-name>, <string-name><given-names>J.</given-names> <surname>Jia</surname></string-name>, <string-name><given-names>M.</given-names> <surname>Xu</surname></string-name>, <string-name><given-names>H.</given-names> <surname>Meng</surname></string-name> <etal>et al.</etal></person-group><italic>,</italic> &#x201C;<article-title>Emotion recognition from variable-length speech segments using deep learning on spectrograms</article-title>,&#x201D; <source>Interspeech</source>, vol. <volume>1</volume>, pp. <fpage>3683</fpage>&#x2013;<lpage>3687</lpage>, <year>2018</year>.</mixed-citation></ref>
<ref id="ref-32"><label>[32]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>A.</given-names> <surname>Zhang</surname></string-name>, <string-name><given-names>W.</given-names> <surname>Zhu</surname></string-name> and <string-name><given-names>J.</given-names> <surname>Li</surname></string-name></person-group>, &#x201C;<article-title>Spiking echo state convolutional neural network for robust time series classification</article-title>,&#x201D; <source>IEEE Access</source>, vol. <volume>7</volume>, pp. <fpage>4927</fpage>&#x2013;<lpage>4935</lpage>, <year>2018</year>.</mixed-citation></ref>
<ref id="ref-33"><label>[33]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>Z. T.</given-names> <surname>Liu</surname></string-name>, <string-name><given-names>M.</given-names> <surname>Wu</surname></string-name>, <string-name><given-names>W. H.</given-names> <surname>Cao</surname></string-name>, <string-name><given-names>J. W.</given-names> <surname>Mao</surname></string-name> and <string-name><given-names>J. P.</given-names> <surname>Xu</surname></string-name></person-group>, &#x201C;<article-title>Speech emotion recognition based on feature selection and extreme learning machine decision tree</article-title>,&#x201D; <source>Neurocomputing</source>, vol. <volume>273</volume>, no. <issue>10</issue>, pp. <fpage>271</fpage>&#x2013;<lpage>280</lpage>, <year>2018</year>.</mixed-citation></ref>
<ref id="ref-34"><label>[34]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>N.</given-names> <surname>Dave</surname></string-name></person-group>, &#x201C;<article-title>Feature extraction methods LPC, PLP and MFCC in speech recognition</article-title>,&#x201D; <source>International Journal for Advance Research in Engineering and Technology</source>, vol. <volume>1</volume>, pp. <fpage>1</fpage>&#x2013;<lpage>4</lpage>, <year>2013</year>.</mixed-citation></ref>
<ref id="ref-35"><label>[35]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>Q.</given-names> <surname>Mao</surname></string-name>, <string-name><given-names>M.</given-names> <surname>Dong</surname></string-name>, <string-name><given-names>Z.</given-names> <surname>Huang</surname></string-name> and <string-name><given-names>Y.</given-names> <surname>Zhan</surname></string-name></person-group>, &#x201C;<article-title>Learning salient features for speech emotion recognition using convolutional neural networks</article-title>,&#x201D; <source>IEEE Transactions on Multimedia</source>, vol. <volume>16</volume>, no. <issue>8</issue>, pp. <fpage>2203</fpage>&#x2013;<lpage>2213</lpage>, <year>2014</year>.</mixed-citation></ref>
<ref id="ref-36"><label>[36]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>P.</given-names> <surname>Liu</surname></string-name>, <string-name><given-names>K. K. R.</given-names> <surname>Choo</surname></string-name>, <string-name><given-names>L.</given-names> <surname>Wang</surname></string-name> and <string-name><given-names>F.</given-names> <surname>Huang</surname></string-name></person-group>, &#x201C;<article-title>SVM or deep learning? A comparative study on remote sensing image classification</article-title>,&#x201D; <source>Soft Computing</source>, vol. <volume>21</volume>, no. <issue>23</issue>, pp. <fpage>7053</fpage>&#x2013;<lpage>7065</lpage>, <year>2017</year>.</mixed-citation></ref>
<ref id="ref-37"><label>[37]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>J.</given-names> <surname>Yan</surname></string-name>, <string-name><given-names>W.</given-names> <surname>Zheng</surname></string-name>, <string-name><given-names>Z.</given-names> <surname>Cui</surname></string-name>, <string-name><given-names>C.</given-names> <surname>Tang</surname></string-name>, <string-name><given-names>T.</given-names> <surname>Zhang</surname></string-name> <etal>et al.</etal></person-group><italic>,</italic> &#x201C;<article-title>Multi-cue fusion for emotion recognition in the wild</article-title>,&#x201D; <source>Neurocomputing</source>, vol. <volume>309</volume>, no. <issue>5</issue>, pp. <fpage>27</fpage>&#x2013;<lpage>35</lpage>, <year>2018</year>.</mixed-citation></ref>
<ref id="ref-38"><label>[38]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>D.</given-names> <surname>Luo</surname></string-name>, <string-name><given-names>Y.</given-names> <surname>Zou</surname></string-name> and <string-name><given-names>D.</given-names> <surname>Huang</surname></string-name></person-group>, &#x201C;<article-title>Investigation on joint representation learning for robust feature extraction in speech emotion recognition</article-title>,&#x201D; <source>Interspeech</source>, vol. <volume>1</volume>, pp. <fpage>152</fpage>&#x2013;<lpage>156</lpage>, <year>2018</year>.</mixed-citation></ref>
<ref id="ref-39"><label>[39]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>Y.</given-names> <surname>Zeng</surname></string-name>, <string-name><given-names>H.</given-names> <surname>Mao</surname></string-name>, <string-name><given-names>D.</given-names> <surname>Peng</surname></string-name> and <string-name><given-names>Z.</given-names> <surname>Yi</surname></string-name></person-group>, &#x201C;<article-title>Spectrogram based multi-task audio classification</article-title>,&#x201D; <source>Multimedia Tools and Applications</source>, vol. <volume>78</volume>, no. <issue>3</issue>, pp. <fpage>3705</fpage>&#x2013;<lpage>3722</lpage>, <year>2019</year>.</mixed-citation></ref>
<ref id="ref-40"><label>[40]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>N.</given-names> <surname>Srivastava</surname></string-name>, <string-name><given-names>G.</given-names> <surname>Hinton</surname></string-name>, <string-name><given-names>A.</given-names> <surname>Krizhevsky</surname></string-name>, <string-name><given-names>I.</given-names> <surname>Sutskever</surname></string-name> and <string-name><given-names>R.</given-names> <surname>Salakhutdinov</surname></string-name></person-group>, &#x201C;<article-title>Dropout: A simple way to prevent neural networks from overfitting</article-title>,&#x201D; <source>Journal of Machine Learning Research</source>, vol. <volume>15</volume>, pp. <fpage>1929</fpage>&#x2013;<lpage>1958</lpage>, <year>2014</year>.</mixed-citation></ref>
<ref id="ref-41"><label>[41]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>N.</given-names> <surname>Upadhyay</surname></string-name> and <string-name><given-names>A.</given-names> <surname>Karmakar</surname></string-name></person-group>, &#x201C;<article-title>Speech enhancement using spectral subtraction-type algorithms: A comparison and simulation study</article-title>,&#x201D; <source>Procedia Computer Science</source>, vol. <volume>54</volume>, no. <issue>2</issue>, pp. <fpage>574</fpage>&#x2013;<lpage>584</lpage>, <year>2015</year>.</mixed-citation></ref>
<ref id="ref-42"><label>[42]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><given-names>J.</given-names> <surname>Chung</surname></string-name>, <string-name><given-names>C.</given-names> <surname>Gulcehre</surname></string-name>, <string-name><given-names>K.</given-names> <surname>Cho</surname></string-name> and <string-name><given-names>Y.</given-names> <surname>Bengio</surname></string-name></person-group>, &#x201C;<article-title>Empirical evaluation of gated recurrent neural networks on sequence modeling</article-title>,&#x201D; <comment>Arxiv Preprint Arxiv: 1412.3555</comment>, <year>2014</year>.</mixed-citation></ref>
<ref id="ref-43"><label>[43]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>F.</given-names> <surname>Burkhardt</surname></string-name>, <string-name><given-names>A.</given-names> <surname>Paeschke</surname></string-name>, <string-name><given-names>M.</given-names> <surname>Rolfes</surname></string-name>, <string-name><given-names>W. F.</given-names> <surname>Sendlmeier</surname></string-name> and <string-name><given-names>B.</given-names> <surname>Weiss</surname></string-name></person-group>, &#x201C;<article-title>A database of german emotional speech</article-title>,&#x201D; <source>Ninth European Conf. on Speech Communication and Technology</source>, vol. <volume>1</volume>, pp. <fpage>1</fpage>&#x2013;<lpage>10</lpage>, <year>2005</year>.</mixed-citation></ref>
<ref id="ref-44"><label>[44]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>J.</given-names> <surname>Zhao</surname></string-name>, <string-name><given-names>X.</given-names> <surname>Mao</surname></string-name> and <string-name><given-names>L.</given-names> <surname>Chen</surname></string-name></person-group>, &#x201C;<article-title>Speech emotion recognition using deep 1D &#x0026; 2D cnn lstm networks</article-title>,&#x201D; <source>Biomedical Signal Processing and Control</source>, vol. <volume>47</volume>, no. <issue>4</issue>, pp. <fpage>312</fpage>&#x2013;<lpage>323</lpage>, <year>2019</year>.</mixed-citation></ref>
<ref id="ref-45"><label>[45]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>L.</given-names> <surname>Guo</surname></string-name>, <string-name><given-names>L.</given-names> <surname>Wang</surname></string-name>, <string-name><given-names>J.</given-names> <surname>Dang</surname></string-name>, <string-name><given-names>Z.</given-names> <surname>Liu</surname></string-name> and <string-name><given-names>H.</given-names> <surname>Guan</surname></string-name></person-group>, &#x201C;<article-title>Exploration of complementary features for speech emotion recognition based on kernel extreme learning machine</article-title>,&#x201D; <source>IEEE Access</source>, vol. <volume>7</volume>, pp. <fpage>75798</fpage>&#x2013;<lpage>75809</lpage>, <year>2019</year>.</mixed-citation></ref>
<ref id="ref-46"><label>[46]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>W.</given-names> <surname>Zheng</surname></string-name>, <string-name><given-names>J.</given-names> <surname>Yu</surname></string-name> and <string-name><given-names>Y.</given-names> <surname>Zou</surname></string-name></person-group>, &#x201C;<article-title>An experimental study of speech emotion recognition based on deep convolutional neural networks</article-title>,&#x201D; <source>Int. Conf. on Affective Computing and Intelligent Interaction</source>, vol. <volume>1</volume>, pp. <fpage>827</fpage>&#x2013;<lpage>831</lpage>, <year>2015</year>.</mixed-citation></ref>
<ref id="ref-47"><label>[47]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>H.</given-names> <surname>Meng</surname></string-name>, <string-name><given-names>T.</given-names> <surname>Yan</surname></string-name>, <string-name><given-names>F.</given-names> <surname>Yuan</surname></string-name> and <string-name><given-names>H.</given-names> <surname>Wei</surname></string-name></person-group>, &#x201C;<article-title>Speech emotion recognition from 3D log-mel spectrograms with deep learning network</article-title>,&#x201D; <source>IEEE Access</source>, vol. <volume>7</volume>, pp. <fpage>125868</fpage>&#x2013;<lpage>125881</lpage>, <year>2019</year>.</mixed-citation></ref>
<ref id="ref-48"><label>[48]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>Z.</given-names> <surname>Zhao</surname></string-name>, <string-name><given-names>Z.</given-names> <surname>Bao</surname></string-name>, <string-name><given-names>Y.</given-names> <surname>Zhao</surname></string-name>, <string-name><given-names>Z.</given-names> <surname>Zhang</surname></string-name>, <string-name><given-names>N.</given-names> <surname>Cummins</surname></string-name> <etal>et al.</etal></person-group><italic>,</italic> &#x201C;<article-title>Exploring deep spectrum representations via attention-based recurrent and convolutional neural networks for speech emotion recognition</article-title>,&#x201D; <source>IEEE Access</source>, vol. <volume>7</volume>, pp. <fpage>97515</fpage>&#x2013;<lpage>97525</lpage>, <year>2019</year>.</mixed-citation></ref>
<ref id="ref-49"><label>[49]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>M.</given-names> <surname>Chen</surname></string-name>, <string-name><given-names>X.</given-names> <surname>He</surname></string-name>, <string-name><given-names>J.</given-names> <surname>Yang</surname></string-name> and <string-name><given-names>H.</given-names> <surname>Zhang</surname></string-name></person-group>, &#x201C;<article-title>3-D convolutional recurrent neural networks with attention model for speech emotion recognition</article-title>,&#x201D; <source>IEEE Signal Processing Letters</source>, vol. <volume>25</volume>, no. <issue>10</issue>, pp. <fpage>1440</fpage>&#x2013;<lpage>1444</lpage>, <year>2018</year>.</mixed-citation></ref>
<ref id="ref-50"><label>[50]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>D.</given-names> <surname>Issa</surname></string-name>, <string-name><given-names>M. F.</given-names> <surname>Demirci</surname></string-name> and <string-name><given-names>A.</given-names> <surname>Yazici</surname></string-name></person-group>, &#x201C;<article-title>Speech emotion recognition with deep convolutional neural networks</article-title>,&#x201D; <source>Biomedical Signal Processing and Control</source>, vol. <volume>59</volume>, pp. <fpage>101894</fpage>, <year>2020</year>.</mixed-citation></ref>
<ref id="ref-51"><label>[51]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>P.</given-names> <surname>Jiang</surname></string-name>, <string-name><given-names>H.</given-names> <surname>Fu</surname></string-name>, <string-name><given-names>H.</given-names> <surname>Tao</surname></string-name>, <string-name><given-names>P.</given-names> <surname>Lei</surname></string-name> and <string-name><given-names>L.</given-names> <surname>Zhao</surname></string-name></person-group>, &#x201C;<article-title>Parallelized convolutional recurrent neural network with spectral features for speech emotion recognition</article-title>,&#x201D; <source>IEEE Access</source>, vol. <volume>7</volume>, pp. <fpage>90368</fpage>&#x2013;<lpage>90377</lpage>, <year>2019</year>.</mixed-citation></ref>
<ref id="ref-52"><label>[52]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>M. A.</given-names> <surname>Jalal</surname></string-name>, <string-name><given-names>E.</given-names> <surname>Loweimi</surname></string-name>, <string-name><given-names>R. K.</given-names> <surname>Moore</surname></string-name> and <string-name><given-names>T.</given-names> <surname>Hain</surname></string-name></person-group>, &#x201C;<article-title>Learning temporal clusters using capsule routing for speech emotion recognition</article-title>,&#x201D; <source>Proc. Interspeech</source>, vol. <volume>1</volume>, pp. <fpage>1701</fpage>&#x2013;<lpage>1705</lpage>, <year>2019</year>.</mixed-citation></ref>
<ref id="ref-53"><label>[53]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>A.</given-names> <surname>Bhavan</surname></string-name>, <string-name><given-names>P.</given-names> <surname>Chauhan</surname></string-name> and <string-name><given-names>R. R.</given-names> <surname>Shah</surname></string-name></person-group>, &#x201C;<article-title>Bagged support vector machines for emotion recognition from speech</article-title>,&#x201D; <source>Knowledge-Based Systems</source>, vol. <volume>184</volume>, no. <issue>3</issue>, pp. <fpage>104886</fpage>, <year>2019</year>.</mixed-citation></ref>
<ref id="ref-54"><label>[54]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>A. A. A.</given-names> <surname>Zamil</surname></string-name>, <string-name><given-names>S.</given-names> <surname>Hasan</surname></string-name>, <string-name><given-names>S. M. J.</given-names> <surname>Baki</surname></string-name>, <string-name><given-names>J. M.</given-names> <surname>Adam</surname></string-name> and <string-name><given-names>I.</given-names> <surname>Zaman</surname></string-name></person-group>, &#x201C;<article-title>Emotion detection from speech signals using voting mechanism on classified frames</article-title>,&#x201D; <source>International Conf. on Robotics, Electrical and Signal Processing Techniques</source>, vol. <volume>1</volume>, pp. <fpage>281</fpage>&#x2013;<lpage>285</lpage>, <year>2019</year>.</mixed-citation></ref>
<ref id="ref-55"><label>[55]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>N.</given-names> <surname>Khan</surname></string-name>, <string-name><given-names>A.</given-names> <surname>Ullah</surname></string-name>, <string-name><given-names>I. U.</given-names> <surname>Haq</surname></string-name>, <string-name><given-names>V. G.</given-names> <surname>Menon</surname></string-name> and <string-name><given-names>S. W.</given-names> <surname>Baik</surname></string-name></person-group>, &#x201C;<article-title>SD-Net: Understanding overcrowded scenes in real-time via an efficient dilated convolutional neural network</article-title>,&#x201D; <source>Journal of Real-Time Image Processing</source>, vol. <volume>1</volume>, pp. <fpage>1</fpage>&#x2013;<lpage>15</lpage>, <year>2020</year>.</mixed-citation></ref>
</ref-list>
</back>
</article>