<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.1 20151215//EN" "http://jats.nlm.nih.gov/publishing/1.1/JATS-journalpublishing1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" article-type="research-article" dtd-version="1.1">
<front>
<journal-meta>
<journal-id journal-id-type="pmc">CMC</journal-id>
<journal-id journal-id-type="nlm-ta">CMC</journal-id>
<journal-id journal-id-type="publisher-id">CMC</journal-id>
<journal-title-group>
<journal-title>Computers, Materials &#x0026; Continua</journal-title>
</journal-title-group>
<issn pub-type="epub">1546-2226</issn>
<issn pub-type="ppub">1546-2218</issn>
<publisher>
<publisher-name>Tech Science Press</publisher-name>
<publisher-loc>USA</publisher-loc>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="publisher-id">27379</article-id>
<article-id pub-id-type="doi">10.32604/cmc.2023.027379</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Article</subject>
</subj-group>
</article-categories>
<title-group>
<article-title>Speech Enhancement via Mask-Mapping Based Residual Dense Network</article-title>
<alt-title alt-title-type="left-running-head">Speech Enhancement via Mask-Mapping Based Residual Dense Network</alt-title>
<alt-title alt-title-type="right-running-head">Speech Enhancement via Mask-Mapping Based Residual Dense Network</alt-title>
</title-group>
<contrib-group content-type="authors">
<contrib id="author-1" contrib-type="author" corresp="yes">
<name name-style="western"><surname>Zhou</surname><given-names>Lin</given-names></name><xref ref-type="aff" rid="aff-1">1</xref><email>Linzhou@seu.edu.cn</email>
</contrib>
<contrib id="author-2" contrib-type="author">
<name name-style="western"><surname>Chen</surname><given-names>Xijin</given-names></name><xref ref-type="aff" rid="aff-1">1</xref></contrib>
<contrib id="author-3" contrib-type="author">
<name name-style="western"><surname>Wu</surname><given-names>Chaoyan</given-names></name><xref ref-type="aff" rid="aff-1">1</xref></contrib>
<contrib id="author-4" contrib-type="author">
<name name-style="western"><surname>Zhong</surname><given-names>Qiuyue</given-names></name><xref ref-type="aff" rid="aff-1">1</xref></contrib>
<contrib id="author-5" contrib-type="author">
<name name-style="western"><surname>Cheng</surname><given-names>Xu</given-names></name><xref ref-type="aff" rid="aff-2">2</xref></contrib>
<contrib id="author-6" contrib-type="author">
<name name-style="western"><surname>Tang</surname><given-names>Yibin</given-names></name><xref ref-type="aff" rid="aff-3">3</xref></contrib>
<aff id="aff-1"><label>1</label><institution>School of Information Science and Engineering, Southeast University</institution>, <addr-line>Nanjing, 210096</addr-line>, <country>China</country></aff>
<aff id="aff-2"><label>2</label><institution>Center for Machine Vision and Signal Analysis, University of Oulu</institution>, <addr-line>Oulu, FI-90014</addr-line>, <country>Finland</country></aff>
<aff id="aff-3"><label>3</label><institution>College of IOT Engineering, Hohai University</institution>, <addr-line>Changzhou, 213022</addr-line>, <country>China</country></aff>
</contrib-group>
<author-notes>
<corresp id="cor1"><label>&#x002A;</label>Corresponding Author: Lin Zhou. Email: <email>Linzhou@seu.edu.cn</email></corresp>
</author-notes>
<pub-date pub-type="epub" date-type="pub" iso-8601-date="2022-08-16"><day>16</day>
<month>08</month>
<year>2022</year></pub-date>
<volume>74</volume>
<issue>1</issue>
<fpage>1259</fpage>
<lpage>1277</lpage>
<history>
<date date-type="received"><day>16</day><month>1</month><year>2022</year></date>
<date date-type="accepted"><day>06</day><month>4</month><year>2022</year></date>
</history>
<permissions>
<copyright-statement>&#x00A9; 2023 Zhou et al.</copyright-statement>
<copyright-year>2023</copyright-year>
<copyright-holder>Zhou et al.</copyright-holder>
<license xlink:href="https://creativecommons.org/licenses/by/4.0/">
<license-p>This work is licensed under a <ext-link ext-link-type="uri" xlink:type="simple" xlink:href="https://creativecommons.org/licenses/by/4.0/">Creative Commons Attribution 4.0 International License</ext-link>, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
</license>
</permissions>
<self-uri content-type="pdf" xlink:href="TSP_CMC_27379.pdf"></self-uri>
<abstract>
<p>Masking-based and spectrum mapping-based methods are the two main algorithms of speech enhancement with deep neural network (DNN). But the mapping-based methods only utilizes the phase of noisy speech, which limits the upper bound of speech enhancement performance. Masking-based methods need to accurately estimate the masking which is still the key problem. Combining the advantages of above two types of methods, this paper proposes the speech enhancement algorithm MM-RDN (masking-mapping residual dense network) based on masking-mapping (MM) and residual dense network (RDN). Using the logarithmic power spectrogram (LPS) of consecutive frames, MM estimates the ideal ratio masking (IRM) matrix of consecutive frames. RDN can make full use of feature maps of all layers. Meanwhile, using the global residual learning to combine the shallow features and deep features, RDN obtains the global dense features from the LPS, thereby improves estimated accuracy of the IRM matrix. Simulations show that the proposed method achieves attractive speech enhancement performance in various acoustic environments. Specifically, in the untrained acoustic test with limited priors, e.g., unmatched signal-to-noise ratio (SNR) and unmatched noise category, MM-RDN can still outperform the existing convolutional recurrent network (CRN) method in the measures of perceptual evaluation of speech quality (PESQ) and other evaluation indexes. It indicates that the proposed algorithm is more generalized in untrained conditions.</p>
</abstract>
<kwd-group kwd-group-type="author">
<kwd>Mask-mapping-based method</kwd>
<kwd>residual dense block</kwd>
<kwd>speech enhancement</kwd>
</kwd-group>
</article-meta>
</front>
<body>
<sec id="s1"><label>1</label><title>Introduction</title>
<p>Speech enhancement is a fundamental task in speech signal processing, which is widely used in various scenarios, e.g., mobile phone, intelligent vehicles [<xref ref-type="bibr" rid="ref-1">1</xref>] and medical devices [<xref ref-type="bibr" rid="ref-2">2</xref>,<xref ref-type="bibr" rid="ref-3">3</xref>]. It is performed as a front-end signal procedure for automatic speech recognition (ASR), speaker identification, hearing-aid devices and cochlear implant. At present, speech enhancement based on deep learning (DL) is treated as a supervised learning problem, which can be divided into two categories: spectrum mapping [<xref ref-type="bibr" rid="ref-4">4</xref>] and masking [<xref ref-type="bibr" rid="ref-5">5</xref>], according to the training target.</p>
<p>The masking-based method focuses on separating clean speech from background interference by estimating masking value, which describes the time-frequency relationships of clean speech to noise. Generally, the masking of current time frequency (TF) unit is estimated through features of previous and current frames due to the causal system. The ideal binary masking (IBM) is the commonly used masking, which is firstly adopted in the DL based speech separation [<xref ref-type="bibr" rid="ref-4">4</xref>]. In this work, a pre-trained DNN is used to estimate the IBM on each sub-band. The DNN with support vector machines (DNN-SVM) system demonstrates good generalization [<xref ref-type="bibr" rid="ref-6">6</xref>]. Besides IBM, IRM [<xref ref-type="bibr" rid="ref-7">7</xref>], complex IRM [<xref ref-type="bibr" rid="ref-8">8</xref>], phase-sensitive mask (PSM) [<xref ref-type="bibr" rid="ref-9">9</xref>] and spectral magnitude mask (SMM) [<xref ref-type="bibr" rid="ref-10">10</xref>] are also designed as training targets. In terms of speech quality, ratio masking performs better than binary masking. In 2014, Wang use the DNN to estimate IBM and IRM, which indicates that DNN-based mask estimation method can significantly improve speech enhancement. Overall, the IRM and the SMM are the preferred targets, and the DNN based on ratio masking performs better than unsupervised speech enhancement&#x00A0;[<xref ref-type="bibr" rid="ref-11">11</xref>].</p>
<p>The mapping-based method aims to estimate the magnitude spectrogram or temporal representation of clean speech directly from noisy speech, which naturally avoids the masking selection in the masking-based method. Related research indicates that the mapping has superiority to the masking at a low SNR [<xref ref-type="bibr" rid="ref-12">12</xref>]. A deep autoencoder (DAE) is the first proposed algorithm to map the Mel-power spectrum of degraded speech to the clean one [<xref ref-type="bibr" rid="ref-5">5</xref>]. In the later research, log spectral magnitude and log Mel-spectrum are used in DL-based speech separation [<xref ref-type="bibr" rid="ref-13">13</xref>,<xref ref-type="bibr" rid="ref-14">14</xref>]. Also, DNN is exploited in the LPS mapping [<xref ref-type="bibr" rid="ref-15">15</xref>]. Compared with DNN, convolutional neural networks (CNN) can obtain more accurate local features, which can better recover the high-frequency of the speech signal, and improve the quality and intelligibility of the enhanced speech [<xref ref-type="bibr" rid="ref-16">16</xref>,<xref ref-type="bibr" rid="ref-17">17</xref>]. The generative adversarial networks (GANs) learn the nonlinear transformation from noisy speech to clean speech by generating confrontation, which has generalization in untrained conditions [<xref ref-type="bibr" rid="ref-18">18</xref>]. The DNN, GANs or CNN-based speech enhancement rarely consider the temporal characteristics of the speech, which limits the performance of enhancement. With the self-feedback neurons, Recurrent Neural Network (RNN) can process the sequence signals, and achieve better performance on speech enhancement [<xref ref-type="bibr" rid="ref-19">19</xref>]. The optimization of RNN via the back propagation through time (BPTT) has the problem of vanishing and exploding gradients [<xref ref-type="bibr" rid="ref-20">20</xref>], long short-time memory recurrent neural network (LSTM-RNN) is proposed to solve this problem [<xref ref-type="bibr" rid="ref-12">12</xref>], and improves both the speech quality and intelligibility.</p>
<p>As the recent study [<xref ref-type="bibr" rid="ref-12">12</xref>] indicates that masking is advantageous at higher SNRs and mapping is more advantageous at lower SNRs. We combined these two types of speech enhancement, which is denoted as MM based method. The MM method maps LPS [<xref ref-type="bibr" rid="ref-21">21</xref>] to IRM matrix of consecutive frames, not just the IRM of the current frame. The RDN [<xref ref-type="bibr" rid="ref-22">22</xref>] makes full use of features of all layers through local dense connection. Also, using the global residual learning, RDN combines the shallow features and deep features to obtain the global dense features from the LPS spectrogram, thereby improves accuracy of the masking estimation. The proposed MM-RDN speech enhancement outperforms the mapping-based CRN [<xref ref-type="bibr" rid="ref-23">23</xref>] which reached State-of-the-Art (SOTA) level in the enhancement speech.</p>
<p>The outline of the paper is organized as follow. In Section 2, the architecture and the implementation of the proposed method are described in detail. Simulation results and analysis are presented in Section 3. Finally, conclusions are drawn in Section 4.</p>
</sec>
<sec id="s2"><label>2</label><title>Method Description</title>
<p>The proposed MM-RDN based speech enhancement system is illustrated in <xref ref-type="fig" rid="fig-1">Fig. 1</xref>. In training, the LPS of consecutive frames is extracted, and treated as the input features for the RDN network. IRM of the corresponding frames composes the two-dimensional IRM matrix, which is used as the training target. The RDN network is trained to establish relationship between the LPS and the IRM matrix. In testing, the RDN outputs the estimated ratio masking (ERM) matrix, which is used to reconstruct the clean speech with the original noisy speech.</p>
<fig id="fig-1"><label>Figure 1</label><caption><title>The block diagram of proposed algorithm</title></caption><graphic mimetype="image" mime-subtype="png" xlink:href="CMC_27379-fig-1.png"/></fig>
<p>The noisy speech signal is formulated as:
<disp-formula id="eqn-1"><label>(1)</label><mml:math id="mml-eqn-1" display="block"><mml:mi>x</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>n</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mi>s</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>n</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>+</mml:mo><mml:mi>v</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>n</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></disp-formula>where <italic>x</italic>(<italic>n</italic>), s(<italic>n</italic>) and <italic>v</italic>(<italic>n</italic>) denote noisy speech, clean speech and additive noise respectively. <italic>n</italic> represents the time index.</p>
<p>After framing and windowing, the short-time Fourier transform (STFT) of signal can be written as:
<disp-formula id="eqn-2"><label>(2)</label><mml:math id="mml-eqn-2" display="block"><mml:mi>X</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>k</mml:mi><mml:mo>,</mml:mo><mml:mi>f</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:munderover><mml:mo movablelimits="false">&#x2211;</mml:mo><mml:mrow><mml:mi>m</mml:mi><mml:mo>=</mml:mo><mml:mn>0</mml:mn></mml:mrow><mml:mrow><mml:mi>M</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:munderover><mml:mrow><mml:mi>x</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>k</mml:mi><mml:mo>,</mml:mo><mml:mi>m</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mrow><mml:msup><mml:mi>e</mml:mi><mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mi>j</mml:mi><mml:mfrac><mml:mrow><mml:mn>2</mml:mn><mml:mi>&#x03C0;</mml:mi><mml:mi>m</mml:mi><mml:mi>f</mml:mi></mml:mrow><mml:mi>M</mml:mi></mml:mfrac></mml:mrow></mml:msup></mml:mrow></mml:mrow><mml:mrow><mml:mspace width="1pt" /></mml:mrow><mml:mrow><mml:mspace width="1pt" /></mml:mrow><mml:mrow><mml:mspace width="1pt" /></mml:mrow><mml:mrow><mml:mspace width="1pt" /></mml:mrow><mml:mspace width="1em" /><mml:mi>f</mml:mi><mml:mo>=</mml:mo><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mo>&#x22EF;</mml:mo><mml:mo>,</mml:mo><mml:mi>M</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:math></disp-formula>where <italic>X</italic>(<italic>k, f</italic>) is the spectrum of <italic>k</italic>th frame temporal signal <italic>x</italic>(<italic>k</italic>, <italic>m</italic>). <italic>f</italic> is frequency bin index, and <italic>M</italic> is the length of STFT.</p>
<p>Logarithmic power spectrum is defined as:
<disp-formula id="eqn-3"><label>(3)</label><mml:math id="mml-eqn-3" display="block"><mml:mrow><mml:msub><mml:mi>X</mml:mi><mml:mi>s</mml:mi></mml:msub></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>k</mml:mi><mml:mo>,</mml:mo><mml:mi>f</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mn>10</mml:mn><mml:mspace width="thickmathspace" /><mml:mrow><mml:msub><mml:mi>log</mml:mi><mml:mrow><mml:mn>10</mml:mn></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mo fence="false" stretchy="false">|</mml:mo><mml:mi>X</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>k</mml:mi><mml:mo>,</mml:mo><mml:mi>f</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mrow><mml:msup><mml:mo fence="false" stretchy="false">|</mml:mo><mml:mn>2</mml:mn></mml:msup></mml:mrow><mml:mo stretchy="false">]</mml:mo></mml:math></disp-formula></p>
<p>According to STFT symmetry, the first <italic>M</italic>/2 logarithmic power spectrum of M/2 consecutive frames are spliced together to obtain a two-dimensional LPS <italic>C</italic>(<italic>l</italic>), which is defined as:
<disp-formula id="eqn-4"><label>(4)</label><mml:math id="mml-eqn-4" display="block"><mml:mi>C</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>l</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mrow><mml:mo>=</mml:mo></mml:mrow><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:mtable columnalign="left left left left" rowspacing="4pt" columnspacing="1em"><mml:mtr><mml:mtd><mml:mrow><mml:mrow><mml:msub><mml:mi>X</mml:mi><mml:mi>S</mml:mi></mml:msub></mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mstyle displaystyle="true" scriptlevel="0"><mml:mfrac><mml:mi>M</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:mfrac></mml:mstyle><mml:mo>&#x22C5;</mml:mo><mml:mi>l</mml:mi><mml:mrow><mml:mo>,</mml:mo></mml:mrow><mml:mstyle displaystyle="true" scriptlevel="0"><mml:mfrac><mml:mi>M</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:mfrac></mml:mstyle><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:mtd><mml:mtd><mml:mrow><mml:mrow><mml:msub><mml:mi>X</mml:mi><mml:mi>S</mml:mi></mml:msub></mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mstyle displaystyle="true" scriptlevel="0"><mml:mfrac><mml:mi>M</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:mfrac></mml:mstyle><mml:mo>&#x22C5;</mml:mo><mml:mi>l</mml:mi><mml:mo>+</mml:mo><mml:mrow><mml:mn>1</mml:mn></mml:mrow><mml:mo>,</mml:mo><mml:mstyle displaystyle="true" scriptlevel="0"><mml:mfrac><mml:mi>M</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:mfrac></mml:mstyle><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:mtd><mml:mtd><mml:mo>&#x22EF;</mml:mo></mml:mtd><mml:mtd><mml:mrow><mml:mrow><mml:msub><mml:mi>X</mml:mi><mml:mi>S</mml:mi></mml:msub></mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mstyle displaystyle="true" scriptlevel="0"><mml:mfrac><mml:mi>M</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:mfrac></mml:mstyle><mml:mo>&#x22C5;</mml:mo><mml:mi>l</mml:mi><mml:mo>+</mml:mo><mml:mstyle displaystyle="true" scriptlevel="0"><mml:mfrac><mml:mi>M</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:mfrac></mml:mstyle><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mstyle displaystyle="true" scriptlevel="0"><mml:mfrac><mml:mi>M</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:mfrac></mml:mstyle><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mspace width="1em" /><mml:mspace width="1em" /><mml:mspace width="1em" /><mml:mo>&#x22EE;</mml:mo></mml:mtd><mml:mtd><mml:mspace width="1em" /><mml:mspace width="1em" /><mml:mspace width="1em" /><mml:mo>&#x22EE;</mml:mo></mml:mtd><mml:mtd><mml:mo>&#x22F1;</mml:mo></mml:mtd><mml:mtd><mml:mspace width="1em" /><mml:mspace width="1em" /><mml:mspace width="1em" /><mml:mo>&#x22EE;</mml:mo></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mrow><mml:mrow><mml:msub><mml:mi>X</mml:mi><mml:mi>S</mml:mi></mml:msub></mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mstyle displaystyle="true" scriptlevel="0"><mml:mfrac><mml:mi>M</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:mfrac></mml:mstyle><mml:mo>&#x22C5;</mml:mo><mml:mi>l</mml:mi><mml:mo>,</mml:mo><mml:mrow><mml:mn>1</mml:mn></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:mtd><mml:mtd><mml:mrow><mml:mrow><mml:msub><mml:mi>X</mml:mi><mml:mi>S</mml:mi></mml:msub></mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mstyle displaystyle="true" scriptlevel="0"><mml:mfrac><mml:mi>M</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:mfrac></mml:mstyle><mml:mo>&#x22C5;</mml:mo><mml:mi>l</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mn>1</mml:mn><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:mtd><mml:mtd><mml:mo>&#x22EF;</mml:mo></mml:mtd><mml:mtd><mml:mrow><mml:mrow><mml:msub><mml:mi>X</mml:mi><mml:mi>S</mml:mi></mml:msub></mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mstyle displaystyle="true" scriptlevel="0"><mml:mfrac><mml:mi>M</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:mfrac></mml:mstyle><mml:mo>&#x22C5;</mml:mo><mml:mi>l</mml:mi><mml:mo>+</mml:mo><mml:mstyle displaystyle="true" scriptlevel="0"><mml:mfrac><mml:mi>M</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:mfrac></mml:mstyle><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mrow><mml:mn>1</mml:mn></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mrow><mml:mrow><mml:msub><mml:mi>X</mml:mi><mml:mi>S</mml:mi></mml:msub></mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mstyle displaystyle="true" scriptlevel="0"><mml:mfrac><mml:mi>M</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:mfrac></mml:mstyle><mml:mo>&#x22C5;</mml:mo><mml:mi>l</mml:mi><mml:mo>,</mml:mo><mml:mn>0</mml:mn><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:mtd><mml:mtd><mml:mrow><mml:mrow><mml:msub><mml:mi>X</mml:mi><mml:mi>S</mml:mi></mml:msub></mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mstyle displaystyle="true" scriptlevel="0"><mml:mfrac><mml:mi>M</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:mfrac></mml:mstyle><mml:mo>&#x22C5;</mml:mo><mml:mi>l</mml:mi><mml:mo>+</mml:mo><mml:mrow><mml:mn>1</mml:mn></mml:mrow><mml:mo>,</mml:mo><mml:mrow><mml:mn>0</mml:mn></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:mtd><mml:mtd><mml:mo>&#x22EF;</mml:mo></mml:mtd><mml:mtd><mml:mrow><mml:msub><mml:mi>X</mml:mi><mml:mi>S</mml:mi></mml:msub></mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mstyle displaystyle="true" scriptlevel="0"><mml:mfrac><mml:mi>M</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:mfrac></mml:mstyle><mml:mo>&#x22C5;</mml:mo><mml:mi>l</mml:mi><mml:mo>+</mml:mo><mml:mstyle displaystyle="true" scriptlevel="0"><mml:mfrac><mml:mi>M</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:mfrac></mml:mstyle><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mrow><mml:mn>0</mml:mn></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:mrow><mml:mo>]</mml:mo></mml:mrow></mml:math></disp-formula></p>
<p>The training target is the IRM matrix, which IRM is calculated using the following formula:
<disp-formula id="eqn-5"><label>(5)</label><mml:math id="mml-eqn-5" display="block"><mml:mi>I</mml:mi><mml:mi>R</mml:mi><mml:mi>M</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>k</mml:mi><mml:mo>,</mml:mo><mml:mi>f</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mrow><mml:msup><mml:mrow><mml:mo>(</mml:mo><mml:mfrac><mml:mrow><mml:mi>S</mml:mi><mml:mrow><mml:msup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>k</mml:mi><mml:mo>,</mml:mo><mml:mi>f</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mn>2</mml:mn></mml:msup></mml:mrow></mml:mrow><mml:mrow><mml:mi>S</mml:mi><mml:mrow><mml:msup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>k</mml:mi><mml:mo>,</mml:mo><mml:mi>f</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mn>2</mml:mn></mml:msup></mml:mrow><mml:mo>+</mml:mo><mml:mi>V</mml:mi><mml:mrow><mml:msup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>k</mml:mi><mml:mo>,</mml:mo><mml:mi>f</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mn>2</mml:mn></mml:msup></mml:mrow></mml:mrow></mml:mfrac><mml:mo>)</mml:mo></mml:mrow><mml:mi>&#x03B2;</mml:mi></mml:msup></mml:mrow></mml:math></disp-formula>where <italic>S</italic>(<italic>k</italic>, <italic>f</italic>) represents the spectrum of the clean speech <italic>s</italic>(<italic>n</italic>) after preprocessing and STFT, and <italic>V</italic>(<italic>k</italic>, <italic>f</italic>) is the spectrum of the noise. The adjustable parameter &#x03B2; is 0.5.</p>
<p>The IRM matrix <italic>R</italic>(<italic>l</italic>) corresponding to <italic>C</italic>(<italic>l</italic>) is computed as follow:
<disp-formula id="eqn-6"><label>(6)</label><mml:math id="mml-eqn-6" display="block"><mml:mi>R</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>l</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mrow><mml:mo>=</mml:mo></mml:mrow><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:mtable columnalign="left left left left" rowspacing="4pt" columnspacing="1em"><mml:mtr><mml:mtd><mml:mrow><mml:mi>I</mml:mi><mml:mi>R</mml:mi><mml:mi>M</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mstyle displaystyle="true" scriptlevel="0"><mml:mfrac><mml:mi>M</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:mfrac></mml:mstyle><mml:mo>&#x22C5;</mml:mo><mml:mi>l</mml:mi><mml:mrow><mml:mo>,</mml:mo></mml:mrow><mml:mstyle displaystyle="true" scriptlevel="0"><mml:mfrac><mml:mi>M</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:mfrac></mml:mstyle><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:mtd><mml:mtd><mml:mrow><mml:mi>I</mml:mi><mml:mi>R</mml:mi><mml:mi>M</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mstyle displaystyle="true" scriptlevel="0"><mml:mfrac><mml:mi>M</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:mfrac></mml:mstyle><mml:mo>&#x22C5;</mml:mo><mml:mi>l</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn><mml:mrow><mml:mo>,</mml:mo></mml:mrow><mml:mstyle displaystyle="true" scriptlevel="0"><mml:mfrac><mml:mi>M</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:mfrac></mml:mstyle><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:mtd><mml:mtd><mml:mo>&#x22EF;</mml:mo></mml:mtd><mml:mtd><mml:mrow><mml:mi>I</mml:mi><mml:mi>R</mml:mi><mml:mi>M</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mstyle displaystyle="true" scriptlevel="0"><mml:mfrac><mml:mi>M</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:mfrac></mml:mstyle><mml:mo>&#x22C5;</mml:mo><mml:mi>l</mml:mi><mml:mo>+</mml:mo><mml:mstyle displaystyle="true" scriptlevel="0"><mml:mfrac><mml:mi>M</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:mfrac></mml:mstyle><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mstyle displaystyle="true" scriptlevel="0"><mml:mfrac><mml:mi>M</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:mfrac></mml:mstyle><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mspace width="1em" /><mml:mspace width="1em" /><mml:mspace width="1em" /><mml:mo>&#x22EE;</mml:mo></mml:mtd><mml:mtd><mml:mspace width="1em" /><mml:mspace width="1em" /><mml:mspace width="1em" /><mml:mo>&#x22EE;</mml:mo></mml:mtd><mml:mtd><mml:mo>&#x22F1;</mml:mo></mml:mtd><mml:mtd><mml:mspace width="1em" /><mml:mspace width="1em" /><mml:mspace width="1em" /><mml:mo>&#x22EE;</mml:mo></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mrow><mml:mi>I</mml:mi><mml:mi>R</mml:mi><mml:mi>M</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mstyle displaystyle="true" scriptlevel="0"><mml:mfrac><mml:mi>M</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:mfrac></mml:mstyle><mml:mo>&#x22C5;</mml:mo><mml:mi>l</mml:mi><mml:mo>,</mml:mo><mml:mn>1</mml:mn><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:mtd><mml:mtd><mml:mrow><mml:mi>I</mml:mi><mml:mi>R</mml:mi><mml:mi>M</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mstyle displaystyle="true" scriptlevel="0"><mml:mfrac><mml:mi>M</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:mfrac></mml:mstyle><mml:mo>&#x22C5;</mml:mo><mml:mi>l</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mn>1</mml:mn><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:mtd><mml:mtd><mml:mo>&#x22EF;</mml:mo></mml:mtd><mml:mtd><mml:mrow><mml:mi>I</mml:mi><mml:mi>R</mml:mi><mml:mi>M</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mstyle displaystyle="true" scriptlevel="0"><mml:mfrac><mml:mi>M</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:mfrac></mml:mstyle><mml:mo>&#x22C5;</mml:mo><mml:mi>l</mml:mi><mml:mo>+</mml:mo><mml:mstyle displaystyle="true" scriptlevel="0"><mml:mfrac><mml:mi>M</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:mfrac></mml:mstyle><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mrow><mml:mn>1</mml:mn></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mrow><mml:mi>I</mml:mi><mml:mi>R</mml:mi><mml:mi>M</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mstyle displaystyle="true" scriptlevel="0"><mml:mfrac><mml:mi>M</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:mfrac></mml:mstyle><mml:mo>&#x22C5;</mml:mo><mml:mi>l</mml:mi><mml:mo>,</mml:mo><mml:mrow><mml:mn>0</mml:mn></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:mtd><mml:mtd><mml:mrow><mml:mi>I</mml:mi><mml:mi>R</mml:mi><mml:mi>M</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mstyle displaystyle="true" scriptlevel="0"><mml:mfrac><mml:mi>M</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:mfrac></mml:mstyle><mml:mo>&#x22C5;</mml:mo><mml:mi>l</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mrow><mml:mn>0</mml:mn></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:mtd><mml:mtd><mml:mo>&#x22EF;</mml:mo></mml:mtd><mml:mtd><mml:mrow><mml:mi>I</mml:mi><mml:mi>R</mml:mi><mml:mi>M</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mstyle displaystyle="true" scriptlevel="0"><mml:mfrac><mml:mi>M</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:mfrac></mml:mstyle><mml:mo>&#x22C5;</mml:mo><mml:mi>l</mml:mi><mml:mo>+</mml:mo><mml:mstyle displaystyle="true" scriptlevel="0"><mml:mfrac><mml:mi>M</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:mfrac></mml:mstyle><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mn>0</mml:mn><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:mrow><mml:mo>]</mml:mo></mml:mrow></mml:math></disp-formula></p>
<sec id="s2_1"><label>2.1</label><title>Masking Mapping</title>
<p>The masking-based method uses multi-frame features to predict the masking of a certain frame [<xref ref-type="bibr" rid="ref-15">15</xref>,<xref ref-type="bibr" rid="ref-24">24</xref>], as shown in <xref ref-type="fig" rid="fig-2">Fig. 2a</xref>. The methods are divided into two categories: causal and non-causal one, which use different frames to estimate the masking (as shown by the blue dashed box). Generally speaking, causal speech enhancement is closer to actual application scenarios. The mapping-based method realizes the spectrum-to-spectrum mapping [<xref ref-type="bibr" rid="ref-23">23</xref>,<xref ref-type="bibr" rid="ref-25">25</xref>], as shown in <xref ref-type="fig" rid="fig-2">Fig. 2b</xref>. This method maps the noisy spectrum directly to its corresponding clean spectrum.</p>
<fig id="fig-2"><label>Figure 2</label><caption><title>The training target of different method</title></caption><graphic mimetype="image" mime-subtype="png" xlink:href="CMC_27379-fig-2.png"/></fig>
<p>But both of these two methods have the own shortcomings. First of all, the masking method ignores the spectral correlation between the consecutive frames and cannot make full use of the two-dimensional convolution kernel. Secondly, although the spectrum mapping utilizes the two-dimensional information of the spectrogram, the masking can provide the richer information than that of the spectrogram through the comparison of <xref ref-type="fig" rid="fig-2">Figs. 2a</xref> and <xref ref-type="fig" rid="fig-2">2b</xref>.</p>
<p>Based on the above analysis, MM is presented to estimate the masking matrix of multi frames on the LPS, as shown in <xref ref-type="fig" rid="fig-2">Fig. 2c</xref>. The difference between MM and spectrum mapping is that the training target is no longer the spectrum of clean speech, but the IRM matrix, and the difference between the masking method and the proposed method is that MM estimates the IRM matrix, rather than the IRM for a single frame.</p>
</sec>
<sec id="s2_2"><label>2.2</label><title>Mask-mapping Based on RDN</title>
<p>The structure of proposed MM-RDN is shown in <xref ref-type="fig" rid="fig-3">Fig. 3</xref>. The network contains down-sampling, dense feature extraction module stacked by residual dense block (RDB) and up-sampling. In <xref ref-type="fig" rid="fig-3">Fig. 3</xref>, <italic>k</italic> represents the size of convolution kernel in convolution layer (Conv) and deconvolution layer (Deconv), <italic>o</italic> is the number of convolutional kernels, and <italic>s</italic> represents the convolution step. The down-sampling extracts the local and structural features, and also reduces the size of the feature maps by the convolutional kernel. The Conv layer is followed by batch normalization (BN), Dropout [<xref ref-type="bibr" rid="ref-26">26</xref>] and ReLU, which significantly reduces the computation cost and parameters load, and also increases the receptive field. Then distinguishable features are extracted by a stack of 6 RDBs. The up-sampling restores the feature map through convolutional layer with a step size of 1/2. The skip connections between down-sampling and up-sampling provide the combination of local and global feature and avoid the gradient vanishing.</p>
<fig id="fig-3"><label>Figure 3</label><caption><title>The structure of MM-RDN</title></caption><graphic mimetype="image" mime-subtype="png" xlink:href="CMC_27379-fig-3.png"/></fig>
<p>The following sub-section describes the residual block and dense block of RDB in detail.</p>
</sec>
<sec id="s2_3"><label>2.3</label><title>Residual Block and Dens Block</title>
<p>The structure of residual block is showing in <xref ref-type="fig" rid="fig-4">Fig. 4</xref>. The skip connection structure alleviates the problem of gradient vanishing and network degradation, and can well deal with the problems caused by network Deeping.</p>
<fig id="fig-4"><label>Figure 4</label><caption><title>The structure of residual block</title></caption><graphic mimetype="image" mime-subtype="png" xlink:href="CMC_27379-fig-4.png"/></fig>
<p>DenseNet [<xref ref-type="bibr" rid="ref-27">27</xref>] is composed of several dense blocks (DB) as shown in <xref ref-type="fig" rid="fig-5">Fig. 5</xref>. In DB, there is a skip connection between any two layers, and the input of each layer is the union of outputs of all the previous layers, which means the features learned in certain layer are the input for the all subsequent layers. DenseNet not only alleviates the problem of gradient vanishing, but also realizes the multiplexing of features extracted in the hidden layers.</p>
<fig id="fig-5"><label>Figure 5</label><caption><title>The structure of DB</title></caption><graphic mimetype="image" mime-subtype="png" xlink:href="CMC_27379-fig-5.png"/></fig>
<p>RDB [<xref ref-type="bibr" rid="ref-22">22</xref>] combines the residual block and dense block, as shown in <xref ref-type="fig" rid="fig-6">Fig. 6</xref>. RDB can not only obtain the state from the previous RDB, but also make full use of the feature of all layers through the local dense connections. The CM mechanism ensures that the previous RDB output can pass to each layer of the current RDB, which is formed by dense connection, local feature fusion (LFF) and local residual learning (LRL).</p>
<fig id="fig-6"><label>Figure 6</label><caption><title>The structure of RDB</title></caption><graphic mimetype="image" mime-subtype="png" xlink:href="CMC_27379-fig-6.png"/></fig>
<p>For an RDB, <italic>F<sub>in</sub></italic> denotes the input, then the output of <italic>c</italic>th convolutional layer of RDB <italic>F<sub>c</sub></italic> can be expressed as:
<disp-formula id="eqn-7"><label>(7)</label><mml:math id="mml-eqn-7" display="block"><mml:mrow><mml:msub><mml:mi>F</mml:mi><mml:mi>c</mml:mi></mml:msub></mml:mrow><mml:mo>=</mml:mo><mml:mi>&#x03C3;</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mrow><mml:msub><mml:mi>W</mml:mi><mml:mi>c</mml:mi></mml:msub></mml:mrow><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:mrow><mml:msub><mml:mi>F</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mi>n</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo>,</mml:mo><mml:mrow><mml:msub><mml:mi>F</mml:mi><mml:mn>1</mml:mn></mml:msub></mml:mrow><mml:mo>,</mml:mo><mml:mo>&#x22EF;</mml:mo><mml:mrow><mml:msub><mml:mi>F</mml:mi><mml:mrow><mml:mi>c</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:mrow></mml:mrow><mml:mo>]</mml:mo></mml:mrow></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:math></disp-formula>where <italic>&#x03C3;</italic> denotes ReLU [<xref ref-type="bibr" rid="ref-28">28</xref>] activation function and <italic>W<sub>c</sub></italic> is the weight of the <italic>c</italic>th Conv layer. [&#x22C5;] refers to the concatenation of the input.</p>
<p>The input of the last convolutional layer is the local features of all convolutional layers to obtain the local feature fusion <italic>F<sub>LF</sub></italic>, which is formulates as
<disp-formula id="eqn-8"><label>(8)</label><mml:math id="mml-eqn-8" display="block"><mml:mrow><mml:msub><mml:mi>F</mml:mi><mml:mrow><mml:mi>L</mml:mi><mml:mi>F</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo>=</mml:mo><mml:mrow><mml:msub><mml:mi>H</mml:mi><mml:mrow><mml:mi>L</mml:mi><mml:mi>F</mml:mi><mml:mi>F</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:mrow><mml:msub><mml:mi>F</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mi>n</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo>,</mml:mo><mml:mrow><mml:msub><mml:mi>F</mml:mi><mml:mn>1</mml:mn></mml:msub></mml:mrow><mml:mo>,</mml:mo><mml:mo>&#x22EF;</mml:mo><mml:mo>,</mml:mo><mml:mrow><mml:msub><mml:mi>F</mml:mi><mml:mi>c</mml:mi></mml:msub></mml:mrow><mml:mo>,</mml:mo><mml:mo>&#x22EF;</mml:mo><mml:mrow><mml:msub><mml:mi>F</mml:mi><mml:mi>C</mml:mi></mml:msub></mml:mrow></mml:mrow><mml:mo>]</mml:mo></mml:mrow></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:math></disp-formula>where <italic>H<sub>LFF</sub></italic> denotes the 1&#x2009;&#x00D7;&#x2009;1 convolutional layer and the <italic>C</italic> is the number of the convolutional layers.</p>
<p>The RDB output can be obtained by add the input and the fused local features, which realizes the local residual learning:
<disp-formula id="eqn-9"><label>(9)</label><mml:math id="mml-eqn-9" display="block"><mml:mrow><mml:msub><mml:mi>F</mml:mi><mml:mrow><mml:mi>o</mml:mi><mml:mi>u</mml:mi><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo>=</mml:mo><mml:mrow><mml:msub><mml:mi>F</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mi>n</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo>+</mml:mo><mml:mrow><mml:msub><mml:mi>F</mml:mi><mml:mrow><mml:mi>L</mml:mi><mml:mi>F</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:math></disp-formula></p>
</sec>
</sec>
<sec id="s3"><label>3</label><title>Simulation Setup and Result Analysis</title>
<sec id="s3_1"><label>3.1</label><title>Simulation Setup</title>
<p>To evaluate the proposed algorithm, clean speech signals are taken from the CHAINS corpus [<xref ref-type="bibr" rid="ref-29">29</xref>]. The dataset consists of recordings of 36 speakers. Four fables&#x2019; sentences by 9 males and 9 females are used in training while 33 sentences from the TIMIT corpus of 3 males and 3 females are used in testing. The speakers of the training differ from that of the testing. 4 types of noise (babble, factory, pink, white) from the NOISEX-92 database [<xref ref-type="bibr" rid="ref-30">30</xref>] are added to the mentioned utterances at 4 different SNR, i.e., &#x2212;5, 0, 5 and 10 dB. In addition, 3 untrained types of noise (baccaneer2, leopard, f16) at SNR &#x2212;5, 0, 5 and 10 dB are used to test the generalization of the algorithms. Besides, untrained SNR of &#x2212;7.5, &#x2212;2.5, 2.5, 7.5, 12.5 dB with untrained noises are also added to the testing dataset. The sampling rate is 16&#x2005;kHz.</p>
<p>To obtain the spectrum, the framing length is 256 with an overlap of 192 samples. After Hamming windowing, 256 points STFT is performed on each frame. As described above, the dimension of LPS is 128&#x2009;&#x00D7;&#x2009;128 representing the log-power spectrum of consecutive frame. In the proposed MM-RDN method, we utilized 2 down-sampling blocks and 2 up-sampling blocks. RDN has 6 RDBs with 3 skip connections in <xref ref-type="fig" rid="fig-3">Fig. 3</xref>. The probability value of Dropout is 0.5 to increase generalization. Adam optimizer [<xref ref-type="bibr" rid="ref-31">31</xref>] optimizes the network with a learning rate of 0.0002 under the mean square error (MSE) criterion, and the hyper-parameters of momentum decay are set to 0.9 and 0.999 respectively. The model was trained for 10 epochs.</p>
<p>In the simulation, we firstly discuss the effect of the frame length on the performance of the proposed algorithm. The frame length is set to 128, 64, and 32, respectively, and the corresponding algorithms are denoted as MM-RDN128, MM-RDN64, and MM-RDN32.</p>
<p>To evaluate the quality of the enhanced speech, source to distortion radio (SDR) [<xref ref-type="bibr" rid="ref-32">32</xref>], PESQ [<xref ref-type="bibr" rid="ref-33">33</xref>] (from &#x2212; 0.5 to 4.5), mean opinion score (MOS) prediction of the intrusiveness of background noise (CBAK) (from 1 to 5), extended short-time objective intelligibility (ESTOI) [<xref ref-type="bibr" rid="ref-34">34</xref>] (from 0 to 1) and MOS prediction of the overall effect (COVL) (from 1 to 5) [<xref ref-type="bibr" rid="ref-35">35</xref>] are selected. SDR is used to estimate the overall distortion of the signal. PESQ and ESTOI are parameters for evaluating the speech perceptual quality and intelligibility, respectively. CBAK and COVL are comprehensive indicators related to subjective evaluation. In Section 3.3, the CRN [<xref ref-type="bibr" rid="ref-21">21</xref>] method which reached SOTA level in enhancement speech is compared with the proposed method MM-RDN of the most appropriate window length both in the matched and unmatched environments.</p>
</sec>
<sec id="s3_2"><label>3.2</label><title>Effect of Frame Length on Performance of MM-RDN</title>
<p>In this section, the performance of MM-RDN with different frame length is compared in the matched noisy environment, and the results are shown in <xref ref-type="table" rid="table-1">Tab. 1</xref>. For the unmatched environment, the testing dataset has different noise type or different SNR with the training dataset. <xref ref-type="table" rid="table-2">Tab. 2</xref> is the comparison results for MM-RDN with different frame lengths, in which only the noise types are different in the testing and the training. <xref ref-type="table" rid="table-3">Tab. 3</xref> presents the results on trained noise type and untrained SNR. Specifically, the SNRs in the testing are &#x2212;5, 0, 5 and 10 dB, while the SNRs in the training are &#x2212;7.5, &#x2212;2.5, 2.5, 7.5 and 12.5 dB. Here, the noise types of testing dataset are the same as that of the training dataset. <xref ref-type="table" rid="table-4">Tab. 4</xref> displayed the results of MM-RDN with different frame length on untrained noise and untrained SNR.</p>
<table-wrap id="table-1"><label>Table 1</label><caption><title>Metrics of noisy and enhanced speech in matched environments for different frame lengths</title></caption>
<table frame="hsides">
<colgroup>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
</colgroup>
<thead>
<tr>
<th align="left">Model</th>
<th align="center" colspan="3">Noisy</th>
<th align="center" colspan="3">MM-RDN32</th>
<th align="center" colspan="3">MM-RDN64</th>
<th align="center" colspan="3">MM-RDN128</th>
</tr>
<tr>
<th align="left">SNR(dB)</th>
<th align="left">SDR</th>
<th align="left">PESQ</th>
<th align="left">ESTOI</th>
<th align="left">SDR</th>
<th align="left">PESQ</th>
<th align="left">ESTOI</th>
<th align="left">SDR</th>
<th align="left">PESQ</th>
<th align="left">ESTOI</th>
<th align="left">SDR</th>
<th align="left">PESQ</th>
<th align="left">ESTOI</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">&#x2212;5</td>
<td align="left">&#x2212;5.098</td>
<td align="left">1.035</td>
<td align="left">0.266</td>
<td align="left">2.061</td>
<td align="left">1.082</td>
<td align="left">0.378</td>
<td align="left">2.510</td>
<td align="left">1.104</td>
<td align="left">0.399</td>
<td align="left"><bold>3.210</bold></td>
<td align="left"><bold>1.121</bold></td>
<td align="left"><bold>0.410</bold></td>
</tr>
<tr>
<td align="left">0</td>
<td align="left">&#x2212;0.226</td>
<td align="left">1.043</td>
<td align="left">0.414</td>
<td align="left">6.617</td>
<td align="left">1.210</td>
<td align="left">0.558</td>
<td align="left">6.848</td>
<td align="left">1.252</td>
<td align="left">0.580</td>
<td align="left"><bold>7.239</bold></td>
<td align="left"><bold>1.290</bold></td>
<td align="left"><bold>0.589</bold></td>
</tr>
<tr>
<td align="left">5</td>
<td align="left">4.732</td>
<td align="left">1.086</td>
<td align="left">0.573</td>
<td align="left">10.637</td>
<td align="left">1.460</td>
<td align="left">0.716</td>
<td align="left">10.761</td>
<td align="left">1.529</td>
<td align="left">0.733</td>
<td align="left"><bold>10.985</bold></td>
<td align="left"><bold>1.592</bold></td>
<td align="left"><bold>0.739</bold></td>
</tr>
<tr>
<td align="left">10</td>
<td align="left">9.719</td>
<td align="left">1.202</td>
<td align="left">0.722</td>
<td align="left">14.392</td>
<td align="left">1.863</td>
<td align="left">0.830</td>
<td align="left">14.562</td>
<td align="left">1.954</td>
<td align="left">0.841</td>
<td align="left"><bold>14.671</bold></td>
<td align="left"><bold>2.050</bold></td>
<td align="left"><bold>0.844</bold></td>
</tr>
</tbody>
</table>
</table-wrap>
<table-wrap id="table-2"><label>Table 2</label><caption><title>Metrics of noisy and enhanced speech on unseen noise type for different window lengths</title></caption>
<table frame="hsides">
<colgroup>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
</colgroup>
<thead>
<tr>
<th align="left">Model</th>
<th align="center" colspan="3">Noisy</th>
<th align="center" colspan="3">MM-RDN32</th>
<th align="center" colspan="3">MM-RDN64</th>
<th align="center" colspan="3">MM-RDN128</th>
</tr>
<tr>
<th align="left">SNR(dB)</th>
<th align="left">SDR</th>
<th align="left">PESQ</th>
<th align="left">ESTOI</th>
<th align="left">SDR</th>
<th align="left">PESQ</th>
<th align="left">ESTOI</th>
<th align="left">SDR</th>
<th align="left">PESQ</th>
<th align="left">ESTOI</th>
<th align="left">SDR</th>
<th align="left">PESQ</th>
<th align="left">ESTOI</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">&#x2212;5</td>
<td align="left">&#x2212;4.178</td>
<td align="left">1.043</td>
<td align="left">0.334</td>
<td align="left">&#x2212;1.375</td>
<td align="left">1.066</td>
<td align="left">0.377</td>
<td align="left">&#x2212;1.220</td>
<td align="left">1.068</td>
<td align="left">0.379</td>
<td align="left"><bold>&#x2212;0.478</bold></td>
<td align="left"><bold>1.076</bold></td>
<td align="left"><bold>0.387</bold></td>
</tr>
<tr>
<td align="left">0</td>
<td align="left">0.722</td>
<td align="left">1.075</td>
<td align="left">0.467</td>
<td align="left">4.169</td>
<td align="left">1.135</td>
<td align="left">0.518</td>
<td align="left">4.273</td>
<td align="left">1.137</td>
<td align="left">0.523</td>
<td align="left"><bold>5.137</bold></td>
<td align="left"><bold>1.161</bold></td>
<td align="left"><bold>0.533</bold></td>
</tr>
<tr>
<td align="left">5</td>
<td align="left">5.691</td>
<td align="left">1.157</td>
<td align="left">0.608</td>
<td align="left">9.281</td>
<td align="left">1.311</td>
<td align="left">0.660</td>
<td align="left">9.372</td>
<td align="left">1.321</td>
<td align="left">0.670</td>
<td align="left"><bold>9.972</bold></td>
<td align="left"><bold>1.381</bold></td>
<td align="left"><bold>0.679</bold></td>
</tr>
<tr>
<td align="left">10</td>
<td align="left">10.681</td>
<td align="left">1.335</td>
<td align="left">0.740</td>
<td align="left">13.780</td>
<td align="left">1.657</td>
<td align="left">0.786</td>
<td align="left">13.936</td>
<td align="left">1.699</td>
<td align="left">0.796</td>
<td align="left"><bold>14.205</bold></td>
<td align="left"><bold>1.791</bold></td>
<td align="left"><bold>0.801</bold></td>
</tr>
</tbody>
</table>
</table-wrap>
<table-wrap id="table-3"><label>Table 3</label><caption><title>Metrics of noisy and enhanced speech on unseen SNR for different window lengths</title></caption>
<table frame="hsides">
<colgroup>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
</colgroup>
<thead>
<tr>
<th align="left">Model</th>
<th align="center" colspan="3">Noisy</th>
<th align="center" colspan="3">MM-RDN32</th>
<th align="center" colspan="3">MM-RDN64</th>
<th align="center" colspan="3">MM-RDN128</th>
</tr>
<tr>
<th align="left">SNR(dB)</th>
<th align="left">SDR</th>
<th align="left">PESQ</th>
<th align="left">ESTOI</th>
<th align="left">SDR</th>
<th align="left">PESQ</th>
<th align="left">ESTOI</th>
<th align="left">SDR</th>
<th align="left">PESQ</th>
<th align="left">ESTOI</th>
<th align="left">SDR</th>
<th align="left">PESQ</th>
<th align="left">ESTOI</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">&#x2212;7.5</td>
<td align="left">&#x2212;7.457</td>
<td align="left">1.048</td>
<td align="left">0.203</td>
<td align="left">&#x2212;0.524</td>
<td align="left">1.054</td>
<td align="left">0.294</td>
<td align="left">0.053</td>
<td align="left">1.066</td>
<td align="left">0.309</td>
<td align="left"><bold>0.971</bold></td>
<td align="left"><bold>1.077</bold></td>
<td align="left"><bold>0.322</bold></td>
</tr>
<tr>
<td align="left">&#x2212;2.5</td>
<td align="left">&#x2212;2.680</td>
<td align="left">1.039</td>
<td align="left">0.337</td>
<td align="left">4.433</td>
<td align="left">1.133</td>
<td align="left">0.469</td>
<td align="left">4.758</td>
<td align="left">1.164</td>
<td align="left">0.490</td>
<td align="left"><bold>5.281</bold></td>
<td align="left"><bold>1.191</bold></td>
<td align="left"><bold>0.502</bold></td>
</tr>
<tr>
<td align="left">2.5</td>
<td align="left">2.247</td>
<td align="left">1.059</td>
<td align="left">0.493</td>
<td align="left">8.676</td>
<td align="left">1.317</td>
<td align="left">0.642</td>
<td align="left">8.828</td>
<td align="left">1.372</td>
<td align="left">0.662</td>
<td align="left"><bold>9.128</bold></td>
<td align="left"><bold>1.422</bold></td>
<td align="left"><bold>0.670</bold></td>
</tr>
<tr>
<td align="left">7.5</td>
<td align="left">7.224</td>
<td align="left">1.131</td>
<td align="left">0.650</td>
<td align="left">12.540</td>
<td align="left">1.643</td>
<td align="left">0.779</td>
<td align="left">12.661</td>
<td align="left">1.722</td>
<td align="left">0.793</td>
<td align="left"><bold>12.833</bold></td>
<td align="left"><bold>1.806</bold></td>
<td align="left"><bold>0.798</bold></td>
</tr>
<tr>
<td align="left">12.5</td>
<td align="left">12.217</td>
<td align="left">1.308</td>
<td align="left">0.785</td>
<td align="left">16.180</td>
<td align="left">2.118</td>
<td align="left">0.870</td>
<td align="left">16.448</td>
<td align="left">2.218</td>
<td align="left">0.879</td>
<td align="left"><bold>16.505</bold></td>
<td align="left"><bold>2.317</bold></td>
<td align="left"><bold>0.880</bold></td>
</tr>
</tbody>
</table>
</table-wrap>
<table-wrap id="table-4"><label>Table 4</label><caption><title>Metrics of noisy and enhanced speech on unseen noise type and unseen SNR</title></caption>
<table frame="hsides">
<colgroup>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
</colgroup>
<thead>
<tr>
<th align="left">Model</th>
<th align="center" colspan="3">Noisy</th>
<th align="center" colspan="3">MM-RDN32</th>
<th align="center" colspan="3">MM-RDN64</th>
<th align="center" colspan="3">MM-RDN128</th>
</tr>
<tr>
<th align="left">SNR(dB)</th>
<th align="left">SDR</th>
<th align="left">PESQ</th>
<th align="left">ESTOI</th>
<th align="left">SDR</th>
<th align="left">PESQ</th>
<th align="left">ESTOI</th>
<th align="left">SDR</th>
<th align="left">PESQ</th>
<th align="left">ESTOI</th>
<th align="left">SDR</th>
<th align="left">PESQ</th>
<th align="left">ESTOI</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">&#x2212;7.5</td>
<td align="left">&#x2212;6.568</td>
<td align="left">1.038</td>
<td align="left">0.274</td>
<td align="left">&#x2212;4.268</td>
<td align="left">1.052</td>
<td align="left">0.312</td>
<td align="left">&#x2212;4.123</td>
<td align="left">1.054</td>
<td align="left">0.313</td>
<td align="left"><bold>&#x2212;3.535</bold></td>
<td align="left"><bold>1.060</bold></td>
<td align="left"><bold>0.318</bold></td>
</tr>
<tr>
<td align="left">&#x2212;2.5</td>
<td align="left">&#x2212;1.742</td>
<td align="left">1.055</td>
<td align="left">0.399</td>
<td align="left">1.448</td>
<td align="left">1.091</td>
<td align="left">0.447</td>
<td align="left">1.565</td>
<td align="left">1.093</td>
<td align="left">0.449</td>
<td align="left"><bold>2.418</bold></td>
<td align="left"><bold>1.107</bold></td>
<td align="left"><bold>0.458</bold></td>
</tr>
<tr>
<td align="left">2.5</td>
<td align="left">3.202</td>
<td align="left">1.107</td>
<td align="left">0.573</td>
<td align="left">6.786</td>
<td align="left">1.205</td>
<td align="left">0.589</td>
<td align="left">6.888</td>
<td align="left">1.210</td>
<td align="left">0.597</td>
<td align="left"><bold>7.643</bold></td>
<td align="left"><bold>1.249</bold></td>
<td align="left"><bold>0.607</bold></td>
</tr>
<tr>
<td align="left">7.5</td>
<td align="left">8.184</td>
<td align="left">1.231</td>
<td align="left">0.676</td>
<td align="left">11.613</td>
<td align="left">1.459</td>
<td align="left">0.727</td>
<td align="left">11.719</td>
<td align="left">1.482</td>
<td align="left">0.737</td>
<td align="left"><bold>12.142</bold></td>
<td align="left"><bold>1.561</bold></td>
<td align="left"><bold>0.745</bold></td>
</tr>
<tr>
<td align="left">12.5</td>
<td align="left">13.179</td>
<td align="left">1.472</td>
<td align="left">0.797</td>
<td align="left">15.778</td>
<td align="left">1.901</td>
<td align="left">0.837</td>
<td align="left">16.039</td>
<td align="left">1.962</td>
<td align="left">0.845</td>
<td align="left"><bold>16.150</bold></td>
<td align="left"><bold>2.065</bold></td>
<td align="left"><bold>0.848</bold></td>
</tr>
</tbody>
</table>
</table-wrap>
<p>In order to clearly display the performance of each indicator of the algorithm, we consolidate the data of <xref ref-type="table" rid="table-1 table-2 table-3 table-4">Tabs. 1&#x2013;4</xref> and draw the line chart. The three algorithm performance data in the situation of matched noise type are combined based on the contents of <xref ref-type="table" rid="table-1">Tabs. 1</xref> and <xref ref-type="table" rid="table-3">3</xref> and the comparative incremental results of each parameter are shown in <xref ref-type="fig" rid="fig-7 fig-8 fig-9">Figs. 7&#x2013;9</xref>. For the situation of unseen noise type are combined based on the contents of <xref ref-type="table" rid="table-2">Tabs. 2</xref> and <xref ref-type="table" rid="table-4">4</xref> and the comparative incremental results of each parameter are shown in <xref ref-type="fig" rid="fig-10 fig-11 fig-12">Figs. 10&#x2013;12</xref>.</p>
<fig id="fig-7"><label>Figure 7</label><caption><title>Comparation on SDR between MM-RDN on matched noise for different window lengths</title></caption><graphic mimetype="image" mime-subtype="png" xlink:href="CMC_27379-fig-7.png"/></fig>
<fig id="fig-8"><label>Figure 8</label><caption><title>Comparation on PESQ between MM-RDN on matched noise for different window lengths</title></caption><graphic mimetype="image" mime-subtype="png" xlink:href="CMC_27379-fig-8.png"/></fig>
<fig id="fig-9"><label>Figure 9</label><caption><title>Comparation on ESTOI between MM-RDN on matched noise for different window lengths</title></caption><graphic mimetype="image" mime-subtype="png" xlink:href="CMC_27379-fig-9.png"/></fig>
<fig id="fig-10"><label>Figure 10</label><caption><title>Comparation on SDR between MM-RDN on unseen noise for different window lengths</title></caption><graphic mimetype="image" mime-subtype="png" xlink:href="CMC_27379-fig-10.png"/></fig>
<fig id="fig-11"><label>Figure 11</label><caption><title>Comparation on PESQ between MM-RDN on unseen noise for different window lengths</title></caption><graphic mimetype="image" mime-subtype="png" xlink:href="CMC_27379-fig-11.png"/></fig>
<fig id="fig-12"><label>Figure 12</label><caption><title>Comparation on ESTOI between MM-RDN on unseen noise for different window lengths</title></caption><graphic mimetype="image" mime-subtype="png" xlink:href="CMC_27379-fig-12.png"/></fig>
<p>It can be seen that the evaluation index of MM-RDN at all SNR is better than that of noisy speech, which means the MM-RDN can effectively improve the quality and intelligibility. In addition, MM-RDN128 obtains the best results, which indicates that the longer frame length can get better performance. Since the proposed algorithm uses a convolutional network to extract the high-level features of LPS, the longer the frame length, the better the convolution operation can capture the long-term and short-term correlations of speech features. Similarly, the frame length cannot be increased indefinitely. Due to the short-term stability of the speech, the frame length is too long to destroy the correlation between the LPS of adjacent frames, and the convolution operation cannot extract accurate high-level features. From <xref ref-type="fig" rid="fig-7">Fig. 7</xref>, the incremental trend of SDR is consistent. When the frame length is 128 and 64, as the SNR increases, SDR increment relative to the original noisy speech gradually decreases. This shows that increasing the frame length can more effectively improve the SDR in a low SNR environment. However, when the frame length is 32, the frame length is too small and the corresponding speech duration is too short to provide enough information for network learning, which limits the performance improvement of the MM-RDN algorithm. Also, <xref ref-type="fig" rid="fig-8">Fig. 8</xref> shows that MM-RDN can effectively improve the PESQ of enhanced speech. With the increase of SNR, the increase of PESQ of the algorithm also increases, indicating that MM-RDN can better improve the speech quality in the case of high SNR. Moreover, the longer the frame length, the more obvious the improvement of PESQ, which indicates that the frame length affects the improvement of the algorithm in speech quality. For the ESTOI in <xref ref-type="fig" rid="fig-9">Fig. 9</xref>, the long frame length can improve the speech intelligibility. In general, under the matched noise type, MM-RDN still shows a certain generalization to the noise SNR.</p>
<p>As shown in <xref ref-type="fig" rid="fig-10 fig-11 fig-12">Figs. 10&#x2013;12</xref>, the proposed algorithm can still effectively improve the perceptual quality of speech under unseen noise environment. Compared with <xref ref-type="fig" rid="fig-8 fig-9 fig-11">Figs. 8, 9, 11</xref> and <xref ref-type="fig" rid="fig-12">12</xref> shows that the change trend of the algorithm for PESQ and ESTOI is consistent with the matched noise situation. Compared with the results of <xref ref-type="fig" rid="fig-7">Fig. 7</xref>, when the noise is unseen, from the <xref ref-type="fig" rid="fig-10">Fig. 10</xref>, the SDR increment of the enhanced speech does not always decrease, but first rises and then falls. The overall incremental data is lower the matching noise case. The results indicate that it is more difficult to improve the SDR at a low SNR unseen acoustic environment. It shows that the algorithm has limited generalization to unseen environment. At the same time, when the frame length is increased from 64 to 128, the SDR is improved more obvious than when the frame length is increased from 32 to 64, which means that the increase in frame length can compensate for the algorithm&#x2019;s impact on performance under unseen acoustic environments. Also, MM-RDN128 with longest frame length maintains the best performance, which indicates that the algorithm performance is related to frame length.MM-RDN can effectively improve the SDR, PSEQ and ESTOI of noisy speech, and the longer the frame length, the more obvious the SDR, PESQ and ESTOI improvement. Moreover, the performance gap between different frame lengths also increases. The results indicate that MM-RDN has certain generalization to noise type.</p>
<p>In general, through the above simulations, the following conclusions can be drawn: 1. MM-RDN can effectively improve the speech quality and intelligibility in different environments. 2. The increase of the frame length has a positive effect on the performance improvement of the proposed algorithm. 3 Increasing the length of the frame can more significantly improve the SDR and ESTOI at low SNR, and improve PESQ at high SNR.</p>
<p>Therefore, 128 is selected as the frame length of MM-RDN in the algorithm comparison in the following section, where MM-RD128 is denoted as MM-RDN.</p>
</sec>
<sec id="s3_3"><label>3.3</label><title>Simulation of MM-RDN and Other Model Results and Analysis</title>
<p>Firstly, the performance of CRN and MM-RDN are compared in the matched noisy environment, and the results are shown in <xref ref-type="table" rid="table-5">Tab. 5</xref>. Here noisy speech means unprocessed speech. For the unmatched environment, the testing dataset has different noise type or different SNR with the training dataset. <xref ref-type="table" rid="table-6">Tab. 6</xref> is the comparison results for MM-RDN and CRN, in which only the noise types are different in the testing and the training. <xref ref-type="table" rid="table-7">Tab. 7</xref> presents the results on trained noise type and untrained SNR. Specifically, the SNRs in the testing are &#x2212;5, 0, 5 and 10&#x2005;dB, while the SNRs in the training are &#x2212;7.5, &#x2212;2.5, 2.5, 7.5 and 12.5&#x2005;dB, while the noise types are the same. <xref ref-type="table" rid="table-8">Tab. 8</xref> displayed the results on untrained noise and untrained SNR.</p>
<table-wrap id="table-5"><label>Table 5</label><caption><title>Metrics of noisy and enhanced speech in matched environments for different algorithm</title></caption>
<table frame="hsides">
<colgroup>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
</colgroup>
<thead>
<tr>
<th align="left">Model</th>
<th align="center" colspan="4">Noisy</th>
<th align="center" colspan="4">CRN</th>
<th align="center" colspan="4">MM-RDN</th>
</tr>
<tr>
<th align="left">SNR(dB)</th>
<th align="left">PESQ</th>
<th align="left">CBAK</th>
<th align="left">ESTOI</th>
<th align="left">COVL</th>
<th align="left">PESQ</th>
<th align="left">CBAK</th>
<th align="left">ESTOI</th>
<th align="left">COVL</th>
<th align="left">PESQ</th>
<th align="left">CBAK</th>
<th align="left">ESTOI</th>
<th align="left">COVL</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">&#x2212;5</td>
<td align="left">1.035</td>
<td align="left">1.185</td>
<td align="left">0.266</td>
<td align="left">1.089</td>
<td align="left">1.086</td>
<td align="left">1.518</td>
<td align="left">0.367</td>
<td align="left">1.264</td>
<td align="left"><bold>1.121</bold></td>
<td align="left"><bold>1.579</bold></td>
<td align="left"><bold>0.410</bold></td>
<td align="left"><bold>1.365</bold></td>
</tr>
<tr>
<td align="left">0</td>
<td align="left">1.043</td>
<td align="left">1.461</td>
<td align="left">0.414</td>
<td align="left">1.187</td>
<td align="left">1.212</td>
<td align="left">1.926</td>
<td align="left">0.524</td>
<td align="left">1.596</td>
<td align="left"><bold>1.290</bold></td>
<td align="left"><bold>2.007</bold></td>
<td align="left"><bold>0.589</bold></td>
<td align="left"><bold>1.718</bold></td>
</tr>
<tr>
<td align="left">5</td>
<td align="left">1.086</td>
<td align="left">1.825</td>
<td align="left">0.573</td>
<td align="left">1.383</td>
<td align="left">1.457</td>
<td align="left">2.371</td>
<td align="left">0.689</td>
<td align="left">2.016</td>
<td align="left"><bold>1.592</bold></td>
<td align="left"><bold>2.471</bold></td>
<td align="left"><bold>0.739</bold></td>
<td align="left"><bold>2.152</bold></td>
</tr>
<tr>
<td align="left">10</td>
<td align="left">1.202</td>
<td align="left">2.243</td>
<td align="left">0.722</td>
<td align="left">1.707</td>
<td align="left">1.850</td>
<td align="left">2.862</td>
<td align="left">0.811</td>
<td align="left">2.513</td>
<td align="left"><bold>2.050</bold></td>
<td align="left"><bold>2.983</bold></td>
<td align="left"><bold>0.844</bold></td>
<td align="left"><bold>2.683</bold></td>
</tr>
</tbody>
</table>
</table-wrap>
<table-wrap id="table-6"><label>Table 6</label><caption><title>Metrics of noisy and enhanced speech on unseen noise type</title></caption>
<table frame="hsides">
<colgroup>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
</colgroup>
<thead>
<tr>
<th align="left">Model</th>
<th align="center" colspan="4">Noisy</th>
<th align="center" colspan="4">CRN</th>
<th align="center" colspan="4">MM-RDN</th>
</tr>
<tr>
<th align="left">SNR (dB)</th>
<th align="left">PESQ</th>
<th align="left">CBAK</th>
<th align="left">ESTOI</th>
<th align="left">COVL</th>
<th align="left">PESQ</th>
<th align="left">CBAK</th>
<th align="left">ESTOI</th>
<th align="left">COVL</th>
<th align="left">PESQ</th>
<th align="left">CBAK</th>
<th align="left">ESTOI</th>
<th align="left">COVL</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">&#x2212;5</td>
<td align="left">1.037</td>
<td align="left">1.197</td>
<td align="left">0.275</td>
<td align="left">1.048</td>
<td align="left">1.062</td>
<td align="left">1.329</td>
<td align="left">0.323</td>
<td align="left">1.043</td>
<td align="left"><bold>1.073</bold></td>
<td align="left"><bold>1.317</bold></td>
<td align="left"><bold>0.339</bold></td>
<td align="left"><bold>1.020</bold></td>
</tr>
<tr>
<td align="left">0</td>
<td align="left">1.051</td>
<td align="left">1.481</td>
<td align="left">0.416</td>
<td align="left">1.167</td>
<td align="left">1.129</td>
<td align="left">1.751</td>
<td align="left">0.473</td>
<td align="left">1.291</td>
<td align="left"><bold>1.143</bold></td>
<td align="left"><bold>1.757</bold></td>
<td align="left"><bold>0.494</bold></td>
<td align="left"><bold>1.278</bold></td>
</tr>
<tr>
<td align="left">5</td>
<td align="left">1.091</td>
<td align="left">1.844</td>
<td align="left">0.566</td>
<td align="left">1.388</td>
<td align="left">1.284</td>
<td align="left">2.196</td>
<td align="left">0.625</td>
<td align="left">1.699</td>
<td align="left"><bold>1.332</bold></td>
<td align="left"><bold>2.234</bold></td>
<td align="left"><bold>0.648</bold></td>
<td align="left"><bold>1.736</bold></td>
</tr>
<tr>
<td align="left">10</td>
<td align="left">1.203</td>
<td align="left">2.259</td>
<td align="left">0.709</td>
<td align="left">1.711</td>
<td align="left">1.591</td>
<td align="left">2.673</td>
<td align="left">0.759</td>
<td align="left">2.192</td>
<td align="left"><bold>1.701</bold></td>
<td align="left"><bold>2.745</bold></td>
<td align="left"><bold>0.780</bold></td>
<td align="left"><bold>2.281</bold></td>
</tr>
</tbody>
</table>
</table-wrap>
<table-wrap id="table-7"><label>Table 7</label><caption><title>Metrics of noisy and enhanced speech on unseen SNR</title></caption>
<table frame="hsides">
<colgroup>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
</colgroup>
<thead>
<tr>
<th align="left">Model</th>
<th align="center" colspan="4">Noisy</th>
<th align="center" colspan="4">CRN</th>
<th align="center" colspan="4">MM-RDN</th>
</tr>
<tr>
<th align="left">SNR (dB)</th>
<th align="left">PESQ</th>
<th align="left">CBAK</th>
<th align="left">ESTOI</th>
<th align="left">COVL</th>
<th align="left">PESQ</th>
<th align="left">CBAK</th>
<th align="left">ESTOI</th>
<th align="left">COVL</th>
<th align="left">PESQ</th>
<th align="left">CBAK</th>
<th align="left">ESTOI</th>
<th align="left">COVL</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">&#x2212;7.5</td>
<td align="left">1.048</td>
<td align="left">1.108</td>
<td align="left">0.203</td>
<td align="left">1.067</td>
<td align="left">1.058</td>
<td align="left">1.347</td>
<td align="left">0.266</td>
<td align="left">1.154</td>
<td align="left"><bold>1.077</bold></td>
<td align="left"><bold>1.388</bold></td>
<td align="left"><bold>0.322</bold></td>
<td align="left"><bold>1.232</bold></td>
</tr>
<tr>
<td align="left">&#x2212;2.5</td>
<td align="left">1.039</td>
<td align="left">1.308</td>
<td align="left">0.337</td>
<td align="left">1.130</td>
<td align="left">1.136</td>
<td align="left">1.714</td>
<td align="left">0.434</td>
<td align="left">1.415</td>
<td align="left"><bold>1.191</bold></td>
<td align="left"><bold>1.787</bold></td>
<td align="left"><bold>0.502</bold></td>
<td align="left"><bold>1.529</bold></td>
</tr>
<tr>
<td align="left">2.5</td>
<td align="left">1.059</td>
<td align="left">1.636</td>
<td align="left">0.493</td>
<td align="left">1.270</td>
<td align="left">1.371</td>
<td align="left">2.143</td>
<td align="left">0.611</td>
<td align="left">1.796</td>
<td align="left"><bold>1.422</bold></td>
<td align="left"><bold>2.234</bold></td>
<td align="left"><bold>0.670</bold></td>
<td align="left"><bold>1.924</bold></td>
</tr>
<tr>
<td align="left">7.5</td>
<td align="left">1.131</td>
<td align="left">2.026</td>
<td align="left">0.650</td>
<td align="left">1.531</td>
<td align="left">1.634</td>
<td align="left">2.610</td>
<td align="left">0.755</td>
<td align="left">2.254</td>
<td align="left"><bold>1.806</bold></td>
<td align="left"><bold>2.722</bold></td>
<td align="left"><bold>0.798</bold></td>
<td align="left"><bold>2.409</bold></td>
</tr>
<tr>
<td align="left">12.5</td>
<td align="left">1.308</td>
<td align="left">2.476</td>
<td align="left">0.785</td>
<td align="left">1.914</td>
<td align="left">2.095</td>
<td align="left">3.119</td>
<td align="left">0.855</td>
<td align="left">2.786</td>
<td align="left"><bold>2.317</bold></td>
<td align="left"><bold>3.250</bold></td>
<td align="left"><bold>0.880</bold></td>
<td align="left"><bold>2.968</bold></td>
</tr>
</tbody>
</table>
</table-wrap>
<table-wrap id="table-8"><label>Table 8</label><caption><title>Metrics of noisy and enhanced speech on unseen noise type and unseen SNR</title></caption>
<table frame="hsides">
<colgroup>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
</colgroup>
<thead>
<tr>
<th align="left">Model</th>
<th align="center" colspan="4">Noisy</th>
<th align="center" colspan="4">CRN</th>
<th align="center" colspan="4">MM-RDN</th>
</tr>
<tr>
<th align="left">SNR (dB)</th>
<th align="left">PESQ</th>
<th align="left">CBAK</th>
<th align="left">ESTOI</th>
<th align="left">COVL</th>
<th align="left">PESQ</th>
<th align="left">CBAK</th>
<th align="left">ESTOI</th>
<th align="left">COVL</th>
<th align="left">PESQ</th>
<th align="left">CBAK</th>
<th align="left">ESTOI</th>
<th align="left">COVL</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">&#x2212;7.5</td>
<td align="left">1.035</td>
<td align="left">1.107</td>
<td align="left">0.214</td>
<td align="left">1.020</td>
<td align="left">1.051</td>
<td align="left">1.167</td>
<td align="left">0.254</td>
<td align="left">1.011</td>
<td align="left"><bold>1.061</bold></td>
<td align="left"><bold>1.151</bold></td>
<td align="left"><bold>0.266</bold></td>
<td align="left"><bold>1.002</bold></td>
</tr>
<tr>
<td align="left">&#x2212;2.5</td>
<td align="left">1.042</td>
<td align="left">1.322</td>
<td align="left">0.343</td>
<td align="left">1.096</td>
<td align="left">1.085</td>
<td align="left">1.534</td>
<td align="left">0.396</td>
<td align="left">1.137</td>
<td align="left"><bold>1.097</bold></td>
<td align="left"><bold>1.529</bold></td>
<td align="left"><bold>0.415</bold></td>
<td align="left"><bold>1.111</bold></td>
</tr>
<tr>
<td align="left">2.5</td>
<td align="left">1.065</td>
<td align="left">1.656</td>
<td align="left">0.490</td>
<td align="left">1.263</td>
<td align="left">1.189</td>
<td align="left">1.969</td>
<td align="left">0.549</td>
<td align="left">1.481</td>
<td align="left"><bold>1.218</bold></td>
<td align="left"><bold>1.993</bold></td>
<td align="left"><bold>0.573</bold></td>
<td align="left"><bold>1.492</bold></td>
</tr>
<tr>
<td align="left">7.5</td>
<td align="left">1.135</td>
<td align="left">2.044</td>
<td align="left">0.639</td>
<td align="left">1.536</td>
<td align="left">1.414</td>
<td align="left">2.429</td>
<td align="left">0.695</td>
<td align="left">1.933</td>
<td align="left"><bold>1.493</bold></td>
<td align="left"><bold>2.484</bold></td>
<td align="left"><bold>0.719</bold></td>
<td align="left"><bold>1.998</bold></td>
</tr>
<tr>
<td align="left">12.5</td>
<td align="left">1.302</td>
<td align="left">2.490</td>
<td align="left">0.772</td>
<td align="left">1.914</td>
<td align="left">1.811</td>
<td align="left">2.929</td>
<td align="left">0.815</td>
<td align="left">2.470</td>
<td align="left"><bold>1.961</bold></td>
<td align="left"><bold>3.020</bold></td>
<td align="left"><bold>0.833</bold></td>
<td align="left"><bold>2.590</bold></td>
</tr>
</tbody>
</table>
</table-wrap>
<p>Consolidate the data of <xref ref-type="table" rid="table-5 table-6 table-7 table-8">Tabs. 5&#x2013;8</xref> and draw the line chart. The algorithm performance data in the situation of matched noise type are combined based on the contents of <xref ref-type="table" rid="table-5">Tabs. 5</xref> and <xref ref-type="table" rid="table-7">7</xref> and the comparative incremental results of each parameter are shown in <xref ref-type="fig" rid="fig-13 fig-14 fig-15 fig-16">Figs. 13&#x2013;16</xref>. For the situation of unseen noise type are combined based on the contents of <xref ref-type="table" rid="table-6">Tabs. 6</xref> and <xref ref-type="table" rid="table-8">8</xref> and the comparative incremental results of each parameter are shown in <xref ref-type="fig" rid="fig-17 fig-18 fig-19 fig-20">Figs. 17&#x2013;20</xref>.</p>
<fig id="fig-13"><label>Figure 13</label><caption><title>Comparation on PESQ between MM-RDN and CRN on matched noise</title></caption><graphic mimetype="image" mime-subtype="png" xlink:href="CMC_27379-fig-13.png"/></fig>
<fig id="fig-14"><label>Figure 14</label><caption><title>Comparation on CBAK between MM-RDN and CRN on matched noise</title></caption><graphic mimetype="image" mime-subtype="png" xlink:href="CMC_27379-fig-14.png"/></fig>
<fig id="fig-15"><label>Figure 15</label><caption><title>Comparation on ESTOI between MM-RDN and CRN on matched noise</title></caption><graphic mimetype="image" mime-subtype="png" xlink:href="CMC_27379-fig-15.png"/></fig>
<fig id="fig-16"><label>Figure 16</label><caption><title>Comparation on COVL between MM-RDN and CRN on matched noise</title></caption><graphic mimetype="image" mime-subtype="png" xlink:href="CMC_27379-fig-16.png"/></fig>
<fig id="fig-17"><label>Figure 17</label><caption><title>Comparation on PESQ between MM-RDN and CRN on unseen noise</title></caption><graphic mimetype="image" mime-subtype="png" xlink:href="CMC_27379-fig-17.png"/></fig>
<fig id="fig-18"><label>Figure 18</label><caption><title>Comparation on CBAK between MM-RDN and CRN on unseen noise</title></caption><graphic mimetype="image" mime-subtype="png" xlink:href="CMC_27379-fig-18.png"/></fig>
<fig id="fig-19"><label>Figure 19</label><caption><title>Comparation on ESTOI between MM-RDN and CRN on unseen noise</title></caption><graphic mimetype="image" mime-subtype="png" xlink:href="CMC_27379-fig-19.png"/></fig>
<fig id="fig-20"><label>Figure 20</label><caption><title>Comparation on COVL between MM-RDN and CRN on unseen noise</title></caption><graphic mimetype="image" mime-subtype="png" xlink:href="CMC_27379-fig-20.png"/></fig>
<p>It can be seen that MM-RDN gets the best scores at all the SNR on all the metrics, which means it not only effectively reduces the noise, but also improves the perceptual quality and intelligibility of enhanced speech. Additionally, for MM-RDN, <xref ref-type="fig" rid="fig-13 fig-14">Figs. 13&#x2013;14</xref> shows that the perceptual quality is improved more obviously under high SNR, and <xref ref-type="fig" rid="fig-15">Figs. 15</xref> and <xref ref-type="fig" rid="fig-16">16</xref> shows that the intelligibility is improved more obviously under low SNR condition. From <xref ref-type="fig" rid="fig-13 fig-14 fig-15 fig-16">Figs. 13&#x2013;16</xref>, the performance of MM-RDN is stable. With matching noise types, the metrics maintain the similar trend for different SNR which demonstrates the robustness and generalization of the propose algorithm to the SNR.</p>
<p>As shown in <xref ref-type="fig" rid="fig-17 fig-18 fig-19 fig-20">Fig. 17&#x2013;20</xref>, the performance metrics of MM-RDN are superior to that of CRN, including the quality and intelligibility. At low SNR, MM-RDN is slightly inferior to CRN in COVL, but MM-RDN is significantly superior when SNR is greater than zero. Also, PESQ and COVL of MM-RDN increases faster than CBAK and ESTOI, which means the perceptual of speech is significantly improved. The results show that MM-RDN has certain generalization in the overall effect to noise type.</p>
<p>Thus, the proposed method achieved the best results, which demonstrate the robustness and generalization to the SNR and noise type. In addition to the performance improvement, it should be noted that the number of parameters in MM-RDN is about 27.99&#x0025; of that of CRN. As is known to all, a small network would avoid overfitting and assure the generalization of a model. Compared with CRN, MM-RDN not only has the better performance, but also reduces the computation cost and parameters load.</p>
<p>In terms of the signal quality and intelligibility, MM-RDN is more robust and generalizable to noise type and SNR than CRN. It suggested that the proposed algorithm has better performance than the masking-based and spectrum mapping based method. <xref ref-type="fig" rid="fig-21">Figs. 21</xref> and <xref ref-type="fig" rid="fig-22">22</xref> shows the waveform and amplitude spectrum of noisy speech and the enhanced speech by MM-RDN.</p>
<fig id="fig-21"><label>Figure 21</label><caption><title>Waveform of (a) noisy speech and (b) enhanced speech by MM-RDN</title></caption><graphic mimetype="image" mime-subtype="png" xlink:href="CMC_27379-fig-21.png"/></fig>
<fig id="fig-22"><label>Figure 22</label><caption><title>Amplitude spectrum of (a) noisy speech and (b) enhanced speech by MM-RDN</title></caption><graphic mimetype="image" mime-subtype="png" xlink:href="CMC_27379-fig-22.png"/></fig>
</sec>
</sec>
<sec id="s4"><label>4</label><title>Conclusions</title>
<p>We introduce a MM based speech enhancement method using RDN and LPS of speech. The proposed MM-RDN reduces the network parameters load and avoids over-fitting through dense connection layers, LFF and LRL. At the same time, the method makes full use of the two-dimensional inter-frame information of LPS and the prior information of IRM, thus effectively improve perceptual quality and speech intelligibility of enhance speech, and also has generalization to the noise. Although the algorithm in this paper uses the two-dimensional inter-frame information, it is not enough to mine the timing characteristics of speech. In the future, the network may be further improved by means of timing convolution.</p>
</sec>
</body>
<back>
<fn-group>
<fn fn-type="other"><p><bold>Funding Statement:</bold> This work is supported by the National Key Research and Development Program of China under Grant 2020YFC2004003 and Grant 2020YFC2004002, and the National Nature Science Foundation of China (NSFC) under Grant No. 61571106.</p></fn>
<fn fn-type="conflict"><p><bold>Conflicts of Interest:</bold> We declare that they have no conflicts of interest to report regarding the present study.</p></fn>
</fn-group>
<ref-list content-type="authoryear">
<title>References</title>
<ref id="ref-1"><label>[1]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>P.</given-names> <surname>Yan</surname></string-name>, <string-name><given-names>J.</given-names> <surname>Zou</surname></string-name>, <string-name><given-names>Z.</given-names> <surname>Li</surname></string-name> and <string-name><given-names>X.</given-names> <surname>Yang</surname></string-name></person-group>, &#x201C;<article-title>Infrared and visible image fusion based on nsst and rdn</article-title>,&#x201D; <source>Intelligent Automation &#x0026; Soft Computing</source>, vol. <volume>28</volume>, no. <issue>1</issue>, pp. <fpage>213</fpage>&#x2013;<lpage>225</lpage>, <year>2021</year>.</mixed-citation></ref>
<ref id="ref-2"><label>[2]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>M. O.</given-names> <surname>El-Habbak</surname></string-name>, <string-name><given-names>M. A.</given-names> <surname>Abdelalim</surname></string-name>, <string-name><given-names>H. N.</given-names> <surname>Mohamed</surname></string-name>, <string-name><given-names>M. H.</given-names> <surname>Abd-Elaty</surname></string-name>, <string-name><given-names>A. M.</given-names> <surname>Hammouda</surname></string-name> <etal>et al.,</etal></person-group> &#x201C;<article-title>Enhancing Parkinson&#x2019;s disease diagnosis accuracy through speech signal algorithm modeling</article-title>,&#x201D; <source>Computers, Materials &#x0026; Continua</source>, vol. <volume>70</volume>, no. <issue>2</issue>, pp. <fpage>2953</fpage>&#x2013;<lpage>2969</lpage>, <year>2022</year>.</mixed-citation></ref>
<ref id="ref-3"><label>[3]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>G.</given-names> <surname>Jyoshna</surname></string-name>, <string-name><given-names>M.</given-names> <surname>Zia</surname></string-name> and <string-name><given-names>L.</given-names> <surname>Koteswararao</surname></string-name></person-group>, &#x201C;<article-title>An efficient reference free adaptive learning process for speech enhancement applications</article-title>,&#x201D; <source>Computers, Materials &#x0026; Continua</source>, vol. <volume>70</volume>, no. <issue>2</issue>, pp. <fpage>3067</fpage>&#x2013;<lpage>3080</lpage>, <year>2022</year>.</mixed-citation></ref>
<ref id="ref-4"><label>[4]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>Y. X.</given-names> <surname>Wang</surname></string-name> and <string-name><given-names>D. L.</given-names> <surname>Wang</surname></string-name></person-group>, &#x201C;<article-title>Boosting classification-based speech separation using temporal dynamics</article-title>,&#x201D; in <conf-name>13th Annual Conf. of the Int. Speech Communication Association 2012</conf-name>, <conf-loc>Portland, OR, USA,</conf-loc> pp. <fpage>1526</fpage>&#x2013;<lpage>1529</lpage>, <year>2012</year>.</mixed-citation></ref>
<ref id="ref-5"><label>[5]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>X. G.</given-names> <surname>Lu</surname></string-name>, <string-name><given-names>Y.</given-names> <surname>Tsao</surname></string-name>, <string-name><given-names>S.</given-names> <surname>Matsuda</surname></string-name> and <string-name><given-names>C.</given-names> <surname>Hori</surname></string-name></person-group>, &#x201C;<article-title>Speech enhancement based on deep denoising autoencoder</article-title>,&#x201D; in <conf-name>14th Annual Conf. of the Int. Speech Communication Association 2013</conf-name>, <conf-loc>Lyon, France</conf-loc>, pp. <fpage>436</fpage>&#x2013;<lpage>440</lpage>, <year>2013</year>.</mixed-citation></ref>
<ref id="ref-6"><label>[6]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>Y. X.</given-names> <surname>Wang</surname></string-name> and <string-name><given-names>D. L.</given-names> <surname>Wang</surname></string-name></person-group>, &#x201C;<article-title>Towards scaling up classification-based speech separation</article-title>,&#x201D; <source>IEEE Transactions on Audio, Speech and Language Processing</source>, vol. <volume>21</volume>, no. <issue>7</issue>, pp. <fpage>1381</fpage>&#x2013;<lpage>1390</lpage>, <year>2013</year>.</mixed-citation></ref>
<ref id="ref-7"><label>[7]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>D. L.</given-names> <surname>Wang</surname></string-name></person-group>, &#x201C;<article-title>On ideal binary mask as the computational goal of auditory scene analysis</article-title>,&#x201D; in <conf-name>Workshop on Speech Separation by Humans and Machines</conf-name>, <conf-loc>Montreal, Canada</conf-loc>, pp. <fpage>181</fpage>&#x2013;<lpage>197</lpage>, <year>2005</year>.</mixed-citation></ref>
<ref id="ref-8"><label>[8]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>D. S.</given-names> <surname>Williamson</surname></string-name>, <string-name><given-names>Y. X.</given-names> <surname>Wang</surname></string-name> and <string-name><given-names>D. L.</given-names> <surname>Wang</surname></string-name></person-group>, &#x201C;<article-title>Complex ratio masking for monaural speech separation</article-title>,&#x201D; <source>IEEE/ACM Transactions on Audio, Speech, and Language Processing</source>, vol. <volume>24</volume>, no. <issue>3</issue>, pp. <fpage>483</fpage>&#x2013;<lpage>492</lpage>, <year>2016</year>.</mixed-citation></ref>
<ref id="ref-9"><label>[9]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>H.</given-names> <surname>Erdogan</surname></string-name>, <string-name><given-names>J. R.</given-names> <surname>Hershey</surname></string-name>, <string-name><given-names>S.</given-names> <surname>Watanabe</surname></string-name> and <string-name><given-names>J.</given-names> <surname>Le Roux</surname></string-name></person-group>, &#x201C;<article-title>Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks</article-title>,&#x201D; in <conf-name>IEEE Int. Conf. on Acoustics, Speech and Signal Processing</conf-name>, <conf-loc>South Brisbane, QLD, Australia</conf-loc>, pp. <fpage>708</fpage>&#x2013;<lpage>712</lpage>, <year>2015</year>.</mixed-citation></ref>
<ref id="ref-10"><label>[10]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>Y. X.</given-names> <surname>Wang</surname></string-name>, <string-name><given-names>A.</given-names> <surname>Narayanan</surname></string-name> and <string-name><given-names>D. L.</given-names> <surname>Wang</surname></string-name></person-group>, &#x201C;<article-title>On training targets for supervised speech separation</article-title>,&#x201D; <source>IEEE-ACM Transactions on Audio, Speech, and Language Processing</source>, vol. <volume>22</volume>, no. <issue>12</issue>, pp. <fpage>1849</fpage>&#x2013;<lpage>1858</lpage>, <year>2014</year>.</mixed-citation></ref>
<ref id="ref-11"><label>[11]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>D. L.</given-names> <surname>Wang</surname></string-name> and <string-name><given-names>J. T.</given-names> <surname>Chen</surname></string-name></person-group>, &#x201C;<article-title>Supervised speech separation based on deep learning: An overview</article-title>,&#x201D; in <conf-name>IEEE-ACM Transactions on Audio, Speech, and Language Processing</conf-name>, vol. <volume>26</volume>, no. <issue>10</issue>, pp. <fpage>1702</fpage>&#x2013;<lpage>1726</lpage>, <year>2018</year>.</mixed-citation></ref>
<ref id="ref-12"><label>[12]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>L.</given-names> <surname>Sun</surname></string-name>, <string-name><given-names>J.</given-names> <surname>Du</surname></string-name>, <string-name><given-names>L. R.</given-names> <surname>Dai</surname></string-name> and <string-name><given-names>C. H.</given-names> <surname>Lee</surname></string-name></person-group>, &#x201C;<article-title>Multiple-target deep learning for LSTM-RNN based speech enhancement</article-title>,&#x201D; in <conf-name>Conf. on Hands-Free Communications and Microphone Arrays</conf-name>, <conf-loc>San Francisco, CA, USA</conf-loc>, pp. <fpage>136</fpage>&#x2013;<lpage>140</lpage>, <year>2017</year>.</mixed-citation></ref>
<ref id="ref-13"><label>[13]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>P. S.</given-names> <surname>Huang</surname></string-name>, <string-name><given-names>M.</given-names> <surname>Kim</surname></string-name>, <string-name><given-names>M.</given-names> <surname>Hasegawa-Johnson</surname></string-name> and <string-name><given-names>P.</given-names> <surname>Smaragdis</surname></string-name></person-group>, &#x201C;<article-title>Joint optimization of masks and deep recurrent neural networks for monaural source separation</article-title>,&#x201D; <source>IEEE-ACM Transactions on Audio Speech and Language Processing</source>, vol. <volume>23</volume>, no. <issue>12</issue>, pp. <fpage>2136</fpage>&#x2013;<lpage>2147</lpage>, <year>2015</year>.</mixed-citation></ref>
<ref id="ref-14"><label>[14]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>Y.</given-names> <surname>Xu</surname></string-name>, <string-name><given-names>J.</given-names> <surname>Du</surname></string-name>, <string-name><given-names>L. R.</given-names> <surname>Dai</surname></string-name> and <string-name><given-names>C. H.</given-names> <surname>Lee</surname></string-name></person-group>, &#x201C;<article-title>An experimental study on speech enhancement based on deep neural networks</article-title>,&#x201D; <source>IEEE Signal Processing Letters</source>, vol. <volume>21</volume>, no. <issue>1</issue>, pp. <fpage>65</fpage>&#x2013;<lpage>68</lpage>, <year>2014</year>.</mixed-citation></ref>
<ref id="ref-15"><label>[15]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>Y.</given-names> <surname>Xu</surname></string-name>, <string-name><given-names>J.</given-names> <surname>Du</surname></string-name>, <string-name><given-names>L. R.</given-names> <surname>Dai</surname></string-name> and <string-name><given-names>C. H.</given-names> <surname>Lee</surname></string-name></person-group>, &#x201C;<article-title>A regression approach to speech enhancement based on deep neural networks</article-title>,&#x201D; <source>IEEE/ACM Transactions on Audio Speech and Language Processing</source>, vol. <volume>23</volume>, no. <issue>1</issue>, pp. <fpage>7</fpage>&#x2013;<lpage>19</lpage>, <year>2015</year>.</mixed-citation></ref>
<ref id="ref-16"><label>[16]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>T.</given-names> <surname>Kounovsky</surname></string-name> and <string-name><given-names>J.</given-names> <surname>Malek</surname></string-name></person-group>, &#x201C;<article-title>Single channel speech enhancement using convolutional neural network</article-title>,&#x201D; in <conf-name>IEEE Int. Workshop of Electronics Control Measurement Signals and Their Application to Mechatronics</conf-name>, <conf-loc>Donostia-San, Spain</conf-loc>, pp. <fpage>1</fpage>&#x2013;<lpage>5</lpage>, <year>2017</year>.</mixed-citation></ref>
<ref id="ref-17"><label>[17]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Mustaqeem</surname> and <given-names>S.</given-names></string-name> <string-name><surname>Kwon</surname></string-name></person-group>, &#x201C;<article-title>1D-Cnn: Speech emotion recognition system using a stacked network with dilated cnn features</article-title>,&#x201D; <source>Computers, Materials &#x0026; Continua</source>, vol. <volume>67</volume>, no. <issue>3</issue>, pp. <fpage>4039</fpage>&#x2013;<lpage>4059</lpage>, <year>2021</year>.</mixed-citation></ref>
<ref id="ref-18"><label>[18]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>L.</given-names> <surname>Zhou</surname></string-name>, <string-name><given-names>Q. Y.</given-names> <surname>Zhong</surname></string-name>, <string-name><given-names>T. Y.</given-names> <surname>Wang</surname></string-name>, <string-name><given-names>S. Y.</given-names> <surname>Lu</surname></string-name> and <string-name><given-names>H. M.</given-names> <surname>Hu</surname></string-name></person-group>, &#x201C;<article-title>Speech enhancement via residual dense generative adversarial network</article-title>,&#x201D; <source>Computer Systems Science and Engineering</source>, vol. <volume>38</volume>, no. <issue>3</issue>, pp. <fpage>279</fpage>&#x2013;<lpage>289</lpage>, <year>2021</year>.</mixed-citation></ref>
<ref id="ref-19"><label>[19]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>L. C.</given-names> <surname>Yann</surname></string-name>, <string-name><given-names>Y.</given-names> <surname>Bengio</surname></string-name> and <string-name><given-names>H.</given-names> <surname>Geoffrey</surname></string-name></person-group>, &#x201C;<article-title>Deep learning</article-title>,&#x201D; <source>Nature</source>, vol. <volume>521</volume>, no. <issue>7553</issue>, pp. <fpage>436</fpage>&#x2013;<lpage>444</lpage>, <year>2015</year>.</mixed-citation></ref>
<ref id="ref-20"><label>[20]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>R.</given-names> <surname>Pascanu</surname></string-name>, <string-name><given-names>T.</given-names> <surname>Mikolov</surname></string-name> and <string-name><given-names>Y.</given-names> <surname>Bengio</surname></string-name></person-group>, &#x201C;<article-title>On the difficulty of training recurrent neural networks</article-title>,&#x201D; in <conf-name>Int. Conf. on Machine Learning</conf-name>, <conf-loc>Atlanta, GA, USA</conf-loc>, pp. <fpage>2347</fpage>&#x2013;<lpage>2355</lpage>, <year>2013</year>.</mixed-citation></ref>
<ref id="ref-21"><label>[21]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>F.</given-names> <surname>Weninger</surname></string-name>, <string-name><given-names>H.</given-names> <surname>Erdogan</surname></string-name>, <string-name><given-names>S.</given-names> <surname>Watanabe</surname></string-name>, <string-name><given-names>E.</given-names>, <surname>Vincent</surname></string-name>, <string-name><given-names>J.</given-names> <surname>Le Roux</surname></string-name> <etal>et al.,</etal></person-group> &#x201C;<article-title>Speech enhancement with lstm recurrent neural networks and its application to noise-robust ASR</article-title>,&#x201D; in <conf-name>Int. Conf. on Latent Variable Analysis and Signal Separation</conf-name>, <conf-loc>Tech Univ Liberec, Liberec, Czech Republic</conf-loc>, pp. <fpage>91</fpage>&#x2013;<lpage>99</lpage>, <year>2015</year>.</mixed-citation></ref>
<ref id="ref-22"><label>[22]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>Y. L.</given-names> <surname>Zhang</surname></string-name>, <string-name><given-names>Y. P.</given-names> <surname>Tian</surname></string-name>, <string-name><given-names>Y.</given-names> <surname>Kong</surname></string-name>, <string-name><given-names>B. N.</given-names> <surname>Zhong</surname></string-name> and <string-name><given-names>Y.</given-names> <surname>Fu</surname></string-name></person-group>, &#x201C;<article-title>Residual dense network for image super-resolution</article-title>,&#x201D; in <conf-name>IEEE/CVF Conf. on Computer Vision and Pattern Recognition</conf-name>, <conf-loc>Salt Lake City, UT, USA</conf-loc>, pp. <fpage>2472</fpage>&#x2013;<lpage>2481</lpage>, <year>2018</year>.</mixed-citation></ref>
<ref id="ref-23"><label>[23]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>K.</given-names> <surname>Tan</surname></string-name> and <string-name><given-names>D. L.</given-names> <surname>Wang</surname></string-name></person-group>, &#x201C;<article-title>A convolutional recurrent neural network for real-time speech enhancement</article-title>,&#x201D; in <conf-name>19th Annual Conf. of the Int. Speech Communication Association 2018</conf-name>, <conf-loc>Hyderabad, India</conf-loc>, pp. <fpage>3229</fpage>&#x2013;<lpage>3233</lpage>, <year>2018</year>.</mixed-citation></ref>
<ref id="ref-24"><label>[24]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>S. R.</given-names> <surname>Park</surname></string-name> and <string-name><given-names>J. W.</given-names> <surname>Lee</surname></string-name></person-group>, &#x201C;<article-title>A fully convolutional neural network for speech enhancement</article-title>,&#x201D; in <conf-name>18th Annual Conf. of the Int. Speech Communication Association 2017</conf-name>, <conf-loc>Stockholm, Sweden</conf-loc>, pp. <fpage>1993</fpage>&#x2013;<lpage>1997</lpage>, <year>2017</year>.</mixed-citation></ref>
<ref id="ref-25"><label>[25]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>H.</given-names> <surname>Zhao</surname></string-name>, <string-name><given-names>S.</given-names> <surname>Zarar</surname></string-name>, <string-name><given-names>I.</given-names> <surname>Tashev</surname></string-name> and <string-name><given-names>C. H.</given-names> <surname>Lee</surname></string-name></person-group>, &#x201C;<article-title>Convolutional-recurrent neural networks for speech enhancement</article-title>,&#x201D; in <conf-name>IEEE Int. Conf. on Acoustics, Speech and Signal Processing</conf-name>, <conf-loc>Calgary, Canada</conf-loc>, pp. <fpage>2401</fpage>&#x2013;<lpage>2405</lpage>, <year>2018</year>.</mixed-citation></ref>
<ref id="ref-26"><label>[26]</label><mixed-citation publication-type="web"><person-group person-group-type="author"><string-name><given-names>G. E.</given-names> <surname>Hinton</surname></string-name>, <string-name><given-names>N.</given-names> <surname>Srivastava</surname></string-name>, <string-name><given-names>A.</given-names> <surname>Krizhevsky</surname></string-name>, <string-name><given-names>I.</given-names> <surname>Sutskever</surname></string-name> and <string-name><given-names>R. R.</given-names> <surname>Salakhutdinov</surname></string-name></person-group>, <source>Improving Neural Networks by Preventing co-Adaptation of Feature Detectors</source>, <year>2012</year>. [Online]. Available: <uri xlink:href="https://arxiv.org/abs/1207.0580">https://arxiv.org/abs/1207.0580</uri>.</mixed-citation></ref>
<ref id="ref-27"><label>[27]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>G.</given-names> <surname>Huang</surname></string-name>, <string-name><given-names>Z.</given-names> <surname>Liu</surname></string-name>, <string-name><given-names>L.</given-names> <surname>Van Der Maaten</surname></string-name> and <string-name><given-names>K. Q.</given-names> <surname>Weinberger</surname></string-name></person-group>, &#x201C;<article-title>Densely connected convolutional networks</article-title>,&#x201D; in <conf-name>IEEE Conf. on Computer Vision and Pattern Recognition</conf-name>, <conf-loc>Honolulu, HI, USA</conf-loc>, pp. <fpage>2261</fpage>&#x2013;<lpage>2269</lpage>, <year>2017</year>.</mixed-citation></ref>
<ref id="ref-28"><label>[28]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>O.</given-names> <surname>Ronneberger</surname></string-name>, <string-name><given-names>P.</given-names> <surname>Fischer</surname></string-name> and <string-name><given-names>T.</given-names> <surname>Brox</surname></string-name></person-group>, &#x201C;<article-title>U-Net: Convolutional networks for biomedical image segmentation</article-title>,&#x201D; in <conf-name>Int. Conf. on Medical Image Computing and Computer-Assisted Intervention</conf-name>, <conf-loc>Munich, Germany</conf-loc>, pp. <fpage>234</fpage>&#x2013;<lpage>241</lpage>, <year>2017</year>.</mixed-citation></ref>
<ref id="ref-29"><label>[29]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>C.</given-names> <surname>Fred</surname></string-name>, <string-name><given-names>G.</given-names> <surname>Marco</surname></string-name>, <string-name><given-names>L.</given-names> <surname>Thomas</surname></string-name> and <string-name><given-names>J.</given-names> <surname>Simko</surname></string-name></person-group>, &#x201C;<article-title>The chains corpus: Characterizing individual speakers</article-title>,&#x201D; in <conf-name>SPECOM-2006</conf-name>, <conf-loc>St Petersburg, Russia</conf-loc>, pp. <fpage>431</fpage>&#x2013;<lpage>435</lpage>, <year>2006</year>.</mixed-citation></ref>
<ref id="ref-30"><label>[30]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>A.</given-names> <surname>Varga</surname></string-name> and <string-name><given-names>H. J. M.</given-names> <surname>Steeneken</surname></string-name></person-group>, &#x201C;<article-title>Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recognition systems</article-title>,&#x201D; <source>Speech Communication</source>, vol. <volume>12</volume>, no. <issue>3</issue>, pp. <fpage>247</fpage>&#x2013;<lpage>251</lpage>, <year>1993</year>.</mixed-citation></ref>
<ref id="ref-31"><label>[31]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>D. P.</given-names> <surname>Kingma</surname></string-name> and <string-name><given-names>J. L.</given-names> <surname>Ba</surname></string-name></person-group>, &#x201C;<article-title>Adam: A method for stochastic optimization</article-title>,&#x201D; in <conf-name>Int. Conf. on Learning Representations</conf-name>, <conf-loc>San Diego, CA, USA</conf-loc>, pp. <fpage>1</fpage>&#x2013;<lpage>9</lpage>, <year>2015</year>.</mixed-citation></ref>
<ref id="ref-32"><label>[32]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>E.</given-names> <surname>Vincent</surname></string-name>, <string-name><given-names>R.</given-names> <surname>Gribonval</surname></string-name> and <string-name><given-names>C.</given-names> <surname>Fevotte</surname></string-name></person-group>, &#x201C;<article-title>Performance measurement in blind audio source separation</article-title>,&#x201D; <source>IEEE Transactions on Audio Speech and Language Processing</source>, vol. <volume>14</volume>, no. <issue>4</issue>, pp. <fpage>1462</fpage>&#x2013;<lpage>1469</lpage>, <year>2006</year>.</mixed-citation></ref>
<ref id="ref-33"><label>[33]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><collab>ITU-T P.862.2</collab></person-group>, &#x201C;Wideband extension to recommendation p.862 for the assessment of wideband telephone networks and speech codecs,&#x201D; <italic>Telecommunication Standardization Sector of ITU</italic>, <year>2007</year>.</mixed-citation></ref>
<ref id="ref-34"><label>[34]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>J.</given-names> <surname>Jensen</surname></string-name> and <string-name><given-names>C. H.</given-names> <surname>Taal</surname></string-name></person-group>, &#x201C;<article-title>An algorithm for predicting the intelligibility of speech masked by modulated noise maskers</article-title>,&#x201D; <source>IEEE-ACM Transactions on Audio Speech and Language Processing</source>, vol. <volume>24</volume>, no. <issue>11</issue>, pp. <fpage>2009</fpage>&#x2013;<lpage>2022</lpage>, <year>2016</year>.</mixed-citation></ref>
<ref id="ref-35"><label>[35]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>Y.</given-names> <surname>Hu</surname></string-name> and <string-name><given-names>P. C.</given-names> <surname>Loizou</surname></string-name></person-group>, &#x201C;<article-title>Evaluation of objective quality measures for speech enhancement</article-title>,&#x201D; <source>IEEE Transactions on Audio Speech and Language Processing</source>, vol. <volume>16</volume>, no. <issue>1</issue>, pp. <fpage>229</fpage>&#x2013;<lpage>238</lpage>, <year>2008</year>.</mixed-citation></ref>
</ref-list>
</back>
</article>