<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.1 20151215//EN" "http://jats.nlm.nih.gov/publishing/1.1/JATS-journalpublishing1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" article-type="research-article" dtd-version="1.1">
<front>
<journal-meta>
<journal-id journal-id-type="pmc">CMC</journal-id>
<journal-id journal-id-type="nlm-ta">CMC</journal-id>
<journal-id journal-id-type="publisher-id">CMC</journal-id>
<journal-title-group>
<journal-title>Computers, Materials &#x0026; Continua</journal-title>
</journal-title-group>
<issn pub-type="epub">1546-2226</issn>
<issn pub-type="ppub">1546-2218</issn>
<publisher>
<publisher-name>Tech Science Press</publisher-name>
<publisher-loc>USA</publisher-loc>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="publisher-id">26259</article-id>
<article-id pub-id-type="doi">10.32604/cmc.2022.026259</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Article</subject>
</subj-group>
</article-categories>
<title-group>
<article-title>TP-MobNet: A Two-pass Mobile Network for Low-complexity Classification of Acoustic Scene</article-title>
<alt-title alt-title-type="left-running-head">TP-MobNet: A Two-pass Mobile Network for Low-complexity Classification of Acoustic Scene</alt-title>
<alt-title alt-title-type="right-running-head">TP-MobNet: A Two-pass Mobile Network for Low-complexity Classification of Acoustic Scene</alt-title>
</title-group>
<contrib-group content-type="authors">
<contrib id="author-1" contrib-type="author">
<name name-style="western"><surname>Seo</surname><given-names>Soonshin</given-names></name><xref ref-type="aff" rid="aff-1">1</xref></contrib>
<contrib id="author-2" contrib-type="author">
<name name-style="western"><surname>Oh</surname><given-names>Junseok</given-names></name><xref ref-type="aff" rid="aff-2">2</xref></contrib>
<contrib id="author-3" contrib-type="author">
<name name-style="western"><surname>Cho</surname><given-names>Eunsoo</given-names></name><xref ref-type="aff" rid="aff-2">2</xref></contrib>
<contrib id="author-4" contrib-type="author">
<name name-style="western"><surname>Park</surname><given-names>Hosung</given-names></name><xref ref-type="aff" rid="aff-2">2</xref></contrib>
<contrib id="author-5" contrib-type="author">
<name name-style="western"><surname>Kim</surname><given-names>Gyujin</given-names></name><xref ref-type="aff" rid="aff-2">2</xref></contrib>
<contrib id="author-6" contrib-type="author" corresp="yes">
<name name-style="western"><surname>Kim</surname><given-names>Ji-Hwan</given-names></name><xref ref-type="aff" rid="aff-2">2</xref><email>kimjihwan@sogang.ac.kr</email>
</contrib>
<aff id="aff-1"><label>1</label><institution>NAVER Corporation</institution>, <addr-line>Seongnam, 13561</addr-line>, <country>Korea</country></aff>
<aff id="aff-2"><label>2</label><institution>Department of Computer Science and Engineering, Sogang University</institution>, <addr-line>Seoul, 04107</addr-line>, <country>Korea</country></aff>
</contrib-group>
<author-notes>
<corresp id="cor1"><label>&#x002A;</label>Corresponding Author: Ji-Hwan Kim. Email: <email>kimjihwan@sogang.ac.kr</email></corresp>
</author-notes>
<pub-date pub-type="epub" date-type="pub" iso-8601-date="2022-06-14"><day>14</day>
<month>06</month>
<year>2022</year></pub-date>
<volume>73</volume>
<issue>2</issue>
<fpage>3291</fpage>
<lpage>3303</lpage>
<history>
<date date-type="received"><day>20</day><month>12</month><year>2021</year></date>
<date date-type="accepted"><day>22</day><month>2</month><year>2022</year></date>
</history>
<permissions>
<copyright-statement>&#x00A9; 2022 Seo et al.</copyright-statement>
<copyright-year>2022</copyright-year>
<copyright-holder>Seo et al.</copyright-holder>
<license xlink:href="https://creativecommons.org/licenses/by/4.0/">
<license-p>This work is licensed under a <ext-link ext-link-type="uri" xlink:type="simple" xlink:href="https://creativecommons.org/licenses/by/4.0/">Creative Commons Attribution 4.0 International License</ext-link>, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
</license>
</permissions>
<self-uri content-type="pdf" xlink:href="TSP_CMC_26259.pdf"></self-uri>
<abstract>
<p>Acoustic scene classification (ASC) is a method of recognizing and classifying environments that employ acoustic signals. Various ASC approaches based on deep learning have been developed, with convolutional neural networks (CNNs) proving to be the most reliable and commonly utilized in ASC systems due to their suitability for constructing lightweight models. When using ASC systems in the real world, model complexity and device robustness are essential considerations. In this paper, we propose a two-pass mobile network for low-complexity classification of the acoustic scene, named TP-MobNet. With inverse residuals and linear bottlenecks, TP-MobNet is based on MobileNetV2, and following mobile blocks, coordinate attention and two-pass fusion approaches are utilized. The log-range dependencies and precise position information in feature maps can be trained via coordinate attention. By capturing more diverse feature resolutions at the network&#x0027;s end sides, two-pass fusions can also train generalization. Also, the model size is reduced by applying weight quantization to the trained model. By adding weight quantization to the trained model, the model size is also lowered. The TAU Urban Acoustic Scenes 2020 Mobile development set was used for all of the experiments. It has been confirmed that the proposed model, with a model size of 219.6 kB, achieves an accuracy of 73.94&#x0025;.</p>
</abstract>
<kwd-group kwd-group-type="author">
<kwd>Acoustic scene classification</kwd>
<kwd>low-complexity</kwd>
<kwd>device robustness</kwd>
<kwd>two-pass mobile network</kwd>
<kwd>coordinate attention</kwd>
<kwd>weight quantization</kwd>
</kwd-group>
</article-meta>
</front>
<body>
<sec id="s1"><label>1</label><title>Introduction</title>
<p>The goal of acoustic scene classification (ASC) is recognition and classification using audio signals to identify an environment [<xref ref-type="bibr" rid="ref-1">1</xref>], thereby enabling a wide range of applications including surveillance [<xref ref-type="bibr" rid="ref-2">2</xref>], intelligent wearable devices, and robot sensing services. As in other fields, ASC problems have been approached using deep learning, including deep neural networks [<xref ref-type="bibr" rid="ref-3">3</xref>,<xref ref-type="bibr" rid="ref-4">4</xref>], convolutional neural networks (CNN) [<xref ref-type="bibr" rid="ref-5">5</xref>], recurrent neural networks [<xref ref-type="bibr" rid="ref-6">6</xref>], and convolutional recurrent neural networks [<xref ref-type="bibr" rid="ref-7">7</xref>]. Of these, CNNs are used widely for ASC problems because they perform reliably when using spectrogram images of audio data for training [<xref ref-type="bibr" rid="ref-8">8</xref>]. There have been various ASC studies using CNNs. Piczak et al. [<xref ref-type="bibr" rid="ref-5">5</xref>] introduced CNNs into ASC and environmental sound classification and evaluated their potential, and since then various CNN models have been proposed for ASC [<xref ref-type="bibr" rid="ref-9">9</xref>,<xref ref-type="bibr" rid="ref-10">10</xref>] and there have been several studies of low-complexity CNN models [<xref ref-type="bibr" rid="ref-11">11</xref>&#x2013;<xref ref-type="bibr" rid="ref-13">13</xref>].</p>
<p>Additionally, model complexity and device robustness are important issues when applying ASC systems in the real world, and ASC systems are also used in low-performance devices such as mobile devices. In this paper, we propose a two-pass mobile network for low-complexity classification of the acoustic scene, named TP-MobNet. MobileNetV2 [<xref ref-type="bibr" rid="ref-12">12</xref>] and MobNet [<xref ref-type="bibr" rid="ref-14">14</xref>] are the foundations for TP-MobNet. Inverse residuals and linear bottlenecks are used, just as they were in the two preceding models. In addition, the proposed two-pass approaches and coordinated attention are used. The proposed two-pass technique has an impact on feature resolution training. Also, weight quantization is used to reduce model size. The TAU Urban Acoustic Scenes 2020 Mobile development set [<xref ref-type="bibr" rid="ref-15">15</xref>] was used for all of the experiments. The proposed TP-MobNet in the single model and ensemble model showed 72.59&#x0025; and 73.94&#x0025; accuracy, respectively, with model sizes of 126.5 and 219.6 kB.</p>
<p>This paper is organized as follows. Section 2 presents related work involving various ASC methods. Section 3 analyzes in detail the proposed method of TP-MobNet using coordinate attention. Finally, Sections 4 and 5 present the experimental results and conclusions, respectively.</p>
</sec>
<sec id="s2"><label>2</label><title>Related Work</title>
<p>The use of CNN has been widely explored over the years in the task of ASC. In this section, we describe various modifications in CNN or additional that have been proposed to improve the performance of ASC. Starting with the CNN-based ASC model in Sections 2.1, 2.2 describes the mobile-network-based ASC model. It then proceeds to explore previous studies involving the combination of attention and two-pass methods with CNN and mobile-network models.</p>
<sec id="s2_1"><label>2.1</label><title>ASC Models Based on CNNs</title>
<p>As part of the detection and classification of acoustic scenes and events (DCASE) series, ASC appears in every edition of task [<xref ref-type="bibr" rid="ref-16">16</xref>]. The baseline system for DCASE 2021 Task 1a task implementation uses a deep CNN, which was proposed by Valenti et al. [<xref ref-type="bibr" rid="ref-17">17</xref>] as a training strategy that takes full use of the limited datasets available. Batch normalization and adjustments to layer widths were applied to the original system. There have been several entries based on adaptations of the CNN-based network since Valenti et al. [<xref ref-type="bibr" rid="ref-17">17</xref>] recommended employing a CNN to identify short sequences of audio data in the DCASE challenge.</p>
<p>Dorfer et al. [<xref ref-type="bibr" rid="ref-18">18</xref>] proposed a system based on a CNN trained on spectrograms, which consisted of two convolutions and one fully connected layer, the entire input segment. Additionally, they also optimized their models using hyperparameter tuning.</p>
</sec>
<sec id="s2_2"><label>2.2</label><title>ASC Models Based on Mobile Networks</title>
<p>MobNet was proposed by Hu et al. [<xref ref-type="bibr" rid="ref-14">14</xref>], coupled with a set of CNN-based models and a mix of data augmentation strategies. Time and frequency operations and analysis were specialized by CNN-based systems. MobNet is a mobile network that is heavily based on MobileNetV2. The MobileNetV2 is built on an inverted residual, with bottleneck layers that are typical residual models in reverse order as input and output of the residual block [<xref ref-type="bibr" rid="ref-12">12</xref>]. The inverted residual block expands the input dimensions rather than lowering them, preserving information that might otherwise be lost in a typical residual model. To filter features in the intermediate expansion layer and reduce the model&#x0027;s complexity, MobileNetV2 employs lightweight depth-wise convolutions [<xref ref-type="bibr" rid="ref-12">12</xref>]. To retain representational power, it then reduces nonlinearities in the narrow layers. As a result, MobileNetV2 keeps its great accuracy while lowering its complexity.</p>
</sec>
<sec id="s2_3"><label>2.3</label><title>ASC Models Based on Attention and Two-pass Method</title>
<p>Cao et al. [<xref ref-type="bibr" rid="ref-19">19</xref>] proposed using attention to maintain details from the feature map. ResNet with an inverted residual block was the basis for their proposed model. They included spatial attention and channel attention module to the model to compensate for aspects such as crucial features that could be lost in the model. In the feature map, the channel attention performs global average and max pooling, then feeds the features into a two-layer neural network. Following the activation functions, a weight coefficient is generated to multiply with the feature map. The spatial attention, on the other hand, executes maximum and average pooling on a single channel dimension. This is applied to each channel separately before being concatenated. The concatenation results are then sent into a convolution with a sigmoid activation function to produce a weight coefficient, which is subsequently multiplied by the feature map.</p>
<p>McDonnell et al. [<xref ref-type="bibr" rid="ref-20">20</xref>] proposed late fusion, which involves combining two CNN paths before the network output. The input spectrogram is split into two parts: high frequencies and low frequencies, which are then averaged over overlapping views rather than global views, and the two paths are then merged in the final layer.</p>
</sec>
</sec>
<sec id="s3"><label>3</label><title>Two-pass Mobile Networks Using Coordinate Attention</title>
<p>We designed TP-MobNet based on MobNet and MobileNetV2, as described in Sections 3.1 and 3.2. In addition, coordinate attention is added, and the detailed mechanism is described in Section 3.3. Also, Section 3.4 describes the weight quantization.</p>
<sec id="s3_1"><label>3.1</label><title>Baseline</title>
<p>The proposed baseline model, as shown in <xref ref-type="table" rid="table-1">Tab. 1</xref>, is mostly made up of mobile blocks. To input features, the first two-dimensional convolution and three mobile blocks are employed. The mobile blocks are built for channel dimensions, and each has 32, 48, or 64 channels. The features are then activated using the batch normalization and ReLU activation functions. The features are then given coordinate attention after one convolution and dropout. Finally, the coordinate attention features are supplied into the final convolution, where pooling and softmax functions are applied.</p>
<table-wrap id="table-1"><label>Table 1</label><caption><title>Structure of the proposed baseline</title></caption>
<table frame="hsides">
<colgroup>
<col align="left"/>
<col align="left"/>
<col align="left"/>
</colgroup>
<thead>
<tr>
<th align="left">Layer name</th>
<th align="left">Layer config</th>
<th align="left">Output feature size</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">Input</td>
<td align="left"/>
<td align="left">(128, 423, 3)</td>
</tr>
<tr>
<td align="left">Convolution</td>
<td align="left">stride&#x2009;&#x003D;&#x2009;2, 32, 3&#x2009;&#x00D7;&#x2009;3</td>
<td align="left">(64, 212, 32)</td>
</tr>
<tr>
<td align="left">BatchNorm &#x0026; ReLU</td>
<td align="left"/>
<td align="left"/>
</tr>
<tr>
<td align="left">Mobile block 1</td>
<td align="left">stride&#x2009;&#x003D;&#x2009;2, 32, 3&#x2009;&#x00D7;&#x2009;3</td>
<td align="left">(32, 106, 32)</td>
</tr>
<tr>
<td align="left">Mobile block 2</td>
<td align="left">stride&#x2009;&#x003D;&#x2009;2, 48, 3&#x2009;&#x00D7;&#x2009;3</td>
<td align="left">(16, 53, 48)</td>
</tr>
<tr>
<td align="left">Mobile block 3</td>
<td align="left">stride&#x2009;&#x003D;&#x2009;2, 64, 3&#x2009;&#x00D7;&#x2009;3</td>
<td align="left">(8, 27, 64)</td>
</tr>
<tr>
<td align="left">Convolution</td>
<td align="left">stride&#x2009;&#x003D;&#x2009;1, 64, 1&#x2009;&#x00D7;&#x2009;1</td>
<td align="left">(8, 27, 64)</td>
</tr>
<tr>
<td align="left">BatchNorm &#x002B; ReLU</td>
<td align="left"/>
<td align="left"/>
</tr>
<tr>
<td align="left">Convolution</td>
<td align="left">stride&#x2009;&#x003D;&#x2009;1, 64, 1&#x2009;&#x00D7;&#x2009;1</td>
<td align="left">(8, 27, 64)</td>
</tr>
<tr>
<td align="left">Dropout</td>
<td align="left"/>
<td align="left"/>
</tr>
<tr>
<td align="left">Attention</td>
<td align="left"/>
<td align="left">(8, 27, 64)</td>
</tr>
<tr>
<td align="left">BatchNorm</td>
<td align="left"/>
<td align="left"/>
</tr>
<tr>
<td align="left">Convolution</td>
<td align="left">stride&#x2009;&#x003D;&#x2009;1, 10, 1&#x2009;&#x00D7;&#x2009;1</td>
<td align="left">(8, 27, 10)</td>
</tr>
<tr>
<td align="left">BatchNorm</td>
<td align="left"/>
<td align="left"/>
</tr>
<tr>
<td align="left">Pooling</td>
<td align="left"/>
<td align="left">(1, 10)</td>
</tr>
<tr>
<td align="left">Softmax</td>
<td align="left"/>
<td align="left">(1, 10)</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>Based on MobileNetV2, the mobile blocks are subjected to linear bottlenecks and inverted residuals. As shown in <xref ref-type="table" rid="table-2">Tab. 2</xref>, a mobile block consists of three bottlenecks: three bottleneck layers: one bottleneck, and two residual bottlenecks. The preceding bottleneck&#x0027;s output features are linearly transmitted to the next bottleneck without being activated.</p>
<table-wrap id="table-2"><label>Table 2</label><caption><title>Structure of mobile block</title></caption>
<table frame="hsides">
<colgroup>
<col align="left"/>
<col align="left"/>
<col align="left"/>
</colgroup>
<thead>
<tr>
<th align="left">Layer name</th>
<th align="left">Layer config</th>
<th align="left">Output feature size</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">Input</td>
<td align="left"/>
<td align="left">(H, W, C<sub>input</sub>)</td>
</tr>
<tr>
<td align="left">Bottleneck</td>
<td align="left">stride&#x2009;&#x003D;&#x2009;2, C<sub>output</sub>, 3&#x2009;&#x00D7;&#x2009;3,</td>
<td align="left">(H/2, W/2, C<sub>output</sub>)</td>
</tr>
<tr>
<td align="left">Residual bottleneck</td>
<td align="left">stride&#x2009;&#x003D;&#x2009;1, C<sub>output</sub>, 3&#x2009;&#x00D7;&#x2009;3,</td>
<td align="left">(H/2, W/2, C<sub>output</sub>)</td>
</tr>
<tr>
<td align="left">Residual bottleneck</td>
<td align="left">stride&#x2009;&#x003D;&#x2009;1, C<sub>output</sub>, 3&#x2009;&#x00D7;&#x2009;3,</td>
<td align="left">(H/2, W/2, C<sub>output</sub>)</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>The feature dimension is lowered by half in the bottleneck with stride 2 through the depth-wise convolution, as shown in <xref ref-type="table" rid="table-3">Tabs. 3</xref> and <xref ref-type="table" rid="table-4">4</xref>. Alternatively, in the residual bottleneck, which uses skip connections, the feature dimension is retained. In addition, all bottlenecks are also applied to channel expansion at the first convolution, and they are recovered at the last convolution.</p>
<table-wrap id="table-3"><label>Table 3</label><caption><title>Structure of bottleneck</title></caption>
<table frame="hsides">
<colgroup>
<col align="left"/>
<col align="left"/>
<col align="left"/>
</colgroup>
<thead>
<tr>
<th align="left">Layer name</th>
<th align="left">Layer config</th>
<th align="left">Output feature size</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">Input</td>
<td align="left"/>
<td align="left">(H, W, C<sub>input</sub>)</td>
</tr>
<tr>
<td align="left">Convolution</td>
<td align="left">stride&#x2009;&#x003D;&#x2009;1, 2C<sub>input</sub>, 1&#x2009;&#x00D7;&#x2009;1</td>
<td align="left">(H, W, 2C<sub>input</sub>)</td>
</tr>
<tr>
<td align="left">BatchNorm &#x002B; ReLU</td>
<td align="left"/>
<td align="left"/>
</tr>
<tr>
<td align="left">Depthwise convolution</td>
<td align="left">stride&#x2009;&#x003D;&#x2009;2, 2C<sub>input</sub>, 3&#x2009;&#x00D7;&#x2009;3</td>
<td align="left">(H/2, W/2, 2C<sub>input</sub>)</td>
</tr>
<tr>
<td align="left">BatchNorm &#x002B; ReLU</td>
<td align="left"/>
<td align="left"/>
</tr>
<tr>
<td align="left">Convolution</td>
<td align="left">stride&#x2009;&#x003D;&#x2009;1, C<sub>output</sub>, 1&#x2009;&#x00D7;&#x2009;1,</td>
<td align="left">(H/2, W/2, C<sub>output</sub>)</td>
</tr>
<tr>
<td align="left">BatchNorm</td>
<td align="left"/>
<td align="left"/>
</tr>
</tbody>
</table>
</table-wrap>
<table-wrap id="table-4"><label>Table 4</label><caption><title>Structure of residual bottleneck</title></caption>
<table frame="hsides">
<colgroup>
<col align="left"/>
<col align="left"/>
<col align="left"/>
</colgroup>
<thead>
<tr>
<th align="left">Layer name</th>
<th align="left">Layer config</th>
<th align="left">Output feature size</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">Input</td>
<td align="left"/>
<td align="left">(H, W, C<sub>input</sub>)</td>
</tr>
<tr>
<td align="left">Convolution</td>
<td align="left">stride&#x2009;&#x003D;&#x2009;1, 2C<sub>input</sub>, 1&#x2009;&#x00D7;&#x2009;1</td>
<td align="left">(H, W, 2C<sub>input</sub>)</td>
</tr>
<tr>
<td align="left">BatchNorm &#x002B; ReLU</td>
<td align="left"/>
<td align="left"/>
</tr>
<tr>
<td align="left">Depthwise convolution</td>
<td align="left">stride&#x2009;&#x003D;&#x2009;1, 2C<sub>input</sub>, 3&#x2009;&#x00D7;&#x2009;3</td>
<td align="left">(H, W, 2C<sub>input</sub>)</td>
</tr>
<tr>
<td align="left">BatchNorm &#x002B; ReLU</td>
<td align="left"/>
<td align="left"/>
</tr>
<tr>
<td align="left">Convolution</td>
<td align="left">stride&#x2009;&#x003D;&#x2009;1, C<sub>output</sub>, 1&#x2009;&#x00D7;&#x2009;1</td>
<td align="left">(H, W, C<sub>output</sub>)</td>
</tr>
<tr>
<td align="left">BatchNorm</td>
<td align="left"/>
<td align="left">residual</td>
</tr>
<tr>
<td align="left">Add</td>
<td align="left">residual &#x002B; input</td>
<td align="left">(H, W, C<sub>output</sub>)</td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
<sec id="s3_2"><label>3.2</label><title>Two-pass Fusion</title>
<p>For the proposed TP-MobNet, fusion approaches are used. Two convolution output features are fused in the first convolution (early fusion), and the output features of the last convolution are divided in half, as shown in <xref ref-type="table" rid="table-5">Tab. 5</xref> and <xref ref-type="fig" rid="fig-1">Fig. 1</xref>. Coordinate attention is given to one side of the divided features but not the other. After applying pooling and softmax to both output features, the interpolation is processed (late fusion).</p>
<table-wrap id="table-5"><label>Table 5</label><caption><title>Structure of the proposed TP-MobNet</title></caption>
<table frame="hsides">
<colgroup>
<col align="left"/>
<col align="left"/>
<col align="left"/>
</colgroup>
<thead>
<tr>
<th align="left">Layer name</th>
<th align="left">Layer config</th>
<th align="left">Output feature size</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">Input</td>
<td align="left"/>
<td align="left">(128, 423, 3)</td>
</tr>
<tr>
<td align="left">Convolution</td>
<td align="left">Stride &#x003D; (2, 1), 32, 3&#x2009;&#x00D7;&#x2009;3<break/>
Stride &#x003D; (2, 2), 32, 3&#x2009;&#x00D7;&#x2009;3</td>
<td align="left">(64, 423, 32)<break/>(64, 212, 32)</td>
</tr>
<tr>
<td align="left">Early fusion</td>
<td align="left"/>
<td align="left">(64, 635, 32)</td>
</tr>
<tr>
<td align="left">BatchNorm &#x002B; ReLU</td>
<td align="left"/>
<td align="left"/>
</tr>
<tr>
<td align="left">Mobile block 1</td>
<td align="left">stride&#x2009;&#x003D;&#x2009;2, 32, 3&#x2009;&#x00D7;&#x2009;3</td>
<td align="left">(32, 318, 32)</td>
</tr>
<tr>
<td align="left">Mobile block 2</td>
<td align="left">stride&#x2009;&#x003D;&#x2009;2, 48, 3&#x2009;&#x00D7;&#x2009;3</td>
<td align="left">(16, 159, 48)</td>
</tr>
<tr>
<td align="left">Mobile block 3</td>
<td align="left">stride&#x2009;&#x003D;&#x2009;2, 64, 3&#x2009;&#x00D7;&#x2009;3</td>
<td align="left">(8, 80, 64)</td>
</tr>
<tr>
<td align="left">Convolution</td>
<td align="left">stride&#x2009;&#x003D;&#x2009;2, 32, 1&#x2009;&#x00D7;&#x2009;1</td>
<td align="left">(8, 80, 64)</td>
</tr>
<tr>
<td align="left">BatchNorm &#x002B; ReLU</td>
<td align="left"/>
<td align="left"/>
</tr>
<tr>
<td align="left">Convolution</td>
<td align="left">stride&#x2009;&#x003D;&#x2009;2, 72, 1&#x2009;&#x00D7;&#x2009;1</td>
<td align="left">(8, 80, 72)</td>
</tr>
<tr>
<td align="left">Dropout</td>
<td align="left"/>
<td align="left"/>
</tr>
<tr>
<td align="left">BatchNorm</td>
<td align="left"/>
<td align="left"/>
</tr>
<tr>
<td align="left">Convolution</td>
<td align="left">stride&#x2009;&#x003D;&#x2009;2, 10, 1&#x2009;&#x00D7;&#x2009;1</td>
<td align="left">(8, 80, 10)</td>
</tr>
<tr>
<td align="left">BatchNorm</td>
<td align="left"/>
<td align="left"/>
</tr>
<tr>
<td align="left">Split</td>
<td align="left"/>
<td align="left">(8, 40, 10)<break/>(8, 40, 10)</td>
</tr>
<tr>
<td align="left">Attention &#x002B; pooling &#x002B; softmax</td>
<td align="left"/>
<td align="left">(1, 10) &#x2192; output<sub>A</sub></td>
</tr>
<tr>
<td align="left">Pooling &#x002B; softmax</td>
<td align="left"/>
<td align="left">(1, 10) &#x2192; output<sub>B</sub></td>
</tr>
<tr>
<td align="left">Late fusion</td>
<td align="left"><inline-formula id="ieqn-1"><mml:math id="mml-ieqn-1"><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mn>1</mml:mn><mml:mo>&#x2212;</mml:mo><mml:mi>&#x03BB;</mml:mi></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mo>&#x2217;</mml:mo><mml:mrow><mml:mtext>outputA</mml:mtext></mml:mrow><mml:mo>+</mml:mo><mml:mi>&#x03BB;</mml:mi><mml:mo>&#x2217;</mml:mo><mml:mrow><mml:mtext>outputB</mml:mtext></mml:mrow></mml:math></inline-formula></td>
<td align="left">(1, 10)</td>
</tr>
</tbody>
</table>
</table-wrap>
<fig id="fig-1"><label>Figure 1</label><caption><title>Structure of the proposed TP-MobNet</title></caption><graphic mimetype="image" mime-subtype="png" xlink:href="CMC_26259-fig-1.png"/></fig>
<p>The reason why two fusions are applied is as follows. The output of the convolution is computed only by a small window. Depending on the value of the small window, overfitting may occur. Therefore, early fusion is applied to reduce overfitting caused by windows. And coordinate attention has a low capability for channel relationships. Late fusion is applied to capture the channel relationship in the final output.</p>
<p>The experiment demonstrated that when early fusion and late fusion were combined, they had similar effects on the ensemble. For the first convolution, we used several strides. In contrast to stride (2, 1) which can only be fused along the time axis, stride (2, 2) can be fused in both directions. It was confirmed that proper performance could only be achieved when the axis of early fusion was the same in split operations.</p>
</sec>
<sec id="s3_3"><label>3.3</label><title>Coordinate Attention</title>
<p>We utilize coordinate attention [<xref ref-type="bibr" rid="ref-21">21</xref>], a new method of integrating positioning information into channel attention for mobile networks. In contrast to squeeze-and-excitation channel attention [<xref ref-type="bibr" rid="ref-22">22</xref>], coordinate attention is decomposed into two feature encodings by bi-directional average pooling. It can be used to train long-range relationships as well as precise position information in feature maps.</p>
<p>As shown in <xref ref-type="table" rid="table-6">Tab. 6</xref>, two two-dimensional average poolings are used for the X and Y axes. Thereafter, after the output features have been concatenated, the number of channels is adjusted based on the reduction ratio <italic>r</italic>. Following BN, a swish activation and ReLU6 function are used as activation. For each attention weight, the data is divided into X and Y axes. The attention weights are applied to the input features by multiplying them.</p>
<table-wrap id="table-6"><label>Table 6</label><caption><title>Structure of coordinate attention</title></caption>
<table frame="hsides">
<colgroup>
<col align="left"/>
<col align="left"/>
<col align="left"/>
</colgroup>
<thead>
<tr>
<th align="left">Layer name</th>
<th align="left">Layer config</th>
<th align="left">Output feature size</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">Input</td>
<td align="left"/>
<td align="left">(H, W, C)</td>
</tr>
<tr>
<td align="left">Pooling</td>
<td align="left">stride&#x2009;&#x003D;&#x2009;1, 1 &#x00D7; W<break/>stride&#x2009;&#x003D;&#x2009;1, H &#x00D7; 1</td>
<td align="left">(1, W, C)<break/>(H, 1, C)</td>
</tr>
<tr>
<td align="left">Concatenation</td>
<td align="left"></td>
<td align="left">(H &#x002B; W, 1, C)</td>
</tr>
<tr>
<td align="left">Convolution</td>
<td align="left">stride&#x2009;&#x003D;&#x2009;1, C/<italic>r</italic>, 1&#x2009;&#x00D7;&#x2009;1</td>
<td align="left">(H &#x002B; W, 1, C/<italic>r</italic>)</td>
</tr>
<tr>
<td align="left">BatchNorm &#x002B; activation</td>
<td align="left"/>
<td align="left"/>
</tr>
<tr>
<td align="left">Divide</td>
<td align="left"/>
<td align="left">(1, W, C/<italic>r</italic>)<break/>(H, 1, C/<italic>r</italic>)</td>
</tr>
<tr>
<td align="left">Convolution</td>
<td align="left">stride&#x2009;&#x003D;&#x2009;1, C, 1&#x2009;&#x00D7;&#x2009;1,</td>
<td align="left">(1, W, C)<break/>(H, 1, C)</td>
</tr>
<tr>
<td align="left">Sigmoid</td>
<td align="left"/>
<td align="left">attention weights</td>
</tr>
<tr>
<td align="left">Multiplication</td>
<td align="left">input &#x002A; attention weights</td>
<td align="left">(H, W, C)</td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
<sec id="s3_4"><label>3.4</label><title>Weight Quantization</title>
<p>Our trained model is quantized for efficient integer-arithmetic-only inference [<xref ref-type="bibr" rid="ref-23">23</xref>]. Converting a 32-bit fixed-point operation to a low-precision 8-bit operation can boost the speed of the CNN model while reducing its weight [<xref ref-type="bibr" rid="ref-3">3</xref>]. The TensorFlow Lite converter in TensorFlow Lite [<xref ref-type="bibr" rid="ref-24">24</xref>] supports an 8-bit quantization technique.</p>
</sec>
</sec>
<sec id="s4"><label>4</label><title>Experiments</title>
<p>We evaluated the proposed TP-MobNet using the TAU Urban Acoustic Scenes 2020 Mobile dataset. Sections 4.1 and 4.2 describe the specifications and processing of this dataset, Section 4.3 describes the training in detail, and Section 4.4 presents the experimental results.</p>
<sec id="s4_1"><label>4.1</label><title>Dataset</title>
<p>A development dataset and an evaluation dataset make up the TAU Urban Acoustic Scenes 2020 Mobile dataset (the evaluation dataset is not published). Airport, retail mall, metro station, street pedestrian, public plaza, street traffic, tram, bus, metro, and park are the acoustic scene classifications in the dataset. The development dataset comprises 10-s segments recorded with three genuine devices (A&#x2013;C) and six simulated devices (S1&#x2013;S6), as shown in <xref ref-type="table" rid="table-7">Tab. 7</xref>. The overall length and number of segments are respectively 64 h and 23,040.</p>
<table-wrap id="table-7"><label>Table 7</label><caption><title>Description of the TAU urban acoustic scenes 2020 mobile dataset</title></caption>
<table frame="hsides">
<colgroup>
<col align="left"/>
<col align="left"/>
<col align="left"/>
</colgroup>
<thead>
<tr>
<th align="left">Description</th>
<th align="left">Num. of devices</th>
<th align="left">Num. of segments</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">Development dataset (full)</td>
<td align="left">9</td>
<td align="left">23,040</td>
</tr>
<tr>
<td align="left">Development dataset (for cross-validation, training)</td>
<td align="left">6</td>
<td align="left">13,965</td>
</tr>
<tr>
<td align="left">Development dataset (for cross-validation, test)</td>
<td align="left">9</td>
<td align="left">2,970</td>
</tr>
<tr>
<td align="left">Evaluation dataset</td>
<td align="left">11</td>
<td align="left">7,920</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>The development dataset is separated into 70 percent training and 30 percent test for the cross-validation setup. Several segments aren&#x0027;t used for the balanced test dataset in this situation. In addition, the test dataset contains only three simulated devices (S3&#x2013;S6). The training and test dataset each of 13 965 and 2970 segments, respectively. The evaluation dataset consists of 10-s segments recorded with 11 devices, one of which is a real device (D) and four of which are simulated (S7&#x2013;S11). There are 7920 segments in total (the evaluation dataset is not published).</p>
</sec>
<sec id="s4_2"><label>4.2</label><title>Data Preprocessing and Augmentations</title>
<p>All of the audio segments were recorded in mono with a sampling rate of 44&#x2005;kHz and a 24-bit resolution per sample. 2048 FFT points were done for every 1024 samples in each 10-s input segment, and a power spectrum was derived. In that instance, there were 431 bins in one power spectrum. Then, with 128 frequency bins, log-Mel filterbank characteristics were recovered, and mean and variance normalization was done to each frequency bin. The normalized log-Mel filterbank features were also used to produce delta and delta-delta, which were then stacked into the channel axis. As a result, one of the input features was in the shape of 128&#x2009;&#x00D7;&#x2009;423&#x2009;&#x00D7;&#x2009;3.</p>
<p>Mixup [<xref ref-type="bibr" rid="ref-25">25</xref>], spectrum augmentation [<xref ref-type="bibr" rid="ref-26">26</xref>], spectrum correction [<xref ref-type="bibr" rid="ref-14">14</xref>], pitch shift, speed change, and mix audios were used as data augmentation methods for the features. In the training procedure, mixup and spectrum augmentation were applied. A mixup with an alpha value of 0.4 was applied to each mini-batch of input features, which were randomly masked for time and frequency.</p>
<p>Before training, other augmentation methods like spectrum correction, pitch shift, speed change, and mix audios were used. Averaging the spectra from all training devices except device A was used to generate reference device spectra for spectrum correction. The reference device spectrum was used to adjust the spectra of device A. In addition, the acoustic signals of all training datasets were enhanced by padding and cropping to randomly shift the pitch and change the speed. In addition, acoustic signals from the same classes were mixed at random. As seen in <xref ref-type="table" rid="table-8">Tab. 8</xref>, these data augmentations enhanced the total amount of training data.</p>
<table-wrap id="table-8"><label>Table 8</label><caption><title>Comparison of data amounts using data augmentations</title></caption>
<table frame="hsides">
<colgroup>
<col align="left"/>
<col align="left"/>
<col align="left"/>
</colgroup>
<thead>
<tr>
<th align="left">Description</th>
<th align="left">Num. of devices</th>
<th align="left">Num. of segments</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">Development dataset (full)</td>
<td align="left">9</td>
<td align="left">106,560</td>
</tr>
<tr>
<td align="left">Development dataset (for cross-validation, training)</td>
<td align="left">6</td>
<td align="left">66,075</td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
<sec id="s4_3"><label>4.3</label><title>Training Details</title>
<p>TensorFlow 2.0 and Keras were used in all of the experiments presented here. With a 0.9 momentum weight and a 10&#x2009;&#x2212;&#x2009;6e decay, the optimizer employed stochastic gradient descent. Categorical cross-entropy loss was also employed. With a batch size of 32, all of our models were trained for 256 epochs. The learning rate was initially set to 0.1. The learning rate was reset at epochs 3, 7, 15, 31, 127, and 255. We selected the validation point with the highest accuracy as the best model. The late fusion interpolation value was set to 0.5.</p>
</sec>
<sec id="s4_4"><label>4.4</label><title>Experimental Results</title>
<p><xref ref-type="table" rid="table-9 table-10 table-11 table-12 table-13 table-14">Tabs. 9&#x2013;14</xref> provide the experimental results and details. <xref ref-type="table" rid="table-9">Tab. 9</xref> shows the experimental findings broken down by model type and data normalization. As a starting point, we constructed two models. Small FCNN [<xref ref-type="bibr" rid="ref-17">17</xref>] and MobNet [<xref ref-type="bibr" rid="ref-17">17</xref>] are two of them. Small FCNN had an accuracy of 64.04&#x0025; and 66.09&#x0025;, depending on whether or not data normalization was used. The accuracy standards for the proposed MobNet, on the other hand, were 60.57&#x0025; and 67.24&#x0025;. When both models were normalized, it was confirmed that they performed better.</p>
<table-wrap id="table-9"><label>Table 9</label><caption><title>Experimental results according to model types and applying data normalization</title></caption>
<table frame="hsides">
<colgroup>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
</colgroup>
<thead>
<tr>
<th align="left">Description</th>
<th align="left">w/norm</th>
<th align="left">w/data augs.</th>
<th align="left">Params.</th>
<th align="left">Size [kB]</th>
<th align="left">Acc. [&#x0025;]</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left" rowspan="2">Small FCNN</td>
<td align="left"></td>
<td align="left">&#x221A;</td>
<td align="left">117,169</td>
<td align="left">117.1</td>
<td align="left">64.04</td>
</tr>
<tr>
<td align="left">&#x221A;-</td>
<td align="left">&#x221A;</td>
<td align="left">117,169</td>
<td align="left">117.1</td>
<td align="left">66.09</td>
</tr>
<tr>
<td align="left" rowspan="2">MobNet</td>
<td align="left"></td>
<td align="left">&#x221A;</td>
<td align="left">38,780</td>
<td align="left">124.5</td>
<td align="left">60.57</td>
</tr>
<tr>
<td align="left">&#x221A;</td>
<td align="left">&#x221A;</td>
<td align="left">38,780</td>
<td align="left">124.5</td>
<td align="left"><bold>67.24</bold></td>
</tr>
</tbody>
</table>
</table-wrap>
<table-wrap id="table-10"><label>Table 10</label><caption><title>Experimental results according to data augmentations</title></caption>
<table frame="hsides">
<colgroup>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
</colgroup>
<thead>
<tr>
<th align="left">Description</th>
<th align="left">w/norm.</th>
<th align="left">w/spec corr.</th>
<th align="left">w/pitch</th>
<th align="left">w/speed</th>
<th align="left">w/noise</th>
<th align="left">w/mix</th>
<th align="left">Params.</th>
<th align="left">Size [kB]</th>
<th align="left">Acc. [&#x0025;]</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left" rowspan="8">MobNet</td>
<td align="left">&#x221A;</td>
<td align="left"></td>
<td align="left"></td>
<td align="left"></td>
<td align="left"></td>
<td align="left"></td>
<td align="left" rowspan="8">38,780</td>
<td align="left" rowspan="8">124.5</td>
<td align="left">67.24</td>
</tr>
<tr>
<td align="left">&#x221A;</td>
<td align="left">&#x221A;</td>
<td align="left">&#x221A;</td>
<td align="left">&#x221A;</td>
<td align="left">&#x221A;</td>
<td align="left">&#x221A;</td>
<td align="left">67.81</td>
</tr>
<tr>
<td align="left">&#x221A;</td>
<td align="left"></td>
<td align="left">&#x221A;</td>
<td align="left">&#x221A;</td>
<td align="left">&#x221A;</td>
<td align="left">&#x221A;</td>
<td align="left">67.91</td>
</tr>
<tr>
<td align="left">&#x221A;</td>
<td align="left">&#x221A;</td>
<td align="left"></td>
<td align="left">&#x221A;</td>
<td align="left">&#x221A;</td>
<td align="left">&#x221A;</td>
<td align="left">66.77</td>
</tr>
<tr>
<td align="left">&#x221A;</td>
<td align="left">&#x221A;</td>
<td align="left">&#x221A;</td>
<td align="left"></td>
<td align="left">&#x221A;</td>
<td align="left">&#x221A;</td>
<td align="left">67.27</td>
</tr>
<tr>
<td align="left">&#x221A;</td>
<td align="left">&#x221A;</td>
<td align="left">&#x221A;</td>
<td align="left">&#x221A;</td>
<td align="left"></td>
<td align="left">&#x221A;</td>
<td align="left"><bold>70.30</bold></td>
</tr>
<tr>
<td align="left">&#x221A;</td>
<td align="left">&#x221A;</td>
<td align="left">&#x221A;</td>
<td align="left">&#x221A;</td>
<td align="left">&#x221A;</td>
<td align="left"></td>
<td align="left">66.90</td>
</tr>
<tr>
<td align="left">&#x221A;</td>
<td align="left"></td>
<td align="left">&#x221A;</td>
<td align="left">&#x221A;</td>
<td align="left">&#x221A;</td>
<td align="left"></td>
<td align="left">68.42</td>
</tr>
</tbody>
</table>
</table-wrap>
<table-wrap id="table-11"><label>Table 11</label><caption><title>Experimental results according to model hyperparameters</title></caption>
<table frame="hsides">
<colgroup>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
</colgroup>
<thead>
<tr>
<th align="left">Description</th>
<th align="left">No. of filters</th>
<th align="left">Spectrogram split</th>
<th align="left">Params.</th>
<th align="left">Size [kB]</th>
<th align="left">Acc. [&#x0025;]</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left" rowspan="5">MobNet</td>
<td align="left">12</td>
<td align="left">&#x221A;</td>
<td align="left">38,780</td>
<td align="left">124.5</td>
<td align="left">70.30</td>
</tr>
<tr>
<td align="left">12</td>
<td align="left"></td>
<td align="left">19,873</td>
<td align="left">64.23</td>
<td align="left">67.58</td>
</tr>
<tr>
<td align="left">24</td>
<td align="left"></td>
<td align="left">69,579</td>
<td align="left">95.87</td>
<td align="left">70.74</td>
</tr>
<tr>
<td align="left">32</td>
<td align="left"></td>
<td align="left">70,634</td>
<td align="left">97.42</td>
<td align="left">70.81</td>
</tr>
<tr>
<td align="left">32&#x002A;</td>
<td align="left"></td>
<td align="left">97,820</td>
<td align="left">121.1</td>
<td align="left"><bold>71.45</bold></td>
</tr>
</tbody>
</table>
</table-wrap>
<table-wrap id="table-12"><label>Table 12</label><caption><title>Experimental results according to the proposed methods</title></caption>
<table frame="hsides">
<colgroup>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
</colgroup>
<thead>
<tr>
<th align="left">Description</th>
<th align="left">No.<break/>of<break/>filters</th>
<th align="left">w/coordinate<break/>attention</th>
<th align="left">w/two-pass<break/>methods</th>
<th align="left">Stride<break/>in early<break/>fusion</th>
<th align="left">w/ensemble</th>
<th align="left">Params.</th>
<th align="left">Size<break/>[kB]</th>
<th align="left">Acc.<break/>                          [&#x0025;]</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left" rowspan="4">MobNet</td>
<td align="left" rowspan="2">32</td>
<td align="left"></td>
<td align="left"></td>
<td align="left"></td>
<td align="left"></td>
<td align="left">70,634</td>
<td align="left">97.42</td>
<td align="left">70.81</td>
</tr>
<tr>
<td align="left">&#x221A;</td>
<td align="left"></td>
<td align="left"></td>
<td align="left"></td>
<td align="left">70,883</td>
<td align="left">98.47</td>
<td align="left">71.25</td>
</tr>
<tr>
<td align="left" rowspan="2">32&#x002A;</td>
<td align="left"></td>
<td align="left"></td>
<td align="left"></td>
<td align="left"></td>
<td align="left">97,820</td>
<td align="left">121.1</td>
<td align="left">71.45</td>
</tr>
<tr>
<td align="left">&#x221A;</td>
<td align="left"></td>
<td align="left"></td>
<td align="left"></td>
<td align="left">98,053</td>
<td align="left">122.1</td>
<td align="left">71.82</td>
</tr>
<tr>
<td align="left" rowspan="4">TP-MobNet</td>
<td align="left" rowspan="4">32&#x002A;</td>
<td align="left">&#x221A;</td>
<td align="left">&#x221A;</td>
<td align="left">&#x007B;2, 1&#x007D;</td>
<td align="left"></td>
<td align="left">99,557</td>
<td align="left">126.5</td>
<td align="left"><bold>72.59</bold></td>
</tr>
<tr>
<td align="left">&#x221A;</td>
<td align="left">&#x221A;</td>
<td align="left">&#x007B;2, 2&#x007D;</td>
<td align="left"></td>
<td align="left">99,614</td>
<td align="left">126.6</td>
<td align="left">72.09</td>
</tr>
<tr>
<td align="left">&#x221A;</td>
<td align="left">&#x221A;</td>
<td align="left">&#x007B;1, 2&#x007D;</td>
<td align="left"></td>
<td align="left">99,603</td>
<td align="left">126.5</td>
<td align="left">72.56</td>
</tr>
<tr>
<td align="left">&#x221A;</td>
<td align="left">&#x221A;</td>
<td align="left">&#x007B;2, 1&#x007D;</td>
<td align="left">&#x221A;</td>
<td align="left"></td>
<td align="left">219.6</td>
<td align="left"><bold>73.94</bold></td>
</tr>
</tbody>
</table>
</table-wrap>
<table-wrap id="table-13"><label>Table 13</label><caption><title>Device-wise and class-wise accuracies of the proposed methods</title></caption>
<table frame="hsides">
<colgroup>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
</colgroup>
<thead>
<tr>
<th align="left">Class, Device/Models</th>
<th align="left">Baseline</th>
<th align="left">w/coordinate attention</th>
<th align="left">w/two-pass methods</th>
<th align="left">w/ensemble</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">A</td>
<td align="left">74.85</td>
<td align="left"><bold>76.36</bold></td>
<td align="left">74.85</td>
<td align="left">75.76</td>
</tr>
<tr>
<td align="left">B, C</td>
<td align="left">71.97</td>
<td align="left">70.76</td>
<td align="left"><bold>72.88</bold></td>
<td align="left">72.72</td>
</tr>
<tr>
<td align="left">S1&#x2013;S3</td>
<td align="left">69.60</td>
<td align="left">71.62</td>
<td align="left">72.22</td>
<td align="left"><bold>73.33</bold></td>
</tr>
<tr>
<td align="left">S4&#x2013;S6</td>
<td align="left">71.82</td>
<td align="left">71.21</td>
<td align="left">72.02</td>
<td align="left"><bold>74.75</bold></td>
</tr>
<tr>
<td align="left">Airport</td>
<td align="left">68.35</td>
<td align="left">60.94</td>
<td align="left"><bold>73.06</bold></td>
<td align="left"><bold>73.06</bold></td>
</tr>
<tr>
<td align="left">Bus</td>
<td align="left">81.48</td>
<td align="left">83.16</td>
<td align="left"><bold>85.86</bold></td>
<td align="left">85.52</td>
</tr>
<tr>
<td align="left">Metro</td>
<td align="left">76.77</td>
<td align="left">74.75</td>
<td align="left">76.09</td>
<td align="left"><bold>80.13</bold></td>
</tr>
<tr>
<td align="left">Metro station</td>
<td align="left"><bold>81.48</bold></td>
<td align="left">73.74</td>
<td align="left">69.36</td>
<td align="left">76.09</td>
</tr>
<tr>
<td align="left">Park</td>
<td align="left">82.49</td>
<td align="left"><bold>85.19</bold></td>
<td align="left">82.83</td>
<td align="left">83.50</td>
</tr>
<tr>
<td align="left">Public square</td>
<td align="left">56.90</td>
<td align="left">57.24</td>
<td align="left">62.63</td>
<td align="left"><bold>63.64</bold></td>
</tr>
<tr>
<td align="left">Shopping mall</td>
<td align="left">63.30</td>
<td align="left">68.01</td>
<td align="left"><bold>70.37</bold></td>
<td align="left">66.67</td>
</tr>
<tr>
<td align="left">Street pedestrian</td>
<td align="left">52.19</td>
<td align="left">52.53</td>
<td align="left"><bold>53.87</bold></td>
<td align="left">50.17</td>
</tr>
<tr>
<td align="left">Street traffic</td>
<td align="left">86.53</td>
<td align="left"><bold>88.55</bold></td>
<td align="left">80.81</td>
<td align="left">84.18</td>
</tr>
<tr>
<td align="left">Tram</td>
<td align="left">64.98</td>
<td align="left">74.07</td>
<td align="left">71.04</td>
<td align="left"><bold>76.43</bold></td>
</tr>
<tr>
<td align="left">Overall accuracy</td>
<td align="left">71.45</td>
<td align="left">71.82</td>
<td align="left">72.59</td>
<td align="left"><bold>73.94</bold></td>
</tr>
</tbody>
</table>
</table-wrap>
<table-wrap id="table-14"><label>Table 14</label><caption><title>Performance comparison between the proposed model and previous CNN models</title></caption>
<table frame="hsides">
<colgroup>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
</colgroup>
<thead>
<tr>
<th align="left">Description</th>
<th align="left">w/data aug.</th>
<th align="left">w/weight quant.</th>
<th align="left">Params [K].</th>
<th align="left">Size [kB]</th>
<th align="left">Acc. [&#x0025;]</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">DCASE 2021 baseline [<xref ref-type="bibr" rid="ref-27">27</xref>]</td>
<td align="left"></td>
<td align="left">&#x221A;</td>
<td align="left"></td>
<td align="left">90.3</td>
<td align="left">47.7</td>
</tr>
<tr>
<td align="left">EfficientNet-V2 [<xref ref-type="bibr" rid="ref-28">28</xref>]</td>
<td align="left">&#x221A;</td>
<td align="left">&#x221A;</td>
<td align="left">62</td>
<td align="left">121.8</td>
<td align="left">70.5</td>
</tr>
<tr>
<td align="left">SE-ResNet [<xref ref-type="bibr" rid="ref-29">29</xref>]</td>
<td align="left">&#x221A;</td>
<td align="left">&#x221A;</td>
<td align="left">113</td>
<td align="left">127.6</td>
<td align="left">70.2</td>
</tr>
<tr>
<td align="left">RF-regularized CNN [<xref ref-type="bibr" rid="ref-30">30</xref>]</td>
<td align="left">&#x221A;</td>
<td align="left">&#x221A;</td>
<td align="left">64</td>
<td align="left">126.2</td>
<td align="left">69.5</td>
</tr>
<tr>
<td align="left">Shallow conformer [<xref ref-type="bibr" rid="ref-31">31</xref>]</td>
<td align="left">&#x221A;</td>
<td align="left"></td>
<td align="left">34</td>
<td align="left"></td>
<td align="left">61.25</td>
</tr>
<tr>
<td align="left">TP-MobNet (proposed)</td>
<td align="left">&#x221A;</td>
<td align="left">&#x221A;</td>
<td align="left">99</td>
<td align="left">126.5</td>
<td align="left"><bold>72.59</bold></td>
</tr>
</tbody>
</table>
</table-wrap>
<p>The experimental results are presented in <xref ref-type="table" rid="table-10">Tab. 10</xref> by the method of data augmentation. Ablation studies were carried out using five distinct augmentation strategies, with data normalization used as a default. As a result of the experiment, the best performance was achieved when simply noise was excepted, with an accuracy of 70.30&#x0025;.</p>
<p>The experimental results are presented in <xref ref-type="table" rid="table-11">Tab. 11</xref> according to the MobNet hyperparameters. The performance of the mobile block was confirmed in particular based on the difference in the number of initial filters. To begin, the process of separating the spectrogram into two splits and feeding it to the network was eliminated to decrease parameters. The accuracy climbed to 70.81&#x0025; Tab when the number of filters was raised to 32. It was validated that the accuracy standard with 121.1 KB was increased to 71.45&#x0025; using the same number of filters (displayed as 32&#x002A;) as the model proposed in 1.</p>
<p>The performance evaluation of several proposed strategies is presented in <xref ref-type="table" rid="table-12">Tab. 12</xref>. The accuracy was 71.82&#x0025; in the case of the proposed TP-MobNet with two-pass techniques, and 72.59&#x0025; in the stride 2, 1 combination in the case of the proposed TP-MobNet with two-pass methods. In this example, we confirmed that it had the best performance of the single models proposed, with 99,557 parameters and a size of 126.5. Various strides were used, and two models were combined to corroborate the 73.94&#x0025; accuracy.</p>
<p><xref ref-type="table" rid="table-13">Tab. 13</xref> shows performance device-wise and class-wise. The majority of the proposed approaches&#x2019; performances were confirmed to be better than the baseline. It was confirmed that the ensemble model performed better for S4&#x2013;S6, which is a previously unknown device.</p>
<p>Finally, <xref ref-type="table" rid="table-14">Tab. 14</xref> compares the performance of previous CNN models and the proposed TP-MobNet. The size of the single model was confirmed to be similar to previous CNN models, but the performance was somewhat enhanced.</p>
</sec>
</sec>
<sec id="s5"><label>5</label><title>Conclusions</title>
<p>In this paper, we propose a two-pass mobile network for low-complexity classification of the acoustic scene, named TP-MobNet. The proposed TP-MobNet includes two-pass fusion techniques in a single model, as well as coordinate attention and weight quantization. Experiments on the TAU Urban Acoustic Scenes 2020 Mobile development set confirmed that our model, with a model size of 219.6 kB, obtained an accuracy of 73.94&#x0025;.</p>
</sec>
</body>
<back>
<fn-group>
<fn fn-type="other"><p><bold>Funding Statement:</bold> This work was supported by Institute of Information &#x0026; communications Technology Planning &#x0026; Evaluation (IITP) grant funded by the Korea government (MSIT) [No. 2021-0-0268, Artificial Intelligence Innovation Hub (Artificial Intelligence Institute, Seoul National University)]</p></fn>
<fn fn-type="conflict"><p><bold>Conflicts of Interest:</bold> The authors declare that they have no conflicts of interest to report regarding the present study.</p></fn>
</fn-group>
<ref-list content-type="authoryear">
<title>References</title>
<ref id="ref-1"><label>[1]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>E.</given-names> <surname>Sophiya</surname></string-name> and <string-name><given-names>S.</given-names> <surname>Jothilakshmi</surname></string-name></person-group>, &#x201C;<article-title>Deep learning based audio scene classification</article-title>,&#x201D; in <conf-name>Proc. of Int. Conf. on Computational Intelligence, Cyber Security, and Computational Models (ICC3)</conf-name>, <conf-loc>Coimbatore, India</conf-loc>, pp. <fpage>98</fpage>&#x2013;<lpage>109</lpage>, <year>2017</year>.</mixed-citation></ref>
<ref id="ref-2"><label>[2]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>Y.</given-names> <surname>Petetin</surname></string-name>, <string-name><given-names>C.</given-names> <surname>Laroche</surname></string-name> and <string-name><given-names>A.</given-names> <surname>Mayoue</surname></string-name></person-group>, &#x201C;<article-title>Deep neural networks for audio scene recognition</article-title>,&#x201D; in <conf-name>Proc. of European Signal Processing Conf. (EUSIPCO)</conf-name>, <conf-loc>Nice, France</conf-loc>, pp. <fpage>125</fpage>&#x2013;<lpage>129</lpage>, <year>2015</year>.</mixed-citation></ref>
<ref id="ref-3"><label>[3]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>V.</given-names> <surname>Vanhoucke</surname></string-name>, <string-name><given-names>A.</given-names> <surname>Senior</surname></string-name> and <string-name><given-names>M. Z.</given-names> <surname>Mao</surname></string-name></person-group>, &#x201C;<article-title>Improving the speed of neural networks on CPUs</article-title>,&#x201D; in <conf-name>Proc. of Deep Learning and Unsupervised Feature Learning NIPS Workshop</conf-name>, <conf-loc>Granada, Spain</conf-loc>, vol. <volume>1</volume>, pp. <fpage>4</fpage>, <year>2011</year>.</mixed-citation></ref>
<ref id="ref-4"><label>[4]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>R.</given-names> <surname>Mu</surname></string-name> and <string-name><given-names>X.</given-names> <surname>Zeng</surname></string-name></person-group>, &#x201C;<article-title>A review of deep learning research</article-title>,&#x201D; <source>KSII Transactions on Internet and Information Systems</source>, vol. <volume>13</volume>, no. <issue>4</issue>, pp. <fpage>1738</fpage>&#x2013;<lpage>1764</lpage>, <year>2019</year>.</mixed-citation></ref>
<ref id="ref-5"><label>[5]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>K. J.</given-names> <surname>Piczak</surname></string-name></person-group>, &#x201C;<article-title>Environmental sound classification with convolutional neural networks</article-title>,&#x201D; in <conf-name>Proc. of 25th Int. Workshop on Machine Learning for Signal Processing (MLSP)</conf-name>, <conf-loc>Boston, USA</conf-loc>, pp. <fpage>1</fpage>&#x2013;<lpage>6</lpage>, <year>2015</year>.</mixed-citation></ref>
<ref id="ref-6"><label>[6]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>S. H.</given-names> <surname>Bae</surname></string-name>, <string-name><given-names>I.</given-names> <surname>Choi</surname></string-name> and <string-name><given-names>N. S.</given-names> <surname>Kim</surname></string-name></person-group>, &#x201C;<article-title>Acoustic scene classification using parallel combination of LSTM and CNN</article-title>,&#x201D; in <conf-name>Proc. of Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE2016)</conf-name>, <conf-loc>Budapest, Hungary</conf-loc>, pp. <fpage>11</fpage>&#x2013;<lpage>15</lpage>, <year>2016</year>.</mixed-citation></ref>
<ref id="ref-7"><label>[7]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>H.</given-names> <surname>Jallet</surname></string-name>, <string-name><given-names>E.</given-names> <surname>Cak&#x0131;r</surname></string-name> and <string-name><given-names>T.</given-names> <surname>Virtanen</surname></string-name></person-group>, &#x201C;<article-title>Acoustic scene classification using convolutional recurrent neural networks</article-title>,&#x201D; <conf-name>Detection and Classification of Acoustic Scenes and Events (DCASE2017)</conf-name>, Virtual, <year>2017</year>.</mixed-citation></ref>
<ref id="ref-8"><label>[8]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>M.</given-names> <surname>Lim</surname></string-name>, <string-name><given-names>D.</given-names> <surname>Lee</surname></string-name>, <string-name><given-names>H.</given-names> <surname>Park</surname></string-name>, <string-name><given-names>Y.</given-names> <surname>Kang</surname></string-name>, <string-name><given-names>J.</given-names> <surname>Oh</surname></string-name> <etal>et al.,</etal></person-group> &#x201C;<article-title>Convolutional neural network based audio event classification</article-title>,&#x201D; <source>KSII Transactions on Internet and Information Systems</source>, vol. <volume>12</volume>, no. <issue>6</issue>, pp. <fpage>2748</fpage>&#x2013;<lpage>2760</lpage>, <year>2018</year>.</mixed-citation></ref>
<ref id="ref-9"><label>[9]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>M.</given-names> <surname>Valenti</surname></string-name>, <string-name><given-names>A.</given-names> <surname>Diment</surname></string-name> and <string-name><given-names>G.</given-names> <surname>Parascandolo</surname></string-name></person-group>, &#x201C;<article-title>DCASE 2016 acoustic scene classification using convolutional neural networks</article-title>,&#x201D; in <conf-name>Proc. of Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE2016)</conf-name>, <conf-loc>Budapest, Hungary</conf-loc>, pp. <fpage>95</fpage>&#x2013;<lpage>99</lpage>, <year>2016</year>.</mixed-citation></ref>
<ref id="ref-10"><label>[10]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>Y.</given-names> <surname>Han</surname></string-name>, <string-name><given-names>J.</given-names> <surname>Park</surname></string-name> and <string-name><given-names>K.</given-names> <surname>Lee</surname></string-name></person-group>, &#x201C;<article-title>Convolutional neural networks with binaural representations and background subtraction for acoustic scene classification</article-title>,&#x201D; <conf-name>Detection and Classification of Acoustic Scenes and Events (DCASE2017)</conf-name>, Virtual, <year>2017</year>.</mixed-citation></ref>
<ref id="ref-11"><label>[11]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><given-names>A. G.</given-names> <surname>Howard</surname></string-name>, <string-name><given-names>M.</given-names> <surname>Zhu</surname></string-name>, <string-name><given-names>B.</given-names> <surname>Chen</surname></string-name>, <string-name><given-names>D.</given-names> <surname>Kalenicheonko</surname></string-name>, <string-name><given-names>W.</given-names> <surname>Wang</surname></string-name> <etal>et al.,</etal></person-group> &#x201C;<article-title>MobileNets: Efficient convolutional neural networks for mobile vision applications</article-title>,&#x201D; <comment>arXiv preprint arXiv:1704.04861</comment>, <year>2017</year>.</mixed-citation></ref>
<ref id="ref-12"><label>[12]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>M.</given-names> <surname>Sandler</surname></string-name>, <string-name><given-names>A.</given-names> <surname>Howard</surname></string-name>, <string-name><given-names>M.</given-names> <surname>Zhu</surname></string-name>, <string-name><given-names>A.</given-names> <surname>Zhmoginov</surname></string-name> and <string-name><given-names>L. -C.</given-names> <surname>Chen</surname></string-name></person-group>, &#x201C;<article-title>Mobilenetv2: Inverted residuals and linear bottlenecks</article-title>,&#x201D; in <conf-name>Proc. of IEEE Conf. on Computer Vision and Pattern Recognition (CVPR)</conf-name>, <conf-loc>Salt Lake City, Utah</conf-loc>, pp. <fpage>4510</fpage>&#x2013;<lpage>4520</lpage>, <year>2018</year>.</mixed-citation></ref>
<ref id="ref-13"><label>[13]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>X.</given-names> <surname>Zhang</surname></string-name>, <string-name><given-names>X.</given-names> <surname>Zhou</surname></string-name>, <string-name><given-names>M.</given-names> <surname>Lin</surname></string-name> and <string-name><given-names>J.</given-names> <surname>Sun</surname></string-name></person-group>, &#x201C;<article-title>ShuffleNet: An extremely efficient convolutional neural network for mobile devices</article-title>,&#x201D; in <conf-name>Proc. of IEEE Conf. on Computer Vision and Pattern Recognition (CVPR)</conf-name>, <conf-loc>Salt Lake City, Utah</conf-loc>, pp. <fpage>6848</fpage>&#x2013;<lpage>6856</lpage>, <year>2018</year>.</mixed-citation></ref>
<ref id="ref-14"><label>[14]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>H.</given-names> <surname>Hu</surname></string-name>, <string-name><given-names>C. -H. H.</given-names> <surname>Yang</surname></string-name>, <string-name><given-names>X.</given-names> <surname>Xia</surname></string-name>, <string-name><given-names>X.</given-names> <surname>Bai</surname></string-name>, <string-name><given-names>X.</given-names> <surname>Tang</surname></string-name> <etal>et al.,</etal></person-group> &#x201C;<article-title>Device-robust acoustic scene classification based on two-stage categorization and data augmentation</article-title>,&#x201D; <conf-name>Detection and Classification of Acoustic Scenes and Events (DCASE2020)</conf-name>, Virtual, <year>2020</year>.</mixed-citation></ref>
<ref id="ref-15"><label>[15]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>T.</given-names> <surname>Heittola</surname></string-name>, <string-name><given-names>A.</given-names> <surname>Mesaros</surname></string-name> and <string-name><given-names>T.</given-names> <surname>Virtanen</surname></string-name></person-group>, &#x201C;<article-title>Acoustic scene classification in DCASE 2020 challenge: Generalization across devices and low complexity solutions</article-title>,&#x201D; in <conf-name>Proc. of Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE2020)</conf-name>, <conf-loc>Tokyo, Japan</conf-loc>, pp. <fpage>56</fpage>&#x2013;<lpage>60</lpage>, <year>2020</year>.</mixed-citation></ref>
<ref id="ref-16"><label>[16]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>A.</given-names> <surname>Mesaros</surname></string-name>, <string-name><given-names>T.</given-names> <surname>Heittola</surname></string-name> and <string-name><given-names>T.</given-names> <surname>Virtanen</surname></string-name></person-group>, &#x201C;<article-title>A multi-device dataset for urban acoustic scene classification</article-title>,&#x201D; in <conf-name>Proc. of Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE2018)</conf-name>, <conf-loc>Surrey, UK</conf-loc>, pp. <fpage>9</fpage>&#x2013;<lpage>13</lpage>, <year>2018</year>.</mixed-citation></ref>
<ref id="ref-17"><label>[17]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>M.</given-names> <surname>Valenti</surname></string-name>, <string-name><given-names>S.</given-names> <surname>Squartini</surname></string-name>, <string-name><given-names>A.</given-names> <surname>Diment</surname></string-name>, <string-name><given-names>G.</given-names> <surname>Parascandolo</surname></string-name> and <string-name><given-names>T.</given-names> <surname>Virtanen</surname></string-name></person-group>, &#x201C;<article-title>A convolutional neural network approach for acoustic scene classification</article-title>,&#x201D; in <conf-name>Proc. of Int. Joint Conf. on Neural Networks (IJCNN)</conf-name>, <conf-loc>Anchorage, Alaska</conf-loc>, pp. <fpage>1547</fpage>&#x2013;<lpage>1554</lpage>, <year>2017</year>.</mixed-citation></ref>
<ref id="ref-18"><label>[18]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>M.</given-names> <surname>Dorfer</surname></string-name>, <string-name><given-names>B.</given-names> <surname>Lehner</surname></string-name>, <string-name><given-names>H.</given-names> <surname>Eghbal-zadeh</surname></string-name>, <string-name><given-names>H.</given-names> <surname>Christop</surname></string-name>, <string-name><given-names>P.</given-names> <surname>Fabian</surname></string-name> <etal>et al.,</etal></person-group> &#x201C;<article-title>Acoustic scene classification with fully convolutional neural network and i-vectors</article-title>,&#x201D; <conf-name>Detection and Classification of Acoustic Scenes and Events (DCASE2018)</conf-name>, Virtual, <year>2018</year>.</mixed-citation></ref>
<ref id="ref-19"><label>[19]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>W.</given-names> <surname>Cao</surname></string-name>, <string-name><given-names>Y.</given-names> <surname>Li</surname></string-name> and <string-name><given-names>Q.</given-names> <surname>Huang</surname></string-name></person-group>, &#x201C;<article-title>Acoustic scene classification using lightweight ResNet with attention</article-title>,&#x201D; <conf-name>Detection and Classification of Acoustic Scenes and Events (DCASE2021)</conf-name>, Virtual, <year>2021</year>.</mixed-citation></ref>
<ref id="ref-20"><label>[20]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>M.</given-names> <surname>McDonnell</surname></string-name> and <string-name><given-names>W.</given-names> <surname>Gao</surname></string-name></person-group>, &#x201C;<article-title>Acoustic scene classification using deep residual network with late fusion of separated high and low frequency paths</article-title>,&#x201D; <conf-name>Detection and Classification of Acoustic Scenes and Events (DCASE2019)</conf-name>, Virtual, <year>2019</year>.</mixed-citation></ref>
<ref id="ref-21"><label>[21]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>Q.</given-names> <surname>Hou</surname></string-name>, <string-name><given-names>D.</given-names> <surname>Zhou</surname></string-name> and <string-name><given-names>J.</given-names> <surname>Feng</surname></string-name></person-group>, &#x201C;<article-title>Coordinate attention for efficient mobile network design</article-title>,&#x201D; in <conf-name>Proc. of IEEE Conf. on Computer Vision and Pattern Recognition (CVPR)</conf-name>, <conf-loc>Nashville, USA</conf-loc>, pp. <fpage>13713</fpage>&#x2013;<lpage>13722</lpage>, <year>2021</year>.</mixed-citation></ref>
<ref id="ref-22"><label>[22]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>J.</given-names> <surname>Hu</surname></string-name>, <string-name><given-names>L.</given-names> <surname>Shen</surname></string-name> and <string-name><given-names>G.</given-names> <surname>Sun</surname></string-name></person-group>, &#x201C;<article-title>Squeeze-and-excitation networks</article-title>,&#x201D; in <conf-name>Proc. of IEEE Conf. on Computer Vision and Pattern Recognition (CVPR)</conf-name>, <conf-loc>Salt Lake City, Utah</conf-loc>, pp. <fpage>7132</fpage>&#x2013;<lpage>7141</lpage>, <year>2018</year>.</mixed-citation></ref>
<ref id="ref-23"><label>[23]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>B.</given-names> <surname>Jacob</surname></string-name>, <string-name><given-names>S.</given-names> <surname>Klgys</surname></string-name>, <string-name><given-names>B.</given-names> <surname>chen</surname></string-name>, <string-name><given-names>M.</given-names> <surname>Zhu</surname></string-name>, <string-name><given-names>M.</given-names> <surname>tang</surname></string-name> <etal>et al.,</etal></person-group> &#x201C;<article-title>Quantization and training of neural networks for efficient integer-arithmetic-only inference</article-title>,&#x201D; in <conf-name>Proc. of IEEE Conf. on Computer Vision and Pattern Recognition (CVPR)</conf-name>, <conf-loc>Salt Lake City, Utah</conf-loc>, pp. <fpage>2704</fpage>&#x2013;<lpage>2713</lpage>, <year>2018</year>.</mixed-citation></ref>
<ref id="ref-24"><label>[24]</label><mixed-citation publication-type="web"><comment>Google Inc. TensorFlow Lite. [Online].</comment> Available: <uri xlink:href="https://www. tensorflow.org/mobile/tflite">https://www. tensorflow.org/mobile/tflite</uri>, <year>2021</year>.</mixed-citation></ref>
<ref id="ref-25"><label>[25]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><given-names>H.</given-names> <surname>Zhang</surname></string-name>, <string-name><given-names>M.</given-names> <surname>Cisse</surname></string-name>, <string-name><given-names>Y. N.</given-names> <surname>Dauphin</surname></string-name> and <string-name><given-names>D.</given-names> <surname>Lopez-Paz</surname></string-name></person-group>, &#x201C;<article-title>Mixup: Beyond empirical risk minimization</article-title>,&#x201D; <comment>arXiv preprint arXiv:1710.09412</comment>, <year>2017</year>.</mixed-citation></ref>
<ref id="ref-26"><label>[26]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>D. S.</given-names> <surname>Park</surname></string-name>, <string-name><given-names>W.</given-names> <surname>Chan</surname></string-name>, <string-name><given-names>Y.</given-names> <surname>Zhang</surname></string-name>, <string-name><given-names>C.</given-names> <surname>Chiu</surname></string-name>, <string-name><given-names>B.</given-names> <surname>Zoph</surname></string-name> <etal>et al.,</etal></person-group> &#x201C;<article-title>SpecAugment: A simple data augmentation method for automatic speech recognition</article-title>,&#x201D; in <conf-name>Proc. of ISCA Interspeech</conf-name>, <conf-loc>Graz, Austria</conf-loc>, pp. <fpage>2019</fpage>&#x2013;<lpage>2680</lpage>, <year>2019</year>.</mixed-citation></ref>
<ref id="ref-27"><label>[27]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><given-names>I.</given-names> <surname>Martin-Morato</surname></string-name>, <string-name><given-names>T.</given-names> <surname>Heittola</surname></string-name>, <string-name><given-names>A.</given-names> <surname>Mesaros</surname></string-name> and <string-name><given-names>T.</given-names> <surname>Virtanen</surname></string-name></person-group>, &#x201C;<article-title>Low-complexity acoustic scene classification for multi-device audio: Analysis of DCASE 2021 challenge systems</article-title>,&#x201D; <comment>arXiv preprint arXiv:2105.13734</comment>, <year>2021</year>.</mixed-citation></ref>
<ref id="ref-28"><label>[28]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>S.</given-names> <surname>Verbitskiy</surname></string-name> and <string-name><given-names>V.</given-names> <surname>Vyshegorodtsev</surname></string-name></person-group>, &#x201C;<article-title>Low-complexity acoustic scene classification using mobile inverted bottleneck blocks</article-title>,&#x201D; <conf-name>Detection and Classification of Acoustic Scenes and Events (DCASE2021)</conf-name>, Virtual, <year>2021</year>.</mixed-citation></ref>
<ref id="ref-29"><label>[29]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>L.</given-names> <surname>Byttebier</surname></string-name>, <string-name><given-names>B.</given-names> <surname>Desplanques</surname></string-name>, <string-name><given-names>J.</given-names> <surname>Thienpondt</surname></string-name>, <string-name><given-names>S.</given-names> <surname>Song</surname></string-name>, <string-name><given-names>K.</given-names> <surname>Demuynck</surname></string-name> <etal>et al.,</etal></person-group> &#x201C;<article-title>Small-footprint acoustic scene classification through 8-bit quantization-aware training and pruning of ResNet models</article-title>,&#x201D; <conf-name>Detection and Classification of Acoustic Scenes and Events (DCASE2021)</conf-name>, Virtual, <year>2021</year>.</mixed-citation></ref>
<ref id="ref-30"><label>[30]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>K.</given-names> <surname>Koutini</surname></string-name>, <string-name><given-names>S.</given-names> <surname>Jan</surname></string-name> and <string-name><given-names>G.</given-names> <surname>Widmer</surname></string-name></person-group>, &#x201C;<article-title>Cpjku submission to decase21: Cross-device audio scene classification with wide sparse frequency-damped cnns</article-title>,&#x201D; <conf-name>Detection and Classification of Acoustic Scenes and Events (DCASE2021)</conf-name>, Virtual, <year>2021</year>.</mixed-citation></ref>
<ref id="ref-31"><label>[31]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>S.</given-names> <surname>Seo</surname></string-name>, <string-name><given-names>D.</given-names> <surname>Lee</surname></string-name> and <string-name><given-names>J. -H.</given-names> <surname>Kim</surname></string-name></person-group>, &#x201C;<article-title>Shallow convolution-augmented transformer with differentiable neural computer for low-complexity classification of variable-length acoustic scene</article-title>,&#x201D; in <conf-name>Proc. of ISCA Interspeech</conf-name>, <conf-loc>Brno, Czech Republic</conf-loc>, pp. <fpage>576</fpage>&#x2013;<lpage>580</lpage>, <year>2021</year>.</mixed-citation></ref>
</ref-list>
</back>
</article>