<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.1 20151215//EN" "http://jats.nlm.nih.gov/publishing/1.1/JATS-journalpublishing1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:mml="http://www.w3.org/1998/Math/MathML" xml:lang="en" article-type="research-article" dtd-version="1.1">
<front>
<journal-meta>
<journal-id journal-id-type="pmc">CMC</journal-id>
<journal-id journal-id-type="nlm-ta">CMC</journal-id>
<journal-id journal-id-type="publisher-id">CMC</journal-id>
<journal-title-group>
<journal-title>Computers, Materials &#x0026; Continua</journal-title>
</journal-title-group>
<issn pub-type="epub">1546-2226</issn>
<issn pub-type="ppub">1546-2218</issn>
<publisher>
<publisher-name>Tech Science Press</publisher-name>
<publisher-loc>USA</publisher-loc>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="publisher-id">73201</article-id>
<article-id pub-id-type="doi">10.32604/cmc.2025.073201</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Article</subject>
</subj-group>
</article-categories>
<title-group>
<article-title>Design, Realization, and Evaluation of Faster End-to-End Data Transmission over Voice Channels</article-title>
<alt-title alt-title-type="left-running-head">Design, Realization, and Evaluation of Faster End-to-End Data Transmission over Voice Channels</alt-title>
<alt-title alt-title-type="right-running-head">Design, Realization, and Evaluation of Faster End-to-End Data Transmission over Voice Channels</alt-title>
</title-group>
<contrib-group>
<contrib id="author-1" contrib-type="author">
<name name-style="western"><surname>Huang</surname><given-names>Jian</given-names></name><xref ref-type="aff" rid="aff-1">1</xref></contrib>
<contrib id="author-2" contrib-type="author">
<name name-style="western"><surname>Li</surname><given-names>Mingwei</given-names></name><xref ref-type="aff" rid="aff-1">1</xref></contrib>
<contrib id="author-3" contrib-type="author">
<name name-style="western"><surname>Tian</surname><given-names>Yulong</given-names></name><xref ref-type="aff" rid="aff-1">1</xref></contrib>
<contrib id="author-4" contrib-type="author">
<name name-style="western"><surname>Yao</surname><given-names>Yi</given-names></name><xref ref-type="aff" rid="aff-2">2</xref></contrib>
<contrib id="author-5" contrib-type="author" corresp="yes">
<name name-style="western"><surname>Han</surname><given-names>Hao</given-names></name><xref ref-type="aff" rid="aff-1">1</xref><email>hhan@nuaa.edu.cn</email></contrib>
<aff id="aff-1"><label>1</label><institution>The College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics</institution>, <addr-line>Nanjing, 211106</addr-line>, <country>China</country></aff>
<aff id="aff-2"><label>2</label><institution>The College of Command and Control Engineering, Army Engineering University of PLA</institution>, <addr-line>Nanjing, 210042</addr-line>, <country>China</country></aff>
</contrib-group>
<author-notes>
<corresp id="cor1"><label>&#x002A;</label>Corresponding Author: Hao Han. Email: <email>hhan@nuaa.edu.cn</email></corresp>
</author-notes>
<pub-date date-type="collection" publication-format="electronic">
<year>2026</year>
</pub-date>
<pub-date date-type="pub" publication-format="electronic">
<day>10</day><month>2</month><year>2026</year>
</pub-date>
<volume>87</volume>
<issue>1</issue>
<elocation-id>69</elocation-id>
<history>
<date date-type="received">
<day>12</day>
<month>09</month>
<year>2025</year>
</date>
<date date-type="accepted">
<day>03</day>
<month>12</month>
<year>2025</year>
</date>
</history>
<permissions>
<copyright-statement>&#x00A9; 2026 The Authors.</copyright-statement>
<copyright-year>2026</copyright-year>
<copyright-holder>Published by Tech Science Press.</copyright-holder>
<license xlink:href="https://creativecommons.org/licenses/by/4.0/">
<license-p>This work is licensed under a <ext-link ext-link-type="uri" xlink:type="simple" xlink:href="https://creativecommons.org/licenses/by/4.0/">Creative Commons Attribution 4.0 International License</ext-link>, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
</license>
</permissions>
<self-uri content-type="pdf" xlink:href="TSP_CMC_73201.pdf"></self-uri>
<abstract>
<p>With the popularization of new technologies, telephone fraud has become the main means of stealing money and personal identity information. Taking inspiration from the website authentication mechanism, we propose an end-to-end data modem scheme that transmits the caller&#x2019;s digital certificates through a voice channel for the recipient to verify the caller&#x2019;s identity. Encoding useful information through voice channels is very difficult without the assistance of telecommunications providers. For example, speech activity detection may quickly classify encoded signals as non-speech signals and reject input waveforms. To address this issue, we propose a novel modulation method based on linear frequency modulation that encodes 3 bits per symbol by varying its frequency, shape, and phase, alongside a lightweight MobileNetV3-Small-based demodulator for efficient and accurate signal decoding on resource-constrained devices. This method leverages the unique characteristics of linear frequency modulation signals, making them more easily transmitted and decoded in speech channels. To ensure reliable data delivery over unstable voice links, we further introduce a robust framing scheme with delimiter-based synchronization, a sample-level position remedying algorithm, and a feedback-driven retransmission mechanism. We have validated the feasibility and performance of our system through expanded real-world evaluations, demonstrating that it outperforms existing advanced methods in terms of robustness and data transfer rate. This technology establishes the foundational infrastructure for reliable certificate delivery over voice channels, which is crucial for achieving strong caller authentication and preventing telephone fraud at its root cause.</p>
</abstract>
<kwd-group kwd-group-type="author">
<kwd>Deep learning</kwd>
<kwd>modulation</kwd>
<kwd>chirp</kwd>
<kwd>data over voice</kwd>
</kwd-group>
</article-meta>
</front>
<body>
<sec id="s1">
<label>1</label>
<title>Introduction</title>
<p>Phone scams and voice phishing (a.k.a. vishing) have seen a significant rise in recent years, largely fueled by the rapid advancement of evolutionary AI technologies and the increasing reliance on digital communication channels. Several studies have documented that complex phishing activities utilizing synthetic speech and social engineering are evolving [<xref ref-type="bibr" rid="ref-1">1</xref>,<xref ref-type="bibr" rid="ref-2">2</xref>]. As illustrated in <xref ref-type="fig" rid="fig-1">Fig. 1</xref>, scammers and fraudsters exploit emerging technologies and public trust in institutions to deceive individuals into believing they are interacting with legitimate organizations, financial entities, or service providers. A critical tactic employed by phone scammers is the use of fake caller IDs [<xref ref-type="bibr" rid="ref-3">3</xref>], which manipulate the displayed incoming number to mimic known and trusted sources, thereby increasing the likelihood of the call being answered. Once connected, scammers utilize sophisticated social engineering scripts to extract sensitive personal information or financial details. Furthermore, advancements in artificial intelligence have enabled the use of AI-driven chatbots [<xref ref-type="bibr" rid="ref-4">4</xref>] and highly convincing deepfake voice synthesis [<xref ref-type="bibr" rid="ref-5">5</xref>] to impersonate familiar contacts or authority figures. The accessibility of caller ID spoofing tools, such as those referenced in [<xref ref-type="bibr" rid="ref-6">6</xref>,<xref ref-type="bibr" rid="ref-7">7</xref>], allows malicious actors to easily disguise their identity by displaying any chosen number&#x2014;including emergency lines like 911&#x2014;further eroding trust in telephonic communication systems. Analysis of phishing trends indicates that traditional authentication mechanisms are insufficient against these emerging threats [<xref ref-type="bibr" rid="ref-8">8</xref>].</p>
<fig id="fig-1">
<label>Figure 1</label>
<caption>
<title>Motivation of our idea to prevent phone frauds</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_73201-fig-1.tif"/>
</fig>
<p>To stop phone scams, we believe it is essential to authenticate conversation parties over traditional telephone networks. This is similar to the Internet: When a user visits a website, the Secure Sockets Layer (SSL) certificate plays an important role in ensuring the authenticity of the website. However, the modern telephony infrastructure provides no means for a callee to reason about the caller&#x2019;s identity except the called ID, which can be spoofed. We need <italic>the caller to have the capability to transmit its digital certificate to the callee for authentication</italic>. However, several challenges prevent us from achieving this goal, including:
<list list-type="bullet">
<list-item>
<p><bold>C1: Without the support of cellular carriers.</bold> The transmission of digital certificates should be end-to-end and compatible with existing infrastructure without relying on cellular carriers. For example, dial-up modems have been available for decades to transmit data through telephone lines. However, this approach does not work for mobile phones. That is because the baseband in a smartphone that helps convert digital data into radio frequency signals (and vice versa) is a black box for end users. It is challenging for users to implement their own dial-up modem on their smartphones without the support of vendors.</p></list-item>
<list-item>
<p><bold>C2: In case of NO Internet.</bold> Although mobile data offers an alternative solution to transfer data over cellular networks such as 4G/5G, it will incur an extra financial cost. GSMA Research [<xref ref-type="bibr" rid="ref-9">9</xref>] shows that 3.4 billion mobile consumers still cannot pay for the Internet despite living in areas with mobile data coverage. Also, data plans are unavailable in certain areas with weak signals.</p></list-item>
</list></p>
<p>In particular, enabling the caller to transmit its digital certificate directly over the voice channel provides a practical defense mechanism when Internet connectivity is absent. Unlike approaches that rely on mobile data or Wi-Fi, our vision is that the certificate can be embedded as digital data within the ongoing call itself. This allows the callee to authenticate the caller in real time, even in areas with limited network coverage or among populations unable to afford data plans. By reusing the ubiquitous voice channel as a carrier of authentication information, our system strengthens trust in telephony without imposing additional infrastructure requirements or financial burdens on end users.</p>
<p>Some studies have proposed approaches such as Hermes [<xref ref-type="bibr" rid="ref-10">10</xref>] and Authloop [<xref ref-type="bibr" rid="ref-11">11</xref>] to enable data transmission over the voice channel of cellular networks. However, those works perform poorly in our experiments with China Mobile networks and cannot reach a fast enough data rate to transmit digital certificates. After analyzing the collected signals, we found that those approaches stopped working after a short time (see <xref ref-type="sec" rid="s2">Section 2</xref> for details). A possible reason is that if we carry data into acoustic signals, those non-speech-like signals may be rejected by the Voice Activity Detector (VAD) used in the Discontinuous Transmission (DTX) within cellular telecommunication systems. In addition, the complex network infrastructure will distort signals transmitted from one subsystem to another, and the voice/speech codec may severely distort the encoded signals.</p>
<p>To this end, we revisit the long-standing idea of transmitting digital data over voice channels and present Fast Data Over Voice (FastDOV)&#x2014;a fast and reliable data transmission mechanism that enables the exchange of digital certificates between callers and callees, even in the absence of Internet connectivity. By empowering each party to authenticate the other during a phone call, FastDOV directly addresses the root cause of many phone scams: the inability of users to verify caller identities. Technically, FastDOV employs a chirp-based modulation/demodulation scheme that is resilient to distortions in complex telecommunication infrastructures, as chirp signals are well known for their robustness against channel noise. To further enhance decoding accuracy in weak signal environments, we integrate deep learning (DL) models that recover distorted chirp signals. In addition, we design a dedicated data link protocol incorporating stop/resume, time synchronization, and retransmission mechanisms to mitigate the impact of Voice Activity Detectors (VAD) and Discontinuous Transmission (DTX). We implemented a prototype of FastDOV on Commercial Off-The-Shelf (COTS) smartphones and evaluated it through extensive experiments. Results demonstrate that FastDOV achieves an average goodput of 1291.0 bit/s over diverse mobile networks, outperforming state-of-the-art approaches, while making real-time certificate transmission practical for reducing phone fraud.</p>
<p>The contributions of this work are summarized as follows:
<list list-type="bullet">
<list-item>
<p>We propose a novel DL-based acoustic scheme that can transmit data over mobile voice channels. The demodulation is robust to distortions and interruptions in cellular voice channels by modulating the frequency, shape, and phase of chirp signals.</p></list-item>
<list-item>
<p>Since the caller and the callee do not synchronize over voice channels, we design a cross-correlation-based method with a remedying algorithm to accurately determine when the data transmission begins and ends without using an external clock.</p></list-item>
<list-item>
<p>We present a working prototype of FastDOV and an extensive empirical study under various environmental factors such as weather conditions and noise impact at the transmitter and receiver. The evaluation results show that FastDOV achieves higher goodput than state-of-the-art approaches.</p></list-item>
</list></p>
<p>The rest of this paper is organized as follows. <xref ref-type="sec" rid="s2">Section 2</xref> introduces the background and challenges. <xref ref-type="sec" rid="s4">Section 4</xref> provides a detailed description of FastDOV&#x2019;s design. <xref ref-type="sec" rid="s5">Section 5</xref> presents the implementation and evaluation results with practical use cases. Last, <xref ref-type="sec" rid="s3">Section 3</xref> describes the existing related work, followed by the conclusion in <xref ref-type="sec" rid="s7">Section 7</xref>.</p>
</sec>
<sec id="s2">
<label>2</label>
<title>Background and Challenges</title>
<p>In this section, we identify several challenges to achieving fast data transmission over a mobile voice channel and provide some background.</p>
<sec id="s2_1">
<label>2.1</label>
<title>Rejection of Non-Voice-Like Signals</title>
<p>In cellular networks such as Global System for Mobile Communications (GSM) and Code-division Multiple Access (CDMA), Discontinuous Transmission (DTX) technology is widely used to stop transmitting signals when there is no voice signal transmission to reduce interference and improve the system&#x2019;s efficiency. Voice activity detection (VAD) is a technology to detect whether a voice signal exists. The presence of VAD/DTX can be crucial for transmitting any modulated signals over voice channels. Once the audio is determined as non-speech, it will be removed from the transmission, leading to the loss of the signal and a deterioration in the transmission performance [<xref ref-type="bibr" rid="ref-12">12</xref>].</p>
<p>To demonstrate, we implemented several existing approaches for data transmission over the voice channel, including Hermes [<xref ref-type="bibr" rid="ref-10">10</xref>] and Authloop [<xref ref-type="bibr" rid="ref-11">11</xref>], and tested them in China Mobile/Telecom/Unicom. The basic idea is to modulate a 0 or 1 bit by decrementing or incrementing the base frequency by a fixed delta and transmit a sinusoid of these frequencies for 15 s. We observed that the audio signals received at the callee side were significantly weakened after 1&#x2013;2 s of transmission time. This phenomenon can be seen from the comparison between the spectrogram of the original signal and that of the signal in the receiver, as shown in <xref ref-type="fig" rid="fig-2">Fig. 2</xref>. If playing a human speech sound instead, we did not observe any filtering out of the received signal.</p>
<fig id="fig-2">
<label>Figure 2</label>
<caption>
<title>Spectrogram comparison between modulated signals and normal human voice, where modulated signals were filtered out after 1-2 s by cellular networks</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_73201-fig-2.tif"/>
</fig>
</sec>
<sec id="s2_2">
<label>2.2</label>
<title>Limited Frequency Band and Codec Effect</title>
<p>As mentioned in previous studies, not all frequencies in the voice channel of the GSM and its successors, 4G/5G, are suitable for modulating data. Since they are designed to transfer speech only, the energy that reflects the main characteristics of human speech is mainly concentrated in the range of 0.3khz&#x2013;3.4khz. The audio signal will be band-pass filtered, and components outside this frequency range are automatically removed.</p>
<p>In addition, audio and speech codecs widely used in the voice channel will also distort the modulated audio signals that carry information significantly. Previous work [<xref ref-type="bibr" rid="ref-13">13</xref>] provides the reasons for the unpredictable distortions produced by speech codecs. Linear predictive coding (LPC) is used in audio/speech codecs to digitize voice. However, LPC is a lossy compression technique. Thus, the synthesized voice differs considerably from the original waveform, introducing additional distortion. In addition, linear prediction and differential coding for extracting speech parameters assume a high correlation between adjacent input samples. This assumption applies to speech but not to conventional data signals.</p>
<p>To demonstrate, we generated a chirp signal with a linear increment from 0 to 24 KHz in our experiment. After measuring our cellular networks, we found that frequencies in the range of human speech have different attenuation.</p>
<p><xref ref-type="fig" rid="fig-3">Fig. 3</xref> shows the attenuation ratio (<inline-formula id="ieqn-1"><mml:math id="mml-ieqn-1"><mml:mi>a</mml:mi><mml:mi>t</mml:mi><mml:mi>t</mml:mi><mml:mi>e</mml:mi><mml:mi>n</mml:mi><mml:mi>u</mml:mi><mml:mi>a</mml:mi><mml:mi>t</mml:mi><mml:mi>i</mml:mi><mml:mi>o</mml:mi><mml:mi>n</mml:mi><mml:mo>=</mml:mo><mml:mi>r</mml:mi><mml:mi>e</mml:mi><mml:mi>c</mml:mi><mml:mi>e</mml:mi><mml:mi>i</mml:mi><mml:mi>v</mml:mi><mml:mi>e</mml:mi><mml:mi>r</mml:mi><mml:mrow><mml:mo>/</mml:mo></mml:mrow><mml:mi>s</mml:mi><mml:mi>e</mml:mi><mml:mi>n</mml:mi><mml:mi>d</mml:mi><mml:mi>e</mml:mi><mml:mi>r</mml:mi></mml:math></inline-formula>) of the FFT (fast Fourier transform) magnitude at the receiver during mobile phone calls and air transmission. It can be seen that (1) the attenuation magnitude indeed drops to 0 after approximately 4 KHz which means the receiver almost can not receive the signal, and (2) the red line in the figure indicates that sound above 3400 Hz can still transmit in the air.</p>
<fig id="fig-3">
<label>Figure 3</label>
<caption>
<title>FFT of original/received signals</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_73201-fig-3.tif"/>
</fig>
</sec>
<sec id="s2_3">
<label>2.3</label>
<title>Heterogeneous Telephony Infrastructure</title>
<p>Circuit-switched networks [<xref ref-type="bibr" rid="ref-14">14</xref>] were used in the first generation of wireless mobile communication in the 1980s. In the 2.5G era, the General Packet Radio Service (GPRS) [<xref ref-type="bibr" rid="ref-15">15</xref>] was added to communication as a packet-based data service. Until the 4G era, LTE (Long Term Evolution) emerged [<xref ref-type="bibr" rid="ref-16">16</xref>], with the core network EPC (Evolved Packet Core) eliminating the circuit domain. To support all networks, the telephony architecture becomes increasingly complex with a considerable number of subsystems. The voice has to travel across multiple hops from the caller to the callee. Each intermediate link may use different protocols. Thus, the voice may be converted from analog to digital, then to analog, and back to digital. This process may be repeated across every hop until the destination is reached, resulting in unpredictable distortions.</p>
<p>We conducted the experiments by transmitting the aforementioned chirp signal through one carrier (e.g., inside China Mobile) and multiple carriers (e.g., Mobile <inline-formula id="ieqn-2"><mml:math id="mml-ieqn-2"><mml:mo stretchy="false">&#x2192;</mml:mo></mml:math></inline-formula> Telecom), respectively. It is seen that the FFT curve in <xref ref-type="fig" rid="fig-4">Fig. 4</xref>, the blue line (Mobile <inline-formula id="ieqn-3"><mml:math id="mml-ieqn-3"><mml:mo stretchy="false">&#x2192;</mml:mo></mml:math></inline-formula> Mobile) has the highest energy value at different frequencies, indicating that the less complex the carriers, the higher the signal quality.</p>
<fig id="fig-4">
<label>Figure 4</label>
<caption>
<title>Affect of heterogeneous telephony infrastructure</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_73201-fig-4.tif"/>
</fig>
</sec>
</sec>
<sec id="s3">
<label>3</label>
<title>Related Work</title>
<sec id="s3_1">
<label>3.1</label>
<title>Acoustic Data Transmission in the Air</title>
<p>Data transmission using acoustic signals in the air typically uses near ultrasonic sound waves or sometimes audible sound waves that exceed the voice channel&#x2019;s frequency range. Previous work [<xref ref-type="bibr" rid="ref-17">17</xref>] proposed a wireless keyboard link using binary Frequency-Shift Keying (FSK)-modulated ultrasonic signals. Work [<xref ref-type="bibr" rid="ref-18">18</xref>] implemented OOK and binary FSK modulation schemes in the system with wireless synchronization. Hush [<xref ref-type="bibr" rid="ref-19">19</xref>] used Orthogonal Frequency Division Multiplex (OFDM) to send data over a 5&#x2013;20 cm distance between commercial smart mobile devices at frequencies 16&#x2013;20 KHz. Work [<xref ref-type="bibr" rid="ref-20">20</xref>] used data sequence control and error correction algorithms to communicate high-frequency audio data. Work [<xref ref-type="bibr" rid="ref-21">21</xref>] proposed a method similar to FSK with 3 kHz as the base frequency to broadcast Wi-Fi information to customers through sound signals with a supported distance from 10 to 100 cm.</p>
<p>Existing work in this category mainly uses ultrasonic signals for data transmission without considering all constraints in mobile communication channels, such as narrow frequency bands and signal distortions caused by heterogeneous networks and speech codes. Hence, they are not suitable for data transmission over voice channels.</p>
</sec>
<sec id="s3_2">
<label>3.2</label>
<title>Data Transmission over Voice Channels</title>
<p>Data over voice is a technique based on encoding data signals into speech-like parameters, codebook training, or modulation techniques [<xref ref-type="bibr" rid="ref-22">22</xref>]. Previous work [<xref ref-type="bibr" rid="ref-23">23</xref>] implemented a real-time prototype and maps the input data on line spectral frequencies, pitch frequency, and speech frames energy, which can facilitate the encrypted data transmission using unknown encryption methods. Work [<xref ref-type="bibr" rid="ref-24">24</xref>] also used a similar mapping, but they applied the FR (Full Rate) vocoder, which has a short time delay and good interoperability. Work [<xref ref-type="bibr" rid="ref-25">25</xref>] encoded the data by a set of predefined signals called &#x201C;symbols&#x201D;, which are synthesized by genetic algorithms. PCCD-OFDM-ASK (Phase Continuous Context Dependent- Orthogonal Frequency Division-Multiplexing Amplitude Shift Keying). Work [<xref ref-type="bibr" rid="ref-26">26</xref>] combined PCCD on OFDM, thus having robust and reliable data transmission over the GSM or CDMA(Code Division Multiple Access) speech channels. Hermes [<xref ref-type="bibr" rid="ref-10">10</xref>] can transmit data over unknown voice channels with the idea of frequency shift keying coding in the voice channel. Work [<xref ref-type="bibr" rid="ref-27">27</xref>] used a single codebook to transmit the voice and designed an efficient low-bit-rate speech coder. Work [<xref ref-type="bibr" rid="ref-28">28</xref>] synthesized waveform symbols using sinusoidal signals and designed a learning algorithm for obtaining demodulated codebooks online to construct a general implementation scheme for DoV. This scheme proposes an analysis algorithm based on surface packaging to optimize the modulation codebook offline, so it has good symbol error rate (SER) performance. Work [<xref ref-type="bibr" rid="ref-22">22</xref>] proposed a DoV technology based on a short harmonic waveform codebook, which relies on linear predictive coded speech compression (LPC speech coding) and has a high transmission rate and robustness. AuthLoop [<xref ref-type="bibr" rid="ref-11">11</xref>] provided a strong cryptographic authentication protocol inspired by Transport Layer Security (TLS) 1.2 to determine the identity of the entity at the other end of the call (that is, the caller ID). Work [<xref ref-type="bibr" rid="ref-29">29</xref>] proposed a modulation algorithm based on Frequency Modulation (FM). This algorithm can convert encrypted voice data to a waveform that conforms to GSM voice channel specifications. However, in this method, the actual modulator bit rate does not allow real-time communication. Overview [<xref ref-type="bibr" rid="ref-30">30</xref>] provides a more detailed summary of the improvement or design of modulation, which divides the methods into three categories which are parameter mapping, codebook optimization, and modulation optimization.</p>
<p>It is also worth distinguishing our work from another class of solutions that focus on post-answer detection by analyzing speech content using on-device models [<xref ref-type="bibr" rid="ref-31">31</xref>] or large language models [<xref ref-type="bibr" rid="ref-32">32</xref>]. While effective for content analysis, these methods intervene <italic>after</italic> the call has been connected and trust established. In contrast, FastDOV aims to provide pre-answer authentication by transmitting a verifiable certificate before the conversation begins, addressing the threat at an earlier stage.</p>
<p>Compared to the above work, most approaches require the help of telephony service providers, while our work is a pure end-to-end solution. Our FastDOV also incorporates deep learning technology into modulation and demodulation to improve the robustness of data transmission.</p>
</sec>
</sec>
<sec id="s4">
<label>4</label>
<title>System Design</title>
<p>To address the challenges mentioned in <xref ref-type="sec" rid="s2">Section 2</xref>, we propose FastDOV, an end-to-end data modem on top of existing (unmodified) cellular voice channels. The overview of FastDOV is presented in <xref ref-type="fig" rid="fig-5">Fig. 5</xref>, illustrating the workflow between a transmitter and a receiver. The transmitter first divides the data under transmission into multiple frames according to a pre-defined size. A specially designed delimiter is inserted at the beginning of each frame and the end of the last frame. Each frame is appended with an error correction code and converted into an audio signal using a chirp-based modulation. The modulated chirps are then transmitted over the cellular voice channel.</p>
<fig id="fig-5">
<label>Figure 5</label>
<caption>
<title>System framework</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_73201-fig-5.tif"/>
</fig>
<p>When the phone call is answered, the receiver relies on the delimiter to locate each frame&#x2019;s beginning and end within the received audio stream. These frames are demodulated and decoded by a Deep Learning (DL) model, followed by the error correction check. If any error is found, a feedback mechanism is proposed to notify the sender to retransmit the frames that are in error. Lastly, all successful frames are combined to restore the original data.</p>
<sec id="s4_1">
<label>4.1</label>
<title>Framing Scheme</title>
<p>In FastDOV, data are divided into multiple <italic>frames</italic> and transmitted over a voice channel. The format of a data frame is presented in <xref ref-type="fig" rid="fig-6">Fig. 6</xref>. A specially designed signal is inserted to separate each data frame, and we call it a <italic>delimiter</italic>. A data frame comprises multiple <italic>symbol groups</italic> (SG), each of which contains several <italic>symbols</italic> carrying a fixed length of data bits and parity bits for error correction. To avoid the impact of VAD and DTX mentioned in <xref ref-type="sec" rid="s2">Section 2</xref>, we add a <italic>gap</italic> between consecutive SGs. Such a gap is an empty signal without carrying any information, aiming to remove the memory of the VAD/DTX algorithm.</p>
<fig id="fig-6">
<label>Figure 6</label>
<caption>
<title>The data format of frames and symbols</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_73201-fig-6.tif"/>
</fig>
<p><bold>Determining the start/end of transmission.</bold> We designed a unique chirp as a delimiter, inserted at the start of every frame and appended to the last frame to indicate when the data frames begin and end. The goal is to ensure synchronization between the receiver and transmitter. Furthermore, the inserted delimiter is also treated as a guard to protect our data signals. That is because the sudden energy change on a voice channel may cause signal fluctuation, so the frontmost and backmost data signals may not be received completely. Adding delimiters can reduce the loss of data frames.</p>
<p>To detect the exact position of a delimiter on the receiver side, we adopt a cross-correlation-based method, where the known delimiter signal is correlated with the received audio stream in a sliding window. Let the received signal stream as <inline-formula id="ieqn-4"><mml:math id="mml-ieqn-4"><mml:mo fence="false" stretchy="false">{</mml:mo><mml:msub><mml:mi>u</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo fence="false" stretchy="false">}</mml:mo></mml:math></inline-formula> where <inline-formula id="ieqn-5"><mml:math id="mml-ieqn-5"><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mn>2</mml:mn><mml:mo>,</mml:mo><mml:mo>&#x22EF;</mml:mo><mml:mo>,</mml:mo><mml:mi>n</mml:mi></mml:math></inline-formula> and each <inline-formula id="ieqn-6"><mml:math id="mml-ieqn-6"><mml:msub><mml:mi>u</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:math></inline-formula> is an audio sample. The delimiter emitted by the sender is denoted as <inline-formula id="ieqn-7"><mml:math id="mml-ieqn-7"><mml:mo fence="false" stretchy="false">{</mml:mo><mml:msub><mml:mi>v</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo fence="false" stretchy="false">}</mml:mo></mml:math></inline-formula> where <inline-formula id="ieqn-8"><mml:math id="mml-ieqn-8"><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mn>2</mml:mn><mml:mo>,</mml:mo><mml:mo>&#x22EF;</mml:mo><mml:mo>,</mml:mo><mml:mi>m</mml:mi></mml:math></inline-formula> and <inline-formula id="ieqn-9"><mml:math id="mml-ieqn-9"><mml:mi>n</mml:mi><mml:mo>&#x003E;&#x003E;</mml:mo><mml:mi>m</mml:mi></mml:math></inline-formula>. A sliding window with a length equal to <inline-formula id="ieqn-10"><mml:math id="mml-ieqn-10"><mml:mi>m</mml:mi></mml:math></inline-formula> is extracted from <inline-formula id="ieqn-11"><mml:math id="mml-ieqn-11"><mml:mo fence="false" stretchy="false">{</mml:mo><mml:msub><mml:mi>u</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo fence="false" stretchy="false">}</mml:mo></mml:math></inline-formula> using matched filtering. The sample correlation coefficient <inline-formula id="ieqn-12"><mml:math id="mml-ieqn-12"><mml:mi>r</mml:mi></mml:math></inline-formula> is computed as follows:
<disp-formula id="ueqn-1"><mml:math id="mml-ueqn-1" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd /><mml:mtd><mml:mi>r</mml:mi><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mi>m</mml:mi></mml:munderover><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>u</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo>&#x2212;</mml:mo><mml:mover><mml:mi>u</mml:mi><mml:mo accent="false">&#x00AF;</mml:mo></mml:mover><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>v</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo>&#x2212;</mml:mo><mml:mover><mml:mi>v</mml:mi><mml:mo accent="false">&#x00AF;</mml:mo></mml:mover><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mrow><mml:msqrt><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mi>n</mml:mi></mml:munderover></mml:msqrt><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>u</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo>&#x2212;</mml:mo><mml:mover><mml:mi>u</mml:mi><mml:mo accent="false">&#x00AF;</mml:mo></mml:mover><mml:msup><mml:mo stretchy="false">)</mml:mo><mml:mn>2</mml:mn></mml:msup><mml:msqrt><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mi>n</mml:mi></mml:munderover><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>v</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo>&#x2212;</mml:mo><mml:mover><mml:mi>v</mml:mi><mml:mo accent="false">&#x00AF;</mml:mo></mml:mover><mml:mo stretchy="false">)</mml:mo></mml:msqrt></mml:mrow></mml:mfrac></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>where <inline-formula id="ieqn-13"><mml:math id="mml-ieqn-13"><mml:mover><mml:mi>u</mml:mi><mml:mo accent="false">&#x00AF;</mml:mo></mml:mover></mml:math></inline-formula> and <inline-formula id="ieqn-14"><mml:math id="mml-ieqn-14"><mml:mover><mml:mi>v</mml:mi><mml:mo accent="false">&#x00AF;</mml:mo></mml:mover></mml:math></inline-formula> are the sample means of the sliding window and <inline-formula id="ieqn-15"><mml:math id="mml-ieqn-15"><mml:mo fence="false" stretchy="false">{</mml:mo><mml:msub><mml:mi>v</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo fence="false" stretchy="false">}</mml:mo></mml:math></inline-formula>, respetively. For each <italic>r</italic>, Welford&#x2019;s one-pass algorithm can achieve the computational complexity <italic>O</italic>(<italic>m</italic>). As the sliding windows move from the beginning to the end of the audio stream sample by sample, each <inline-formula id="ieqn-16"><mml:math id="mml-ieqn-16"><mml:mi>r</mml:mi></mml:math></inline-formula> is computed with the total complexity equal to <italic>O</italic>(<italic>nm</italic>). A large value of <italic>r</italic> means a high similarity between two sequences. The position of the delimiter can be found at the maximal peak of these coefficients.</p>
<p>To accelerate the computation time, we first locate a delimiter&#x2019;s approximate position <italic>c</italic> by computing the coefficient window by window. Within the range of <inline-formula id="ieqn-17"><mml:math id="mml-ieqn-17"><mml:mo stretchy="false">[</mml:mo><mml:mi>c</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mi>m</mml:mi><mml:mo>,</mml:mo><mml:mi>c</mml:mi><mml:mo>+</mml:mo><mml:mi>m</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:math></inline-formula>, we use fine-grained correlation with the sliding window size set to 1 to locate the exact position. Welford&#x2019;s one-pass algorithm ensures numerical stability during this computation (see <xref ref-type="sec" rid="s9">Appendix A.2</xref> for details). Additionally, we use multiple threads to compute the coefficients in parallel.</p>
<p><bold>Remedying inaccurate frame positions.</bold> Due to unpredictable channel noise, cross-correlation sometimes may not locate delimiters accurately, so the derived data frame does not have the pre-defined length. Suppose each data frame contains <italic>k</italic> audio samples at a certain sampling rate, and two delimiters surrounding the frame are positioned at sample indices <inline-formula id="ieqn-18"><mml:math id="mml-ieqn-18"><mml:msub><mml:mi>d</mml:mi><mml:mn>1</mml:mn></mml:msub></mml:math></inline-formula> and <inline-formula id="ieqn-19"><mml:math id="mml-ieqn-19"><mml:msub><mml:mi>d</mml:mi><mml:mn>2</mml:mn></mml:msub></mml:math></inline-formula>. Ideally, <inline-formula id="ieqn-20"><mml:math id="mml-ieqn-20"><mml:msub><mml:mi>d</mml:mi><mml:mn>2</mml:mn></mml:msub><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mi>d</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:mo>&#x2212;</mml:mo><mml:mi>m</mml:mi></mml:math></inline-formula> should equal <italic>k</italic>, where <italic>m</italic> is the length of the delimiter. Our experiments confirm that <inline-formula id="ieqn-21"><mml:math id="mml-ieqn-21"><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>d</mml:mi><mml:mn>2</mml:mn></mml:msub><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mi>d</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:mo>&#x2212;</mml:mo><mml:mi>m</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mi>k</mml:mi><mml:mo>=</mml:mo><mml:mi>&#x03B4;</mml:mi></mml:math></inline-formula> where <inline-formula id="ieqn-22"><mml:math id="mml-ieqn-22"><mml:mi>&#x03B4;</mml:mi></mml:math></inline-formula> may not always equal to zero. Hence, we need to adjust <inline-formula id="ieqn-23"><mml:math id="mml-ieqn-23"><mml:msub><mml:mi>d</mml:mi><mml:mn>1</mml:mn></mml:msub></mml:math></inline-formula> and <inline-formula id="ieqn-24"><mml:math id="mml-ieqn-24"><mml:msub><mml:mi>d</mml:mi><mml:mn>2</mml:mn></mml:msub></mml:math></inline-formula> to remedy this <inline-formula id="ieqn-25"><mml:math id="mml-ieqn-25"><mml:mi>&#x03B4;</mml:mi></mml:math></inline-formula>.</p>
<p>Algorithm 1 presents our remedying algorithm. First, we set selected symbols in the data frame to a fixed value so the receiver knows these symbols beforehand. Next, we try every possible adjustment from <inline-formula id="ieqn-26"><mml:math id="mml-ieqn-26"><mml:mo fence="false" stretchy="false">{</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>d</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>d</mml:mi><mml:mn>2</mml:mn></mml:msub><mml:mo>&#x2212;</mml:mo><mml:mi>&#x03B4;</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>,</mml:mo><mml:mo>&#x22EF;</mml:mo><mml:mo>,</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>d</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:mo>+</mml:mo><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:msub><mml:mi>d</mml:mi><mml:mn>2</mml:mn></mml:msub><mml:mo>&#x2212;</mml:mo><mml:mi>&#x03B4;</mml:mi><mml:mo>+</mml:mo><mml:mi>i</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>,</mml:mo><mml:mo>&#x22EF;</mml:mo><mml:mo>,</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>d</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:mo>+</mml:mo><mml:mi>&#x03B4;</mml:mi><mml:mo>,</mml:mo><mml:msub><mml:mi>d</mml:mi><mml:mn>2</mml:mn></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mo fence="false" stretchy="false">}</mml:mo></mml:math></inline-formula> for <inline-formula id="ieqn-27"><mml:math id="mml-ieqn-27"><mml:mo stretchy="false">(</mml:mo><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mo>&#x22EF;</mml:mo><mml:mo>,</mml:mo><mml:mi>&#x03B4;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> and select one that provides the highest accuracy of decoding selected symbols. Thus, FastDOV can tolerate signal distortion to some extent. If <inline-formula id="ieqn-28"><mml:math id="mml-ieqn-28"><mml:mi>&#x03B4;</mml:mi></mml:math></inline-formula> is found beyond a threshold, the receiver discards this frame and asks for retransmission. The detailed procedure of decoding a symbol is presented in <xref ref-type="sec" rid="s4_3">Section 4.3</xref>.</p>
<fig id="fig-12">
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_73201-fig-12.tif"/>
</fig>
<p><bold>Adding error correction code.</bold> Since some bits could be transmitted incorrectly, in FastDOV, we use Reed-Solomon code (RS) [<xref ref-type="bibr" rid="ref-33">33</xref>] as our error correction code, which will be added at the end of each symbol group.</p>
<p>We employ RS codes with parameters <inline-formula id="ieqn-47"><mml:math id="mml-ieqn-47"><mml:mo stretchy="false">(</mml:mo><mml:mi>n</mml:mi><mml:mo>=</mml:mo><mml:mn>255</mml:mn><mml:mo>,</mml:mo><mml:mi>k</mml:mi><mml:mo>=</mml:mo><mml:mn>223</mml:mn><mml:mo>,</mml:mo><mml:mi>t</mml:mi><mml:mo>=</mml:mo><mml:mn>16</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> to provide robust error correction capability, enabling the recovery of up to 16 symbol errors. The encoded data maintains compatibility with voice channel characteristics while ensuring reliable transmission. Detailed mathematical derivations are provided in <xref ref-type="sec" rid="s8">Appendix A.1</xref>.</p>
</sec>
<sec id="s4_2">
<label>4.2</label>
<title>Chirp-Based Modulation Scheme</title>
<p>Modulation and demodulation refer to the process of altering the carrier signal to contain information to be transmitted and vice versa. There are common modulation methods widely used in practice, including On-Off Keying (OOK), Amplitude-Shift Keying (ASK), Frequency-Shift Keying (FSK), and Phase-Shift Keying (PSK). In OOK, a 0/1 bit is defined by the presence or absence of the carrier signal. ASK changes the amplitude of a carrier signal to encode different bits of information. Since OOK and ASK can be easily affected by signal distortion, they are unsuitable for data modulation over voice channels. In FSK, the frequency of a carrier signal with constant amplitude switches as the input bitstream changes. PSK is a phase modulation method that switches the carrier phase between different values based on the level of a digital baseband signal. Due to the tendency of FSK and PSK to cause discontinuity in the input signal, they do not resemble sound. The codec can severely distort such signals, making them unsuitable for data modulation over voice channels.</p>
<p>In FastDOV, we adopt a chirp signal rather than sinusoidal waves used in [<xref ref-type="bibr" rid="ref-10">10</xref>] and [<xref ref-type="bibr" rid="ref-28">28</xref>] for modulation. Due to its strong anti-interference feature, Chirp has a wide range of applications in communication, sonar, radar, and other fields. Our experiments show that a linear-frequency chirp is enough to tolerate noise and signal distortion on the cellular voice channel. In a linear chirp, the instantaneous frequency <inline-formula id="ieqn-48"><mml:math id="mml-ieqn-48"><mml:mi>f</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mi>c</mml:mi><mml:mi>t</mml:mi><mml:mo>+</mml:mo><mml:msub><mml:mi>f</mml:mi><mml:mn>0</mml:mn></mml:msub></mml:math></inline-formula> varies strictly linearly with time <inline-formula id="ieqn-49"><mml:math id="mml-ieqn-49"><mml:mi>t</mml:mi></mml:math></inline-formula> where <inline-formula id="ieqn-50"><mml:math id="mml-ieqn-50"><mml:mi>c</mml:mi></mml:math></inline-formula> is the chirp rate, and <inline-formula id="ieqn-51"><mml:math id="mml-ieqn-51"><mml:msub><mml:mi>f</mml:mi><mml:mn>0</mml:mn></mml:msub></mml:math></inline-formula> is the starting frequency. The corresponding time-domain function for a linear chirp is expressed as follows in radians:
<disp-formula id="ueqn-2"><mml:math id="mml-ueqn-2" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd /><mml:mtd><mml:mi>x</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mi>s</mml:mi><mml:mi>i</mml:mi><mml:mi>n</mml:mi><mml:mo stretchy="false">[</mml:mo><mml:msub><mml:mi>&#x03D5;</mml:mi><mml:mn>0</mml:mn></mml:msub><mml:mo>+</mml:mo><mml:mn>2</mml:mn><mml:mi>&#x03C0;</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mfrac><mml:mi>c</mml:mi><mml:mn>2</mml:mn></mml:mfrac><mml:msup><mml:mi>t</mml:mi><mml:mn>2</mml:mn></mml:msup><mml:mo>+</mml:mo><mml:msub><mml:mi>f</mml:mi><mml:mn>0</mml:mn></mml:msub><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">]</mml:mo><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>where <inline-formula id="ieqn-52"><mml:math id="mml-ieqn-52"><mml:msub><mml:mi>&#x03D5;</mml:mi><mml:mn>0</mml:mn></mml:msub></mml:math></inline-formula> is the initial phase at the time <inline-formula id="ieqn-53"><mml:math id="mml-ieqn-53"><mml:mi>t</mml:mi><mml:mo>=</mml:mo><mml:mn>0</mml:mn></mml:math></inline-formula>. The complete mathematical analysis of chirp signals, including frequency modulation characteristics and spectral properties, is provided in <xref ref-type="sec" rid="s10">Appendix A.3</xref>. Based on the above equation, we empirically modulate 3-bit information by varying the frequency, shape, and phase of our chirp signal. It should be noted that we intentionally do not choose amplitude for modulation because previous work [<xref ref-type="bibr" rid="ref-10">10</xref>] has already demonstrated that the amplitude of the received signal might be quite different from that of the input and vary unpredictably. Our modulation is not limited to 1 bit for frequency, shape, and phase. Real-world experiments show this setting is the best practice.</p>
<p><bold>Frequency.</bold> As mentioned in <xref ref-type="sec" rid="s2_2">Section 2.2</xref>, the voice channel responds differently to the signals in certain frequencies. We propose to use different frequency ranges to encode bits 0 and 1, respectively. As shown in <xref ref-type="disp-formula" rid="eqn-1">Eq. (1)</xref>, bit 0 is represented by the frequency range of <inline-formula id="ieqn-54"><mml:math id="mml-ieqn-54"><mml:mo stretchy="false">[</mml:mo><mml:mn>0.4</mml:mn><mml:mo>,</mml:mo><mml:mn>1.4</mml:mn><mml:mspace width="thinmathspace" /><mml:mi mathvariant="normal">k</mml:mi><mml:mi mathvariant="normal">H</mml:mi><mml:mi mathvariant="normal">z</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> (i.e., <inline-formula id="ieqn-55"><mml:math id="mml-ieqn-55"><mml:msub><mml:mi>f</mml:mi><mml:mn>0</mml:mn></mml:msub><mml:mo>=</mml:mo><mml:mn>0.4</mml:mn><mml:mspace width="thinmathspace" /><mml:mi mathvariant="normal">k</mml:mi><mml:mi mathvariant="normal">H</mml:mi><mml:mi mathvariant="normal">z</mml:mi></mml:math></inline-formula>), and bit 1 is modulated by the range of <inline-formula id="ieqn-56"><mml:math id="mml-ieqn-56"><mml:mo stretchy="false">[</mml:mo><mml:mn>2.4</mml:mn><mml:mo>,</mml:mo><mml:mn>3.4</mml:mn><mml:mspace width="thinmathspace" /><mml:mi mathvariant="normal">k</mml:mi><mml:mi mathvariant="normal">H</mml:mi><mml:mi mathvariant="normal">z</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> (i.e., <inline-formula id="ieqn-57"><mml:math id="mml-ieqn-57"><mml:msub><mml:mi>f</mml:mi><mml:mn>0</mml:mn></mml:msub><mml:mo>=</mml:mo><mml:mn>2.4</mml:mn><mml:mspace width="thinmathspace" /><mml:mi mathvariant="normal">k</mml:mi><mml:mi mathvariant="normal">H</mml:mi><mml:mi mathvariant="normal">z</mml:mi></mml:math></inline-formula>). Since these two frequency ranges contain 1st and 2nd frequency formants, as well as the fourth frequency formants [<xref ref-type="bibr" rid="ref-24">24</xref>], respectively.
<disp-formula id="eqn-1"><label>(1)</label><mml:math id="mml-eqn-1" display="block"><mml:mrow><mml:mo>{</mml:mo><mml:mtable columnalign="left left" rowspacing=".2em" columnspacing="1em" displaystyle="false"><mml:mtr><mml:mtd><mml:mi>b</mml:mi><mml:mo>=</mml:mo><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mspace width="thinmathspace" /><mml:msub><mml:mi>f</mml:mi><mml:mn>0</mml:mn></mml:msub><mml:mo>=</mml:mo><mml:mn>0.4</mml:mn><mml:mspace width="thinmathspace" /><mml:mrow><mml:mrow><mml:mi mathvariant="normal">k</mml:mi><mml:mi mathvariant="normal">H</mml:mi><mml:mi mathvariant="normal">z</mml:mi></mml:mrow></mml:mrow></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mi>b</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mspace width="thinmathspace" /><mml:msub><mml:mi>f</mml:mi><mml:mn>0</mml:mn></mml:msub><mml:mo>=</mml:mo><mml:mn>2.4</mml:mn><mml:mspace width="thinmathspace" /><mml:mrow><mml:mrow><mml:mi mathvariant="normal">k</mml:mi><mml:mi mathvariant="normal">H</mml:mi><mml:mi mathvariant="normal">z</mml:mi></mml:mrow></mml:mrow></mml:mtd></mml:mtr></mml:mtable><mml:mo fence="true" stretchy="true" symmetric="true"></mml:mo></mml:mrow></mml:math></disp-formula></p>
<p><bold>Shape.</bold> In either frequency range, we can switch the starting frequency <inline-formula id="ieqn-58"><mml:math id="mml-ieqn-58"><mml:msub><mml:mi>f</mml:mi><mml:mn>0</mml:mn></mml:msub></mml:math></inline-formula> and finish frequency <inline-formula id="ieqn-59"><mml:math id="mml-ieqn-59"><mml:msub><mml:mi>f</mml:mi><mml:mn>1</mml:mn></mml:msub></mml:math></inline-formula> to encode additional 1-bit information. In other words, we change the slope of the chirp without changing its frequency band. The up-chirp (e.g., <inline-formula id="ieqn-60"><mml:math id="mml-ieqn-60"><mml:mn>0.4</mml:mn><mml:mspace width="thinmathspace" /><mml:mrow><mml:mrow><mml:mi mathvariant="normal">k</mml:mi><mml:mi mathvariant="normal">H</mml:mi><mml:mi mathvariant="normal">z</mml:mi></mml:mrow></mml:mrow><mml:mo stretchy="false">&#x2192;</mml:mo><mml:mn>1.4</mml:mn><mml:mspace width="thinmathspace" /><mml:mrow><mml:mrow><mml:mi mathvariant="normal">k</mml:mi><mml:mi mathvariant="normal">H</mml:mi><mml:mi mathvariant="normal">z</mml:mi></mml:mrow></mml:mrow></mml:math></inline-formula> and <inline-formula id="ieqn-61"><mml:math id="mml-ieqn-61"><mml:mn>2.4</mml:mn><mml:mspace width="thinmathspace" /><mml:mrow><mml:mrow><mml:mi mathvariant="normal">k</mml:mi><mml:mi mathvariant="normal">H</mml:mi><mml:mi mathvariant="normal">z</mml:mi></mml:mrow></mml:mrow><mml:mo stretchy="false">&#x2192;</mml:mo><mml:mn>3.4</mml:mn><mml:mspace width="thinmathspace" /><mml:mrow><mml:mrow><mml:mi mathvariant="normal">k</mml:mi><mml:mi mathvariant="normal">H</mml:mi><mml:mi mathvariant="normal">z</mml:mi></mml:mrow></mml:mrow></mml:math></inline-formula>) represents bit 0, and the down-chirp is for bit 1. The equation for our shape modulation is shown in <xref ref-type="disp-formula" rid="eqn-2">Eq. 2</xref>.
<disp-formula id="eqn-2"><label>(2)</label><mml:math id="mml-eqn-2" display="block"><mml:mrow><mml:mo>{</mml:mo><mml:mtable columnalign="left left" rowspacing=".2em" columnspacing="1em" displaystyle="false"><mml:mtr><mml:mtd><mml:mi>b</mml:mi><mml:mo>=</mml:mo><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mi>c</mml:mi><mml:mo>&#x003E;</mml:mo><mml:mn>0</mml:mn></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mi>b</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mi>c</mml:mi><mml:mo>&#x003C;</mml:mo><mml:mn>0</mml:mn></mml:mtd></mml:mtr></mml:mtable><mml:mo fence="true" stretchy="true" symmetric="true"></mml:mo></mml:mrow></mml:math></disp-formula></p>
<p><bold>Phase.</bold> Without changing the frequency and shape of a chirp signal, we can modulate an extra bit by using different initial phases. According to <xref ref-type="disp-formula" rid="eqn-3">Eq. 3</xref>, whether the signal carries 0 or 1 depends on whether the initial phase <inline-formula id="ieqn-62"><mml:math id="mml-ieqn-62"><mml:msub><mml:mi>&#x03D5;</mml:mi><mml:mn>0</mml:mn></mml:msub></mml:math></inline-formula> equals zero. Thanks to the proposed DL-based demodulation scheme, we can reliably detect the phase difference of each symbol.
<disp-formula id="eqn-3"><label>(3)</label><mml:math id="mml-eqn-3" display="block"><mml:mrow><mml:mo>{</mml:mo><mml:mtable columnalign="left left" rowspacing=".2em" columnspacing="1em" displaystyle="false"><mml:mtr><mml:mtd><mml:mi>b</mml:mi><mml:mo>=</mml:mo><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:msub><mml:mi>&#x03D5;</mml:mi><mml:mn>0</mml:mn></mml:msub><mml:mo>=</mml:mo><mml:mn>0</mml:mn></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mi>b</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:msub><mml:mi>&#x03D5;</mml:mi><mml:mn>0</mml:mn></mml:msub><mml:mo>&#x2260;</mml:mo><mml:mn>0</mml:mn></mml:mtd></mml:mtr></mml:mtable><mml:mo fence="true" stretchy="true" symmetric="true"></mml:mo></mml:mrow></mml:math></disp-formula></p>
</sec>
<sec id="s4_3">
<label>4.3</label>
<title>DL-Based Demodulation Scheme</title>
<p>Since our modulation scheme is based on frequency, shape, and phase, a straightforward method is to demodulate every bit of the symbol separately. For example, we can calculate the symbol&#x2019;s FFT to determine its frequency range. We can compare the FFT of the signals&#x2019; first and second halves to determine the up-chirp and down-chirp. If the magnitude of the former is less than the second, the chirp is up; Otherwise, it is down. To compare the initial phase, we can compute the complete FFT with real and imaginary parts. However, this approach is too ideal and does not work well in practice due to signal distortion in voice channels. Instead, we propose using deep learning (DL) to improve demodulation accuracy.</p>
<p><bold>Feature extraction.</bold></p>
<p>The features selected by FastDOV include time-domain, frequency-domain, and phase-angle features of signals. Given the sampling rate is <italic>Fs</italic>, the time duration of each symbol is <italic>t</italic>, and the input of a received symbol is <inline-formula id="ieqn-63"><mml:math id="mml-ieqn-63"><mml:mover><mml:mi>u</mml:mi><mml:mo accent="false">&#x00AF;</mml:mo></mml:mover><mml:mo>=</mml:mo><mml:mo fence="false" stretchy="false">{</mml:mo><mml:msub><mml:mi>u</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>u</mml:mi><mml:mn>2</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:mo>&#x22EF;</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mi>u</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mo fence="false" stretchy="false">}</mml:mo></mml:math></inline-formula>, where <inline-formula id="ieqn-64"><mml:math id="mml-ieqn-64"><mml:mi>k</mml:mi><mml:mo>=</mml:mo><mml:mi>F</mml:mi><mml:mi>s</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mi>t</mml:mi></mml:math></inline-formula>, the feature vector <inline-formula id="ieqn-65"><mml:math id="mml-ieqn-65"><mml:mover><mml:mi>v</mml:mi><mml:mo accent="false">&#x00AF;</mml:mo></mml:mover><mml:mo>=</mml:mo><mml:mo fence="false" stretchy="false">{</mml:mo><mml:msub><mml:mi>v</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>v</mml:mi><mml:mn>2</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:mo>&#x22EF;</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mi>v</mml:mi><mml:mrow><mml:mn>3</mml:mn><mml:mi>k</mml:mi><mml:mo>+</mml:mo><mml:mn>3</mml:mn></mml:mrow></mml:msub><mml:mo fence="false" stretchy="false">}</mml:mo></mml:math></inline-formula> used for our demodulation model is calculated as follows:
<list list-type="bullet">
<list-item>
<p><italic>Time-domain features</italic> refer to signal characteristics that change over time. For these features, we directly each sample value in the time domain.
<disp-formula id="ueqn-6"><mml:math id="mml-ueqn-6" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd /><mml:mtd><mml:mi></mml:mi><mml:mo fence="false" stretchy="false">{</mml:mo><mml:msub><mml:mi>v</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:mo>&#x22EF;</mml:mo><mml:msub><mml:mi>v</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mo fence="false" stretchy="false">}</mml:mo><mml:mo>=</mml:mo><mml:mo fence="false" stretchy="false">{</mml:mo><mml:msub><mml:mi>u</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:mo>&#x22EF;</mml:mo><mml:msub><mml:mi>u</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mo fence="false" stretchy="false">}</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
</list-item>
<list-item>
<p><italic>Frequency-domain features</italic> utilizes the output of the Fourier transform. The first part contains the magnitude of each frequency bin of the entire signal after <italic>FFT</italic>. The second and third parts are calculated similarly but independently based on the first and second halves of the signal. The number of each part is <inline-formula id="ieqn-66"><mml:math id="mml-ieqn-66"><mml:mi>n</mml:mi><mml:mrow><mml:mo>/</mml:mo></mml:mrow><mml:mn>2</mml:mn><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:math></inline-formula>, where <italic>n</italic> is the number of points for time-domain discrete signals, which is to be calculated by <italic>FFT</italic>.
<disp-formula id="ueqn-7"><mml:math id="mml-ueqn-7" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mo fence="false" stretchy="false">{</mml:mo><mml:msub><mml:mi>v</mml:mi><mml:mrow><mml:mi>k</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mo>&#x22EF;</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mi>v</mml:mi><mml:mrow><mml:mn>2</mml:mn><mml:mi>k</mml:mi><mml:mo>+</mml:mo><mml:mn>3</mml:mn></mml:mrow></mml:msub><mml:mo fence="false" stretchy="false">}</mml:mo><mml:mo>=</mml:mo></mml:mtd><mml:mtd><mml:mi></mml:mi><mml:mo fence="false" stretchy="false">{</mml:mo><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mi>F</mml:mi><mml:mi>F</mml:mi><mml:mi>T</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mo fence="false" stretchy="false">{</mml:mo><mml:msub><mml:mi>u</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:mo>&#x22EF;</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mi>u</mml:mi><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:mo fence="false" stretchy="false">}</mml:mo><mml:mo stretchy="false">)</mml:mo></mml:mtd></mml:mtr><mml:mtr><mml:mtd /><mml:mtd><mml:mi></mml:mi><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mi>F</mml:mi><mml:mi>F</mml:mi><mml:mi>T</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mo fence="false" stretchy="false">{</mml:mo><mml:msub><mml:mi>u</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:mo>&#x22EF;</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mi>u</mml:mi><mml:mrow><mml:mi>k</mml:mi><mml:mrow><mml:mo>/</mml:mo></mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub><mml:mo fence="false" stretchy="false">}</mml:mo><mml:mo stretchy="false">)</mml:mo><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mo>,</mml:mo></mml:mtd></mml:mtr><mml:mtr><mml:mtd /><mml:mtd><mml:mi></mml:mi><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mi>F</mml:mi><mml:mi>F</mml:mi><mml:mi>T</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mo fence="false" stretchy="false">{</mml:mo><mml:msub><mml:mi>u</mml:mi><mml:mrow><mml:mi>k</mml:mi><mml:mrow><mml:mo>/</mml:mo></mml:mrow><mml:mn>2</mml:mn><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mo>&#x22EF;</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mi>u</mml:mi><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:mo fence="false" stretchy="false">}</mml:mo><mml:mo stretchy="false">)</mml:mo><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mo fence="false" stretchy="false">}</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
</list-item>
<list-item>
<p><italic>Phase angle features</italic> refer to the angle difference between a particular moment in a cycle and a reference moment. In signal processing, the phase angle can be used to describe the phase difference of a signal. We calculate the phase feature using the arctangent function <inline-formula id="ieqn-67"><mml:math id="mml-ieqn-67"><mml:mi>a</mml:mi><mml:mi>t</mml:mi><mml:mi>a</mml:mi><mml:mi>n</mml:mi><mml:mn>2</mml:mn><mml:mo stretchy="false">(</mml:mo><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>.
<disp-formula id="ueqn-8"><mml:math id="mml-ueqn-8" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd /><mml:mtd><mml:mi></mml:mi><mml:mo fence="false" stretchy="false">{</mml:mo><mml:msub><mml:mi>v</mml:mi><mml:mrow><mml:mn>2</mml:mn><mml:mi>k</mml:mi><mml:mo>+</mml:mo><mml:mn>4</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mo>&#x22EF;</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mi>v</mml:mi><mml:mrow><mml:mn>3</mml:mn><mml:mi>k</mml:mi><mml:mo>+</mml:mo><mml:mn>3</mml:mn></mml:mrow></mml:msub><mml:mo fence="false" stretchy="false">}</mml:mo><mml:mo>=</mml:mo><mml:mo fence="false" stretchy="false">{</mml:mo><mml:msub><mml:mi>z</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:mo>&#x22EF;</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mi>z</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo>,</mml:mo><mml:mo>&#x22EF;</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mi>z</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mo fence="false" stretchy="false">}</mml:mo><mml:mo>,</mml:mo></mml:mtd></mml:mtr><mml:mtr><mml:mtd /><mml:mtd><mml:msub><mml:mi>z</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mi>a</mml:mi><mml:mi>t</mml:mi><mml:mi>a</mml:mi><mml:mi>n</mml:mi><mml:mn>2</mml:mn><mml:mo stretchy="false">(</mml:mo><mml:mi>i</mml:mi><mml:mi>m</mml:mi><mml:mi>a</mml:mi><mml:mi>g</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>u</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mo>,</mml:mo><mml:mi>r</mml:mi><mml:mi>e</mml:mi><mml:mi>a</mml:mi><mml:mi>l</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>u</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">)</mml:mo><mml:mtext>&#xA0;</mml:mtext><mml:mo stretchy="false">(</mml:mo><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mn>2</mml:mn><mml:mo>,</mml:mo><mml:mo>&#x22EF;</mml:mo><mml:mo>,</mml:mo><mml:mi>k</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
</list-item>
</list>
<disp-formula id="eqn-4"><label>(4)</label><mml:math id="mml-eqn-4" display="block"><mml:mi>a</mml:mi><mml:mi>t</mml:mi><mml:mi>a</mml:mi><mml:mi>n</mml:mi><mml:mn>2</mml:mn><mml:mo stretchy="false">(</mml:mo><mml:mi>a</mml:mi><mml:mo>,</mml:mo><mml:mi>b</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mrow><mml:mo>{</mml:mo><mml:mtable columnalign="left left" rowspacing=".2em" columnspacing="1em" displaystyle="false"><mml:mtr><mml:mtd><mml:mi>a</mml:mi><mml:mi>r</mml:mi><mml:mi>c</mml:mi><mml:mi>t</mml:mi><mml:mi>a</mml:mi><mml:mi>n</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>a</mml:mi><mml:mrow><mml:mo>/</mml:mo></mml:mrow><mml:mi>b</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mtd><mml:mtd><mml:mi>b</mml:mi><mml:mo>&#x003E;</mml:mo><mml:mn>0</mml:mn></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mi>a</mml:mi><mml:mi>r</mml:mi><mml:mi>c</mml:mi><mml:mi>t</mml:mi><mml:mi>a</mml:mi><mml:mi>n</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>a</mml:mi><mml:mrow><mml:mo>/</mml:mo></mml:mrow><mml:mi>b</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>+</mml:mo><mml:mi>&#x03C0;</mml:mi></mml:mtd><mml:mtd><mml:mi>a</mml:mi><mml:mo>&#x2265;</mml:mo><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mi>b</mml:mi><mml:mo>&#x003C;</mml:mo><mml:mn>0</mml:mn></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mi>a</mml:mi><mml:mi>r</mml:mi><mml:mi>c</mml:mi><mml:mi>t</mml:mi><mml:mi>a</mml:mi><mml:mi>n</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>a</mml:mi><mml:mrow><mml:mo>/</mml:mo></mml:mrow><mml:mi>b</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mi>&#x03C0;</mml:mi></mml:mtd><mml:mtd><mml:mi>a</mml:mi><mml:mo>&#x003C;</mml:mo><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mi>b</mml:mi><mml:mo>&#x003C;</mml:mo><mml:mn>0</mml:mn></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mo>+</mml:mo><mml:mi>&#x03C0;</mml:mi><mml:mrow><mml:mo>/</mml:mo></mml:mrow><mml:mn>2</mml:mn></mml:mtd><mml:mtd><mml:mi>a</mml:mi><mml:mo>&#x003E;</mml:mo><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mi>b</mml:mi><mml:mo>=</mml:mo><mml:mn>0</mml:mn></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mo>&#x2212;</mml:mo><mml:mi>&#x03C0;</mml:mi><mml:mrow><mml:mo>/</mml:mo></mml:mrow><mml:mn>2</mml:mn></mml:mtd><mml:mtd><mml:mi>a</mml:mi><mml:mo>&#x003C;</mml:mo><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mi>b</mml:mi><mml:mo>=</mml:mo><mml:mn>0</mml:mn></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mi>u</mml:mi><mml:mi>n</mml:mi><mml:mi>d</mml:mi><mml:mi>e</mml:mi><mml:mi>f</mml:mi><mml:mi>i</mml:mi><mml:mi>n</mml:mi><mml:mi>e</mml:mi><mml:mi>d</mml:mi></mml:mtd><mml:mtd><mml:mi>a</mml:mi><mml:mo>=</mml:mo><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mi>b</mml:mi><mml:mo>=</mml:mo><mml:mn>0</mml:mn></mml:mtd></mml:mtr></mml:mtable><mml:mo fence="true" stretchy="true" symmetric="true"></mml:mo></mml:mrow></mml:math></disp-formula></p>
<p><bold>Data preprocessing.</bold> In linear regression models, it is generally required that the linear correlation between features is low and the number of features is less than the sample size. Otherwise, it may lead to problems such as increased variance and overfitting of parameter estimates. Therefore, our model uses PCA (Principal Component Analysis) [<xref ref-type="bibr" rid="ref-34">34</xref>] to process the data inputs. PCA is a commonly used linear dimensionality reduction method that maps n-dimensional features onto lower-dimensional k-dimensions through a certain linear projection. We selected PCA because it can not only mitigate the overfitting problem but also reduce the computational cost.</p>
<p><bold>DL-based demodulator.</bold> We use the MobileNetV3-Small [<xref ref-type="bibr" rid="ref-35">35</xref>] model to demodulate the received signal in FastDOV. The MobileNetV3-Small model is a lightweight deep neural network proposed by Google for embedded devices such as smartphones [<xref ref-type="bibr" rid="ref-36">36</xref>], aiming to improve mobile device efficiency and performance. In our design, MobileNetV3-Small will take the signals processed by PCA as the input and generate classification results (the eight classes in <xref ref-type="table" rid="table-1">Table 1</xref>).</p>
<table-wrap id="table-1">
<label>Table 1</label>
<caption>
<title>Symbol modulation and encoding table</title>
</caption>
<table>
<colgroup>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/> </colgroup>
<thead>
<tr>
<th>Symbol</th>
<th>Frequency (<italic>f</italic>)</th>
<th>Shape</th>
<th>Phase (<inline-formula id="ieqn-68"><mml:math id="mml-ieqn-68"><mml:msub><mml:mi>&#x03D5;</mml:mi><mml:mn>0</mml:mn></mml:msub></mml:math></inline-formula>)</th>
</tr>
</thead>
<tbody>
<tr>
<td>000</td>
<td><inline-formula id="ieqn-69"><mml:math id="mml-ieqn-69"><mml:mi>f</mml:mi><mml:mo>&#x2208;</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:mn>400</mml:mn><mml:mo>,</mml:mo><mml:mn>1400</mml:mn><mml:mo stretchy="false">]</mml:mo></mml:math></inline-formula></td>
<td>Up</td>
<td><inline-formula id="ieqn-70"><mml:math id="mml-ieqn-70"><mml:msub><mml:mi>&#x03D5;</mml:mi><mml:mn>0</mml:mn></mml:msub><mml:mo>=</mml:mo><mml:mn>0</mml:mn></mml:math></inline-formula></td>
</tr>
<tr>
<td>001</td>
<td><inline-formula id="ieqn-71"><mml:math id="mml-ieqn-71"><mml:mi>f</mml:mi><mml:mo>&#x2208;</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:mn>400</mml:mn><mml:mo>,</mml:mo><mml:mn>1400</mml:mn><mml:mo stretchy="false">]</mml:mo></mml:math></inline-formula></td>
<td>Up</td>
<td><inline-formula id="ieqn-72"><mml:math id="mml-ieqn-72"><mml:msub><mml:mi>&#x03D5;</mml:mi><mml:mn>0</mml:mn></mml:msub><mml:mo>=</mml:mo><mml:mn>187</mml:mn></mml:math></inline-formula></td>
</tr>
<tr>
<td>010</td>
<td><inline-formula id="ieqn-73"><mml:math id="mml-ieqn-73"><mml:mi>f</mml:mi><mml:mo>&#x2208;</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:mn>1400</mml:mn><mml:mo>,</mml:mo><mml:mn>400</mml:mn><mml:mo stretchy="false">]</mml:mo></mml:math></inline-formula></td>
<td>Down</td>
<td><inline-formula id="ieqn-74"><mml:math id="mml-ieqn-74"><mml:msub><mml:mi>&#x03D5;</mml:mi><mml:mn>0</mml:mn></mml:msub><mml:mo>=</mml:mo><mml:mn>0</mml:mn></mml:math></inline-formula></td>
</tr>
<tr>
<td>011</td>
<td><inline-formula id="ieqn-75"><mml:math id="mml-ieqn-75"><mml:mi>f</mml:mi><mml:mo>&#x2208;</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:mn>1400</mml:mn><mml:mo>,</mml:mo><mml:mn>400</mml:mn><mml:mo stretchy="false">]</mml:mo></mml:math></inline-formula></td>
<td>Down</td>
<td><inline-formula id="ieqn-76"><mml:math id="mml-ieqn-76"><mml:msub><mml:mi>&#x03D5;</mml:mi><mml:mn>0</mml:mn></mml:msub><mml:mo>=</mml:mo><mml:mn>86</mml:mn></mml:math></inline-formula></td>
</tr>
<tr>
<td>100</td>
<td><inline-formula id="ieqn-77"><mml:math id="mml-ieqn-77"><mml:mi>f</mml:mi><mml:mo>&#x2208;</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:mn>2400</mml:mn><mml:mo>,</mml:mo><mml:mn>3400</mml:mn><mml:mo stretchy="false">]</mml:mo></mml:math></inline-formula></td>
<td>Up</td>
<td><inline-formula id="ieqn-78"><mml:math id="mml-ieqn-78"><mml:msub><mml:mi>&#x03D5;</mml:mi><mml:mn>0</mml:mn></mml:msub><mml:mo>=</mml:mo><mml:mn>0</mml:mn></mml:math></inline-formula></td>
</tr>
<tr>
<td>101</td>
<td><inline-formula id="ieqn-79"><mml:math id="mml-ieqn-79"><mml:mi>f</mml:mi><mml:mo>&#x2208;</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:mn>2400</mml:mn><mml:mo>,</mml:mo><mml:mn>3400</mml:mn><mml:mo stretchy="false">]</mml:mo></mml:math></inline-formula></td>
<td>Up</td>
<td><inline-formula id="ieqn-80"><mml:math id="mml-ieqn-80"><mml:msub><mml:mi>&#x03D5;</mml:mi><mml:mn>0</mml:mn></mml:msub><mml:mo>=</mml:mo><mml:mn>271</mml:mn></mml:math></inline-formula></td>
</tr>
<tr>
<td>110</td>
<td><inline-formula id="ieqn-81"><mml:math id="mml-ieqn-81"><mml:mi>f</mml:mi><mml:mo>&#x2208;</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:mn>3400</mml:mn><mml:mo>,</mml:mo><mml:mn>2400</mml:mn><mml:mo stretchy="false">]</mml:mo></mml:math></inline-formula></td>
<td>Down</td>
<td><inline-formula id="ieqn-82"><mml:math id="mml-ieqn-82"><mml:msub><mml:mi>&#x03D5;</mml:mi><mml:mn>0</mml:mn></mml:msub><mml:mo>=</mml:mo><mml:mn>0</mml:mn></mml:math></inline-formula></td>
</tr>
<tr>
<td>111</td>
<td><inline-formula id="ieqn-83"><mml:math id="mml-ieqn-83"><mml:mi>f</mml:mi><mml:mo>&#x2208;</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:mn>3400</mml:mn><mml:mo>,</mml:mo><mml:mn>2400</mml:mn><mml:mo stretchy="false">]</mml:mo></mml:math></inline-formula></td>
<td>Down</td>
<td><inline-formula id="ieqn-84"><mml:math id="mml-ieqn-84"><mml:msub><mml:mi>&#x03D5;</mml:mi><mml:mn>0</mml:mn></mml:msub><mml:mo>=</mml:mo><mml:mn>90</mml:mn></mml:math></inline-formula></td>
</tr>
</tbody>
</table>
</table-wrap>
<p>We chose MobileNetV3-Small due to its ability to achieve excellent accuracy with minimal complexity. It is a lightweight CNN model that can effectively capture spatial information in images and perform translation-invariant processing on features. This translational invariance property also enables CNN to process signal data well. Signal data is similar to image data, but also has spatial and local correlations. The convolution and pooling operations of CNN can extract local features from signal data, making it a universal network structure for processing signal data. Besides, the structure in <xref ref-type="fig" rid="fig-7">Fig. 7</xref> ensures that MobileNetV3-Small achieves higher accuracy with lower complexity, which is very suitable for running on mobile phones.</p>
<fig id="fig-7">
<label>Figure 7</label>
<caption>
<title>The architecture of MobileNetV3-Small</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_73201-fig-7.tif"/>
</fig>
<p>As shown in <xref ref-type="fig" rid="fig-7">Fig. 7</xref>, MobileNetV3-Small consists of 11 core modules, which are also the basic module of the network, bneck. Each bneck includes channel separable convolution, squeeze and excitation module (SE), and inverted residual module. Among them, the lightweight depthwise separable convolution module decomposes the standard convolution operation into two steps: first, channel separation, and then channel wise convolution; the SE module introduces a global average pooling layer and models the importance of the channel using a pair of fully connected layers; the inverted residual module connects two basic convolution modules and uses shortcuts for cross layer connections to achieve cross layer information transmission. These modules enhance the model&#x2019;s perception of important features while reducing computational complexity.</p>

<p><bold>Model parameters.</bold> In FastDOV, the PCA model reduces the dimensionality of any input to 100. Since the raw features could be high-dimensional (if a symbol has 240 sampling points, its feature dimension will be 524). We use MobileNetV3-Small to demodulate symbols. In this model, symbols are first deformed into 10 <inline-formula id="ieqn-85"><mml:math id="mml-ieqn-85"><mml:mo>&#x00D7;</mml:mo></mml:math></inline-formula> 10 single-channel images and extracted by the DL model. Due to the presence of 8 different symbols in this experiment, the output of the fully connected layer in MobileNetV3-Small is 8.</p>
<p><bold>Preparing training dataset.</bold> The training data are self-collected by callers and callees in China Mobile and China Mobile networks. On the caller side, we randomly generate a large number of symbols and transmit them to the receiver using our modulation scheme. Once receiving these symbols, the callee knows their corresponding labels. However, as mentioned in <xref ref-type="sec" rid="s4_1">Section 4.1</xref>, the position of each delimiter is difficult to determine accurately. Thus, some received frames may be extracted incorrectly. To reduce the negative effect of low-quality training data on our model, we use the same Algorithm 1 to remedy <inline-formula id="ieqn-86"><mml:math id="mml-ieqn-86"><mml:mi>&#x03B4;</mml:mi></mml:math></inline-formula> but with self-training and testing. The idea is to split the number of symbols for each adjustment into training and testing sets. The results of the testing set are used to evaluate if the adjustment can give the highest classification accuracy. We pick the adjustment with the highest accuracy to separate data frames. This step helps us improve the quality of training data in practice.</p>
</sec>
<sec id="s4_4">
<label>4.4</label>
<title>Retransmission Mechanism</title>
<p>Although our MobileNet-based demodulator significantly improves symbol recognition, it still cannot guarantee 100% accuracy due to various reasons (for example, signal distortion, channel noise, or interference). Thus, we propose a retransmission mechanism to provide reliable communication when a non-recoverable error occurs.</p>
<p>The detail of the retransmission mechanism is shown in <xref ref-type="fig" rid="fig-8">Fig. 8</xref>. Upon receiving a symbol group, the receiver performs error checking and correction based on the parity bits received at the end of the group. If successful, it implies that the data has been received without errors. However, if the correction fails, it indicates the presence of errors in the data. The receiver then sends a feedback pulse in the upcoming gap. Once the pulse is detected anywhere in the gap, the transmitter will retransmit the symbol group just sent. Ideally, damaged/error symbols should be sent immediately after the receiver provides feedback in the gap. However, a practical solution is to delay the acknowledgment in the next data frames. For example, we use gaps in the data frame <italic>k</italic> to acknowledge symbol groups in data frame <inline-formula id="ieqn-87"><mml:math id="mml-ieqn-87"><mml:mi>k</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:math></inline-formula> for ease of implementation.</p>
<fig id="fig-8">
<label>Figure 8</label>
<caption>
<title>Retransmission mechanism</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_73201-fig-8.tif"/>
</fig>
<p>If the receiver cannot successfully receive some symbol groups in the last data frame, he will send an individual Acknowledgement (ACK) frame to the transmitter to notify them of the missing symbol groups. Like a normal data frame, the ACK frame comprises delimiters, symbol groups, and gaps. However, the data delivered by symbol groups in an ACK frame is the index of errors. Once this ACK frame is received, the transmitter will retransmit all damaged/missing symbol groups. Note that if the receiver has sent out an ACK frame but does not receive any retransmitted data, it indicates the loss of the ACK frame, and it will send a new one.</p>
</sec>
</sec>
<sec id="s5">
<label>5</label>
<title>Evaluation</title>
<sec id="s5_1">
<label>5.1</label>
<title>Experimental Methodology</title>
<p><bold>Experimental settings.</bold> As shown in <xref ref-type="fig" rid="fig-9">Fig. 9</xref>, our experiments involve two smartphones, which communicate through cellular networks, and two laptops used to modulate/demodulate the data. Due to strict permission controls in smartphone operating systems, ordinary applications cannot directly implement underlying modem functions. Thus, we cannot directly implement FastDOV on COTS smartphones without rooting the phone. Instead, we use a PC to emulate a Bluetooth speakerphone of the caller&#x2019;s device so that we can manipulate the calling voice on the PC. This is much easier to implement than rooting the phones. Similarly, we leverage another PC connected to the callee to demodulate the received audio data. It should be noted that if the phone manufacturers provide support in the future, we could easily adapt the implementation into a PC-free scheme.</p>
<fig id="fig-9">
<label>Figure 9</label>
<caption>
<title>Experimental settings with two smartphones and laptops</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_73201-fig-9.tif"/>
</fig>
<p>In our experiments, we use a Lenovo Y7000 laptop (Intel i7-9750H CPU) as an emulated Bluetooth speakerphone of a Huawei P60 (Snapdragon 8&#x002B; 4G Platform, EMUI 13.1 system), leveraging the Hands-free profile protocol [<xref ref-type="bibr" rid="ref-37">37</xref>]. The signal modulation part of FastDOV is implemented on the laptop in Python. During a phone call, the voice signals will be first modulated on the laptop and then fed into the emulated Bluetooth speakerphone. A Xiaomi Redmi Note 7 phone acts as a receiver to record phone calls, which will be demodulated by another Lenovo Y7000 laptop. As for the cellular networks, we use the telecommunication networks provided by the three major carriers in China, including China Mobile, China Telecom, and China Unicom.</p>
<p><bold>Model training.</bold> To train our MobileNet-based demodulation model, we collected a dataset of 28,800 modulation symbols with a 70/30 split for training and testing sets. In the data collection stage, we used an audio sampling rate of 48 kHz and a Reed-Solomon code ratio of 3 data bits to 1 parity bit. We utilized an interleaved structure for the symbol groups, where each group contains 600 symbols (with ten symbols used for error correction) and is separated by a 0.5 s gap. The length of each symbol is 0.001 s, resulting in 48 samples per symbol, given the 48 kHz sampling rate. We used a 0.1 s linear chirp as the delimiter. All the collected modulation symbols are labeled into eight classes. <xref ref-type="table" rid="table-1">Table 1</xref> summarizes the Symbol modulation and encoding table.</p>

<p>The MobileNetV3-Small model used for demodulation will firstly be initialized by parameters pretrained on the ILSVRC2012 dataset [<xref ref-type="bibr" rid="ref-38">38</xref>] and then fine-tuned by the modulation symbols collected in our experiments. We use cross-entropy loss and SGD (stochastic gradient descent) optimizer with the weight decay set to 0.8 in the model training. We fine-tune the model for 40 epochs using a batch size of 64. The learning rate is 0.0005 for the first 30 epochs and decreases to 0.00005 for the latter 10 epochs. During the training, to improve model performance, we add noise sampled from a uniform distribution (<inline-formula id="ieqn-88"><mml:math id="mml-ieqn-88"><mml:mo stretchy="false">[</mml:mo><mml:mn>0.0</mml:mn><mml:mo>,</mml:mo><mml:mn>1.0</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>) to each model input.</p>
<p><bold>Performance metrics.</bold> We choose three widely accepted performance metrics, including accuracy, throughput, and goodput, to evaluate the prototype of FastDOV.</p>
<p><italic>Accuracy</italic> considered in this paper is twofold. First is the accuracy of locating the positions of start and end delimiters in received audio streams. It indicates how well our approach is able to separate modulated symbols. Second is the accuracy of our DL-based demodulator to classify symbols correctly.</p>
<p><italic>Throughput</italic> is the amount of all received bits (whether useful or not), including protocol overhead bits and duplicated bits per unit of time.</p>
<p><italic>Goodput</italic> is the amount of useful information delivered to the receiver per unit of time. We calculate the goodput by dividing the data size by the finish time to transfer all data successfully. In practice, some communication systems may have large throughput but suffer from low goodput due to the overhead bits transferred between parties and a large number of retransmissions.</p>
<p><bold>Experimental consistency.</bold> All comparative experiments are conducted under identical conditions: the same hardware platform (Huawei P60 and Xiaomi Redmi Note 7 smartphones with Lenovo Y7000 laptops), the same test dataset, and the same evaluation metrics.</p>
<p><bold>Evaluation goals.</bold> With the above experiment settings and performance metrics, we evaluate the performance of FastDOV by answering the following questions:
<list list-type="order">
<list-item><p><italic>RQ1</italic>: How do parameters such as delimiter length, symbol length, and gap length affect the performance of FastDOV in practice?</p></list-item>
<list-item><p><italic>RQ2</italic>: How do deep learning models affect the accuracy of FastDOV?</p></list-item>
<list-item><p><italic>RQ3</italic>: What is the performance of FastDOV by environmental factors such as cellular signal strengths and phone brands?</p></list-item>
<list-item><p><italic>RQ4</italic>: Can FastDOV improve the state-of-the-art (SOTA) data transmission systems over voice channels?</p></list-item>
</list></p>
</sec>
<sec id="s5_2">
<label>5.2</label>
<title>Parameter Study</title>
<p>To answer <italic>RQ1</italic>, we conducted experiments to study the impacts of various parameters, including <italic>delimiter pattern/length</italic>, <italic>symbol length</italic>, and <italic>duration of the gap</italic> by changing the value of each parameter with the others fixed. All experiments in this section were conducted in environments with strong cellular signals to achieve stable and reliable results. In the next section, we consider the scenario with weak signal strength. We tested FastDOV in all cellular carriers, but only present the results in China Mobile because they all show similar performance.</p>
<p><bold>Delimiter pattern and length.</bold> The positioning of the delimiter is a prerequisite for FastDOV to demodulate the received symbols correctly. Different delimiter patterns (e.g., random sine signals and various types of chirps) and their lengths may affect the accuracy of separating modulated symbols in FastDOV. It is necessary to understand their performance and choose the best value of this parameter for practical use.</p>
<p>We tested different delimiter patterns, including linear chirp, quadratic chirp, hyperbolic chirp, logarithmic chirp, and random values. Each pattern was set for 0.1, 0.2, and 0.3 s, respectively. <xref ref-type="table" rid="table-2">Table 2</xref> shows the similarity (in terms of correlation coefficients) between the sent and received delimiters, where a value closer to 1 means higher similarity. This table indicates that chirp signals of any shape (similarity above 0.98) are superior to random sine signals (similarity only 0.2&#x2013;0.3). Therefore, we choose the chirp signal as the delimiter.</p>
<table-wrap id="table-2">
<label>Table 2</label>
<caption>
<title>Correlation coefficients of sent and received delimiters</title>
</caption>
<table>
<colgroup>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/> </colgroup>
<thead>
<tr>
<th>Length</th>
<th>Linear</th>
<th>Quadratic</th>
<th>Hyperbolic</th>
<th>Logarithmic</th>
<th>Random</th>
</tr>
</thead>
<tbody>
<tr>
<td>0.1 s</td>
<td>0.986</td>
<td>0.987</td>
<td>0.987</td>
<td>0.987</td>
<td>0.2495</td>
</tr>
<tr>
<td>0.2 s</td>
<td>0.988</td>
<td>0.983</td>
<td>0.984</td>
<td>0.987</td>
<td>0.3398</td>
</tr>
<tr>
<td>0.3 s</td>
<td>0.990</td>
<td>0.985</td>
<td>0.985</td>
<td>0.989</td>
<td>0.2876</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>Besides, <xref ref-type="table" rid="table-3">Table 3</xref> presents the demodulation accuracy of our approach based on these delimiter patterns and lengths. We can see that regardless of how the delimiter changes, the correlation is high enough to determine symbol positions accurately. Meanwhile, our demodulator&#x2019;s accuracy is above 0.98. After the Reed-Solomon (RS) error correction, the accuracy even increases to 100% without retransmission. Since there are no big differences in using different types of chirp-based delimiters, we choose a linear chirp with 0.1 s in practice to reduce computation complexity.</p>
<table-wrap id="table-3">
<label>Table 3</label>
<caption>
<title>Accuracy (%) of our DL-based demodulation with different delimiter patterns and lengths</title>
</caption>
<table>
<colgroup>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/> </colgroup>
<thead>
<tr>
<th>Length</th>
<th>Linear</th>
<th>Quadratic</th>
<th>Hyperbolic</th>
<th>Logarithmic</th>
</tr>
</thead>
<tbody>
<tr>
<td>0.1 s</td>
<td>99.0</td>
<td>99.1</td>
<td>98.4</td>
<td>98.9</td>
</tr>
<tr>
<td>0.2 s</td>
<td>98.8</td>
<td>98.7</td>
<td>98.7</td>
<td>98.9</td>
</tr>
<tr>
<td>0.3 s</td>
<td>98.8</td>
<td>98.2</td>
<td>98.3</td>
<td>99.0</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>With the above delimiter, we studied <inline-formula id="ieqn-89"><mml:math id="mml-ieqn-89"><mml:mi>&#x03B4;</mml:mi></mml:math></inline-formula> (i.e., the difference between the derived and actual number of samples in a data frame) calculated by our cross-correlation method and compared the demodulation accuracy obtained before and after our position remedying algorithm in practice. <xref ref-type="fig" rid="fig-10">Fig. 10a</xref> shows the Cumulative Distribution Function (CDF) of <inline-formula id="ieqn-90"><mml:math id="mml-ieqn-90"><mml:mi>&#x03B4;</mml:mi></mml:math></inline-formula> in 100 experiments. It is seen that <inline-formula id="ieqn-91"><mml:math id="mml-ieqn-91"><mml:mo stretchy="false">&#x2223;</mml:mo><mml:mspace width="negativethinmathspace" /><mml:mrow><mml:mi>&#x03B4;</mml:mi></mml:mrow><mml:mspace width="negativethinmathspace" /><mml:mo>&#x2223;</mml:mo><mml:mspace width="thinmathspace" /><mml:mo>&#x003C;</mml:mo><mml:mn>10</mml:mn></mml:math></inline-formula> accounted for 95% of the total experiments. Our cross-correlation method indeed achieved relatively high accuracy according to our experiments.</p>
<fig id="fig-10">
<label>Figure 10</label>
<caption>
<title>Performance of our cross-correlation method and position remedying algorithm in FastDOV</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_73201-fig-10.tif"/>
</fig>
<p>If <inline-formula id="ieqn-92"><mml:math id="mml-ieqn-92"><mml:mi>&#x03B4;</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mn>19</mml:mn><mml:mo>,</mml:mo><mml:mn>72</mml:mn></mml:math></inline-formula> denotes a small, medium, and large error occurring in practice, we compared the accuracy of symbol demodulation before and after applying the position remedying algorithm. <xref ref-type="fig" rid="fig-10">Fig. 10b</xref> shows that after adjusting the position of the delimiter, the accuracy is higher than the data based on the re-predicted delimiter. It indicates that our proposed solution works well to improve accuracy.</p>

<p>As shown in <xref ref-type="fig" rid="fig-10">Fig. 10c</xref>, the RS code we use can 100% correct all symbols with an accuracy of demodulation over 96%, as well as some cases with an accuracy of 94%&#x0223C;96%, which does not require retransmission. Otherwise, the retransmission mechanism is triggered to fetch all missing symbols.</p>

<p><bold>Symbol length.</bold> Intuitively, a longer symbol is more likely to be recognized correctly (due to a larger feature space), but at the cost of a longer transmission time. To investigate the impact of different symbol lengths on the system&#x2019;s reliability, we set five different symbol lengths: 0.001, 0.002, 0.004, 0.005, and 0.01 s, while a symbol group and a gap were fixed to last for 0.6 and 0.5 s, respectively.</p>
<p>In <xref ref-type="fig" rid="fig-11">Fig. 11a</xref>, we see that longer symbols have higher demodulation accuracy as expected, but the increment is marginal. After the error correction, all symbols (100%) can be correctly restored without retransmission, regardless of how the symbol length changes. However, when the signal strength is not strong enough, incrementing the symbol length may not help improve the demodulation accuracy. The details can be found in the next section.</p>
<fig id="fig-11">
<label>Figure 11</label>
<caption>
<title>Accuracy of DL-based demodulation before and after Reed&#x2013;Solomon error correction in different conditions of signal strengths</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_73201-fig-11.tif"/>
</fig>
<p>However, as reported in <xref ref-type="table" rid="table-4">Table 4</xref>, both the throughput and goodput decrease as the symbol length increases. When the symbol length is set to 0.001 s, our system achieves an excellent goodput of 1338.8 bit/s when the signal is strong and 1201.3 bit/s when the signal is weak. Hence, we use the symbol length of 0.001 s (i.e., 48 samples at the sampling rate of 48 kHz) for practical use.</p>
<table-wrap id="table-4">
<label>Table 4</label>
<caption>
<title>Throughput/goodput with different symbol lengths (bit/s)</title>
</caption>
<table>
<colgroup>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/> </colgroup>
<thead>
<tr>
<th>Symbol length</th>
<th>0.001 s</th>
<th>0.002 s</th>
<th>0.004 s</th>
<th>0.005 s</th>
<th>0.01 s</th>
</tr>
</thead>
<tbody>
<tr>
<td>Throughput</td>
<td>1785.1</td>
<td>853.8</td>
<td>427.7</td>
<td>347.8</td>
<td>173.5</td>
</tr>
<tr>
<td>Goodput (Strong)</td>
<td>1338.8</td>
<td>640.3</td>
<td>320.8</td>
<td>260.9</td>
<td>130.1</td>
</tr>
<tr>
<td>Goodput (Fair)</td>
<td>1332.95</td>
<td>629.5</td>
<td>320.8</td>
<td>255.1</td>
<td>130.1</td>
</tr>
<tr>
<td>Goodput (Poor)</td>
<td>1201.3</td>
<td>568.7</td>
<td>291.9</td>
<td>223.2</td>
<td>115.8</td>
</tr>
</tbody>
</table>
</table-wrap>
<p><bold>SG and gap lengths.</bold> The length of the symbol group (SG) and the gap determine whether we can avoid the rejection of VAD/DTX. We explore the impact of these two parameters and show the results in <xref ref-type="table" rid="table-5">Table 5</xref>.</p>
<table-wrap id="table-5">
<label>Table 5</label>
<caption>
<title>Impact of the length of the symbol group of gap</title>
</caption>
<table>
<colgroup>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/> </colgroup>
<thead>
<tr>
<th>SG</th>
<th>Gap</th>
<th>VAD triggered?</th>
<th>DL acc (%)</th>
<th>DL&#x002B;RS acc (%)</th>
</tr>
</thead>
<tbody>
<tr>
<td rowspan="5">0.6 s</td>
<td>0.1 s</td>
<td>Yes</td>
<td>60.5</td>
<td>63.9</td>
</tr>
<tr>
<td>0.2 s</td>
<td>Yes</td>
<td>64.2</td>
<td>66.7</td>
</tr>
<tr>
<td>0.3 s</td>
<td>Yes</td>
<td>72.5</td>
<td>75.4</td>
</tr>
<tr>
<td>0.4 s</td>
<td>No</td>
<td>97.2</td>
<td>100</td>
</tr>
<tr>
<td>0.5 s</td>
<td>No</td>
<td>98.7</td>
<td>100</td>
</tr>
<tr>
<td rowspan="5">0.8 s</td>
<td>0.1 s</td>
<td>No</td>
<td>54.6</td>
<td>55.7</td>
</tr>
<tr>
<td>0.2 s</td>
<td>Yes</td>
<td>56.4</td>
<td>58.3</td>
</tr>
<tr>
<td>0.3 s</td>
<td>Yes</td>
<td>61.7</td>
<td>64.2</td>
</tr>
<tr>
<td>0.4 s</td>
<td>No</td>
<td>89.5</td>
<td>93.2</td>
</tr>
<tr>
<td>0.5 s</td>
<td>No</td>
<td>97.1</td>
<td>100</td>
</tr>
<tr>
<td rowspan="5">1.0 s</td>
<td>0.1 s</td>
<td>Yes</td>
<td>60.5</td>
<td>63.9</td>
</tr>
<tr>
<td>0.2 s</td>
<td>Yes</td>
<td>51.7</td>
<td>54.3</td>
</tr>
<tr>
<td>0.3 s</td>
<td>Yes</td>
<td>55.3</td>
<td>57.9</td>
</tr>
<tr>
<td>0.4 s</td>
<td>Yes</td>
<td>82.6</td>
<td>85.7</td>
</tr>
<tr>
<td>0.5 s</td>
<td>No</td>
<td>89.5</td>
<td>93.8</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>As the SG length decreases and the gap length increases, received signals have less effect on VAD/DTX, thus increasing the demodulation accuracy. When the SG length is 0.6 s, FastDOV achieves the best demodulation accuracy compared to other intervals of 0.8 and 1 s. Additionally, with the fixed symbol duration of 0.6 s, FastDOV demonstrates perfect (100%) accuracy following Reed-Solomon (RS) error correction for gap lengths exceeding 0.4 s. Therefore, we empirically use the gap length of 0.5 s and symbol group duration of 0.6 s to achieve the optimal performance.</p>
</sec>
<sec id="s5_3">
<label>5.3</label>
<title>Affects of Deep Learning Models</title>
<p>In our preliminary study [<xref ref-type="bibr" rid="ref-39">39</xref>], we used a tailored ResNet model for symbol demodulation, while we propose to use a lightweight MobileNet model in this work. To compare their performance differences, we conducted experiments about accuracy and computational cost with the gap length of 0.5 s and symbol group duration of 0.6 s.</p>
<p>As shown in <xref ref-type="table" rid="table-6">Table 6</xref>, The proposed MobileNetV3-Small-based solution requires only 0.0029 GFLOPs to receive 3 bits of data, while ResNet34 requires 0.051 GFLOPs. Both models have high demodulation accuracy and can achieve 100% demodulation after RS code error correction. With mobile platforms commonly providing computational capabilities exceeding 1000 GFLOPs (e.g., iPhone and Qualcomm chipsets), our solution is highly practical given the computational capabilities of modern mobile devices.</p>
<table-wrap id="table-6">
<label>Table 6</label>
<caption>
<title>Comparison of accuracy (%) and computational cost (Giga floating-point operations per second, GFLOPS) between Resnet34 and MobileNetV3-Small</title>
</caption>
<table>
<colgroup>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/> </colgroup>
<thead>
<tr>
<th>Deep learning model</th>
<th>DL- acc</th>
<th>DL &#x002B;RS ass</th>
<th>Computational cost</th>
</tr>
</thead>
<tbody>
<tr>
<td>ResNet34</td>
<td>99.0</td>
<td>100</td>
<td>0.051</td>
</tr>
<tr>
<td>MobileNetV3-Small</td>
<td>98.7</td>
<td>100</td>
<td>0.00296</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>While our MobileNetV3-Small-based design already demonstrates strong performance in terms of both model performance (with an accuracy <inline-formula id="ieqn-93"><mml:math id="mml-ieqn-93"><mml:mo>&#x003E;</mml:mo><mml:mspace width="negativethinmathspace" /><mml:mn>97</mml:mn><mml:mi mathvariant="normal">&#x0025;</mml:mi></mml:math></inline-formula> and an averaged goodput of 1291.0 bit/s) and low computational cost, we sought to further study the relationship between those two key factors. Specifically, we conduct experiments by iteratively reducing the number of layers/modules in the MobileNetV3-Small network (this model consists of two CNN layers and 11 bneck modules) and report the demodulation accuracy and computational cost of each shrunken model.</p>
<p><xref ref-type="table" rid="table-7">Table 7</xref> summarizes the experimental results. We find that even when reserving only a small number of bneck modules from the original MobileNetV3-Small architecture, demodulation accuracy remains high. However, directly utilizing only the CNN layers leads to a sharp decline in performance. Specifically, models with at least two bneck modules consistently achieve demodulation accuracy &#x003E;97%, nearly matching the full 11-module model (first row of <xref ref-type="table" rid="table-7">Table 7</xref>). With a single bneck, the accuracy remains relatively high at 94.7%. In contrast, CNN-only models with one or two layers demonstrate unsatisfactory performance (accuracy &#x003C;82%). These results suggest further optimization of FastDOV&#x2019;s computational efficiency is possible without sacrificing accuracy. We leave the exploration of extremely compact architectures for future work, while noting that our proposed design already achieves an effective balance.</p>
<table-wrap id="table-7">
<label>Table 7</label>
<caption>
<title>Accuracy (%) and computational cost (GFLOPs) of different shrunken models</title>
</caption>
<table>
<colgroup>
<col align="center"/>
<col align="center"/>
<col align="center"/> </colgroup>
<thead>
<tr>
<th>Shrunken model</th>
<th>Accuracy</th>
<th>Computational cost</th>
</tr>
</thead>
<tbody>
<tr>
<td>2 CNN &#x002B; 11 bneck</td>
<td>97.6</td>
<td>0.00296</td>
</tr>
<tr>
<td>2 CNN &#x002B; 10 bneck</td>
<td>97.7</td>
<td>0.00241</td>
</tr>
<tr>
<td>2 CNN &#x002B; 9 bneck</td>
<td>97.8</td>
<td>0.00186</td>
</tr>
<tr>
<td>2 CNN &#x002B; 8 bneck</td>
<td>97.8</td>
<td>0.00163</td>
</tr>
<tr>
<td>2 CNN &#x002B; 7 bneck</td>
<td>97.9</td>
<td>0.00158</td>
</tr>
<tr>
<td>2 CNN &#x002B; 6 bneck</td>
<td>97.9</td>
<td>0.00154</td>
</tr>
<tr>
<td>2 CNN &#x002B; 5 bneck</td>
<td>97.9</td>
<td>0.00141</td>
</tr>
<tr>
<td>2 CNN &#x002B; 4 bneck</td>
<td>98.1</td>
<td>0.00138</td>
</tr>
<tr>
<td>2 CNN &#x002B; 3 bneck</td>
<td>97.8</td>
<td>0.00133</td>
</tr>
<tr>
<td>2 CNN &#x002B; 2 bneck</td>
<td>97.7</td>
<td>0.00129</td>
</tr>
<tr>
<td>2 CNN &#x002B; 1 bneck</td>
<td>94.7</td>
<td>0.00125</td>
</tr>
<tr>
<td>2 CNN</td>
<td>80.5</td>
<td>0.000375</td>
</tr>
<tr>
<td>1 CNN</td>
<td>81.6</td>
<td>0.000041</td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
<sec id="s5_4">
<label>5.4</label>
<title>Performance by Environmental Factors</title>
<p>To answer RQ3, we conducted experiments across different cellular signal strengths, carriers, and phone brands. In the following experiments, we use the optimal parameters for the best data rate: the linear chirp of 0.1 s as a delimiter, the symbol length of 0.001 s, the symbol group duration of 0.6 s, and the gap length of 0.5 s.</p>
<p><bold>Cellular signal strengths.</bold> As shown in <xref ref-type="table" rid="table-8">Table 8</xref>, we tested the performance of FastDOV in three scenarios with different average signal strengths. The signal strength was measured by Reference Signal Received Power (RSRP)&#x2014;The average power received from a single Reference signal in decibel-milliwatts (dBm) and Signal-to-Noise Ratio (SINR)/Signal-to-Interference-plus-Noise Ratio (SNR)&#x2014;The signal-to-noise ratio of the given signal.</p>
<table-wrap id="table-8">
<label>Table 8</label>
<caption>
<title>Signal strength reference in terms of RSRP and SINR</title>
</caption>
<table>
<colgroup>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/> </colgroup>
<thead>
<tr>
<th>Scenario</th>
<th>Signal strength</th>
<th>RSRP (dBm)</th>
<th>SINR (dB)</th>
</tr>
</thead>
<tbody>
<tr>
<td>Open area</td>
<td>Strong</td>
<td>&#x2212;90<inline-formula id="ieqn-94"><mml:math id="mml-ieqn-94"><mml:mo>&#x223C;</mml:mo></mml:math></inline-formula>&#x2212;85</td>
<td>15<inline-formula id="ieqn-95"><mml:math id="mml-ieqn-95"><mml:mo>&#x223C;</mml:mo></mml:math></inline-formula>20</td>
</tr>
<tr>
<td>Inside building</td>
<td>Fair</td>
<td>&#x2212;100<inline-formula id="ieqn-96"><mml:math id="mml-ieqn-96"><mml:mo>&#x223C;</mml:mo></mml:math></inline-formula>&#x2212;90</td>
<td>11<inline-formula id="ieqn-97"><mml:math id="mml-ieqn-97"><mml:mo>&#x223C;</mml:mo></mml:math></inline-formula>15</td>
</tr>
<tr>
<td>Basement</td>
<td>Poor</td>
<td>&#x2212;106<inline-formula id="ieqn-98"><mml:math id="mml-ieqn-98"><mml:mo>&#x223C;</mml:mo></mml:math></inline-formula>&#x2212;100</td>
<td>0<inline-formula id="ieqn-99"><mml:math id="mml-ieqn-99"><mml:mo>&#x223C;</mml:mo></mml:math></inline-formula>6</td>
</tr>
</tbody>
</table>
</table-wrap>
<p><xref ref-type="fig" rid="fig-11">Fig. 11</xref> presents the results of our DL-based demodulation scheme. It is seen that FastDOV achieves high accuracy in most cases. When the signal strength is strong, the demodulation accuracy is greater than 98% (i.e., the blue curve in <xref ref-type="fig" rid="fig-11">Fig. 11a</xref>) and 100% before and after Reed&#x2013;Solomon error correction. It indicates that our approach does not need retransmission in the strong signal case. When the signal quality is fair, the accuracy drops to 95% but cannot always be corrected by the RS code. When the quality is poor, the channel becomes unstable, and the demodulation accuracy is only above 83%. In these cases, we need retransmission to help achieve reliable data communication.</p>

<p><xref ref-type="table" rid="table-4">Table 4</xref> shows the throughput and goodput of the system. As expected, the performance drops from 1338 bit/s (goodput) to 1201 bit/s (goodput) when the signal strength becomes weak. With the help of retransmission, the performance is still acceptable even when the signal strength is poor. Overall, based on experiments in three signal strength environments, when the length of the symbol length is 0.001 s, the average goodput of FastDOV is 1291.0 bits/s, which is a preferred result.</p>

<p><bold>Cellular carriers.</bold> We tested FastDOV on cellular networks provided by the three major carriers in China, namely China Mobile, China Telecom, and China Unicom. The experiments demonstrate consistent performance, with goodput exceeding 1260 bit/s across all networks, varying within a narrow 10 bit/s range (<xref ref-type="table" rid="table-9">Table 9</xref>). These results not only highlight the robustness and reliability of FastDOV under different cellular network conditions, but also demonstrate its superior adaptability to network changes. We have verified the cross-network compatibility and excellent transmission efficiency of FastDOV through rigorous testing in different network environments of different operators. This robust performance across networks further establishes FastDOV&#x2019;s position as a reliable communication solution.</p>
<table-wrap id="table-9">
<label>Table 9</label>
<caption>
<title>Goodput (bit/s) in different cellular networks</title>
</caption>
<table>
<colgroup>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/> </colgroup>
<thead>
<tr>
<th>Receiver/Sender</th>
<th>China mobile</th>
<th>China telecom</th>
<th>China unicom</th>
</tr>
</thead>
<tbody>
<tr>
<td>China mobile</td>
<td>1270.1</td>
<td>1266.5</td>
<td>1267.3</td>
</tr>
<tr>
<td>China telecom</td>
<td>1266.2</td>
<td>1270.9</td>
<td>1261.1</td>
</tr>
<tr>
<td>China unicom</td>
<td>1268.3</td>
<td>1263.1</td>
<td>1269.8</td>
</tr>
</tbody>
</table>
</table-wrap>
<p><bold>Phone brands/models.</bold> To evaluate the robustness across different phone models, we tested FastDOV&#x2019;s performance on three phone brands/models: Huawei P60, Vivo Y30, and Honor 7X. Results in <xref ref-type="table" rid="table-10">Table 10</xref> show that the performance of FastDOV will not be affected by phone models. Specifically, FastDOV demonstrates consistent demodulation accuracy above 96% across all devices before the Reed-Solomon error correction. After the correction, FastDOV achieves 100% accuracy and 1338.84 bit/s goodput regardless of the phone model.</p>
<table-wrap id="table-10">
<label>Table 10</label>
<caption>
<title>Accuracy (%) and goodput (bit/s) when using different phone models</title>
</caption>
<table>
<colgroup>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/> </colgroup>
<thead>
<tr>
<th>Phone</th>
<th>Huawei P60</th>
<th>Vivo Y30</th>
<th>Honor 7X</th>
</tr>
</thead>
<tbody>
<tr>
<td>DL accuracy</td>
<td>98.8</td>
<td>97.7</td>
<td>96.8</td>
</tr>
<tr>
<td>RS accuracy</td>
<td>100.0</td>
<td>100.0</td>
<td>100.0</td>
</tr>
<tr>
<td>Goodput</td>
<td>1338.8</td>
<td>1338.8</td>
<td>1338.8</td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
<sec id="s5_5">
<label>5.5</label>
<title>Comparison to SOTA Approaches</title>
<p>We compare FastDOV with representative methods that share the same operational constraint: <italic>no modification to telecommunication infrastructure</italic>. The selected baselines&#x2014;Hermes [<xref ref-type="bibr" rid="ref-10">10</xref>], Authloop [<xref ref-type="bibr" rid="ref-11">11</xref>], and others&#x2014;represent the state-of-the-art approaches for end-to-end data transmission over unmodified voice channels, making them ideal references for evaluating our system&#x2019;s performance.</p>
<p>To answer RQ4, we compare our proposed FastDOV with other representative data transmission over voice channel methods, including Hermes [<xref ref-type="bibr" rid="ref-10">10</xref>] and Authloop [<xref ref-type="bibr" rid="ref-11">11</xref>]. We transmit the existing web certificate Bilibili, which is 2.1 kB, to test performance. Of course, the Bilibili web certificate is only for testing the performance of FastDOV, and we will design the phone certificate suitable for telephone transmission in future work.</p>
<p><xref ref-type="table" rid="table-11">Table 11</xref> shows the comparison. Our proposed FastDOV outperforms other methods by a large margin. The throughput of FastDOV is 1785.1 bit/s, while Hermes and Xu [<xref ref-type="bibr" rid="ref-40">40</xref>] only achieve a throughput of 1200 and 1330 bit/s. The average goodput of FastDOV is 1291.0 bit/s (the symbol length is 0.001 s), while Authloop only has a goodput of 500 bit/s.</p>
<table-wrap id="table-11">
<label>Table 11</label>
<caption>
<title>Performance comparison between various methods</title>
</caption>
<table>
<colgroup>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/> </colgroup>
<thead>
<tr>
<th>Method</th>
<th>Throughput/bps</th>
<th>Goodput/bps</th>
<th>Modulator</th>
<th>Synchronization</th>
<th>Retransmission</th>
</tr>
</thead>
<tbody>
<tr>
<td>FastDOV</td>
<td>1785.1</td>
<td>1291.0</td>
<td>Hybrid</td>
<td>&#x2713;</td>
<td>&#x2713;</td>
</tr>
<tr>
<td>Hermes</td>
<td>1200</td>
<td>NA</td>
<td>FKS</td>
<td>&#x2717;</td>
<td>&#x2717;</td>
</tr>
<tr>
<td>Authloop</td>
<td>NA</td>
<td>500</td>
<td>3-FKS</td>
<td>&#x2713;</td>
<td>&#x2713;</td>
</tr>
<tr>
<td>Rashidi et al. [<xref ref-type="bibr" rid="ref-41">41</xref>]</td>
<td>800</td>
<td>NA</td>
<td>Hybrid</td>
<td>&#x2717;</td>
<td>&#x2717;</td>
</tr>
<tr>
<td>Rashidi et al. [<xref ref-type="bibr" rid="ref-24">24</xref>]</td>
<td>1150</td>
<td>NA</td>
<td>Formants-based</td>
<td>&#x2713;</td>
<td>&#x2717;</td>
</tr>
<tr>
<td>Xu et al. [<xref ref-type="bibr" rid="ref-40">40</xref>]</td>
<td>1330</td>
<td>NA</td>
<td>PSK</td>
<td>&#x2717;</td>
<td>&#x2717;</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>Additionally, FastDOV and Authloop have synchronization and retransmission designs which can ensure the signal received is complete and correct, but other methods overlook those designs (Rashidi [<xref ref-type="bibr" rid="ref-24">24</xref>] has synchronization, but it&#x2019;s not detailed).</p>
</sec>
</sec>
<sec id="s6">
<label>6</label>
<title>Discussion</title>
<p>The practical deployment of the proposed authentication system faces challenges in security, practicality, and generalization. Security-wise, beyond common replay attacks, the system is vulnerable to adversarial audio injections that may deceive the deep learning model. Practically, the model must achieve low inference latency in real calling environments while handling diverse acoustic conditions and user dialects. Furthermore, the current training and evaluation are primarily based on data from Chinese telecom networks, limiting their direct applicability to international networks with different codecs and infrastructure standards.</p>
<p>To address these limitations, future work will focus on three key directions. Security enhancements will include dynamic challenge-response mechanisms and adversarial training to improve robustness. Performance optimization will involve model lightweighting for mobile devices and self-supervised learning for better generalization. For broader applicability, we will investigate cross-domain adaptation techniques to make the system compatible with diverse international telecom environments. These efforts will help bridge the gap between the current prototype and practical deployment.</p>
</sec>
<sec id="s7">
<label>7</label>
<title>Conclusion</title>
<p>We propose a deep-learning-based acoustic modulation scheme named FastDOV to transmit data over mobile voice channels without infrastructure support. FastDOV consists of a novel chirp-based modulation (three signal characteristics are mixed for modulation) and a tailored MobilenetV3-Small model for demodulation. The demodulation method requires only 0.0029 GFLOPs to receive 3 bits of data, which can be easily run on mobile phones. FastDOV can achieve 1291.0 bit/s goodput on average, which is better than existing approaches, and it can be applied to three different cellular networks in China. In order to improve the running speed of this method, we scaled down the model size of MobileNetV3. The experiment found that when MobileNetV3 has two or more residual networks, it can improve the running speed while ensuring accuracy. In addition, we demonstrate that our system can transfer digital certificates over voice channels to prevent telecom fraud.</p>
</sec>
</body>
<back>
<ack>
<p>Not applicable.</p>
</ack>
<sec>
<title>Funding Statement</title>
<p>The authors received no specific funding for this study.</p>
</sec>
<sec>
<title>Author Contributions</title>
<p>Jian Huang and Hao Han conceived the research idea and designed the methodology. Mingwei Li conducted the majority of the experiments and data analysis. Yulong Tian and Yi Yao reviewed, edited, and provided critical feedback on the manuscript. All authors reviewed the results and approved the final version of the manuscript.</p>
</sec>
<sec sec-type="data-availability">
<title>Availability of Data and Materials</title>
<p>Not applicable.</p>
</sec>
<sec>
<title>Ethics Approval</title>
<p>Not applicable.</p>
</sec>
<sec sec-type="COI-statement">
<title>Conflicts of Interest</title>
<p>The authors declare no conflicts of interest to report regarding the present study.</p>
</sec>
<app-group id="appg-1">
<app id="app-1">
<title>Appendix A Mathematical Derivations</title>
<sec id="s8">
<title>Appendix A.1 Reed-Solomon Code Mathematical Formulation</title>
<p>The Reed-Solomon (RS) code employed in FastDOV uses parameters <inline-formula id="ieqn-100"><mml:math id="mml-ieqn-100"><mml:mo stretchy="false">(</mml:mo><mml:mi>n</mml:mi><mml:mo>=</mml:mo><mml:mn>255</mml:mn><mml:mo>,</mml:mo><mml:mi>k</mml:mi><mml:mo>=</mml:mo><mml:mn>223</mml:mn><mml:mo>,</mml:mo><mml:mi>t</mml:mi><mml:mo>=</mml:mo><mml:mn>16</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>, which can correct up to 16 symbol errors. The mathematical formulation is as follows:</p>
<p>Let <inline-formula id="ieqn-101"><mml:math id="mml-ieqn-101"><mml:msub><mml:mrow><mml:mi mathvariant="double-struck">F</mml:mi></mml:mrow><mml:mrow><mml:msup><mml:mn>2</mml:mn><mml:mi>m</mml:mi></mml:msup></mml:mrow></mml:msub></mml:math></inline-formula> be a finite field with <inline-formula id="ieqn-102"><mml:math id="mml-ieqn-102"><mml:msup><mml:mn>2</mml:mn><mml:mi>m</mml:mi></mml:msup></mml:math></inline-formula> elements, where <inline-formula id="ieqn-103"><mml:math id="mml-ieqn-103"><mml:mi>m</mml:mi><mml:mo>=</mml:mo><mml:mn>8</mml:mn></mml:math></inline-formula> in our implementation. The generator polynomial <inline-formula id="ieqn-104"><mml:math id="mml-ieqn-104"><mml:mi>g</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> for a <italic>t</italic>-error-correcting RS code is defined as:
<disp-formula id="eqn-A1"><label>(A1)</label><mml:math id="mml-eqn-A1" display="block"><mml:mi>g</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:munderover><mml:mo>&#x220F;</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mn>2</mml:mn><mml:mi>t</mml:mi></mml:mrow></mml:munderover><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo>&#x2212;</mml:mo><mml:msup><mml:mi>&#x03B1;</mml:mi><mml:mi>i</mml:mi></mml:msup><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mi>&#x03B1;</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo>&#x2212;</mml:mo><mml:msup><mml:mi>&#x03B1;</mml:mi><mml:mn>2</mml:mn></mml:msup><mml:mo stretchy="false">)</mml:mo><mml:mo>&#x22EF;</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo>&#x2212;</mml:mo><mml:msup><mml:mi>&#x03B1;</mml:mi><mml:mrow><mml:mn>2</mml:mn><mml:mi>t</mml:mi></mml:mrow></mml:msup><mml:mo stretchy="false">)</mml:mo></mml:math></disp-formula>where <inline-formula id="ieqn-105"><mml:math id="mml-ieqn-105"><mml:mi>&#x03B1;</mml:mi></mml:math></inline-formula> is a primitive element in <inline-formula id="ieqn-106"><mml:math id="mml-ieqn-106"><mml:msub><mml:mrow><mml:mi mathvariant="double-struck">F</mml:mi></mml:mrow><mml:mrow><mml:msup><mml:mn>2</mml:mn><mml:mi>m</mml:mi></mml:msup></mml:mrow></mml:msub></mml:math></inline-formula>. For our parameters <inline-formula id="ieqn-107"><mml:math id="mml-ieqn-107"><mml:mo stretchy="false">(</mml:mo><mml:mi>n</mml:mi><mml:mo>=</mml:mo><mml:mn>255</mml:mn><mml:mo>,</mml:mo><mml:mi>k</mml:mi><mml:mo>=</mml:mo><mml:mn>223</mml:mn><mml:mo>,</mml:mo><mml:mi>t</mml:mi><mml:mo>=</mml:mo><mml:mn>16</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>, the generator polynomial becomes:
<disp-formula id="eqn-A2"><label>(A2)</label><mml:math id="mml-eqn-A2" display="block"><mml:mi>g</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:munderover><mml:mo>&#x220F;</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mn>32</mml:mn></mml:mrow></mml:munderover><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo>&#x2212;</mml:mo><mml:msup><mml:mi>&#x03B1;</mml:mi><mml:mi>i</mml:mi></mml:msup><mml:mo stretchy="false">)</mml:mo></mml:math></disp-formula></p>
<p>The encoding process transforms a message polynomial <inline-formula id="ieqn-108"><mml:math id="mml-ieqn-108"><mml:mi>m</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> of degree <inline-formula id="ieqn-109"><mml:math id="mml-ieqn-109"><mml:mi>k</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:math></inline-formula> into a codeword polynomial <inline-formula id="ieqn-110"><mml:math id="mml-ieqn-110"><mml:mi>c</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> of degree <inline-formula id="ieqn-111"><mml:math id="mml-ieqn-111"><mml:mi>n</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:math></inline-formula>:
<disp-formula id="eqn-A3"><label>(A3)</label><mml:math id="mml-eqn-A3" display="block"><mml:mi>c</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:msup><mml:mi>x</mml:mi><mml:mrow><mml:mi>n</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mi>k</mml:mi></mml:mrow></mml:msup><mml:mi>m</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>+</mml:mo><mml:mi>r</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></disp-formula>where <inline-formula id="ieqn-112"><mml:math id="mml-ieqn-112"><mml:mi>r</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> is the remainder polynomial satisfying:
<disp-formula id="eqn-A4"><label>(A4)</label><mml:math id="mml-eqn-A4" display="block"><mml:msup><mml:mi>x</mml:mi><mml:mrow><mml:mi>n</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mi>k</mml:mi></mml:mrow></mml:msup><mml:mi>m</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>&#x2261;</mml:mo><mml:mi>r</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mspace width="1em" /><mml:mi>mod</mml:mi><mml:mspace width="thinmathspace" /><mml:mspace width="thinmathspace" /><mml:mi>g</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></disp-formula></p>
<p>The Berlekamp-Massey algorithm is used for decoding, which involves solving the key equation:
<disp-formula id="eqn-A5"><label>(A5)</label><mml:math id="mml-eqn-A5" display="block"><mml:mi mathvariant="normal">&#x039B;</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mi>S</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>&#x2261;</mml:mo><mml:mi mathvariant="normal">&#x03A9;</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mspace width="1em" /><mml:mi>mod</mml:mi><mml:mspace width="thinmathspace" /><mml:mspace width="thinmathspace" /><mml:msup><mml:mi>x</mml:mi><mml:mrow><mml:mn>2</mml:mn><mml:mi>t</mml:mi></mml:mrow></mml:msup></mml:math></disp-formula>where <inline-formula id="ieqn-113"><mml:math id="mml-ieqn-113"><mml:mi mathvariant="normal">&#x039B;</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> is the error locator polynomial, <inline-formula id="ieqn-114"><mml:math id="mml-ieqn-114"><mml:mi>S</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> is the syndrome polynomial, and <inline-formula id="ieqn-115"><mml:math id="mml-ieqn-115"><mml:mi mathvariant="normal">&#x03A9;</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> is the error evaluator polynomial. The Chien search and Forney algorithm are then applied to locate and evaluate errors.</p>
</sec>
<sec id="s9">
<title>Appendix A.2 Welford&#x2019;s One-Pass Algorithm Details</title>
<p>Welford&#x2019;s algorithm provides a numerically stable method for computing variance and correlation in a single pass. For the cross-correlation computation in delimiter detection, the algorithm proceeds as follows:</p>
<p>Let <inline-formula id="ieqn-116"><mml:math id="mml-ieqn-116"><mml:msub><mml:mi>u</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:math></inline-formula> and <inline-formula id="ieqn-117"><mml:math id="mml-ieqn-117"><mml:msub><mml:mi>v</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:math></inline-formula> be sequences of samples, with means <inline-formula id="ieqn-118"><mml:math id="mml-ieqn-118"><mml:mover><mml:mi>u</mml:mi><mml:mo accent="false">&#x00AF;</mml:mo></mml:mover></mml:math></inline-formula> and <inline-formula id="ieqn-119"><mml:math id="mml-ieqn-119"><mml:mover><mml:mi>v</mml:mi><mml:mo accent="false">&#x00AF;</mml:mo></mml:mover></mml:math></inline-formula>, respectively. The sample correlation coefficient <inline-formula id="ieqn-120"><mml:math id="mml-ieqn-120"><mml:mi>r</mml:mi></mml:math></inline-formula> is computed using:
<disp-formula id="eqn-A6"><label>(A6)</label><mml:math id="mml-eqn-A6" display="block"><mml:mi>r</mml:mi><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mi>m</mml:mi></mml:munderover><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>u</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo>&#x2212;</mml:mo><mml:mover><mml:mi>u</mml:mi><mml:mo accent="false">&#x00AF;</mml:mo></mml:mover><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>v</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo>&#x2212;</mml:mo><mml:mover><mml:mi>v</mml:mi><mml:mo accent="false">&#x00AF;</mml:mo></mml:mover><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mrow><mml:msqrt><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mi>m</mml:mi></mml:munderover><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>u</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo>&#x2212;</mml:mo><mml:mover><mml:mi>u</mml:mi><mml:mo accent="false">&#x00AF;</mml:mo></mml:mover><mml:msup><mml:mo stretchy="false">)</mml:mo><mml:mn>2</mml:mn></mml:msup></mml:msqrt><mml:msqrt><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mi>m</mml:mi></mml:munderover><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>v</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo>&#x2212;</mml:mo><mml:mover><mml:mi>v</mml:mi><mml:mo accent="false">&#x00AF;</mml:mo></mml:mover><mml:msup><mml:mo stretchy="false">)</mml:mo><mml:mn>2</mml:mn></mml:msup></mml:msqrt></mml:mrow></mml:mfrac></mml:math></disp-formula>
<disp-formula id="eqn-A7"><label>(A7)</label><mml:math id="mml-eqn-A7" display="block"><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mi>m</mml:mi><mml:mo>&#x2211;</mml:mo><mml:msub><mml:mi>u</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:msub><mml:mi>v</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo>&#x2212;</mml:mo><mml:mo>&#x2211;</mml:mo><mml:msub><mml:mi>u</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo>&#x2211;</mml:mo><mml:msub><mml:mi>v</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mrow><mml:mrow><mml:msqrt><mml:mi>m</mml:mi><mml:mo>&#x2211;</mml:mo><mml:msubsup><mml:mi>u</mml:mi><mml:mi>i</mml:mi><mml:mn>2</mml:mn></mml:msubsup><mml:mo>&#x2212;</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mo>&#x2211;</mml:mo><mml:msub><mml:mi>u</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:msup><mml:mo stretchy="false">)</mml:mo><mml:mn>2</mml:mn></mml:msup></mml:msqrt><mml:msqrt><mml:mi>m</mml:mi><mml:mo>&#x2211;</mml:mo><mml:msubsup><mml:mi>v</mml:mi><mml:mi>i</mml:mi><mml:mn>2</mml:mn></mml:msubsup><mml:mo>&#x2212;</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mo>&#x2211;</mml:mo><mml:msub><mml:mi>v</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:msup><mml:mo stretchy="false">)</mml:mo><mml:mn>2</mml:mn></mml:msup></mml:msqrt></mml:mrow></mml:mfrac></mml:math></disp-formula></p>
<p>The online computation uses the following recurrence relations:
<disp-formula id="eqn-A8"><label>(A8)</label><mml:math id="mml-eqn-A8" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd /><mml:mtd><mml:msub><mml:mi>M</mml:mi><mml:mrow><mml:mi>k</mml:mi><mml:mo>,</mml:mo><mml:mi>u</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mi>M</mml:mi><mml:mrow><mml:mi>k</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mi>u</mml:mi></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>u</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mi>M</mml:mi><mml:mrow><mml:mi>k</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mi>u</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mrow><mml:mo>/</mml:mo></mml:mrow><mml:mi>k</mml:mi></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<disp-formula id="eqn-A9"><label>(A9)</label><mml:math id="mml-eqn-A9" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd /><mml:mtd><mml:msub><mml:mi>S</mml:mi><mml:mrow><mml:mi>k</mml:mi><mml:mo>,</mml:mo><mml:mi>u</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mi>S</mml:mi><mml:mrow><mml:mi>k</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mi>u</mml:mi></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>u</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mi>M</mml:mi><mml:mrow><mml:mi>k</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mi>u</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>u</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mi>M</mml:mi><mml:mrow><mml:mi>k</mml:mi><mml:mo>,</mml:mo><mml:mi>u</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">)</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<disp-formula id="eqn-A10"><label>(A10)</label><mml:math id="mml-eqn-A10" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd /><mml:mtd><mml:msub><mml:mi>M</mml:mi><mml:mrow><mml:mi>k</mml:mi><mml:mo>,</mml:mo><mml:mi>v</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mi>M</mml:mi><mml:mrow><mml:mi>k</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mi>v</mml:mi></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>v</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mi>M</mml:mi><mml:mrow><mml:mi>k</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mi>v</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mrow><mml:mo>/</mml:mo></mml:mrow><mml:mi>k</mml:mi></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<disp-formula id="eqn-A11"><label>(A11)</label><mml:math id="mml-eqn-A11" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd /><mml:mtd><mml:msub><mml:mi>S</mml:mi><mml:mrow><mml:mi>k</mml:mi><mml:mo>,</mml:mo><mml:mi>v</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mi>S</mml:mi><mml:mrow><mml:mi>k</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mi>v</mml:mi></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>v</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mi>M</mml:mi><mml:mrow><mml:mi>k</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mi>v</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>v</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mi>M</mml:mi><mml:mrow><mml:mi>k</mml:mi><mml:mo>,</mml:mo><mml:mi>v</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">)</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<disp-formula id="eqn-A12"><label>(A12)</label><mml:math id="mml-eqn-A12" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd /><mml:mtd><mml:msub><mml:mi>C</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mi>C</mml:mi><mml:mrow><mml:mi>k</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>u</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mi>M</mml:mi><mml:mrow><mml:mi>k</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mi>u</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>v</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mi>M</mml:mi><mml:mrow><mml:mi>k</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mi>v</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">)</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>where <inline-formula id="ieqn-121"><mml:math id="mml-ieqn-121"><mml:msub><mml:mi>M</mml:mi><mml:mi>k</mml:mi></mml:msub></mml:math></inline-formula> represents the mean after <inline-formula id="ieqn-122"><mml:math id="mml-ieqn-122"><mml:mi>k</mml:mi></mml:math></inline-formula> samples, <inline-formula id="ieqn-123"><mml:math id="mml-ieqn-123"><mml:msub><mml:mi>S</mml:mi><mml:mi>k</mml:mi></mml:msub></mml:math></inline-formula> represents the sum of squares, and <inline-formula id="ieqn-124"><mml:math id="mml-ieqn-124"><mml:msub><mml:mi>C</mml:mi><mml:mi>k</mml:mi></mml:msub></mml:math></inline-formula> represents the covariance.</p>
</sec>
<sec id="s10">
<title>Appendix A.3 Chirp Signal Mathematical Analysis</title>
<p>The linear chirp signal used in FastDOV&#x2019;s modulation scheme has a more detailed mathematical formulation. The instantaneous phase <inline-formula id="ieqn-125"><mml:math id="mml-ieqn-125"><mml:mi>&#x03D5;</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> of a linear chirp is:
<disp-formula id="eqn-A13"><label>(A13)</label><mml:math id="mml-eqn-A13" display="block"><mml:mi>&#x03D5;</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mn>2</mml:mn><mml:mi>&#x03C0;</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mfrac><mml:mi>c</mml:mi><mml:mn>2</mml:mn></mml:mfrac><mml:msup><mml:mi>t</mml:mi><mml:mn>2</mml:mn></mml:msup><mml:mo>+</mml:mo><mml:msub><mml:mi>f</mml:mi><mml:mn>0</mml:mn></mml:msub><mml:mi>t</mml:mi><mml:mo>)</mml:mo></mml:mrow><mml:mo>+</mml:mo><mml:msub><mml:mi>&#x03D5;</mml:mi><mml:mn>0</mml:mn></mml:msub></mml:math></disp-formula>where <inline-formula id="ieqn-126"><mml:math id="mml-ieqn-126"><mml:mi>c</mml:mi></mml:math></inline-formula> is the chirp rate, <inline-formula id="ieqn-127"><mml:math id="mml-ieqn-127"><mml:msub><mml:mi>f</mml:mi><mml:mn>0</mml:mn></mml:msub></mml:math></inline-formula> is the starting frequency, and <inline-formula id="ieqn-128"><mml:math id="mml-ieqn-128"><mml:msub><mml:mi>&#x03D5;</mml:mi><mml:mn>0</mml:mn></mml:msub></mml:math></inline-formula> is the initial phase. The instantaneous frequency <inline-formula id="ieqn-129"><mml:math id="mml-ieqn-129"><mml:mi>f</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> is the derivative of the phase:
<disp-formula id="eqn-A14"><label>(A14)</label><mml:math id="mml-eqn-A14" display="block"><mml:mi>f</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:mrow><mml:mn>2</mml:mn><mml:mi>&#x03C0;</mml:mi></mml:mrow></mml:mfrac><mml:mfrac><mml:mrow><mml:mi>d</mml:mi><mml:mi>&#x03D5;</mml:mi></mml:mrow><mml:mrow><mml:mi>d</mml:mi><mml:mi>t</mml:mi></mml:mrow></mml:mfrac><mml:mo>=</mml:mo><mml:mi>c</mml:mi><mml:mi>t</mml:mi><mml:mo>+</mml:mo><mml:msub><mml:mi>f</mml:mi><mml:mn>0</mml:mn></mml:msub></mml:math></disp-formula></p>
<p>For the up-chirp and down-chirp modulation, the frequency ranges are defined as:
<disp-formula id="eqn-A15"><label>(A15)</label><mml:math id="mml-eqn-A15" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd /><mml:mtd><mml:mrow><mml:mtext>Up-chirp:&#xA0;</mml:mtext></mml:mrow><mml:mi>f</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:msub><mml:mi>f</mml:mi><mml:mn>0</mml:mn></mml:msub><mml:mo>+</mml:mo><mml:mfrac><mml:mrow><mml:msub><mml:mi>f</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mi>f</mml:mi><mml:mn>0</mml:mn></mml:msub></mml:mrow><mml:mi>T</mml:mi></mml:mfrac><mml:mi>t</mml:mi><mml:mo>,</mml:mo><mml:mspace width="1em" /><mml:mi>t</mml:mi><mml:mo>&#x2208;</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mi>T</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<disp-formula id="eqn-A16"><label>(A16)</label><mml:math id="mml-eqn-A16" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd /><mml:mtd><mml:mrow><mml:mtext>Down-chirp:&#xA0;</mml:mtext></mml:mrow><mml:mi>f</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:msub><mml:mi>f</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:mo>&#x2212;</mml:mo><mml:mfrac><mml:mrow><mml:msub><mml:mi>f</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mi>f</mml:mi><mml:mn>0</mml:mn></mml:msub></mml:mrow><mml:mi>T</mml:mi></mml:mfrac><mml:mi>t</mml:mi><mml:mo>,</mml:mo><mml:mspace width="1em" /><mml:mi>t</mml:mi><mml:mo>&#x2208;</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mi>T</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>where <italic>T</italic> is the symbol duration, <inline-formula id="ieqn-130"><mml:math id="mml-ieqn-130"><mml:msub><mml:mi>f</mml:mi><mml:mn>0</mml:mn></mml:msub></mml:math></inline-formula> is the starting frequency, and <inline-formula id="ieqn-131"><mml:math id="mml-ieqn-131"><mml:msub><mml:mi>f</mml:mi><mml:mn>1</mml:mn></mml:msub></mml:math></inline-formula> is the ending frequency.</p>
<p>The time-domain representation considering both frequency bands becomes:
<disp-formula id="eqn-A17"><label>(A17)</label><mml:math id="mml-eqn-A17" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd /><mml:mtd><mml:mi>x</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mi>A</mml:mi><mml:mi>sin</mml:mi><mml:mo>&#x2061;</mml:mo><mml:mrow><mml:mo>[</mml:mo><mml:mn>2</mml:mn><mml:mi>&#x03C0;</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mfrac><mml:mi>c</mml:mi><mml:mn>2</mml:mn></mml:mfrac><mml:msup><mml:mi>t</mml:mi><mml:mn>2</mml:mn></mml:msup><mml:mo>+</mml:mo><mml:msub><mml:mi>f</mml:mi><mml:mn>0</mml:mn></mml:msub><mml:mi>t</mml:mi><mml:mo>)</mml:mo></mml:mrow><mml:mo>+</mml:mo><mml:msub><mml:mi>&#x03D5;</mml:mi><mml:mn>0</mml:mn></mml:msub><mml:mo>]</mml:mo></mml:mrow><mml:mo>&#x22C5;</mml:mo><mml:mi>w</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>where <italic>A</italic> is the amplitude and <inline-formula id="ieqn-132"><mml:math id="mml-ieqn-132"><mml:mi>w</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> is a window function (typically rectangular) that ensures the signal is limited to the symbol duration.</p>
<p>The Fourier transform of a linear chirp can be expressed in terms of Fresnel integrals:
<disp-formula id="eqn-A18"><label>(A18)</label><mml:math id="mml-eqn-A18" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd /><mml:mtd><mml:mi>X</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>f</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mfrac><mml:mi>A</mml:mi><mml:mn>2</mml:mn></mml:mfrac><mml:msqrt><mml:mfrac><mml:mi>&#x03C0;</mml:mi><mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mi>c</mml:mi><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow></mml:mrow></mml:mfrac></mml:msqrt><mml:msup><mml:mi>e</mml:mi><mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mi>j</mml:mi><mml:mrow><mml:mo>[</mml:mo><mml:mi>&#x03C0;</mml:mi><mml:mfrac><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>f</mml:mi><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mi>f</mml:mi><mml:mn>0</mml:mn></mml:msub><mml:msup><mml:mo stretchy="false">)</mml:mo><mml:mn>2</mml:mn></mml:msup></mml:mrow><mml:mi>c</mml:mi></mml:mfrac><mml:mo>&#x2212;</mml:mo><mml:mfrac><mml:mi>&#x03C0;</mml:mi><mml:mn>4</mml:mn></mml:mfrac><mml:mo>]</mml:mo></mml:mrow></mml:mrow></mml:msup><mml:mrow><mml:mo>[</mml:mo><mml:mi>C</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>+</mml:mo><mml:mi>j</mml:mi><mml:mi>S</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>]</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>where <inline-formula id="ieqn-133"><mml:math id="mml-ieqn-133"><mml:mi>C</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> and <inline-formula id="ieqn-134"><mml:math id="mml-ieqn-134"><mml:mi>S</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> are the Fresnel cosine and sine integrals, and <inline-formula id="ieqn-135"><mml:math id="mml-ieqn-135"><mml:mi>x</mml:mi><mml:mo>=</mml:mo><mml:msqrt><mml:mfrac><mml:mn>2</mml:mn><mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mi>c</mml:mi><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow></mml:mrow></mml:mfrac></mml:msqrt><mml:mo stretchy="false">(</mml:mo><mml:mi>f</mml:mi><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mi>f</mml:mi><mml:mn>0</mml:mn></mml:msub><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>.</p>
</sec>
</app>
</app-group>
<ref-list content-type="authoryear">
<title>References</title>
<ref id="ref-1"><label>[1]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Gamage</surname> <given-names>P</given-names></string-name>, <string-name><surname>Dissanayake</surname> <given-names>D</given-names></string-name>, <string-name><surname>Kumarasinghe</surname> <given-names>N</given-names></string-name>, <string-name><surname>Ganegoda</surname> <given-names>GU</given-names></string-name></person-group>. <article-title>Acoustic signature analysis for distinguishing human vs. synthetic voices in vishing attacks</article-title>. In: <conf-name>2023 8th International Conference on Information Technology Research (ICITR)</conf-name>. <publisher-loc>Piscataway, NJ, USA</publisher-loc>: <publisher-name>IEEE</publisher-name>; <year>2023</year>. p. <fpage>1</fpage>&#x2013;<lpage>6</lpage>.</mixed-citation></ref>
<ref id="ref-2"><label>[2]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Yal&#x00E7;&#x0131;n</surname> <given-names>N</given-names></string-name>, <string-name><surname>Lale</surname> <given-names>B</given-names></string-name></person-group>. <article-title>Types of cyber-attacks with using voice</article-title>. <source>J Sci Rep A</source>. <year>2025</year>;<volume>61</volume>:<fpage>137</fpage>&#x2013;<lpage>65</lpage>.</mixed-citation></ref>
<ref id="ref-3"><label>[3]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><surname>Mustafa</surname> <given-names>HA</given-names></string-name></person-group>. <article-title>Secure and reliable wireless communication through end-to-end-based solution [dissertation]. Columbia, SC, USA: University of South Carolina</article-title>; <year>2014</year>.</mixed-citation></ref>
<ref id="ref-4"><label>[4]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Bhuiyan</surname> <given-names>MSI</given-names></string-name>, <string-name><surname>Razzak</surname> <given-names>A</given-names></string-name>, <string-name><surname>Ferdous</surname> <given-names>MS</given-names></string-name>, <string-name><surname>Chowdhury</surname> <given-names>MJM</given-names></string-name>, <string-name><surname>Hoque</surname> <given-names>MA</given-names></string-name>, <string-name><surname>Tarkoma</surname> <given-names>S</given-names></string-name></person-group>. <article-title>BONIK: a blockchain empowered chatbot for financial transactions</article-title>. In: <conf-name>2020 IEEE 19th International Conference on Trust, Security and Privacy in Computing and Communications (TrustCom)</conf-name>. <publisher-loc>Piscataway, NJ, USA</publisher-loc>: <publisher-name>IEEE</publisher-name>; <year>2020</year>. p. <fpage>1079</fpage>&#x2013;<lpage>88</lpage>.</mixed-citation></ref>
<ref id="ref-5"><label>[5]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Tjon</surname> <given-names>E</given-names></string-name>, <string-name><surname>Moh</surname> <given-names>M</given-names></string-name>, <string-name><surname>Moh</surname> <given-names>TS</given-names></string-name></person-group>. <article-title>Eff-ynet: a dual task network for deepfake detection and segmentation</article-title>. In: <conf-name>2021 15th International Conference on Ubiquitous Information Management and Communication (IMCOM)</conf-name>. <publisher-loc>Piscataway, NJ, USA</publisher-loc>: <publisher-name>IEEE</publisher-name>; <year>2021</year>. p. <fpage>1</fpage>&#x2013;<lpage>8</lpage>.</mixed-citation></ref>
<ref id="ref-6"><label>[6]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><collab>Fake Caller ID</collab></person-group>. <article-title>Fake caller ID: caller ID faker &#x0026; spoof caller [Internet]</article-title>. <comment>[cited 2024 Sep 28]</comment>. Available from: <ext-link ext-link-type="uri" xlink:href="https://fakecallerid.io/">https://fakecallerid.io/</ext-link>.</mixed-citation></ref>
<ref id="ref-7"><label>[7]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><collab>myphonerobot</collab></person-group>. <article-title>Spoof call and caller ID faker [Internet]</article-title>. <comment>[cited 2024 Sep 28]</comment>. Available from: <ext-link ext-link-type="uri" xlink:href="https://myphonerobot.com/">https://myphonerobot.com/</ext-link>.</mixed-citation></ref>
<ref id="ref-8"><label>[8]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Putra</surname> <given-names>FPE</given-names></string-name>, <string-name><surname>Ubaidi</surname> <given-names>U</given-names></string-name>, <string-name><surname>Zulfikri</surname> <given-names>A</given-names></string-name>, <string-name><surname>Arifin</surname> <given-names>G</given-names></string-name>, <string-name><surname>Ilhamsyah</surname> <given-names>RM</given-names></string-name></person-group>. <article-title>Analysis of phishing attack trends, impacts and prevention methods: literature study</article-title>. <source>Brill Res Artif Intell</source>. <year>2024</year>;<volume>4</volume>(<issue>1</issue>):<fpage>413</fpage>&#x2013;<lpage>21</lpage>. doi:<pub-id pub-id-type="doi">10.47709/brilliance.v4i1.4357</pub-id>.</mixed-citation></ref>
<ref id="ref-9"><label>[9]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><collab>GSMA</collab></person-group>. <article-title>Over half world&#x2019;s population now using mobile internet [Internet]</article-title>. <year>2021 [cited 2024 Sep 28]</year>. Available from: <ext-link ext-link-type="uri" xlink:href="https://www.gsma.com/newsroom/press-release/over-half-worlds-population-now-using-mobile-internet/">https://www.gsma.com/newsroom/press-release/over-half-worlds-population-now-using-mobile-internet/</ext-link>.</mixed-citation></ref>
<ref id="ref-10"><label>[10]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Dhananjay</surname> <given-names>A</given-names></string-name>, <string-name><surname>Sharma</surname> <given-names>A</given-names></string-name>, <string-name><surname>Paik</surname> <given-names>M</given-names></string-name>, <string-name><surname>Chen</surname> <given-names>J</given-names></string-name>, <string-name><surname>Kuppusamy</surname> <given-names>TK</given-names></string-name>, <string-name><surname>Li</surname> <given-names>J</given-names></string-name>, <etal>et al</etal></person-group>. <article-title>Hermes: data transmission over unknown voice channels</article-title>. In: <conf-name>Proceedings of the Sixteenth Annual International Conference on Mobile Computing and Networking</conf-name>. <publisher-loc>New York, NY, USA</publisher-loc>: <publisher-name>ACM</publisher-name>; <year>2010</year>. p. <fpage>113</fpage>&#x2013;<lpage>24</lpage>.</mixed-citation></ref>
<ref id="ref-11"><label>[11]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Reaves</surname> <given-names>B</given-names></string-name>, <string-name><surname>Blue</surname> <given-names>L</given-names></string-name>, <string-name><surname>Traynor</surname> <given-names>P</given-names></string-name></person-group>. <article-title>AuthLoop: end-to-end cryptographic authentication for telephony over voice channels</article-title>. In: <conf-name>25th USENIX Security Symposium (USENIX Security 16)</conf-name>. <publisher-loc>New York, NY, USA</publisher-loc>: <publisher-name>ACM</publisher-name>; <year>2016</year>. p. <fpage>963</fpage>&#x2013;<lpage>78</lpage>.</mixed-citation></ref>
<ref id="ref-12"><label>[12]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Chmayssani</surname> <given-names>T</given-names></string-name>, <string-name><surname>Baudoin</surname> <given-names>G</given-names></string-name></person-group>. <article-title>Data transmission over voice dedicated channels using digital modulations</article-title>. In: <conf-name>2008 18th International Conference Radioelektronika</conf-name>. <publisher-loc>Piscataway, NJ, USA</publisher-loc>: <publisher-name>IEEE</publisher-name>; <year>2008</year>. p. <fpage>1</fpage>&#x2013;<lpage>4</lpage>.</mixed-citation></ref>
<ref id="ref-13"><label>[13]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Boloursaz</surname> <given-names>M</given-names></string-name>, <string-name><surname>Hadavi</surname> <given-names>AH</given-names></string-name>, <string-name><surname>Kazemi</surname> <given-names>R</given-names></string-name>, <string-name><surname>Behnia</surname> <given-names>F</given-names></string-name></person-group>. <article-title>Secure data communication through GSM adaptive multi rate voice channel</article-title>. In: <conf-name>6th International Symposium on Telecommunications (IST)</conf-name>. <publisher-loc>Piscataway, NJ, USA</publisher-loc>: <publisher-name>IEEE</publisher-name>; <year>2012</year>. p. <fpage>1021</fpage>&#x2013;<lpage>6</lpage>.</mixed-citation></ref>
<ref id="ref-14"><label>[14]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Wolkotte</surname> <given-names>PT</given-names></string-name>, <string-name><surname>Smit</surname> <given-names>GJM</given-names></string-name>, <string-name><surname>Rauwerda</surname> <given-names>GK</given-names></string-name>, <string-name><surname>Smit</surname> <given-names>LT</given-names></string-name></person-group>. <article-title>An energy-efficient reconfigurable circuit-switched network-on-chip</article-title>. In: <conf-name>19th IEEE International Parallel and Distributed Processing Symposium; 2005 Apr 4&#x2013;8; Denver, CO, USA</conf-name>. <publisher-loc>Piscataway, NJ, USA</publisher-loc>: <publisher-name>IEEE</publisher-name>; <year>2005</year>.</mixed-citation></ref>
<ref id="ref-15"><label>[15]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Cai</surname> <given-names>J</given-names></string-name>, <string-name><surname>Goodman</surname> <given-names>DJ</given-names></string-name></person-group>. <article-title>General packet radio service in GSM</article-title>. <source>IEEE Commun Mag</source>. <year>1997</year>;<volume>35</volume>(<issue>10</issue>):<fpage>122</fpage>&#x2013;<lpage>31</lpage>. doi:<pub-id pub-id-type="doi">10.1109/35.623996</pub-id>.</mixed-citation></ref>
<ref id="ref-16"><label>[16]</label><mixed-citation publication-type="book"><person-group person-group-type="author"><string-name><surname>Sesia</surname> <given-names>S</given-names></string-name>, <string-name><surname>Toufik</surname> <given-names>I</given-names></string-name>, <string-name><surname>Baker</surname> <given-names>M</given-names></string-name></person-group>. <source>LTE-the UMTS long term evolution: from theory to practice</source>. <publisher-loc>Hoboken, NJ, USA</publisher-loc>: <publisher-name>John Wiley &#x0026; Sons</publisher-name>; <year>2011</year>.</mixed-citation></ref>
<ref id="ref-17"><label>[17]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Li</surname> <given-names>C</given-names></string-name>, <string-name><surname>Hutchins</surname> <given-names>DA</given-names></string-name>, <string-name><surname>Green</surname> <given-names>RJ</given-names></string-name></person-group>. <article-title>Short-range ultrasonic digital communications in air</article-title>. <source>IEEE Trans Ultrason Ferroelectr Freq Control</source>. <year>2008</year>;<volume>55</volume>(<issue>4</issue>):<fpage>908</fpage>&#x2013;<lpage>18</lpage>. doi:<pub-id pub-id-type="doi">10.1109/tuffc.2008.726</pub-id>; <pub-id pub-id-type="pmid">18467236</pub-id></mixed-citation></ref>
<ref id="ref-18"><label>[18]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Jiang</surname> <given-names>W</given-names></string-name>, <string-name><surname>Wright</surname> <given-names>WMD</given-names></string-name></person-group>. <article-title>Progress in airborne ultrasonic data communications for indoor applications</article-title>. In: <conf-name>2016 IEEE 14th International Conference on Industrial Informatics (INDIN)</conf-name>. <publisher-loc>Piscataway, NJ, USA</publisher-loc>: <publisher-name>IEEE</publisher-name>; <year>2016</year>. p. <fpage>322</fpage>&#x2013;<lpage>7</lpage>.</mixed-citation></ref>
<ref id="ref-19"><label>[19]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Novak</surname> <given-names>E</given-names></string-name>, <string-name><surname>Tang</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Li</surname> <given-names>Q</given-names></string-name></person-group>. <article-title>Ultrasound proximity networking on smart mobile devices for IoT applications</article-title>. <source>IEEE Internet Things J</source>. <year>2018</year>;<volume>6</volume>(<issue>1</issue>):<fpage>399</fpage>&#x2013;<lpage>409</lpage>. doi:<pub-id pub-id-type="doi">10.1109/jiot.2018.2848099</pub-id>.</mixed-citation></ref>
<ref id="ref-20"><label>[20]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Kim</surname> <given-names>S</given-names></string-name>, <string-name><surname>Mun</surname> <given-names>H</given-names></string-name>, <string-name><surname>Lee</surname> <given-names>Y</given-names></string-name></person-group>. <article-title>A data-over-sound application: attendance book</article-title>. In: <conf-name>2019 20th Asia-Pacific Network Operations and Management Symposium (APNOMS)</conf-name>. <publisher-loc>Piscataway, NJ, USA</publisher-loc>: <publisher-name>IEEE</publisher-name>; <year>2019</year>. p. <fpage>1</fpage>&#x2013;<lpage>4</lpage>.</mixed-citation></ref>
<ref id="ref-21"><label>[21]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Zhang</surname> <given-names>L</given-names></string-name>, <string-name><surname>Zhu</surname> <given-names>X</given-names></string-name>, <string-name><surname>Wu</surname> <given-names>X</given-names></string-name></person-group>. <article-title>No more free riders: sharing wifi secrets with acoustic signals</article-title>. In: <conf-name>2019 28th International Conference on Computer Communication and Networks (ICCCN)</conf-name>. <publisher-loc>Piscataway, NJ, USA</publisher-loc>: <publisher-name>IEEE</publisher-name>; <year>2019</year>. p. <fpage>1</fpage>&#x2013;<lpage>8</lpage>.</mixed-citation></ref>
<ref id="ref-22"><label>[22]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Krasnowski</surname> <given-names>P</given-names></string-name>, <string-name><surname>Lebrun</surname> <given-names>J</given-names></string-name>, <string-name><surname>Martin</surname> <given-names>B</given-names></string-name></person-group>. <article-title>Introducing a novel data over voice technique for secure voice communication</article-title>. <source>Wirel Pers Commun</source>. <year>2022</year>;<volume>124</volume>(<issue>4</issue>):<fpage>3077</fpage>&#x2013;<lpage>103</lpage>. doi:<pub-id pub-id-type="doi">10.1007/s11277-022-09503-6</pub-id>.</mixed-citation></ref>
<ref id="ref-23"><label>[23]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Katugampala</surname> <given-names>NN</given-names></string-name>, <string-name><surname>Al-Naimi</surname> <given-names>KT</given-names></string-name>, <string-name><surname>Villette</surname> <given-names>S</given-names></string-name>, <string-name><surname>Kondoz</surname> <given-names>AM</given-names></string-name></person-group>. <article-title>Real-time end-to-end secure voice communications over GSM voice channel</article-title>. In: <conf-name>2005 13th European Signal Processing Conference</conf-name>. <publisher-loc>Piscataway, NJ, USA</publisher-loc>: <publisher-name>IEEE</publisher-name>; <year>2005</year>. p. <fpage>1</fpage>&#x2013;<lpage>4</lpage>.</mixed-citation></ref>
<ref id="ref-24"><label>[24]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Rashidi</surname> <given-names>M</given-names></string-name>, <string-name><surname>Sayadiyan</surname> <given-names>A</given-names></string-name>, <string-name><surname>Mowlaee</surname> <given-names>P</given-names></string-name></person-group>. <article-title>Data mapping onto speech-like signal to transmission over the GSM voice channel</article-title>. In: <conf-name>2008 40th Southeastern Symposium on System Theory (SSST)</conf-name>. <publisher-loc>Piscataway, NJ, USA</publisher-loc>: <publisher-name>IEEE</publisher-name>; <year>2008</year>. p. <fpage>54</fpage>&#x2013;<lpage>8</lpage>.</mixed-citation></ref>
<ref id="ref-25"><label>[25]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Ladue</surname> <given-names>CK</given-names></string-name>, <string-name><surname>Sapozhnykov</surname> <given-names>VV</given-names></string-name>, <string-name><surname>Fienberg</surname> <given-names>KS</given-names></string-name></person-group>. <article-title>A data modem for GSM voice channel</article-title>. <source>IEEE Trans Veh Technol</source>. <year>2008</year>;<volume>57</volume>(<issue>4</issue>):<fpage>2205</fpage>&#x2013;<lpage>18</lpage>. doi:<pub-id pub-id-type="doi">10.1109/tvt.2007.912322</pub-id>.</mixed-citation></ref>
<ref id="ref-26"><label>[26]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Mezgec</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Chowdhury</surname> <given-names>A</given-names></string-name>, <string-name><surname>Kotnik</surname> <given-names>B</given-names></string-name>, <string-name><surname>Svecko</surname> <given-names>R</given-names></string-name></person-group>. <article-title>Implementation of PCCD-OFDM-ASK robust data transmission over GSM speech channel</article-title>. <source>Informatica</source>. <year>2009</year>;<volume>20</volume>(<issue>1</issue>):<fpage>51</fpage>&#x2013;<lpage>78</lpage>.</mixed-citation></ref>
<ref id="ref-27"><label>[27]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Abro</surname> <given-names>FI</given-names></string-name>, <string-name><surname>Rauf</surname> <given-names>F</given-names></string-name>, <string-name><surname>Chowdhry</surname> <given-names>BS</given-names></string-name>, <string-name><surname>Rajarajan</surname> <given-names>M</given-names></string-name></person-group>. <article-title>Towards security of GSM voice communication</article-title>. <source>Wirel Pers Commun</source>. <year>2019</year>;<volume>108</volume>(<issue>3</issue>):<fpage>1933</fpage>&#x2013;<lpage>55</lpage>. doi:<pub-id pub-id-type="doi">10.1007/s11277-019-06502-y</pub-id>.</mixed-citation></ref>
<ref id="ref-28"><label>[28]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Zhang</surname> <given-names>T</given-names></string-name>, <string-name><surname>Li</surname> <given-names>S</given-names></string-name>, <string-name><surname>Yu</surname> <given-names>B</given-names></string-name></person-group>. <article-title>A universal data transfer technique over voice channels of cellular mobile communication networks</article-title>. <source>IET Commun</source>. <year>2021</year>;<volume>15</volume>(<issue>1</issue>):<fpage>22</fpage>&#x2013;<lpage>32</lpage>. doi:<pub-id pub-id-type="doi">10.1049/cmu2.12047</pub-id>.</mixed-citation></ref>
<ref id="ref-29"><label>[29]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Biancucci</surname> <given-names>G</given-names></string-name>, <string-name><surname>Claudi</surname> <given-names>A</given-names></string-name>, <string-name><surname>Dragoni</surname> <given-names>AF</given-names></string-name></person-group>. <article-title>Secure data and voice transmission over GSM voice channel: applications for secure communications</article-title>. In: <conf-name>2013 4th International Conference on Intelligent Systems, Modelling and Simulation</conf-name>. <publisher-loc>Piscataway, NJ, USA</publisher-loc>: <publisher-name>IEEE</publisher-name>; <year>2013</year>. p. <fpage>230</fpage>&#x2013;<lpage>3</lpage>.</mixed-citation></ref>
<ref id="ref-30"><label>[30]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Ntantogian</surname> <given-names>C</given-names></string-name>, <string-name><surname>Veroni</surname> <given-names>E</given-names></string-name>, <string-name><surname>Karopoulos</surname> <given-names>G</given-names></string-name>, <string-name><surname>Xenakis</surname> <given-names>C</given-names></string-name></person-group>. <article-title>A survey of voice and communication protection solutions against wiretapping</article-title>. <source>Comput Electr Eng</source>. <year>2019</year>;<volume>77</volume>(<issue>4</issue>):<fpage>163</fpage>&#x2013;<lpage>78</lpage>. doi:<pub-id pub-id-type="doi">10.1016/j.compeleceng.2019.05.008</pub-id>.</mixed-citation></ref>
<ref id="ref-31"><label>[31]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Benny</surname> <given-names>A</given-names></string-name>, <string-name><surname>Saji</surname> <given-names>AM</given-names></string-name>, <string-name><surname>Joseph</surname> <given-names>CJ</given-names></string-name>, <string-name><surname>Christina</surname> <given-names>PB</given-names></string-name>, <string-name><surname>Antony</surname> <given-names>MA</given-names></string-name></person-group>. <article-title>Real-time voice phishing detection using BERT</article-title>. In: <conf-name>International Conference on Artificial Intelligence and Smart Energy</conf-name>. <publisher-loc>Cham, Switzerland</publisher-loc>: <publisher-name>Springer</publisher-name>; <year>2025</year>. p. <fpage>410</fpage>&#x2013;<lpage>26</lpage>.</mixed-citation></ref>
<ref id="ref-32"><label>[32]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><surname>Sim</surname> <given-names>JY</given-names></string-name>, <string-name><surname>Kim</surname> <given-names>SH</given-names></string-name></person-group>. <article-title>Detecting voice phishing with precision: fine-tuning small language models</article-title>. <comment>arXiv:2506.06180. 2025</comment>.</mixed-citation></ref>
<ref id="ref-33"><label>[33]</label><mixed-citation publication-type="book"><person-group person-group-type="author"><string-name><surname>Peterson</surname> <given-names>WW</given-names></string-name>, <string-name><surname>Weldon</surname> <given-names>EJ</given-names></string-name></person-group>. <source>Error-correcting codes</source>. <publisher-loc>Cambridge, MA, USA</publisher-loc>: <publisher-name>MIT Press</publisher-name>; <year>1972</year>.</mixed-citation></ref>
<ref id="ref-34"><label>[34]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Shabani</surname> <given-names>S</given-names></string-name>, <string-name><surname>Norouzi</surname> <given-names>Y</given-names></string-name></person-group>. <article-title>Speech recognition using principal components analysis and neural networks</article-title>. In: <conf-name>2016 IEEE 8th International Conference on Intelligent Systems (IS)</conf-name>. <publisher-loc>Piscataway, NJ, USA</publisher-loc>: <publisher-name>IEEE</publisher-name>; <year>2016</year>. p. <fpage>90</fpage>&#x2013;<lpage>5</lpage>.</mixed-citation></ref>
<ref id="ref-35"><label>[35]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Warohma</surname> <given-names>AM</given-names></string-name>, <string-name><surname>Hindersah</surname> <given-names>H</given-names></string-name>, <string-name><surname>Lestari</surname> <given-names>DP</given-names></string-name></person-group>. <article-title>Speaker recognition using MobileNetV3 for voice-based robot navigation</article-title>. In: <conf-name>2024 11th International Conference on Advanced Informatics: Concept, Theory and Application (ICAICTA)</conf-name>. <publisher-loc>Piscataway, NJ, USA</publisher-loc>: <publisher-name>IEEE</publisher-name>; <year>2024</year>. p. <fpage>1</fpage>&#x2013;<lpage>6</lpage>.</mixed-citation></ref>
<ref id="ref-36"><label>[36]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Kavyashree</surname> <given-names>PSP</given-names></string-name>, <string-name><surname>El-Sharkawy</surname> <given-names>M</given-names></string-name></person-group>. <article-title>Compressed MobileNet V3: a light weight variant for resource-constrained platforms</article-title>. In: <conf-name>2021 IEEE 11th Annual Computing and Communication Workshop and Conference (CCWC); 2021 Jan 27&#x2013;30; Virtual</conf-name>. <publisher-loc>Piscataway, NJ, USA</publisher-loc>: <publisher-name>IEEE</publisher-name>; <year>2021</year>. p. <fpage>0104</fpage>&#x2013;<lpage>7</lpage>.</mixed-citation></ref>
<ref id="ref-37"><label>[37]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><collab>SOURCEFOURGE</collab></person-group>. <article-title>HFP for Linux [Internet]</article-title>. <comment>[cited 2024 Sep 28]</comment>. Available from: <ext-link ext-link-type="uri" xlink:href="https://nohands.sourceforge.net/">https://nohands.sourceforge.net/</ext-link>.</mixed-citation></ref>
<ref id="ref-38"><label>[38]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><collab>Stanford Vision Lab, Stanford University, Princeton University</collab></person-group>. <article-title>ImageNet large scale visual recognition challenge 2012 (ILSVRC2012) [Internet]</article-title>. <year>2020 [cited 2024 Sep 28]</year>. Available from: <ext-link ext-link-type="uri" xlink:href="https://www.image-net.org/challenges/LSVRC/2012/index.php">https://www.image-net.org/challenges/LSVRC/2012/index.php</ext-link>.</mixed-citation></ref>
<ref id="ref-39"><label>[39]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Zhang</surname> <given-names>W</given-names></string-name>, <string-name><surname>Han</surname> <given-names>H</given-names></string-name>, <string-name><surname>Li</surname> <given-names>M</given-names></string-name>, <string-name><surname>Tian</surname> <given-names>Y</given-names></string-name></person-group>. <article-title>Towards faster end-to-end data transmission over voice channels</article-title>. In: <conf-name>ICASSP 2024 IEEE International Conference on Acoustics, Speech and Signal Processing</conf-name>. <publisher-loc>Piscataway, NJ, USA</publisher-loc>: <publisher-name>IEEE</publisher-name>; <year>2024</year>. p. <fpage>9061</fpage>&#x2013;<lpage>5</lpage>.</mixed-citation></ref>
<ref id="ref-40"><label>[40]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Xu</surname> <given-names>Z</given-names></string-name></person-group>. <article-title>A data transmission method based on pulse modulation over GSM voice channel</article-title>. <source>Telecommun Eng</source>. <year>2018</year>;<volume>58</volume>(<issue>2</issue>):<fpage>152</fpage>&#x2013;<lpage>6</lpage>.</mixed-citation></ref>
<ref id="ref-41"><label>[41]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Rashidi</surname> <given-names>M</given-names></string-name>, <string-name><surname>Sayadiyan</surname> <given-names>A</given-names></string-name>, <string-name><surname>Mowlaee</surname> <given-names>P</given-names></string-name></person-group>. <article-title>A harmonic approach to data transmission over GSM voice channel</article-title>. In: <conf-name>2008 3rd International Conference on Information and Communication Technologies: from Theory to Applications</conf-name>.
 <publisher-loc>Piscataway, NJ, USA</publisher-loc>: <publisher-name>IEEE</publisher-name>; <year>2008</year>. p. <fpage>1</fpage>&#x2013;<lpage>4</lpage>.</mixed-citation></ref>
</ref-list>
</back></article>