<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.1 20151215//EN" "http://jats.nlm.nih.gov/publishing/1.1/JATS-journalpublishing1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:mml="http://www.w3.org/1998/Math/MathML" xml:lang="en" article-type="research-article" dtd-version="1.1">
<front>
<journal-meta>
<journal-id journal-id-type="pmc">CMC</journal-id>
<journal-id journal-id-type="nlm-ta">CMC</journal-id>
<journal-id journal-id-type="publisher-id">CMC</journal-id>
<journal-title-group>
<journal-title>Computers, Materials &#x0026; Continua</journal-title>
</journal-title-group>
<issn pub-type="epub">1546-2226</issn>
<issn pub-type="ppub">1546-2218</issn>
<publisher>
<publisher-name>Tech Science Press</publisher-name>
<publisher-loc>USA</publisher-loc>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="publisher-id">33910</article-id>
<article-id pub-id-type="doi">10.32604/cmc.2023.033910</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Article</subject>
</subj-group>
</article-categories>
<title-group>
<article-title>Accelerating Falcon Post-Quantum Digital Signature Algorithm on Graphic Processing Units</article-title>
<alt-title alt-title-type="left-running-head">Accelerating Falcon Post-Quantum Digital Signature Algorithm on Graphic Processing Units</alt-title>
<alt-title alt-title-type="right-running-head">Accelerating Falcon Post-Quantum Digital Signature Algorithm on Graphic Processing Units</alt-title>
</title-group>
<contrib-group>
<contrib id="author-1" contrib-type="author">
<name name-style="western"><surname>Seo</surname><given-names>Seog Chung</given-names>
</name><xref ref-type="aff" rid="aff-1">1</xref></contrib>
<contrib id="author-2" contrib-type="author">
<name name-style="western"><surname>An</surname><given-names>Sang Woo</given-names>
</name><xref ref-type="aff" rid="aff-2">2</xref></contrib>
<contrib id="author-3" contrib-type="author" corresp="yes">
<name name-style="western"><surname>Choi</surname><given-names>Dooho</given-names>
</name><xref ref-type="aff" rid="aff-3">3</xref><email>doohochoi@korea.ac.kr</email></contrib>
<aff id="aff-1"><label>1</label><institution>Kookmin University</institution>, <addr-line>Seoul, 02707</addr-line>, <country>Korea</country></aff>
<aff id="aff-2"><label>2</label><institution>Telecommunications Technology Association (TTA)</institution>, <addr-line>Gyeonggi-do, 13591</addr-line>, <country>Korea</country></aff>
<aff id="aff-3"><label>3</label><institution>Korea University</institution>, <addr-line>Sejong, 30019</addr-line>, <country>Korea</country></aff>
</contrib-group>
<author-notes>
<corresp id="cor1"><label>&#x002A;</label>Corresponding Author: Dooho Choi. Email: <email>doohochoi@korea.ac.kr</email></corresp>
</author-notes>
<pub-date date-type="collection" publication-format="electronic"><year>2023</year></pub-date>
<pub-date date-type="pub" publication-format="electronic"><day>24</day><month>1</month><year>2023</year></pub-date>
<volume>75</volume>
<issue>1</issue>
<fpage>1963</fpage>
<lpage>1980</lpage>
<history>
<date date-type="received"><day>01</day><month>7</month><year>2022</year></date>
<date date-type="accepted"><day>09</day><month>11</month><year>2022</year></date>
</history>
<permissions>
<copyright-statement>&#x00A9; 2023 Seo, An and Choi</copyright-statement>
<copyright-year>2023</copyright-year>
<copyright-holder>Seo, An and Choi</copyright-holder>
<license xlink:href="https://creativecommons.org/licenses/by/4.0/">
<license-p>This work is licensed under a <ext-link ext-link-type="uri" xlink:type="simple" xlink:href="https://creativecommons.org/licenses/by/4.0/">Creative Commons Attribution 4.0 International License</ext-link>, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
</license>
</permissions>
<self-uri content-type="pdf" xlink:href="TSP_CMC_33910.pdf"></self-uri>
<abstract><p>Since 2016, the National Institute of Standards and Technology (NIST) has been performing a competition to standardize post-quantum cryptography (PQC). Although Falcon has been selected in the competition as one of the standard PQC algorithms because of its advantages in short key and signature sizes, its performance overhead is larger than that of other lattice-based cryptosystems. This study presents multiple methodologies to accelerate the performance of Falcon using graphics processing units (GPUs) for server-side use. Direct GPU porting significantly degrades performance because the Falcon reference codes require recursive functions in its sampling process. Thus, an iterative sampling approach for efficient parallel processing is presented. In this study, the Falcon software applied a fine-grained execution model and reported the optimal number of threads in a thread block. Moreover, the polynomial multiplication performance was optimized by parallelizing the number-theoretic transform (NTT)-based polynomial multiplication and the fast Fourier transform (FFT)-based multiplication. Furthermore, dummy-based parallel execution methods have been introduced to handle the thread divergence effects. The presented Falcon software on RTX 3090 NVIDA GPU based on the proposed methods with Falcon-512 and Falcon-1024 parameters outperform at 35.14, 28.84, and 34.64 times and 33.31, 27.45, and 34.40 times, respectively, better than the central processing unit (CPU) reference implementation using Advanced Vector Extensions 2 (AVX2) instructions on a Ryzen 9 5900X running at 3.7 GHz in key generation, signing, and verification, respectively. Therefore, the proposed Falcon software can be used in servers managing multiple concurrent clients for efficient certificate verification and be used as an outsourced key generation and signature generation server for Signature as a Service (SaS).</p>
</abstract>
<kwd-group kwd-group-type="author">
<kwd>DSA</kwd>
<kwd>Falcon</kwd>
<kwd>GPU</kwd>
<kwd>CUDA</kwd>
<kwd>software optimization</kwd>
</kwd-group>
</article-meta>
</front>
<body>
<sec id="s1">
<label>1</label><title>Introduction</title>
<p>Shor&#x2019;s algorithm [<xref ref-type="bibr" rid="ref-1">1</xref>] running on a quantum computer can break the current public key cryptosystems including Rivest&#x2013;Shamir&#x2013;Adleman (RSA), digital signature algorithm (DSA), and elliptic curve Diffie&#x2013;Hellman (ECDH). After 2016, the National Institute of Standards and Technology (NIST) has been organizing a competition to standardize post-quantum cryptography (PQC) to provide reasonable security in the era of quantum computing [<xref ref-type="bibr" rid="ref-2">2</xref>]. There are four algorithms (Classical McEliece [<xref ref-type="bibr" rid="ref-3">3</xref>], CRYSTALS-Kyber [<xref ref-type="bibr" rid="ref-4">4</xref>], NTRU [<xref ref-type="bibr" rid="ref-5">5</xref>], and Saber [<xref ref-type="bibr" rid="ref-6">6</xref>]) in the key encapsulation mechanism (KEM), and three algorithms (CRYSTALS-Dilithium [<xref ref-type="bibr" rid="ref-7">7</xref>], Falcon [<xref ref-type="bibr" rid="ref-8">8</xref>], and Rainbow [<xref ref-type="bibr" rid="ref-9">9</xref>]) in the digital signature algorithm (DSA) in the Round 3 finalists. Among the Round 3 DSA algorithms, Falcon has advantages of shortest key length and fastest signature verification speed. Thus, Falcon can be seamlessly integrated into current security protocols (e.g., transport layer security (TLS) and domain name server security (DNSSEC)) and applications. Consequently, Falcon has recently been selected as one of the standard algorithms in the NIST competition. The advent of Internet of Things (IoT) and cloud environments has significantly increased the number of clients that servers must process. Therefore, servers have the burden of processing high volume cryptographic operations or cryptographic protocol executions for secure communication with clients. For example, servers should concurrently confirm the authenticity of certificates from clients; in a particular situation, they should generate multiple key pairs and sign messages with Signature as a Service (SaS) [<xref ref-type="bibr" rid="ref-10">10</xref>,<xref ref-type="bibr" rid="ref-11">11</xref>]. Graphics processing units (GPUs) can be used as cryptographic accelerators. Many studies [<xref ref-type="bibr" rid="ref-10">10</xref>&#x2013;<xref ref-type="bibr" rid="ref-12">12</xref>] demonstrated that optimized cryptographic software with GPUs can achieve an impressive throughput enhancement compared with conventional software operating on the central processing unit (CPU). Certain studies [<xref ref-type="bibr" rid="ref-13">13</xref>&#x2013;<xref ref-type="bibr" rid="ref-18">18</xref>] have been conducted on PQC to improve its performance using GPUs.</p>
<p>This study presents the first Falcon software optimized on an NVIDIA GPU. Although the Falcon team [<xref ref-type="bibr" rid="ref-8">8</xref>] opened the CPU environment and embedded environments source codes, it did not provide the GPU environment software. Furthermore, on the GPU side, Falcon source codes are related to recursive function inefficiency. Thus, in this study, the gap is filled by developing an efficient GPU Falcon software. When Falcon source codes for CPU execution are converted in an efficient GPU side software, multiple limitations such as warp divergence related to using branch instructions and the heavy use of recursive functions require to be solved. First, a fine-grained execution model is applied to implement the introduced Falcon and identify the optimal number of threads in a thread block. Second, dummy-based polynomial operation methods that alleviate the warp divergence effect and an iterative version of Falcon sampling are proposed. Furthermore, the introduced Falcon software is optimized by considering the GPU advantage on chip memory (registers, shared memory, and constant memory) and by implementing polynomial multiplications using number theoretic transform (NTT)-based method and fast Fourier transform (FFT)-based method.</p>
<p>The contributions of this study can be summarized as follows:
<list list-type="simple">
<list-item>
<p>&#x2022; This is the first study on Falcon implementation in a GPU environment</p></list-item>
</list></p>
<p>This study is the first to present GPU Falcon software, which was developed with a fine-grained execution model where <italic>n</italic> threads (<italic>n</italic> &#x003D; 32 is selected for optimal performance) cooperate to compute a Falcon operation: <italic>Keygen</italic> for generating a pair of public and private keys, Sign for generating a signature, and Verify for signature verification. Furthermore, the introduced Falcon software on an NVIDIA RTX 3090 GPU can execute 256 concurrent Falcon operations. It was observed that its throughput with Falcon-512 and Falcon-1024 parameters outperforms at 35.14, 28.84, and 34.64 times and 33.31, 27.45, and 34.40 times, respectively, better than the CPU reference implementation using the AVX2 instructions on a Ryzen 9 5900X CPU running at 3.7 GHz for <italic>Keygen</italic>, <italic>Sign</italic>, and <italic>Verify</italic>.
<list list-type="simple">
<list-item>
<p>&#x2022; The proposed additional optimization implementation plan</p></list-item>
</list></p>
<p>This study introduced a dummy-based parallel execution method to alleviate the divergence effect from branch instructions, as well as an effective, economical approach to convert the <italic>ffSampling</italic> recursive version into an iterative one because the GPU recursive function execution is inefficient. Furthermore, the polynomial multiplication operations in the integer number and complex number domains were optimized using the NTT-based and FFT-based methods; the fine-grained execution model parallelized both the NTT-based and FFT-based methods.</p>
<p>The remainder of this study is structured as follows: Section 2 presents works of literature review on GPU cryptographic algorithms optimization and introduces research trends for Falcon; Section&#x00A0;3 provides a brief description of Falcon and GPU; Section 4 introduces implementation methods for operating GPU Falcon and optimization implementation methods to improve performance; Section&#x00A0;5 evaluates the implementation performance results; and Section 6 is the conclusion.</p>
</sec>
<sec id="s2">
<label>2</label><title>Related Work</title>
<p>Since 2016, NIST has organized a contest for standardizing PQC algorithms as a response to PQC demand. In July 2020, the third round of the project was started; <xref ref-type="table" rid="table-1">Table 1</xref> shows the round&#x2019;s competition algorithms. The candidate algorithms were classified into public key encryption (PKE)/ KEM and DSA. Information on the final candidate algorithms are presented on the PQClean [<xref ref-type="bibr" rid="ref-19">19</xref>]. In June 2022, four algorithms were selected as the final standard algorithms: Crystals-Kyber for KEM, Crystals-Dilithium, Falcon, and Sphincs&#x002B; for DSA.</p>
<table-wrap id="table-1">
<label>Table 1</label>
<caption><title>Round 3 NIST PQC standardization final candidate algorithms and their corresponding bases</title>
</caption>
<table frame="hsides">
<colgroup>
<col align="left"/>
<col align="left"/>
<col align="left"/>
</colgroup>
<thead>
<tr>
<th></th>
<th>Algorithm</th>
<th>Base</th>
</tr>
</thead>
<tbody>
<tr>
<td>PKE/KEM</td>
<td>Classic McEliece [<xref ref-type="bibr" rid="ref-3">3</xref>]</td>
<td>Code</td>
</tr>
<tr>
<td/>
<td>Crystals-Kyber [<xref ref-type="bibr" rid="ref-4">4</xref>]</td>
<td>LWE</td>
</tr>
<tr>
<td/>
<td>NTRU [<xref ref-type="bibr" rid="ref-5">5</xref>]</td>
<td>NTRU</td>
</tr>
<tr>
<td/>
<td>Saber [<xref ref-type="bibr" rid="ref-6">6</xref>]</td>
<td>LWR</td>
</tr>
<tr>
<td>DSA</td>
<td>Crystals-Dilithium [<xref ref-type="bibr" rid="ref-7">7</xref>]</td>
<td>LWE</td>
</tr>
<tr>
<td/>
<td>Falcon [<xref ref-type="bibr" rid="ref-8">8</xref>]</td>
<td>NTRU</td>
</tr>
<tr>
<td/>
<td>Rainbow [<xref ref-type="bibr" rid="ref-9">9</xref>]</td>
<td>Multivariate</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>There have been multiple pieces of research on PQC implementation in a GPU environment [<xref ref-type="bibr" rid="ref-13">13</xref>&#x2013;<xref ref-type="bibr" rid="ref-18">18</xref>]. Gupta&#x00A0;et&#x00A0;al.&#x00A0;(2020) [<xref ref-type="bibr" rid="ref-13">13</xref>] proposed the techniques that allow PQC-based KEM algorithms such as FrodoKEM, NewHope, and CRYSTALS-Kyber to run fast on GPU. For NewHope, Gao&#x00A0;et&#x00A0;al.&#x00A0;(2021) [<xref ref-type="bibr" rid="ref-14">14</xref>] proposed a computational structure that maximizes GPU computational efficiency by improving its implementation. Furthermore, Seong&#x00A0;et&#x00A0;al.&#x00A0;(2021) [<xref ref-type="bibr" rid="ref-15">15</xref>] introduced a parallel operation structure for the server to efficiently process the key exchange protocol in a multi-client environment via the NTRU algorithm. Moreover, PQC-based KEM algorithms such as Saber, SIKE, and NTRU have been examined on GPU [<xref ref-type="bibr" rid="ref-16">16</xref>&#x2013;<xref ref-type="bibr" rid="ref-18">18</xref>]. Although certain studies have implemented lattice-based PQC in GPU environments, these only focused on the optimization of polynomial multiplication such as parallelizing the NTT-based polynomial multiplications [<xref ref-type="bibr" rid="ref-13">13</xref>,<xref ref-type="bibr" rid="ref-14">14</xref>], and [<xref ref-type="bibr" rid="ref-16">16</xref>]. However, in addition to polynomial multiplication optimization, this study focuses on minimizing divergence effects and converting recursive-based sampling into an iterative version.</p>
<p>For PQC-based DSA, the final candidates were CRYSTALS-Dilithium [<xref ref-type="bibr" rid="ref-20">20</xref>], Falcon, and the Rainbow [<xref ref-type="bibr" rid="ref-21">21</xref>] algorithms. Dilithium and Falcon were selected as the final standard algorithms. Furthermore, CRYSTALS-Dilithium and Falcon are lattice-based cryptographic algorithms and polynomial operations are considered their primary computation methods [<xref ref-type="bibr" rid="ref-22">22</xref>]. To date, the primary concern of GPU lattice-based PQC implementation was to optimize polynomial multiplication by parallelizing the NTT-based method [<xref ref-type="bibr" rid="ref-23">23</xref>,<xref ref-type="bibr" rid="ref-24">24</xref>]. However, reference codes for Falcon are not directly converted into the software on the GPU side because of the heavy use of recursive functions and their branch instructions. To our knowledge, this is the first result of Falcon implementation on a GPU environment.</p>
</sec>
<sec id="s3">
<label>3</label><title>Backgrounds</title>
<sec id="s3_1">
<label>3.1</label><title>Falcon Overview</title>
<p>Falcon [<xref ref-type="bibr" rid="ref-25">25</xref>] is a post-quantum DSA algorithm based on the lattice NTRU problem. Falcon uses the operation in the field of <inline-formula id="ieqn-1"><mml:math id="mml-ieqn-1"><mml:mrow><mml:mi mathvariant="normal">Q</mml:mi></mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">]</mml:mo><mml:mrow><mml:mo>/</mml:mo></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x03D5;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>, where <inline-formula id="ieqn-2"><mml:math id="mml-ieqn-2"><mml:mi>&#x03D5;</mml:mi><mml:mo>=</mml:mo><mml:msup><mml:mi>x</mml:mi><mml:mi>n</mml:mi></mml:msup><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:math></inline-formula>, and is divided into Falcon-512 and Falcon-1024 depending on whether n &#x003D; 512 or 1024. The necessary notation for the algorithm description is shown in <xref ref-type="table" rid="table-2">Table 2</xref>. For example, Falcon-512 and Falcon-1024 uses polynomials of 512 terms and 1024 terms, respectively. <xref ref-type="table" rid="table-3">Table 3</xref> describes the Falcon-512 and Falcon-1024 parameters. Falcon-512 and Falcon-1024 provide NIST Security Levels 1 and 5, respectively. Falcon comprises three primary functions: <italic>Keygen</italic> generates a pair of public and private keys, <italic>Sign</italic> generates a signature, and <italic>Verify</italic> verifies the signature.</p>
<table-wrap id="table-2">
<label>Table 2</label>
<caption><title>Notations</title>
</caption>
<table frame="hsides">
<colgroup>
<col align="left"/>
<col align="left"/>
</colgroup>
<thead>
<tr>
<th>Symbol</th>
<th>Definition</th>
</tr>
</thead>
<tbody>
<tr>
<td>Bold uppercase (e.g., <bold>B</bold>)</td>
<td>Matrices</td>
</tr>
<tr>
<td>Bold lowercase (e.g., <bold>v</bold>)</td>
<td>Vector</td>
</tr>
<tr>
<td>Italic lowercase (e.g., <italic>s</italic>)</td>
<td>Polynomial</td>
</tr>
<tr>
<td><bold>B</bold><sup>t</sup></td>
<td>Transpose of Matrix</td>
</tr>
<tr>
<td><inline-formula id="ieqn-3"><mml:math id="mml-ieqn-3"><mml:mi>&#x03D5;</mml:mi><mml:mo>=</mml:mo><mml:msup><mml:mi>x</mml:mi><mml:mi>n</mml:mi></mml:msup><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:math></inline-formula> (for <italic>x</italic> &#x003D; <italic>2</italic><sup><italic>k</italic></sup>)</td>
<td>Polynomial modulus</td>
</tr>
<tr>
<td>FFT</td>
<td>Fast Fourier Transform</td>
</tr>
<tr>
<td><inline-formula id="ieqn-4"><mml:math id="mml-ieqn-4"><mml:mstyle displaystyle="true" scriptlevel="0"><mml:mfrac><mml:mrow><mml:mi mathvariant="double-struck">Z</mml:mi></mml:mrow><mml:mrow><mml:mi>q</mml:mi><mml:mrow><mml:mi mathvariant="double-struck">Z</mml:mi></mml:mrow></mml:mrow></mml:mfrac></mml:mstyle></mml:math></inline-formula> (with q &#x003D; 12289)</td>
<td>Quotient rings</td>
</tr>
</tbody>
</table>
</table-wrap><table-wrap id="table-3">
<label>Table 3</label>
<caption><title>Falcon parameters</title>
</caption>
<table frame="hsides">
<colgroup>
<col align="left"/>
<col align="left"/>
<col align="left"/>
</colgroup>
<thead>
<tr>
<th></th>
<th>Falcon-512</th>
<th>Falcon-1024</th>
</tr>
</thead>
<tbody>
<tr>
<td>Security level</td>
<td>1</td>
<td>5</td>
</tr>
<tr>
<td>Ring degree n</td>
<td>512</td>
<td>1024</td>
</tr>
<tr>
<td>Modulus q</td>
<td>12289</td>
<td>12289</td>
</tr>
<tr>
<td>Max. signature square norm <inline-formula id="ieqn-5"><mml:math id="mml-ieqn-5"><mml:mrow><mml:mo fence="true" stretchy="true" symmetric="true"></mml:mo><mml:msup><mml:mrow><mml:mo>&#x230A;</mml:mo><mml:mi>&#x03B2;</mml:mi><mml:mo fence="true" stretchy="true" symmetric="true"></mml:mo></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msup><mml:mo>&#x230B;</mml:mo></mml:mrow></mml:math></inline-formula></td>
<td>34034726</td>
<td>70265242</td>
</tr>
<tr>
<td>Public key byte length</td>
<td>897</td>
<td>1793</td>
</tr>
<tr>
<td>Signature byte length</td>
<td>666</td>
<td>1280</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>In the <italic>Keygen</italic> step, the private key <italic>F</italic> and <italic>G</italic> components, which satisfies the NTRU equation, are obtained via random polynomials <italic>f</italic> and <italic>g</italic> (refer to Algorithm 1). The <italic>Sign</italic> phase involves hashing the message to a value modular <italic>&#x03D5;</italic> (refer to Algorithm 2). Next, the signer creates a polynomial-based signature pair (<italic>s</italic><sub>1</sub>, <italic>s</italic><sub>2</sub>) using (<italic>f, g, F, G</italic>), which is the signer&#x2019;s secret information. The signature value is obtained as <italic>s</italic><sub>2</sub>. In <italic>Verify</italic> (refer to Algorithm 3), <italic>s</italic><sub>1</sub> is calculated using the hashed message and signature <italic>s</italic><sub>2</sub>; moreover, it is determined whether the signature is correct based on whether (<italic>s</italic><sub>1</sub>, <italic>s</italic><sub>2</sub>) satisfies the shortest vector in a lattice.</p>
<p>The <italic>Sign</italic> generates <italic>s</italic><sub>1</sub> and <italic>s</italic><sub>2</sub> by satisfying <italic>s</italic><sub>1</sub> &#x002B; <italic>s</italic><sub>2</sub><italic>h</italic> &#x003D; <italic>c</italic> mod (<italic>&#x03D5;</italic>, <italic>q</italic>) using the message <bold>m</bold>, the random seed <bold>r</bold>, and the private key <bold>sk</bold>. The <italic>ffSampling</italic> function is repeatedly called (refer to Algorithm 4) to calculate <bold>s</bold> that meets the condition. In <italic>Verify</italic>, <italic>s</italic><sub>1</sub> and <italic>s</italic><sub>2</sub> are recalculated and verified if <inline-formula id="ieqn-6"><mml:math id="mml-ieqn-6"><mml:mrow><mml:mo symmetric="true">&#x2016;</mml:mo><mml:msup><mml:mrow><mml:mo fence="true" stretchy="true" symmetric="true"></mml:mo><mml:msub><mml:mrow><mml:mtext>s</mml:mtext></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mtext>s</mml:mtext></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub><mml:mo symmetric="true">&#x2016;</mml:mo></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msup><mml:mo fence="true" stretchy="true" symmetric="true"></mml:mo></mml:mrow><mml:mo>&#x2264;</mml:mo><mml:mrow><mml:mo fence="true" stretchy="true" symmetric="true"></mml:mo><mml:msup><mml:mrow><mml:mo>&#x230A;</mml:mo><mml:mi>&#x03B2;</mml:mi><mml:mo fence="true" stretchy="true" symmetric="true"></mml:mo></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msup><mml:mo>&#x230B;</mml:mo></mml:mrow></mml:math></inline-formula> is satisfied. Falcon uses multiple methods to perform efficient polynomial operations for signature generation and verification process.</p>
<p><inline-graphic xlink:href="CMC_33910-inline-1.tif"/></p>
<p><inline-graphic xlink:href="CMC_33910-inline-2.tif"/></p>
<p><inline-graphic xlink:href="CMC_33910-inline-3.tif"/></p>
<p><inline-graphic xlink:href="CMC_33910-inline-4.tif"/></p>
<p>A FFT-based discrete Gaussian sampling is used to efficiently generate polynomial matrices. Moreover, FFT-based [<xref ref-type="bibr" rid="ref-26">26</xref>] and NTT-based [<xref ref-type="bibr" rid="ref-27">27</xref>] methods are used for polynomial multiplication on the complex number domain and integer number domains, respectively. The FFT and NTT are known as efficient methods that can reduce the computational complexity of the existing school book-based polynomial multiplication from <italic>O</italic>(<italic>n</italic><sup>2</sup>) to <italic>O</italic>(<italic>n</italic>log<italic>n</italic>). In FFT and NTT-based polynomial multiplication, two polynomials are converted into either the FFT or NTT domain. Then, point-wise multiplication is computed using the two converted polynomials. After completing the point-wise multiplication process, the final result is obtained by applying either the inverse FFT or NTT. In addition to the NTT-based method that uses integer numbers, the FFT-based method coefficients of polynomials are complex numbers (comprising a real part and an imaginary part) and thus floating-point type is used in representing them.</p>
<p>Complex number-based operations are included in the <italic>Keygen</italic> and <italic>Sign</italic> processes. From the Falcon implementation perspective, complex numbers are represented using an IEEE 754 64-bit floating-point representation [<xref ref-type="bibr" rid="ref-28">28</xref>] known as double precision. FFT-related functions operate on double precision. For NTT, the operation is implemented on a 16-bit integer representation because an integer operation is performed on modular <italic>q</italic> on the finite field Z<sub><italic>q</italic></sub>. The modular multiplication over Z<sub><italic>q</italic></sub> is performed using the Montgomery multiplication [<xref ref-type="bibr" rid="ref-29">29</xref>,<xref ref-type="bibr" rid="ref-30">30</xref>].</p>
<p>In the DSA, different signature values are generated using a random value generator function that is performed even for the same message. Generally, multiple functions are used to generate random values. However, to extract a value that satisfies a specific range or distribution, it is important to perform a sampling process. In Falcon, a function called <italic>ffSampling</italic> is used when generating a signature value. The <italic>ffSampling</italic> process can be reported in Algorithm 4 [<xref ref-type="bibr" rid="ref-31">31</xref>]. The <italic>splitfft</italic> and <italic>mergefft</italic> inside <italic>ffSampling</italic> perform the domain transformation, and <italic>DZ</italic> (<italic>SamplerZ</italic>) accepts only the desired value by rejection sampling. As per the rejection sampling, each value is subject to acceptance&#x2013;rejection evaluation and only those values that satisfy the condition can be accepted. Values determined acceptable by <italic>SamplerZ</italic> follow a distribution like the discrete Gaussian distribution.</p>
<p>In addition to the primary functions, Falcon uses additional ones. The <italic>Sign</italic> phase <italic>HashToPoint</italic> function replaces the message hash value with a polynomial. Moreover, the <italic>Compress</italic> function compresses the generated signature value. However, the <italic>Decompress</italic> function called during the <italic>Verify</italic> process restores the output value generated by the <italic>Compress</italic> function in the <italic>Sign</italic> phase.</p>
</sec>
<sec id="s3_2">
<label>3.2</label><title>Graphic Processing Units</title>
<p>GPUs are devices developed to process graphics operations. Currently, their usage is extended to general purpose applications such as machine learning and accelerating cryptographic operations. Although GPU has a higher number of cores than CPU, a GPU core is slower than that of the CPU. For example, NVIDIA RTX 3090 GPU has 10,496 computational cores. GPUs are known for parallel computation rather than sequential execution. NVIDIA GPUs contain multiple independent streaming multiprocessors (SMs) in which each has multiple computational cores. For example, RTX 3090 has 82 SMs that each have 128 cores. Moreover, each SM has an instruction cache, a data cache, and a shared memory space.</p>
<p>Generally, libraries such as compute unified device architecture (CUDA) [<xref ref-type="bibr" rid="ref-32">32</xref>] or open computing language (OpenCL) [<xref ref-type="bibr" rid="ref-33">33</xref>] are used to operate general purpose computing on graphics processing units (GPGPU). The CUDA library enables GPU parallel programming via the NVCC compiler. In GPU implementation, tasks are processed in parallel by threads that are computation units. Typically, all 32 threads are grouped into an instruction execution unit known as warp. The threads of the same warp perform the same operation without a separate synchronization procedure. Moreover, the bundles of thread blocks composed of multiple threads are distributed to streaming multiprocessor cores. To maximize GPU resource utilization, identifying the optimal number of threads and thread blocks is important.</p>
<p>The proper usage of GPU memory is an important efficiency factor. A GPU is composed of multiple types of memory, and their characteristics are as follows:
<list list-type="bullet">
<list-item>
<p><italic>Global memory</italic> is the dynamic random access memory (DRAM) that occupies the largest capacity of the GPU. However, memory reference speed is slow because data must be copied between the CPU and GPU via a PCIe interface to share data between the CPU and GPU.</p></list-item>
<list-item>
<p><italic>Shared memory</italic> is the memory shared by threads in the same block. It is on-chip memory that has faster access speed than global memory. Shared memory is divided into equally sized memory banks that can be simultaneously accessed.</p></list-item>
<list-item>
<p><italic>Constant memory</italic> is read-only memory. Here, data copying can be performed outside the GPU kernel. When warp threads frequently access data in constant memory, the data are cached to enable fast memory access.</p></list-item>
<list-item>
<p><italic>Texture memory</italic> has the fastest memory access speed for graphic work but is small in size. Therefore, local memory requires to be allocated when multiple local variables are used.</p></list-item>
</list></p>
<p>The GPU is operated by the CPU-launched kernel function. Before using external data in GPU operation, the data require to be copied from the CPU to GPU. Moreover, it is important to perform a memory copy from the GPU to the CPU to use the computed data on the GPU in the CPU. Each thread running inside the kernel receives a unique identification.</p>
</sec>
</sec>
<sec id="s4">
<label>4</label><title>Proposed Falcon GPU Implementation</title>
<sec id="s4_1">
<label>4.1</label><title>Difficulties and Solutions</title>
<p>Many errors are generated when converting Falcon reference codes to GPU. This is because certain operation functions are implemented in a form that is unsuitable for the GPU environment (i.e., the original Falcon reference codes do not fit the GPU&#x2019;s single instruction multiple threads (SIMT) execution model). Therefore, the study introduces multiple implementation methods that can handle the difficulties that arise during Falcon&#x2019;s reference code conversion to GPU efficient codes.</p>
<sec id="s4_1_1">
<label>4.1.1</label><title>CPU to GPU Data Porting Difficulties and Solutions</title>
<p>Falcon uses multiple variables and constant data to generate and confirm signatures. There are declared and used variables inside the function (e.g., temporary variables that store intermediate computation values, flags, and counters) as well as predefined data values (e.g., RC table for SHA-3, max_sig_bits for decoding, GMb for NTT conversion, and iGMb for inverse NTT conversion) that are used in the reference table form. In processes, certain data (e.g., message, signature, and key materials) consume memory from start to finish. Generally, variables declared inside a function can be similarly used on the GPU. However, if the variable size increases beyond a certain level, the stack memory may become insufficient, e.g., in Falcon-1024, the size of one public key is 1,793 bytes while the size of one signature is 1,280 bytes. Since the latest GPU register capacity per block is 256 KB, if the number of available threads per block increases, the register runs out and slow local memory is used instead. Therefore, the CPU dynamically allocates and uses memory for the variable. However, performing dynamic memory allocation in the middle of GPU kernel execution reduces the overall computationally intensive efficiency of the GPU. The size of multiple polynomial data used to solve the NTRU equation in Falcon reference codes may be difficult for each thread to independently declare and use. Therefore, the study has dynamically allocated the memory required to store polynomials before launching kernel execution. To prevent the declaration of variables during a function execution or a change in memory size through the memory reallocation function that results in a performance decrease, the variables are defined in advance as the largest size. Falcon structure variables containing the primary data (i.e., signature, signature length, public key, public key length, message, and message length) are predefined and used in a GPU.</p>
<p>For reference tables having constant data used in Falcon, table values are copied in advance via constant memory and are cached on the GPU. During <italic>Verify</italic>, five constant tables are stored (RC table used in SHA-3 function, max_sig_bits used in Falcon decoding function, GMb table used in NTT conversion, iGMb table used for inverse NTT conversion, and l2bound table for verifying length condition in the signing and verification processes) in a constant memory area wherein the total amount is &#x007E;4 kB.</p>
<p>Moreover, standard memory copy functions such as <italic>memcpy</italic>, which are frequently used in the original Falcon reference codes, have limited usage on the GPU. Accordingly, the value is copied via a deep copy with a for-loop.</p>
</sec>
<sec id="s4_1_2">
<label>4.1.2</label><title>Solution for GPU Double Recursive Function Difficulties</title>
<p>In cryptography, sampling is a method that extracts random values in a specific distribution. Falcon has a function known as <italic>SamplerZ</italic> that performs discrete Gaussian sampling. Moreover, the entire sampling function of Falcon is <italic>ffSampling</italic> (refer to Algorithm 4) and its structure is similar to FFT. The <italic>ffSampling</italic> function is called in a double recursive manner in which a parent function recursively calls two child functions for log<sub>2</sub><italic>n</italic> times for the polynomial dimension <italic>n</italic>. As the operation proceeds, <italic>ffSampling</italic> recursively calls itself twice. At this time, the input parameter <italic>n</italic> is halved. For example, if <italic>ffSampling n</italic> input is 1024, the total number of <italic>ffSampling</italic> functions recursively called via the double recursive manner becomes 2047 times (&#x003D; 1 &#x002B; 2 &#x002B; 4 &#x002B; &#x2026; &#x002B; 1024). <xref ref-type="fig" rid="fig-1">Fig. 1</xref> shows the simplified expressions of the functions before and after the recursive functions. The code block performed in the conditional statement is substituted using the <bold>X</bold> symbol. The code block before the first recursive function is substituted with the <bold>F</bold> symbol. The code block after the second recursive function is substituted with the <bold>H</bold> symbol while the block between two recursive functions is substituted with the <bold>G</bold> symbol. The block related to <bold>X</bold>, <bold>F</bold>, <bold>H</bold>, and <bold>G</bold> symbols can include normal function calls (not recursive functions).</p>
<fig id="fig-1">
<label>Figure 1</label>
<caption><title>Illustration of a simplified <italic>ffSampling</italic> when n &#x003D; 8 (red, yellow, green, and blue-sky rectangles denote codes related to <italic>F</italic>, <italic>X</italic>, <italic>G</italic>, and <italic>H</italic> symbols, respectively)</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_33910-fig-1.tif"/>
</fig>
<p>In GPUs where multiple threads perform simultaneous operations, the use of recursive functions is extremely limited because of the function call stacks problem. Therefore, to efficiently process the <italic>ffSampling</italic> function on the GPU, the double recursive function requires to be replaced with an iterative version. First, <xref ref-type="fig" rid="fig-1">Fig. 1</xref> shows that the <bold>F</bold> and <bold>H</bold> blocks are always continuously executed at the beginning and end of <italic>ffSampling</italic>. After a repeated execution of the first <bold>F</bold>, the next is <bold>X</bold> when <italic>n</italic> is 1, and <bold>G</bold> and <bold>X</bold> are executed only before the last <bold>H</bold> is consecutively executed. Because the number of iterations continuously executed is repeated log<sub>2</sub><italic>n</italic> times as per the first input <italic>n</italic>, the following rules are derived for the iterative version of <italic>ffSampling</italic> by borrowing the concept of the ruler function [<xref ref-type="bibr" rid="ref-34">34</xref>] which defines the order of execution and the execution times for each code block (<bold>F</bold>, <bold>X</bold>, <bold>G</bold>, and <bold>H</bold>):
<list list-type="simple">
<list-item><label>&#x2022;</label><p>In the middle part, the primary iteration is repeated a total of <italic>n</italic>/2&#x2013;1 times; within the main loop, there are two extra loops with the sequence following the derived ruler function as the number of repetitions are executed further. <xref ref-type="fig" rid="fig-2">Fig. 2</xref> shows the derived ruler function graph for the <italic>ffSampling</italic> iterative version.</p></list-item>
<list-item><label>&#x2022;</label><p><bold>G</bold> and <bold>X</bold> are first executed in the middle loop and <bold>H</bold> is executed <italic>RF</italic>[<italic>i</italic>] times where the <italic>RF</italic> function is a predefined table in constant memory and <italic>RF</italic>[<italic>i</italic>] is the <italic>i</italic>-th result of the ruler function for the index <italic>i</italic> that is the counter to the primary loop repetitions. Then, <bold>G</bold> is performed once and then <bold>F</bold> is performed <italic>RF</italic>[<italic>i</italic>] times. Then, <bold>X</bold> is executed to complete the series of primary loops.</p></list-item>
</list></p>
<fig id="fig-2">
<label>Figure 2</label>
<caption><title>Ruler function graph</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_33910-fig-2.tif"/>
</fig>
<p>Algorithm 5 is the proposed <italic>ffSampling</italic> iterative version corresponding to the ruler function recursive execution shown in <xref ref-type="fig" rid="fig-1">Fig. 1</xref>. While the process of replacing a recursive function with an iterative execution model improves efficiency, one other problem still remains. Because the existing <italic>ffSampling</italic> uses a Falcon tree, each time <italic>ffSampling</italic> is recursively called, another child of the tree is called. To use different parameters in the same function, even when using the iterative version, the address of each child of the tree is stored as an address pointer array and passed as a function argument. Then, the address pointer array stores the variable addresses used at each tree level.</p>

<p><inline-graphic xlink:href="CMC_33910-inline-5.tif"/></p>
</sec>
</sec>
<sec id="s4_2">
<label>4.2</label><title>Proposed Functionalities and Overall Software Structure</title>
<sec id="s4_2_1">
<label>4.2.1</label><title>Software Functionalities</title>
<p>The introduced software provides key generation (<italic>Keygen</italic>), signing (<italic>Sign</italic>), and verification (<italic>Verify</italic>) functions. For <italic>Keygen</italic>, it is assumed that multiple and independent keys are generated and can be used in the future, whereas for <italic>Verify</italic>, each multiple signatures should be confirmed with its related public key. Unlike the abovementioned two operations, it is assumed for <italic>Sign</italic> that a single key or multiple keys can be used to sign multiple messages. For example, a typical application server must sign multiple messages with its private key. However, each key in an outsourced signature server [<xref ref-type="bibr" rid="ref-10">10</xref>,<xref ref-type="bibr" rid="ref-11">11</xref>] should sign certain messages. To summarize, the Falcon software provides three functions: <italic>Keygen</italic> for multiple keys, <italic>Sign</italic> for single key and multiple keys, and <italic>Verify</italic> for multiple keys. Thus, in Section 5, the Falcon software performance based on the aforementioned functionalities is presented.</p>
</sec>
<sec id="s4_2_2">
<label>4.2.2</label><title>Overall Structure</title>
<p>There are two primary execution models when implementing GPU applications: coarse-grained execution (CGE) model and fine-grained execution (FGE) model. In CGE, the thread processes one complete task. For example, a CGE thread computes a single <italic>Keygen</italic>, <italic>Sign</italic>, or <italic>Verify</italic>. It has two important advantages: ease of implementation and provision of maximum throughput. However, there is latency in completing the assigned operation because the computational power of each GPU core is considerably lower than that of the CPU.</p>
<p>FGE can reduce latency to complete the assigned operation by making multiple threads operate together. A single <italic>Keygen</italic>, <italic>Sign</italic>, or <italic>Verify</italic> can be processed using multiple threads in the FGE model. The Falcon software lowers operation latency while providing reasonable throughput by following the FGE model.</p>
<p>In the NVIDIA GPU, the maximum number of threads that reside in each thread block is 1,024. However, because there are limited resources per block, it is necessary to adjust the number of threads by considering the required resource (registers) in each thread within the block. When selecting the optimal number of threads in a block, it should be a multiple of Warp size, which is typically 32. Because Warp is the unit of scheduling in GPU, the CUDA manual suggests that the number of threads in a block should be a multiple of Warp size [<xref ref-type="bibr" rid="ref-32">32</xref>]. The study tested GPU Falcon implementation on a target GPU (RTX 3090) with multiple numbers of threads per block and reported that using 32 threads per block provided the best performance in terms of latency. Thus, in the implementation, 32 threads in a block can compute a single <italic>Keygen</italic>, <italic>Sign</italic>, or <italic>Verify</italic>. To simultaneously process multiple operations, the study&#x2019;s software launches multiple thread blocks. <xref ref-type="fig" rid="fig-3">Fig. 3</xref> shows the software overall structure.</p>
<fig id="fig-3">
<label>Figure 3</label>
<caption><title>Overall structure of the Falcon GPU software</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_33910-fig-3.tif"/>
</fig>
<p>In the introduced software, 16 and 32 terms are assigned in the polynomial operation of each thread in a thread block for Falcon-512 and Falcon-1024, respectively, with the applied FGE model. Moreover, multiple <italic>Keygen</italic>, <italic>Sign</italic>, or <italic>Verify</italic> can be computed by launching multiple thread blocks. Several techniques are proposed to minimize warp divergence and for efficient cooperation among block threads, including parallel implementation of polynomial multiplication in the Falcon software.</p>
</sec>
</sec>
<sec id="s4_3">
<label>4.3</label><title>Specific Parallel Optimization Strategy</title>
<sec id="s4_3_1">
<label>4.3.1</label><title>Optimization Method for Common Polynomial Functions</title>
<p>General polynomial-based operation functions operate on each term belonging to a polynomial. For example, when two polynomials are added, each term of the two polynomials should be added based on the position. If the number of terms in the polynomial is 512 (Falcon-512), then 512 addition operations are performed. Therefore, if the GPU optimizes the addition operation using 32 threads, each thread can operate on 16 terms such that the addition of all 512 terms can be processed in parallel. For Falcon-1024, each thread in a block comprising 32 threads should process 32 polynomial operation terms.</p>
</sec>
<sec id="s4_3_2">
<label>4.3.2</label><title>Optimization Method for NTT and FFT Functions</title>
<p>In the study FGE model, each thread of a block cooperates to process polynomial operations such as addition and multiplication. The same number of terms belonging to a polynomial is allocated to and processed by each thread. For example, when adding two polynomials with 512 terms, each of the 32 threads compute different 16 terms, i.e., the <italic>i</italic>-th thread adds 16 terms of the two polynomials from <inline-formula id="ieqn-7"><mml:math id="mml-ieqn-7"><mml:mn>16</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mi>i</mml:mi></mml:math></inline-formula> to <inline-formula id="ieqn-8"><mml:math id="mml-ieqn-8"><mml:mn>16</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mi>i</mml:mi><mml:mo>+</mml:mo><mml:mn>15</mml:mn></mml:math></inline-formula> indexes where <inline-formula id="ieqn-9"><mml:math id="mml-ieqn-9"><mml:mn>0</mml:mn><mml:mo>&#x2264;</mml:mo><mml:mi>i</mml:mi><mml:mo>&#x2264;</mml:mo><mml:mn>31</mml:mn></mml:math></inline-formula>. However, polynomial multiplication is more complex than simple polynomial addition. Falcon uses NTT- and FFT-based methods for efficient polynomial multiplications in the integer and complex domains. Because NTT is an integer domain analog of FFT, the process is similar. Thus, only the NTT-based polynomial multiplication method is explained. NTT-based polynomial multiplication comprises three parts: conversion to the NTT domain, point-wise multiplication, and inverse NTT conversion (which is conversion back to the original). In the NTT process, the <inline-formula id="ieqn-10"><mml:math id="mml-ieqn-10"><mml:msub><mml:mi>Z</mml:mi><mml:mi>q</mml:mi></mml:msub><mml:mo stretchy="false">[</mml:mo><mml:mi>X</mml:mi><mml:mo stretchy="false">]</mml:mo><mml:mrow><mml:mrow><mml:mo>/</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:mo>&#x003C;</mml:mo></mml:mrow><mml:mi>&#x03D5;</mml:mi><mml:mrow><mml:mo>&#x003E;</mml:mo></mml:mrow></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x03D5;</mml:mi><mml:mo>=</mml:mo><mml:msup><mml:mi>X</mml:mi><mml:mi>n</mml:mi></mml:msup><mml:mo>+</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>, which is the ring of Falcon, is factored to <italic>n</italic> different sub-Rings of degree 1, and a polynomial in <inline-formula id="ieqn-11"><mml:math id="mml-ieqn-11"><mml:msub><mml:mi>Z</mml:mi><mml:mi>q</mml:mi></mml:msub><mml:mo stretchy="false">[</mml:mo><mml:mi>X</mml:mi><mml:mo stretchy="false">]</mml:mo><mml:mrow><mml:mrow><mml:mo>/</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:mo>&#x003C;</mml:mo></mml:mrow><mml:mi>&#x03D5;</mml:mi><mml:mrow><mml:mo>&#x003E;</mml:mo></mml:mrow></mml:mrow></mml:math></inline-formula> is converted into <italic>n</italic> polynomials over the factored sub-Rings. Thus, the NTT process can be considered to repeatedly reduce the intermediate polynomials by the sub-Rings of <inline-formula id="ieqn-12"><mml:math id="mml-ieqn-12"><mml:msub><mml:mi>Z</mml:mi><mml:mi>q</mml:mi></mml:msub><mml:mo stretchy="false">[</mml:mo><mml:mi>X</mml:mi><mml:mo stretchy="false">]</mml:mo><mml:mrow><mml:mrow><mml:mo>/</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:mo>&#x003C;</mml:mo></mml:mrow><mml:mi>&#x03D5;</mml:mi><mml:mrow><mml:mo>&#x003E;</mml:mo></mml:mrow></mml:mrow></mml:math></inline-formula> until it reaches degree 1.</p>
<p>Butterfly operation is the primary NTT conversion computation that is responsible for reducing coefficients in degrees higher than the factored sub-Ring&#x2019;s degree to a lesser degree. Because one Butterfly operation reduces the coefficient, <italic>n</italic>/2 times of Butterfly operations are executed in log<sub>2</sub><italic>n</italic> layers where <italic>n</italic> &#x003D; 512 or 1024. For example, at the first layer, a coefficient in the range of 511-th degree and 256-th degree is reduced to the range of 255-th degree and 0-th degree with 256 Butterfly operations for <italic>n</italic> &#x003D; 512. Butterfly operation multiplies a coefficient to be reduced with a twiddle factor and adds/subtracts it with a coefficient in a lower degree. Twiddle factors are predefined values stored in constant memory. After converting two polynomials over <inline-formula id="ieqn-13"><mml:math id="mml-ieqn-13"><mml:msub><mml:mi>Z</mml:mi><mml:mi>q</mml:mi></mml:msub><mml:mo stretchy="false">[</mml:mo><mml:mi>X</mml:mi><mml:mo stretchy="false">]</mml:mo><mml:mrow><mml:mrow><mml:mo>/</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:mo>&#x003C;</mml:mo></mml:mrow><mml:mi>&#x03D5;</mml:mi><mml:mrow><mml:mo>&#x003E;</mml:mo></mml:mrow></mml:mrow></mml:math></inline-formula> into the NTT domain, <italic>n</italic> coefficients in the NTT domain are multiplied by each other in a point-wise manner. The <italic>i</italic>-th coefficient of the first polynomial is multiplied by the <italic>i</italic>-th coefficient of the second polynomial and then stored in the <italic>i</italic>-th position. After computing point-wise multiplication, <italic>n</italic> coefficients are converted into a polynomial over <inline-formula id="ieqn-14"><mml:math id="mml-ieqn-14"><mml:msub><mml:mi>Z</mml:mi><mml:mi>q</mml:mi></mml:msub><mml:mo stretchy="false">[</mml:mo><mml:mi>X</mml:mi><mml:mo stretchy="false">]</mml:mo><mml:mrow><mml:mrow><mml:mo>/</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:mo>&#x003C;</mml:mo></mml:mrow><mml:mi>&#x03D5;</mml:mi><mml:mrow><mml:mo>&#x003E;</mml:mo></mml:mrow></mml:mrow></mml:math></inline-formula> with inverse NTT (iNTT). Note that iNTT is almost the same as the NTT process except that the inverse twiddle factors are used.</p>
<p>In the parallel NTT and FFT implementation, because 32 threads cooperatively process a Falcon operation such as <italic>Keygen</italic>, <italic>Sign</italic>, and <italic>Verify</italic>, these compute an NTT operation in cooperation. For <italic>n</italic> &#x003D; 512, there are eight layers in NTT conversion where each layer computes 256 Butterfly operations. Thus, each thread computes eight Butterfly operations in each layer. When a thread processes a Butterfly, it accesses two coefficients: one to be multiplied with a twiddle factor and the other to be added/subtracted with the multiplied result. Because each thread simultaneously accesses a different coefficient, it is important to determine the coefficients that are accessed by the threads. For efficient position calculation, <italic>section_number</italic> and <italic>index_number</italic> are first defined. The <italic>section_number</italic> and <italic>index_number</italic> are computed with <italic>section_number</italic> &#x003D; <italic>offset</italic>/<italic>interval_size</italic> and <italic>index_number</italic> &#x003D; <italic>offset mod interval_size</italic>, respectively. The initial value of <italic>interval_size</italic> is 256 which decreases by half for each layer such that the final layer becomes 1. Moreover, in a Butterfly operation, <italic>term_number</italic> is the first operand index and the second operand is indexed with <italic>term_number</italic> &#x002B; <italic>interval_size</italic>; these two operands are in the same polynomial. At the <italic>i</italic>-th layer, two operands in a Butterfly operation are located (<inline-formula id="ieqn-15"><mml:math id="mml-ieqn-15"><mml:mn>512</mml:mn><mml:mo>&#x2264;</mml:mo><mml:mrow><mml:mtext>i</mml:mtext></mml:mrow></mml:math></inline-formula>) apart in the polynomial. <xref ref-type="fig" rid="fig-4">Fig. 4</xref> shows how <italic>term_number</italic> is computed and how each thread accesses two operands for the Butterfly operation. Because 32 threads cooperatively execute NTT conversion, each thread executes eight Butterfly operations in each layer. Although their operational structures are similar, the difference between NTT and FFT is that each uses 16-bit and 64-bit integers double-precision float-point, respectively, for expressing polynomial coefficients. Algorithm 6 shows the proposed parallel NTT algorithm. Note that in-place indicates the resulting sub-polynomials are stored in the output storage memory. Each thread in a block executes Algorithm 6 for a complete NTT conversion operation. The <italic>n</italic> in the inner loop (Step 5&#x2013;15) is the number of threads in a block. It is divided by the thread per block (TPB).</p>
<fig id="fig-4">
<label>Figure 4</label>
<caption><title>Parallel implementation techniques for NTT and FFT</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_33910-fig-4.tif"/>
</fig>
<p>Remarks The CUDA platform provides a cuFFT library for FFT conversion operations. However, it requires certain rules to use the library, i.e., the data should be stored in the cufftComplex structure before converting it to the FFT domain. The cufftComplex data memory should be allocated before launching a kernel function. However, in the Falcon software, the FFT conversion process is performed in the middle of Keygen, Sign, and Verify. Thus, allocating memory to the cufftComplex data is difficult. Furthermore, the original data should be converted to cufftComplex data format, which results in overhead because Falcon codes use only an array format to express complex numbers. Thus, the study implemented its own FFT-based polynomial multiplication method.</p>
</sec>
<sec id="s4_3_3">
<label>4.3.3</label><title>Reducing Divergence Effects with Dummy Operations</title>
<p>Synchronization should always be considered when multiple threads concurrently perform operations. If threads perform different operations <underline>because of</underline> branch-like statements, even within the same warp, a divergence problem occurs. This is when the first branch threads execute the corresponding statement as per the branch statement while the other branch threads enter the idle state without performing other operations until all operations on the first branch are performed.</p>
<p>Warp divergence occurs if the threads cannot perform the same operation because of branch instructions. Thus, the functions containing branch instructions with dummy operation-based parallel codes are redesigned. Moreover, the additional memory of a precomputation table must be applied in the dummy operation-based model, i.e., additional memory or a table to exclude the result of a dummy operation can be used such that it does not affect the final result. <xref ref-type="fig" rid="fig-5">Fig. 5</xref> shows the basic approach to dummy operation-based parallel codes where the left side shows the original codes including branch instructions, i.e., thread <italic>i</italic> in a block executes either <italic>R</italic>[<italic>i</italic>] &#x003D; <italic>f</italic>(<italic>A</italic>[<italic>i</italic>]) <italic>op g</italic>(<italic>A</italic>[<italic>i</italic>]) or <italic>R</italic>[<italic>i</italic>] &#x003D; <italic>f</italic>(<italic>A</italic>[<italic>i</italic>]) where <italic>f</italic> and <italic>g</italic> are a type of simple function, and <italic>op</italic> means operations such as addition and multiplication. Furthermore, the right side of the figure shows the revised codes. Moreover, <italic>f</italic> and <italic>g</italic> are redefined as table operations with <italic>f&#x2032;</italic> and <italic>g&#x2032;</italic>. For example, <italic>f&#x2032;</italic>(<italic>A</italic> [<xref ref-type="bibr" rid="ref-3">3</xref>]) and <italic>g&#x2032;</italic>(<italic>A</italic> [<xref ref-type="bibr" rid="ref-1">1</xref>]) return zero value, which does not affect the final result.</p>
<fig id="fig-5">
<label>Figure 5</label>
<caption><title>A basic approach for dummy operation-based parallel codes</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_33910-fig-5.tif"/>
</fig>
</sec>
<sec id="s4_3_4">
<label>4.3.4</label><title>Reducing Latency for Memory Copy</title>
<p>To reduce the idle time of GPU kernel execution because of memory copy between CPU and GPU, the CUDA stream technique [<xref ref-type="bibr" rid="ref-32">32</xref>] is further exploited, which can asynchronously execute memory copy while the kernel executes Falcon operation. From the experimental result, the 32 CUDA stream provides the best performance.</p>
</sec>
</sec>
</sec>
<sec id="s5">
<label>5</label><title>Results</title>
<p>This section discusses the evaluation of the Falcon performance running successfully on the GPU and confirms its implementation by comparing output results through the test vector. <xref ref-type="table" rid="table-4">Table 4</xref> shows the performance comparison between the proposed implementation on a GPU and the latest Falcon implementation on a CPU running AVX2. The Falcon result of the CPU used for performance comparison is referenced by Pornin (2019) [<xref ref-type="bibr" rid="ref-35">35</xref>].</p>
<table-wrap id="table-4">
<label>Table 4</label>
<caption><title>Throughput of <italic>Keygen</italic>, <italic>Sign</italic>, and <italic>Verify</italic> per second for Falcon-512 and Falcon-1024 (<italic>Sign</italic><sup>1</sup> and <italic>Sign</italic><sup>2</sup> are the throughput of Signings with multiple keys and with a single key, respectively)</title>
</caption>
<table frame="hsides">
<colgroup>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
</colgroup>
<thead>
<tr>
<th>Parameter</th>
<th colspan="4" align="center">Falcon-512</th>
<th colspan="4" align="center">Falcon-1024</th>
</tr>
<tr>
<th>Operation</th>
<th><italic>Keygen</italic></th>
<th><italic>Sign</italic><sup>1</sup></th>
<th><italic>Sign</italic><sup>2</sup></th>
<th><italic>Verify</italic></th>
<th><italic>Keygen</italic></th>
<th><italic>Sign</italic><sup>1</sup></th>
<th><italic>Sign</italic><sup>2</sup></th>
<th><italic>Verify</italic></th>
</tr>
</thead>
<tbody>
<tr>
<td>Software1</td>
<td>115.7</td>
<td colspan="2">5,948.1</td>
<td>27,933.0</td>
<td>36.4</td>
<td colspan="2">2,913.0</td>
<td>13,650.0</td>
</tr>
<tr>
<td>Software2</td>
<td>135.3</td>
<td colspan="2">7,692.9</td>
<td>44,424.7</td>
<td>45.5</td>
<td colspan="2">3,818.5</td>
<td>22,416.5</td>
</tr>
<tr>
<td>Software3</td>
<td>1.0</td>
<td colspan="2">7.9</td>
<td>333.3</td>
<td>0.3</td>
<td colspan="2">3.7</td>
<td>162.9</td>
</tr>
<tr>
<td>Software4</td>
<td>172.1</td>
<td colspan="2">12,134.4</td>
<td>58,169.2</td>
<td>59.2</td>
<td colspan="2">6,117.3</td>
<td>28,987.4</td>
</tr>
<tr>
<td>Our works<break/>(GPU)</td>
<td>6047.4</td>
<td>349,960.2</td>
<td>385,761.1</td>
<td>2,014,924.4</td>
<td>1971.8</td>
<td>167,928.4</td>
<td>181,110.7</td>
<td>997,067.4</td>
</tr>
</tbody>
</table>
<table-wrap-foot><p>Notes: Software 1: Falcon on Intel i5-8259U 2.3 GHz [<xref ref-type="bibr" rid="ref-31">31</xref>]. Software 2: Falcon on Intel i7-6567U 3.6 GHz using AVX2 [<xref ref-type="bibr" rid="ref-35">35</xref>]. Software 3: Falcon on ARM embedded Cortex-M4 [<xref ref-type="bibr" rid="ref-35">35</xref>]. Software 4: Ryzen 9 5900X 4.7 GHz using AVX2.</p>
</table-wrap-foot>
</table-wrap>
<p>The performance evaluation environment was as follows: the operating system was Windows and the AMD Ryzen 9 5900X CPU and NVIDIA GeForce RTX 3090 GPU were used. The performance evaluation was measured based on the time required to process a certain amount of key generation/signature generation/signature verification workload and measured based on the average of 1,000 repetitions of the same operation. The GPU-side software was implemented such that 32 threads for each block cooperatively performed one Falcon operation, and the number of blocks available was set to 256, which corresponded to the performance threshold. The time calculation for performance measurement was conducted based on the operation time, including the memory copy time between the CPU and GPU.</p>
<p>The values in <xref ref-type="table" rid="table-2">Table 2</xref> are the throughput per second. In the table, Sign1 and Sign2 are the throughput of the signing operation when using multiple keys and a single key, respectively. For CPU implementations, there is no difference between using multiple keys and a single key because the CPU software&#x2019;s serial execution only uses a single key. For example, Pornin (2019) [<xref ref-type="bibr" rid="ref-35">35</xref>] used Falcon-512 to generate 7,692.9 signatures per second. The study proposed implementation uses a GPU to simultaneously perform more operations. In Falcon-512, it has a 52 times faster throughput for key generation, 58 times speed for signature generation, and 72 times faster signature verification compared with those in the study by Prest&#x00A0;et&#x00A0;al.&#x00A0;(2022) [<xref ref-type="bibr" rid="ref-31">31</xref>]. When using AVX2 instructions, Falcon-512 shows 44/45/45 times faster performance for <italic>Keygen</italic>/<italic>Sign</italic><sup>1</sup>/<italic>Verify</italic>, respectively, than those in the Pornin (2019) study [<xref ref-type="bibr" rid="ref-35">35</xref>]. For Falcon-1024, this study confirmed that its implementation was about 43/43/44 times faster than those of Pornin (2019) [<xref ref-type="bibr" rid="ref-35">35</xref>] for <italic>Keygen</italic>/<italic>Sign</italic><sup>1</sup>/<italic>Verify</italic>, respectively. For <italic>Sign</italic><sup>2</sup>, using a single signing key, the proposed implementation outperforms the CPU implementation [<xref ref-type="bibr" rid="ref-35">35</xref>] with the AVX2 by 50 and 47 times for Falcon-512 and Falcon-1024, respectively.</p>

<p>Compared with Falcon CPU software (Software4) using AVX2 on the latest AMD Ryzen 9 5900X CPU, the study&#x2019;s Falcon-512 software demonstrated 35/28/34 times better performance in <italic>Keygen</italic>/<italic>Sign</italic><sup>1</sup>/<italic>Verify</italic>, respectively. Moreover, the study&#x2019;s Falcon-1024 software demonstrated 33/27/34 times better performance in <italic>Keygen</italic>/<italic>Sign</italic><sup>1</sup>/<italic>Verify</italic>, respectively.</p>
</sec>
<sec id="s6">
<label>6</label><title>Conclusion</title>
<p>In this study, it was suggested that PQC can operate on GPU by considering the Falcon as an example which is the final selected algorithm by NIST&#x2019;s PQC standardization competition. Multiple methods were proposed to successfully help the existing functions operate on the GPU. Moreover, optimization techniques that can be quickly processed using the GPU features were introduced. To our knowledge, this is the first result of implementing Falcon on a GPU. By operating PQC on a GPU, the possibility of replacing the existing algorithm with PQC in multiple server environments using the GPU is proposed. Furthermore, in this study, the proposed implementation techniques have potential use for other lattice-based PQCs.</p>
</sec>
</body>
<back>
<sec><title>Funding Statement</title>
<p>This work was partly supported by the <funding-source>National Research Foundation of Korea (NRF)</funding-source> grant funded by the Korea government (MSIT) (No. <award-id>2022R1C1C1013368</award-id>). This was partly supported in part by <funding-source>Korea University Grant</funding-source> and in part by the <funding-source>Institute of Information and Communications Technology Planning and Evaluation (IITP)</funding-source> Grant through the <funding-source>Korean Government [Ministry of Science and ICT (MSIT)]</funding-source>, Development of Physical Channel Vulnerability-Based Attacks and its Countermeasures for Reliable On-Device Deep Learning Accelerator Design, under Grant <award-id>2021-0-00903</award-id>.</p>
</sec>
<sec sec-type="COI-statement"><title>Conflicts of Interest</title>
<p>The authors declare that they have no conflicts of interest to report regarding the present study.</p>
</sec>
<ref-list content-type="authoryear"><title>References</title>
<ref id="ref-1"><label>[1]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>P. W.</given-names> <surname>Shor</surname></string-name></person-group>, &#x201C;<article-title>Polynomial-time algorithms for prime factorization and discrete logarithms on a quantum computer</article-title>,&#x201D; <source>SIAM Journal on Scientific Computing</source>, vol. <volume>26</volume>, no. <issue>5</issue>, pp. <fpage>1484</fpage>&#x2013;<lpage>1509</lpage>, <year>1997</year>.</mixed-citation></ref>
<ref id="ref-2"><label>[2]</label><mixed-citation publication-type="book"><person-group person-group-type="author"><string-name><given-names>D.</given-names> <surname>Moody</surname></string-name></person-group>, &#x201C;<chapter-title>Round 2 of nist PQC competition</chapter-title>.&#x201D; In <source>Invited Talk at PQCrypto</source>, <publisher-loc>ChongQing, China</publisher-loc>, <year>2019</year>.</mixed-citation></ref>
<ref id="ref-3"><label>[3]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><given-names>M. R.</given-names> <surname>Albrecht</surname></string-name>, <string-name><given-names>D. J.</given-names> <surname>Bernstein</surname></string-name>, <string-name><given-names>T.</given-names> <surname>Chou</surname></string-name>, <string-name><given-names>C.</given-names> <surname>Cid</surname></string-name>, <string-name><given-names>J.</given-names> <surname>Gilcher</surname></string-name> <etal>et al.</etal></person-group><italic>,</italic> &#x201C;<article-title>For classic mceliece</article-title>,&#x201D; <year>2022</year>. [Online]. Available: <ext-link ext-link-type="uri" xlink:href="https://classic.mceliece.org">https://classic.mceliece.org</ext-link>.</mixed-citation></ref>
<ref id="ref-4"><label>[4]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><given-names>P.</given-names> <surname>Schwabe</surname></string-name>, <string-name><given-names>R.</given-names> <surname>Avanzi</surname></string-name>, <string-name><given-names>J.</given-names> <surname>Bos</surname></string-name>, <string-name><given-names>L.</given-names> <surname>Ducas</surname></string-name>, <string-name><given-names>E.</given-names> <surname>Kiltz</surname></string-name> <etal>et al.</etal></person-group><italic>,</italic> &#x201C;<article-title>For Crystals-Kyber</article-title>,&#x201D; <year>2022</year>. [Online]. Available: <ext-link ext-link-type="uri" xlink:href="https://pq-crystals.org/kyber/index.shtml">https://pq-crystals.org/kyber/index.shtml</ext-link>.</mixed-citation></ref>
<ref id="ref-5"><label>[5]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><given-names>C.</given-names> <surname>Chen</surname></string-name>, <string-name><given-names>O.</given-names> <surname>Danba</surname></string-name>, <string-name><given-names>J.</given-names> <surname>Hoffstein</surname></string-name>, <string-name><given-names>A.</given-names> <surname>Hulsing</surname></string-name>, <string-name><given-names>J.</given-names> <surname>Rijneveld</surname></string-name> <etal>et al.</etal></person-group><italic>,</italic> &#x201C;<article-title>For NTRU</article-title>,&#x201D; <year>2022</year>. [Online]. Available: <ext-link ext-link-type="uri" xlink:href="https://ntru.org/">https://ntru.org/</ext-link>.</mixed-citation></ref>
<ref id="ref-6"><label>[6]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><given-names>J. -P.</given-names> <surname>D&#x2019;Anvers</surname></string-name>, <string-name><given-names>A.</given-names> <surname>Karmakar</surname></string-name>, <string-name><given-names>S. S.</given-names> <surname>Roy</surname></string-name>, <string-name><given-names>F.</given-names> <surname>Vercauteren</surname></string-name>, <string-name><given-names>J. M. B.</given-names> <surname>Mera</surname></string-name> <etal>et al.</etal></person-group><italic>,</italic> &#x201C;<article-title>For Saber</article-title>,&#x201D; <year>2022</year>. [Online]. Available: <ext-link ext-link-type="uri" xlink:href="https://www.esat.kuleuven.be/cosic/pqcrypto/saber/">https://www.esat.kuleuven.be/cosic/pqcrypto/saber/</ext-link>.</mixed-citation></ref>
<ref id="ref-7"><label>[7]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><given-names>V.</given-names> <surname>Lyubashevsky</surname></string-name>, <string-name><given-names>L.</given-names> <surname>Ducas</surname></string-name>, <string-name><given-names>E.</given-names> <surname>Kiltz</surname></string-name>, <string-name><given-names>T.</given-names> <surname>Lepoint</surname></string-name>, <string-name><given-names>P.</given-names> <surname>Schwabe</surname></string-name> <etal>et al.</etal></person-group><italic>,</italic> &#x201C;<article-title>For Crystals-Dillithium</article-title>,&#x201D; <year>2022</year>. [Online]. Available: <ext-link ext-link-type="uri" xlink:href="https://pq-crystals.org/dilithium/index.shtml">https://pq-crystals.org/dilithium/index.shtml</ext-link>.</mixed-citation></ref>
<ref id="ref-8"><label>[8]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><given-names>T.</given-names> <surname>Prest</surname></string-name>, <string-name><given-names>P. -A.</given-names> <surname>Fouque</surname></string-name>, <string-name><given-names>J.</given-names> <surname>Hoffstein</surname></string-name>, <string-name><given-names>P.</given-names> <surname>Kirchner</surname></string-name>, <string-name><given-names>V.</given-names> <surname>Lyubashevsky</surname></string-name> <etal>et al.</etal></person-group><italic>,</italic> &#x201C;<article-title>For Falcon</article-title>,&#x201D; <year>2022</year>. [Online]. Available: <ext-link ext-link-type="uri" xlink:href="https://falcon-sign.info">https://falcon-sign.info</ext-link>.</mixed-citation></ref>
<ref id="ref-9"><label>[9]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><given-names>J.</given-names> <surname>Ding</surname></string-name>, <string-name><given-names>M. -S.</given-names> <surname>Chen</surname></string-name>, <string-name><given-names>A.</given-names> <surname>Petzoldt</surname></string-name>, <string-name><given-names>D.</given-names> <surname>Schmidt</surname></string-name>, <string-name><given-names>B. -Y.</given-names> <surname>Yang</surname></string-name> <etal>et al.</etal></person-group><italic>,</italic> &#x201C;<article-title>For Rainbow</article-title>,&#x201D; <year>2022</year>. [Online]. Available: <ext-link ext-link-type="uri" xlink:href="https://www.pqcrainbow.org">https://www.pqcrainbow.org</ext-link>.</mixed-citation></ref>
<ref id="ref-10"><label>[10]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>S. C.</given-names> <surname>Seo</surname></string-name>, <string-name><given-names>T.</given-names> <surname>Kim</surname></string-name> and <string-name><given-names>S.</given-names> <surname>Hong</surname></string-name></person-group>, &#x201C;<article-title>Accelerating elliptic curve scalar multi- plication over GF(2<sup>m</sup>)</article-title>,&#x201D; <source>Journal of Parallel and Distributed Computing</source>, vol. <volume>75</volume>, pp. <fpage>152</fpage>&#x2013;<lpage>167</lpage>, <year>2015</year>.</mixed-citation></ref>
<ref id="ref-11"><label>[11]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>W.</given-names> <surname>Pan</surname></string-name>, <string-name><given-names>F.</given-names> <surname>Zheng</surname></string-name>, <string-name><given-names>Y.</given-names> <surname>Zhao</surname></string-name>, <string-name><given-names>W. T.</given-names> <surname>Zhu</surname></string-name> and <string-name><given-names>J.</given-names> <surname>Jing</surname></string-name></person-group>, &#x201C;<article-title>An efficient elliptic curve cryptography signature server with GPU acceleration</article-title>,&#x201D; <source>IEEE Transactions on Information Forensics and Security</source>, vol. <volume>12</volume>, no. <issue>1</issue>, pp. <fpage>111</fpage>&#x2013;<lpage>122</lpage>, <year>2017</year>.</mixed-citation></ref>
<ref id="ref-12"><label>[12]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>L.</given-names> <surname>Gao</surname></string-name>, <string-name><given-names>F.</given-names> <surname>Zheng</surname></string-name>, <string-name><given-names>R.</given-names> <surname>Wei</surname></string-name>, <string-name><given-names>J.</given-names> <surname>Dong</surname></string-name>, <string-name><given-names>N.</given-names> <surname>Emmart</surname></string-name> <etal>et al.</etal></person-group><italic>,</italic> &#x201C;<article-title>DPF-ECC: A framework for efficient ECC with double precision floating-point computing power</article-title>,&#x201D; <source>IEEE Transactions on Information Forensics and Security</source>, vol. <volume>16</volume>, pp. <fpage>3988</fpage>&#x2013;<lpage>4002</lpage>, <year>2021</year>.</mixed-citation></ref>
<ref id="ref-13"><label>[13]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>N.</given-names> <surname>Gupta</surname></string-name>, <string-name><given-names>A.</given-names> <surname>Jati</surname></string-name>, <string-name><given-names>A. K.</given-names> <surname>Chauhan</surname></string-name> and <string-name><given-names>A.</given-names> <surname>Chattopadhyay</surname></string-name></person-group>, &#x201C;<article-title>PQC acceleration using GPUs: FrodoKEM, NewHope, and Kyber</article-title>,&#x201D; <source>IEEE Transactions on Parallel and Distributed Systems</source>, vol. <volume>32</volume>, no. <issue>3</issue>, pp. <fpage>575</fpage>&#x2013;<lpage>586</lpage>, <year>2020</year>.</mixed-citation></ref>
<ref id="ref-14"><label>[14]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>Y.</given-names> <surname>Gao</surname></string-name>, <string-name><given-names>J.</given-names> <surname>Xu</surname></string-name> and <string-name><given-names>H.</given-names> <surname>Wang</surname></string-name></person-group>, &#x201C;<article-title>CUNH: Efficient GPU implementations of post-quantum KEM NewHope</article-title>,&#x201D; <source>IEEE Transactions on Parallel and Distributed Systems</source>, vol. <volume>33</volume>, no. <issue>3</issue>, pp. <fpage>551</fpage>&#x2013;<lpage>568</lpage>, <year>2021</year>.</mixed-citation></ref>
<ref id="ref-15"><label>[15]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>H.</given-names> <surname>Seong</surname></string-name>, <string-name><given-names>Y.</given-names> <surname>Kim</surname></string-name>, <string-name><given-names>Y.</given-names> <surname>Yeom</surname></string-name> and <string-name><given-names>J. -S.</given-names> <surname>Kang</surname></string-name></person-group>, &#x201C;<article-title>Accelerated implementation of NTRU on GPU for efficient key exchange in multi-client environment</article-title>,&#x201D; <source>Journal of the Korea Institute of Information Security &#x0026; Cryptology</source>, vol. <volume>31</volume>, no. <issue>3</issue>, pp. <fpage>481</fpage>&#x2013;<lpage>496</lpage>, <year>2021</year>.</mixed-citation></ref>
<ref id="ref-16"><label>[16]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>K.</given-names> <surname>Lee</surname></string-name>, <string-name><given-names>M.</given-names> <surname>Gowanlock</surname></string-name> and <string-name><given-names>B.</given-names> <surname>Cambou</surname></string-name></person-group>, &#x201C;<article-title>Saber-GPU: A response-based cryptography algorithm for saber on the GPU</article-title>,&#x201D; in <conf-name>Proc. Pacific Rim Int. Symp. on Dependable Computing</conf-name>, <publisher-loc>Perth, Australia</publisher-loc>, pp. <fpage>123</fpage>&#x2013;<lpage>132</lpage>, <year>2021</year>. </mixed-citation></ref>
<ref id="ref-17"><label>[17]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>S. C.</given-names> <surname>Seo</surname></string-name></person-group>, &#x201C;<article-title>SIKE on GPU: Accelerating Supersingular isogeny-based key encapsulation mechanism on graphic processing units</article-title>,&#x201D; <source>IEEE Access</source>, vol. <volume>9</volume>, pp. <fpage>116731</fpage>&#x2013;<lpage>116744</lpage>, <year>2021</year>.</mixed-citation></ref>
<ref id="ref-18"><label>[18]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>W. -K.</given-names> <surname>Lee</surname></string-name>, <string-name><given-names>H.</given-names> <surname>Seo</surname></string-name>, <string-name><given-names>Z.</given-names> <surname>Zhang</surname></string-name> and <string-name><given-names>S. O.</given-names> <surname>Hwang</surname></string-name></person-group>, &#x201C;<article-title>Tensorcrypto: High throughput acceleration of lattice-based cryptography using tensor core on GPU</article-title>,&#x201D; <source>IEEE Access</source>, vol. <volume>10</volume>, pp. <fpage>20616</fpage>&#x2013;<lpage>20632</lpage>, <year>2022</year>.</mixed-citation></ref>
<ref id="ref-19"><label>[19]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><collab>PQClean Project</collab></person-group>, <year>2022</year>. [Online]. Available: <ext-link ext-link-type="uri" xlink:href="https://github.com/PQClean/PQClean">https://github.com/PQClean/PQClean</ext-link>.</mixed-citation></ref>
<ref id="ref-20"><label>[20]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>L.</given-names> <surname>Ducas</surname></string-name>, <string-name><given-names>E.</given-names> <surname>Kiltz</surname></string-name>, <string-name><given-names>T.</given-names> <surname>Lepoint</surname></string-name>, <string-name><given-names>V.</given-names> <surname>Lyubashevsky</surname></string-name>, <string-name><given-names>P.</given-names> <surname>Schwabe</surname></string-name> <etal>et al.</etal></person-group><italic>,</italic> &#x201C;<article-title>Crystals-dilithium: A lattice-based digital signature scheme</article-title>,&#x201D; <source>IACR Transactions on Cryptographic Hardware and Embedded Systems</source>, vol. <volume>2018</volume>, pp. <fpage>238</fpage>&#x2013;<lpage>268</lpage>, <year>2018</year>.</mixed-citation></ref>
<ref id="ref-21"><label>[21]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><given-names>J.</given-names> <surname>Ding</surname></string-name>, <string-name><given-names>M. -S.</given-names> <surname>Chen</surname></string-name>, <string-name><given-names>A.</given-names> <surname>Petzoldt</surname></string-name>, <string-name><given-names>D.</given-names> <surname>Schmidt</surname></string-name>, <string-name><given-names>B. -Y.</given-names> <surname>Yang</surname></string-name> <etal>et al.</etal></person-group><italic>,</italic> &#x201C;<article-title>Rainbow specifications and supporting documentation</article-title>,&#x201D; <year>2022</year>. [Online]. Available: <ext-link ext-link-type="uri" xlink:href="https://csrc.nist.gov/projects/post-quantum-cryptography/round-3-submissions">https://csrc.nist.gov/projects/post-quantum-cryptography/round-3-submissions</ext-link>.</mixed-citation></ref>
<ref id="ref-22"><label>[22]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>H.</given-names> <surname>Nejatollahi</surname></string-name>, <string-name><given-names>N.</given-names> <surname>Dutt</surname></string-name>, <string-name><given-names>S.</given-names> <surname>Ray</surname></string-name>, <string-name><given-names>F.</given-names> <surname>Regazzoni</surname></string-name>, <string-name><given-names>I.</given-names> <surname>Banerjee</surname></string-name> <etal>et al.</etal></person-group><italic>,</italic> &#x201C;<article-title>Post-quantum lattice-based cryptography implementations: A survey</article-title>,&#x201D; <source>ACM Computing Survey</source>, vol. <volume>51</volume>, no. <issue>6</issue>, pp. <fpage>1</fpage>&#x2013;<lpage>41</lpage>, <year>2019</year>.</mixed-citation></ref>
<ref id="ref-23"><label>[23]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>W. -K.</given-names> <surname>Lee</surname></string-name>, <string-name><given-names>S.</given-names> <surname>Akleylek</surname></string-name>, <string-name><given-names>D. C. -K.</given-names> <surname>Wong</surname></string-name>, <string-name><given-names>W. -S.</given-names> <surname>Yap</surname></string-name>, <string-name><given-names>B-M.</given-names> <surname>Goi</surname></string-name> <etal>et al.</etal></person-group><italic>,</italic> &#x201C;<article-title>Parallel implementation of Nussbaumer algorithm and number theoretic transform on a GPU platform: Application to qTESLA</article-title>,&#x201D; <source>The Journal of Supercomputing</source>, vol. <volume>77</volume>, no. <issue>4</issue>, pp. <fpage>3289</fpage>&#x2013;<lpage>3314</lpage>, <year>2021</year>.</mixed-citation></ref>
<ref id="ref-24"><label>[24]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>&#x00D6;</given-names> <surname>&#x00D6;zerk</surname></string-name>, <string-name><given-names>C.</given-names> <surname>Elgezen</surname></string-name>, <string-name><given-names>A. C.</given-names> <surname>Mert</surname></string-name>, <string-name><given-names>E.</given-names> <surname>&#x00D6;zt&#x00FC;rk</surname></string-name> and <string-name><given-names>E.</given-names> <surname>Savas</surname></string-name></person-group>, &#x201C;<article-title>Efficient number theoretic transform implementation on GPU for homomorphic encryption</article-title>,&#x201D; <source>The Journal of Supercomputing</source>, vol. <volume>78</volume>, no. <issue>2</issue>, pp. <fpage>2840</fpage>&#x2013;<lpage>2872</lpage>, <year>2022</year>.</mixed-citation></ref>
<ref id="ref-25"><label>[25]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><given-names>P. -A.</given-names> <surname>Fouque</surname></string-name>, <string-name><given-names>J.</given-names> <surname>Hoffstein</surname></string-name>, <string-name><given-names>P.</given-names> <surname>Kirchner</surname></string-name>, <string-name><given-names>V.</given-names> <surname>Lyubashevsky</surname></string-name>, <string-name><given-names>T.</given-names> <surname>Pornin</surname></string-name> <etal>et al.</etal></person-group><italic>,</italic> &#x201C;<article-title>Falcon: Fast-Fourier lattice-based compact signatures over NTRU</article-title>,&#x201D; <year>2022</year>. [Online]. Available: <ext-link ext-link-type="uri" xlink:href="https://www.di.ens.fr/&#x007E;prest/Publications/falcon.pdf">https://www.di.ens.fr/&#x007E;prest/Publications/falcon.pdf</ext-link>.</mixed-citation></ref>
<ref id="ref-26"><label>[26]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>W. M.</given-names> <surname>Gentleman</surname></string-name> and <string-name><given-names>G.</given-names> <surname>Sande</surname></string-name></person-group>, &#x201C;<article-title>Fast fourier transforms: For fun and profit</article-title>,&#x201D; in <conf-name>Proc. AFIPS</conf-name>, <publisher-loc>New York, NY, USA</publisher-loc>, pp. <fpage>563</fpage>&#x2013;<lpage>578</lpage>, <year>1966</year>. </mixed-citation></ref>
<ref id="ref-27"><label>[27]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>R.</given-names> <surname>Agarwal</surname></string-name> and <string-name><given-names>C.</given-names> <surname>Burrus</surname></string-name></person-group>, &#x201C;<article-title>Fast convolution using Fermat number trans-forms with applications to digital filtering</article-title>,&#x201D; <source>IEEE Transactions on Acoustics, Speech, and Signal Processing</source>, vol. <volume>22</volume>, no. <issue>2</issue>, pp. <fpage>87</fpage>&#x2013;<lpage>97</lpage>, <year>1974</year>.</mixed-citation></ref>
<ref id="ref-28"><label>[28]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><given-names>I. C.</given-names> <surname>Society</surname></string-name></person-group>, &#x201C;<article-title>IEEE standard for floating-point arithmetic</article-title>,&#x201D; <comment>IEEE STD 754-2019</comment>, <year>2019</year>. [Online]. Available: <ext-link ext-link-type="uri" xlink:href="https://ieeexplore.ieee.org/document/8766229">https://ieeexplore.ieee.org/document/8766229</ext-link>.</mixed-citation></ref>
<ref id="ref-29"><label>[29]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>P.</given-names> <surname>Montgomery</surname></string-name></person-group>, &#x201C;<article-title>Modular multiplication without trial division</article-title>,&#x201D; <source>Mathematics of Computation</source>, vol. <volume>44</volume>, no.&#x00A0;<issue>170</issue>, pp. <fpage>519</fpage>&#x2013;<lpage>521</lpage>, <year>1985</year>.</mixed-citation></ref>
<ref id="ref-30"><label>[30]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><given-names>G.</given-names> <surname>Seiler</surname></string-name></person-group>, &#x201C;<article-title>Faster AVX2 optimized NTT multiplication for ring-LWE lattice cryptography</article-title>,&#x201D; <source>IACR Cryptololgy ePrint Archive</source>, <comment>Report 2018/039</comment>, <year>2018</year>.</mixed-citation></ref>
<ref id="ref-31"><label>[31]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><given-names>T.</given-names> <surname>Prest</surname></string-name>, <string-name><given-names>P. -A.</given-names> <surname>Fouque</surname></string-name>, <string-name><given-names>J.</given-names> <surname>Hoffstein</surname></string-name>, <string-name><given-names>P.</given-names> <surname>Kirchner</surname></string-name>, <string-name><given-names>V.</given-names> <surname>Lyubashevsky</surname></string-name> <etal>et al.</etal></person-group><italic>,</italic> &#x201C;<article-title>Falcon specifications and supporting documentation</article-title>,&#x201D; <year>2022</year>. [Online]. Available: <ext-link ext-link-type="uri" xlink:href="https://csrc.nist.gov/projects/post-quantum-cryptography/round-3-submissions">https://csrc.nist.gov/projects/post-quantum-cryptography/round-3-submissions</ext-link>.</mixed-citation></ref>
<ref id="ref-32"><label>[32]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><given-names>NVIDIA.</given-names> <surname>P. Vingelmann</surname></string-name> and <string-name><given-names>F. H.</given-names> <surname>Fitzek</surname></string-name></person-group>, &#x201C;<article-title>CUDA, release: 10.2.89</article-title>,&#x201D; <year>2022</year>. [Online]. Available: <ext-link ext-link-type="uri" xlink:href="https://developer.nvidia.com/cuda-toolkit">https://developer.nvidia.com/cuda-toolkit</ext-link>.</mixed-citation></ref>
<ref id="ref-33"><label>[33]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>J. E.</given-names> <surname>Stone</surname></string-name>, <string-name><given-names>D.</given-names> <surname>Gohara</surname></string-name> and <string-name><given-names>G.</given-names> <surname>Shi</surname></string-name></person-group>, &#x201C;<article-title>OpenCL: A parallel programming standard for heterogeneous computing systems</article-title>,&#x201D; <source>Computing in Science Engineering</source>, vol. <volume>12</volume>, no. <issue>3</issue>, pp. <fpage>66</fpage>&#x2013;<lpage>73</lpage>, <year>2010</year>.</mixed-citation></ref>
<ref id="ref-34"><label>[34]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><given-names>O. F.</given-names> <surname>Inc</surname></string-name></person-group>, &#x201C;<article-title>The ruler function, entry a001511 in the on-line encyclopedia of integer sequences</article-title>,&#x201D; <year>2022</year>. [Online]. Available: <ext-link ext-link-type="uri" xlink:href="http://oeis.org/A001511">http://oeis.org/A001511</ext-link>.</mixed-citation></ref>
<ref id="ref-35"><label>[35]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><given-names>T.</given-names> <surname>Pornin</surname></string-name></person-group>, &#x201C;<article-title>New efficient, constant-time implementations of falcon</article-title>,&#x201D; <source>Cryptology ePrint Archive</source>, <comment>Report 2019/893</comment>, <year>2019</year>.</mixed-citation></ref>
</ref-list>
</back>
</article>