<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.1 20151215//EN" "http://jats.nlm.nih.gov/publishing/1.1/JATS-journalpublishing1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" article-type="research-article" dtd-version="1.1">
<front>
<journal-meta>
<journal-id journal-id-type="pmc">CSSE</journal-id>
<journal-id journal-id-type="nlm-ta">CSSE</journal-id>
<journal-id journal-id-type="publisher-id">CSSE</journal-id>
<journal-title-group>
<journal-title>Computer Systems Science &#x0026; Engineering</journal-title>
</journal-title-group>
<issn pub-type="ppub">0267-6192</issn>
<publisher>
<publisher-name>Tech Science Press</publisher-name>
<publisher-loc>USA</publisher-loc>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="publisher-id">17144</article-id>
<article-id pub-id-type="doi">10.32604/csse.2021.017144</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Article</subject>
</subj-group>
</article-categories>
<title-group>
<article-title>Efficient Concurrent L1-Minimization Solvers on GPUs</article-title><alt-title alt-title-type="left-running-head">Efficient Concurrent L1-Minimization Solvers on GPUs</alt-title><alt-title alt-title-type="right-running-head">Efficient Concurrent L1-Minimization Solvers on GPUs</alt-title>
</title-group>
<contrib-group content-type="authors">
<contrib id="author-1" contrib-type="author">
<name name-style="western">
<surname>Chu</surname>
<given-names>Xinyue</given-names>
</name>
<xref ref-type="aff" rid="aff-1">1</xref>
</contrib>
<contrib id="author-2" contrib-type="author" corresp="yes">
<name name-style="western">
<surname>Gao</surname>
<given-names>Jiaquan</given-names>
</name>
<xref ref-type="aff" rid="aff-1">1</xref>
<email>springf12@163.com</email>
</contrib>
<contrib id="author-3" contrib-type="author">
<name name-style="western">
<surname>Sheng</surname>
<given-names>Bo</given-names>
</name>
<xref ref-type="aff" rid="aff-2">2</xref>
</contrib>
<aff id="aff-1">
<label>1</label><institution>Jiangsu Key Laboratory for NSLSCS, School of Computer and Electronic Information, Nanjing Normal University</institution>, <addr-line>Nanjing 210023</addr-line>, <country>China</country></aff>
<aff id="aff-2">
<label>2</label><institution>Department of Computer Science, University of Massachusetts Boston</institution>, <addr-line>MA 02125</addr-line>, <country>USA</country></aff>
</contrib-group><author-notes><corresp id="cor1">&#x002A;Corresponding Author: Jiaquan Gao. Email: <email>springf12@163.com</email></corresp></author-notes>
<pub-date pub-type="epub" date-type="pub" iso-8601-date="2021-05-12">
<day>12</day>
<month>5</month>
<year>2021</year>
</pub-date>
<volume>38</volume>
<issue>3</issue>
<fpage>305</fpage>
<lpage>320</lpage>
<history>
<date date-type="received">
<day>20</day>
<month>1</month>
<year>2021</year>
</date>
<date date-type="accepted">
<day>22</day>
<month>2</month>
<year>2021</year>
</date>
</history>
<permissions>
<copyright-statement>&#x00A9; 2021 Chu, Gao and Sheng</copyright-statement>
<copyright-year>2021</copyright-year>
<copyright-holder>Chu, Gao and Sheng</copyright-holder>
<license xlink:href="https://creativecommons.org/licenses/by/4.0/">
<license-p>This work is licensed under a <ext-link ext-link-type="uri" xlink:type="simple" xlink:href="https://creativecommons.org/licenses/by/4.0/">Creative Commons Attribution 4.0 International License</ext-link>, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
</license>
</permissions>
<self-uri content-type="pdf" xlink:href="TSP_CSSE_17144.pdf"></self-uri>
<abstract>
<p>Given that the concurrent L1-minimization (L1-min) problem is often required in some real applications, we investigate how to solve it in parallel on GPUs in this paper. First, we propose a novel self-adaptive warp implementation of the matrix-vector multiplication (<italic>Ax</italic>) and a novel self-adaptive thread implementation of the matrix-vector multiplication (<italic>A</italic><sup><italic>T</italic></sup><italic>x</italic>), respectively, on the GPU. The vector-operation and inner-product decision trees are adopted to choose the optimal vector-operation and inner-product kernels for vectors of any size. Second, based on the above proposed kernels, the iterative shrinkage-thresholding algorithm is utilized to present two concurrent L1-min solvers from the perspective of the streams and the thread blocks on a GPU, and optimize their performance by using the new features of GPU such as the shuffle instruction and the read-only data cache. Finally, we design a concurrent L1-min solver on multiple GPUs. The experimental results have validated the high effectiveness and good performance of our proposed methods.</p>
</abstract>
<kwd-group kwd-group-type="author">
<kwd>Concurrent L1-minimization problem</kwd>
<kwd>dense matrix-vector multiplication</kwd>
<kwd>fast iterative shrinkage-thresholding algorithm</kwd>
<kwd>CUDA</kwd>
<kwd>GPUs</kwd>
</kwd-group>
</article-meta>
</front>
<body>
<sec id="s1">
<label>1</label>
<title>Introduction</title>
<p>Due to the sparsity of the solution of the L1-min problem, it has been successfully applied in various fields such as signal processing [<xref ref-type="bibr" rid="ref-1">1</xref>&#x2013;<xref ref-type="bibr" rid="ref-3">3</xref>], machine learning [<xref ref-type="bibr" rid="ref-4">4</xref>&#x2013;<xref ref-type="bibr" rid="ref-8">8</xref>], and statistical inference [<xref ref-type="bibr" rid="ref-9">9</xref>]. Moreover, the concurrent L1-min problem where a great number of L1-min problems need be concurrently computed is often required in these real applications. This motivates us to discuss the concurrent L1-min problem in this paper. Here the following concurrent L1-min problem is considered:</p>
<p><disp-formula id="eqn-1">
<label>(1)</label>
<!--<alternatives>
<graphic mimetype="image" mime-subtype="png" xlink:href="eqn-1.png"/><tex-math id="tex-eqn-1"><![CDATA[\min \;||{x^i}|{|_1}\;\;\;\;\;s.t.\;\;\;A{x^i} = {b^i},\;\;\;i = 1,2, \cdots ,k,\;]]></tex-math>--><mml:math id="mml-eqn-1" display="block"><mml:mo form="prefix" movablelimits="true">min</mml:mo><mml:mspace width="thickmathspace"></mml:mspace><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mrow><mml:msup><mml:mi>x</mml:mi><mml:mi>i</mml:mi></mml:msup></mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mn>1</mml:mn></mml:msub></mml:mrow><mml:mspace width="thickmathspace"></mml:mspace><mml:mspace width="thickmathspace"></mml:mspace><mml:mspace width="thickmathspace"></mml:mspace><mml:mspace width="thickmathspace"></mml:mspace><mml:mspace width="thickmathspace"></mml:mspace><mml:mi>s</mml:mi><mml:mo>.</mml:mo><mml:mi>t</mml:mi><mml:mo>.</mml:mo><mml:mspace width="thickmathspace"></mml:mspace><mml:mspace width="thickmathspace"></mml:mspace><mml:mspace width="thickmathspace"></mml:mspace><mml:mi>A</mml:mi><mml:mrow><mml:msup><mml:mi>x</mml:mi><mml:mi>i</mml:mi></mml:msup></mml:mrow><mml:mo>&#x003D;</mml:mo><mml:mrow><mml:msup><mml:mi>b</mml:mi><mml:mi>i</mml:mi></mml:msup></mml:mrow><mml:mo>,</mml:mo><mml:mspace width="thickmathspace"></mml:mspace><mml:mspace width="thickmathspace"></mml:mspace><mml:mspace width="thickmathspace"></mml:mspace><mml:mi>i</mml:mi><mml:mo>&#x003D;</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mn>2</mml:mn><mml:mo>,</mml:mo><mml:mo>&#x22EF;</mml:mo><mml:mo>,</mml:mo><mml:mi>k</mml:mi><mml:mo>,</mml:mo><mml:mspace width="thickmathspace"></mml:mspace></mml:math>
<!--</alternatives>--></disp-formula></p>
<p>where <inline-formula id="ieqn-1">
<!--<alternatives><inline-graphic xlink:href="ieqn-1.tif"/><tex-math id="tex-ieqn-1"><![CDATA[A \in {R^{m \times n}}\;(m < < n)]]></tex-math>--><mml:math id="mml-ieqn-1"><mml:mi>A</mml:mi><mml:mo>&#x2208;</mml:mo><mml:mrow><mml:msup><mml:mi>R</mml:mi><mml:mrow><mml:mi>m</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mi>n</mml:mi></mml:mrow></mml:msup></mml:mrow><mml:mspace width="thickmathspace"></mml:mspace><mml:mo stretchy="false">(</mml:mo><mml:mi>m</mml:mi><mml:mo>&#x003C;&#x003C;</mml:mo><mml:mi>n</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:math>
<!--</alternatives>--></inline-formula> is a full-rank dense matrix, <inline-formula id="ieqn-2">
<!--<alternatives><inline-graphic xlink:href="ieqn-2.tif"/><tex-math id="tex-ieqn-2"><![CDATA[{b^i} \in {R^m}]]></tex-math>--><mml:math id="mml-ieqn-2"><mml:mrow><mml:msup><mml:mi>b</mml:mi><mml:mi>i</mml:mi></mml:msup></mml:mrow><mml:mo>&#x2208;</mml:mo><mml:mrow><mml:msup><mml:mi>R</mml:mi><mml:mi>m</mml:mi></mml:msup></mml:mrow></mml:math>
<!--</alternatives>--></inline-formula> is a pre-specified vector, and <inline-formula id="ieqn-3">
<!--<alternatives><inline-graphic xlink:href="ieqn-3.tif"/><tex-math id="tex-ieqn-3"><![CDATA[{x^i} \in {R^n}]]></tex-math>--><mml:math id="mml-ieqn-3"><mml:mrow><mml:msup><mml:mi>x</mml:mi><mml:mi>i</mml:mi></mml:msup></mml:mrow><mml:mo>&#x2208;</mml:mo><mml:mrow><mml:msup><mml:mi>R</mml:mi><mml:mi>n</mml:mi></mml:msup></mml:mrow></mml:math>
<!--</alternatives>--></inline-formula> is an unknown solution. Each one of these L1-min problems is independent except that they share the matrix <italic>A</italic>. This makes it suitable for parallel computing.</p>
<p>Given their multiple core structures, graphics processing units (GPUs) have sufficient computation power for scientific computations. Processing big data by GPUs has drown much attention over the recent years [<xref ref-type="bibr" rid="ref-10">10</xref>&#x2013;<xref ref-type="bibr" rid="ref-14">14</xref>]. Following the introduction of the compute unified device architecture (CUDA), a programming model that supports the joint CPU/GPU execution of applications by NVIDIA [<xref ref-type="bibr" rid="ref-15">15</xref>], GPUs have become strong competitors as general-purpose parallel programming systems.</p>
<p>Due to high compute capacity of GPUs, accelerating algorithms that are used to solve the L1-min problem on the GPU has attracted considerable attention recently [<xref ref-type="bibr" rid="ref-16">16</xref>,<xref ref-type="bibr" rid="ref-17">17</xref>]. As we know, there exists a great number of L1-min algorithms such as the gradient projection method [<xref ref-type="bibr" rid="ref-18">18</xref>], truncation Newton interior-point method [<xref ref-type="bibr" rid="ref-19">19</xref>], homotopy methods [<xref ref-type="bibr" rid="ref-20">20</xref>], augmented Lagrange multiplier method (ALM) [<xref ref-type="bibr" rid="ref-21">21</xref>], class of iterative shrinkage-thresholding methods (FISTA) [<xref ref-type="bibr" rid="ref-22">22</xref>], and alternating direction method of multipliers [<xref ref-type="bibr" rid="ref-23">23</xref>]. And most of them are composed of dense matrix-vector multiplications such as <italic>Ax</italic> and <italic>A</italic><sup><italic>T</italic></sup><italic>x</italic>, and vector operations. In 2011, Nath et al. [<xref ref-type="bibr" rid="ref-24">24</xref>] presented an optimization symmetric dense matrix-vector multiplication on GPUs. In 2016, Abdelfattah et al. [<xref ref-type="bibr" rid="ref-25">25</xref>] proposed an open-source, high-performance library for the dense matrix-vector multiplication on GPU accelerators, KBLAS, which provides optimized kernels for a subset of Level 2 BLAS functionalities on CUDA-enabled GPUs. Moreover, a subset of KBLAS high performance kernels has been integrated into the CUBLAS library [<xref ref-type="bibr" rid="ref-26">26</xref>], starting from version 6.0. In addition, there have been highly efficient implementations for the vector operation on the GPU in the CUBLAS library. Therefore, the existing GPU-accelerated L1-min algorithms are mostly based on CUBLAS.</p>
<p>However, for the implementations of <italic>Ax</italic> and <italic>A</italic><sup><italic>T</italic></sup><italic>x</italic> in CUBLAS, the performance value fluctuates as <italic>m</italic> (row) increases when <italic>n</italic> (column) is fixed or <italic>n</italic> increases when <italic>m</italic> is fixed, and the difference between the maximum and minimum performance values is distinct. In [<xref ref-type="bibr" rid="ref-17">17</xref>], Gao et al. observe these phenomena, and present two novel <italic>Ax</italic> and <italic>A</italic><sup><italic>T</italic></sup><italic>x</italic> implementations on GPU to alleviate the drawbacks of CUBLAS. Furthermore, they take FISTA and ALM to propose two adaptive optimization L1-min solvers on GPU.</p>
<p>In this paper, we further investigate the design of effective algorithms that are used to solve the L1-min problem on GPUs. Different from other publications [<xref ref-type="bibr" rid="ref-16">16</xref>,<xref ref-type="bibr" rid="ref-17">17</xref>], here we emphasize the design of concurrent L1-min solvers on GPUs. First, we enhance Gao&#x2019;s GEMV and GEMV-T kernels by optimizing the warp allocation strategy of the GEMV kernel and the thread allocation strategy of the GEMV-T kernel, and designing the optimization schemes for the GEMV and GEMV-T kernels. Second, the vector-operation and inner-product decision trees are established automatically to merge the same operations into a single kernel. For any-sized vector, the optimization vector-operation and inner-product implementation methods are automatically and rapidly selected from the decision trees. Furthermore, the popular L1-min algorithm, fast iterative shrinkage-thresholding algorithm (FISTA), is taken for example. Based on the proposed GEMV, GEMV-T, vector-operation and inner-product kernels, we present two optimization concurrent L1-min solvers on a GPU that are designed from the perspective of the streams and the thread blocks, respectively. Finally, we design a concurrent L1-min solver on multiple GPUs. In this solver, each GPU only solves a L1-min problem every time instead of solving multiple L1-min problems by utilizing multiple streams or thread blocks. This solver is applied to this case where the number of L1-min problems included in the concurrent L1-min problem is much less than the number of streams (or thread blocks). Experimental results show that our proposed GEMV and GEMV-T kernels are more robust than those that are suggested by Gao et al. and CUBLAS, and the proposed concurrent L1-min solvers on GPUs are effective. The main contributions are summarized as follows:<list list-type="bullet"><list-item>
<p>Two novel adaptive optimization GPU-accelerated implementations of the matrix-vector multiplication are proposed.</p></list-item><list-item>
<p>Two optimization concurrent L1-min solvers on a GPU are presented from the perspective of the streams and thread blocks, respectively.</p></list-item><list-item>
<p>Utilizing new features of GPU and the technique of merging kernels, an optimization concurrent L1-min solver on multiple GPUs is proposed.</p></list-item></list></p>
<p>The remainder of this paper is organized as follows. In Section 2, we describe the fast iterative shrinkage-thresholding algorithm. Two adaptive optimization implementations of the matrix-vector multiplication on the GPU and the vector-operation and inner-product decision trees are described in Section 3. Sections 4 and 5 give two concurrent L1-min solvers on a GPU and a concurrent L1-min solver on multiple GPUs, respectively. Experimental results are presented in Section 6. Section 7 contains our conclusions and points to our future research directions.</p>
</sec>
<sec id="s2">
<label>2</label>
<title>Fast Iterative Shrinkage-Thresholding Algorithm</title>
<p>The L1-min problem is known as the basis pursuit (BP) problem [<xref ref-type="bibr" rid="ref-27">27</xref>]. In practice, a measurement data <italic>b</italic> often contains noise (such as the measurement error:<inline-formula id="ieqn-4">
<!--<alternatives><inline-graphic xlink:href="ieqn-4.tif"/><tex-math id="tex-ieqn-4"><![CDATA[\epsilon]]></tex-math>--><mml:math id="mml-ieqn-4"><mml:mi>&#x03B5;</mml:mi></mml:math>
<!--</alternatives>--></inline-formula>), which is called the BPDN problem. A variant of this problem is also well known as the unconstrained BPDN problem with a scalar weight <inline-formula id="ieqn-5">
<!--<alternatives><inline-graphic xlink:href="ieqn-5.tif"/><tex-math id="tex-ieqn-5"><![CDATA[\lambda]]></tex-math>--><mml:math id="mml-ieqn-5"><mml:mi>&#x03BB;</mml:mi></mml:math>
<!--</alternatives>--></inline-formula> or the Lasso problem [<xref ref-type="bibr" rid="ref-28">28</xref>] in the statistics perspective:</p>
<p><disp-formula id="eqn-11">
<label>(2)</label>
<!--<alternatives>
<graphic mimetype="image" mime-subtype="png" xlink:href="eqn-11.png"/><tex-math id="tex-eqn-11"><![CDATA[\min \;\displaystyle{1 \over 2}||Ax - b||_2^2 + \lambda ||x|{|_1}.]]></tex-math>--><mml:math id="mml-eqn-11"><mml:mo movablelimits="true" form="prefix">min</mml:mo><mml:mspace width="thickmathspace"></mml:mspace><mml:mstyle displaystyle="true" scriptlevel="0"><mml:mrow><mml:mfrac><mml:mn>1</mml:mn><mml:mn>2</mml:mn></mml:mfrac></mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mi>A</mml:mi><mml:mi>x</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mi>b</mml:mi><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:msubsup><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mn>2</mml:mn><mml:mn>2</mml:mn></mml:msubsup><mml:mo>&#x002B;</mml:mo><mml:mi>&#x03BB;</mml:mi><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mi>x</mml:mi><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mn>1</mml:mn></mml:msub></mml:mrow><mml:mo>.</mml:mo></mml:mstyle></mml:math>
<!--</alternatives>--></disp-formula></p>
<p>The fast iterative shrinkage-thresholding algorithm (FISTA) is a kind of accelerations, and achieves an accelerated non-asymptotic convergence rate of <italic>O(k</italic><sup><italic>2</italic></sup><italic>)</italic> by combining Nesterov&#x2019;s optimal gradient method [<xref ref-type="bibr" rid="ref-22">22</xref>]. For FISTA, it adds a new sequence<inline-formula id="ieqn-6">
<!--<alternatives><inline-graphic xlink:href="ieqn-6.tif"/><tex-math id="tex-ieqn-6"><![CDATA[\left\{ {{y_k},\;k = 1,2, \cdots } \right\}]]></tex-math>--><mml:math id="mml-ieqn-6"><mml:mrow><mml:mo>{</mml:mo><mml:mrow><mml:mrow><mml:msub><mml:mi>y</mml:mi><mml:mi>k</mml:mi></mml:msub></mml:mrow><mml:mo>,</mml:mo><mml:mspace width="thickmathspace"></mml:mspace><mml:mi>k</mml:mi><mml:mo>&#x003D;</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mn>2</mml:mn><mml:mo>,</mml:mo><mml:mo>&#x22EF;</mml:mo></mml:mrow><mml:mo>}</mml:mo></mml:mrow></mml:math>
<!--</alternatives>--></inline-formula> as follows.</p>
<p><disp-formula id="eqn-2">
<label>(3)</label>
<!--<alternatives>
<graphic mimetype="image" mime-subtype="png" xlink:href="eqn-2.png"/><tex-math id="tex-eqn-2"><![CDATA[\left\{ {\matrix{ {{x_{k + 1}} = soft({y_k} - \displaystyle{1 \over {{L_f}}}\nabla f({y_k}),\displaystyle{\lambda \over {{L_f}}}),} \hfill \cr {{t_{k + 1}} = \displaystyle{{1 + \sqrt {1 + 4t_k^2} } \over 2},} \hfill \cr {{y_{k + 1}} = {x_k} + \displaystyle{{{t_k} - 1} \over {{t_k} + 1}}({x_k} - {x_{k - 1}}),} \hfill \cr } } \right.]]></tex-math>--><mml:math id="mml-eqn-2" display="block"><mml:mrow><mml:mo>{</mml:mo><mml:mrow><mml:mtable columnspacing="1em" rowspacing="4pt"><mml:mtr><mml:mtd columnalign="left"><mml:mrow><mml:mrow><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>k</mml:mi><mml:mo>&#x002B;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:mrow><mml:mo>&#x003D;</mml:mo><mml:mi>s</mml:mi><mml:mi>o</mml:mi><mml:mi>f</mml:mi><mml:mi>t</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mi>y</mml:mi><mml:mi>k</mml:mi></mml:msub></mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mstyle scriptlevel="0" displaystyle="true"><mml:mrow><mml:mfrac><mml:mn>1</mml:mn><mml:mrow><mml:mrow><mml:msub><mml:mi>L</mml:mi><mml:mi>f</mml:mi></mml:msub></mml:mrow></mml:mrow></mml:mfrac></mml:mrow><mml:mi mathvariant="normal">&#x2207;</mml:mi><mml:mi>f</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mi>y</mml:mi><mml:mi>k</mml:mi></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo><mml:mo>,</mml:mo><mml:mstyle scriptlevel="0" displaystyle="true"><mml:mrow><mml:mfrac><mml:mi>&#x03BB;</mml:mi><mml:mrow><mml:mrow><mml:msub><mml:mi>L</mml:mi><mml:mi>f</mml:mi></mml:msub></mml:mrow></mml:mrow></mml:mfrac></mml:mrow><mml:mo stretchy="false">)</mml:mo><mml:mo>,</mml:mo></mml:mstyle></mml:mstyle></mml:mrow></mml:mtd></mml:mtr><mml:mtr><mml:mtd columnalign="left"><mml:mrow><mml:mrow><mml:msub><mml:mi>t</mml:mi><mml:mrow><mml:mi>k</mml:mi><mml:mo>&#x002B;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:mrow><mml:mo>&#x003D;</mml:mo><mml:mstyle scriptlevel="0" displaystyle="true"><mml:mrow><mml:mfrac><mml:mrow><mml:mn>1</mml:mn><mml:mo>&#x002B;</mml:mo><mml:msqrt><mml:mn>1</mml:mn><mml:mo>&#x002B;</mml:mo><mml:mn>4</mml:mn><mml:msubsup><mml:mi>t</mml:mi><mml:mi>k</mml:mi><mml:mn>2</mml:mn></mml:msubsup></mml:msqrt></mml:mrow><mml:mn>2</mml:mn></mml:mfrac></mml:mrow><mml:mo>,</mml:mo></mml:mstyle></mml:mrow></mml:mtd></mml:mtr><mml:mtr><mml:mtd columnalign="left"><mml:mrow><mml:mrow><mml:msub><mml:mi>y</mml:mi><mml:mrow><mml:mi>k</mml:mi><mml:mo>&#x002B;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:mrow><mml:mo>&#x003D;</mml:mo><mml:mrow><mml:msub><mml:mi>x</mml:mi><mml:mi>k</mml:mi></mml:msub></mml:mrow><mml:mo>&#x002B;</mml:mo><mml:mstyle scriptlevel="0" displaystyle="true"><mml:mrow><mml:mfrac><mml:mrow><mml:mrow><mml:msub><mml:mi>t</mml:mi><mml:mi>k</mml:mi></mml:msub></mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mrow><mml:msub><mml:mi>t</mml:mi><mml:mi>k</mml:mi></mml:msub></mml:mrow><mml:mo>&#x002B;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:mfrac></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mi>x</mml:mi><mml:mi>k</mml:mi></mml:msub></mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mrow><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>k</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo><mml:mo>,</mml:mo></mml:mstyle></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:mrow><mml:mo stretchy="true" symmetric="true" fence="true"></mml:mo></mml:mrow></mml:math>
<!--</alternatives>--></disp-formula></p>
<p>where <inline-formula id="ieqn-7">
<!--<alternatives><inline-graphic xlink:href="ieqn-7.tif"/><tex-math id="tex-ieqn-7"><![CDATA[soft(u,a) = sign(u)\max \left\{ {|u| - a,0} \right\}]]></tex-math>--><mml:math id="mml-ieqn-7"><mml:mi>s</mml:mi><mml:mi>o</mml:mi><mml:mi>f</mml:mi><mml:mi>t</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>u</mml:mi><mml:mo>,</mml:mo><mml:mi>a</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>&#x003D;</mml:mo><mml:mi>s</mml:mi><mml:mi>i</mml:mi><mml:mi>g</mml:mi><mml:mi>n</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>u</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo form="prefix" movablelimits="true">max</mml:mo><mml:mrow><mml:mo>{</mml:mo><mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mi>u</mml:mi><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mi>a</mml:mi><mml:mo>,</mml:mo><mml:mn>0</mml:mn></mml:mrow><mml:mo>}</mml:mo></mml:mrow></mml:math>
<!--</alternatives>--></inline-formula> is the soft-thresholding operator, <inline-formula id="ieqn-8">
<!--<alternatives><inline-graphic xlink:href="ieqn-8.tif"/><tex-math id="tex-ieqn-8"><![CDATA[{y_1} = {x_0}]]></tex-math>--><mml:math id="mml-ieqn-8"><mml:mrow><mml:msub><mml:mi>y</mml:mi><mml:mn>1</mml:mn></mml:msub></mml:mrow><mml:mo>&#x003D;</mml:mo><mml:mrow><mml:msub><mml:mi>x</mml:mi><mml:mn>0</mml:mn></mml:msub></mml:mrow></mml:math>
<!--</alternatives>--></inline-formula>, <inline-formula id="ieqn-9">
<!--<alternatives><inline-graphic xlink:href="ieqn-9.tif"/><tex-math id="tex-ieqn-9"><![CDATA[{t_1} = 1]]></tex-math>--><mml:math id="mml-ieqn-9"><mml:mrow><mml:msub><mml:mi>t</mml:mi><mml:mn>1</mml:mn></mml:msub></mml:mrow><mml:mo>&#x003D;</mml:mo><mml:mn>1</mml:mn></mml:math>
<!--</alternatives>--></inline-formula> and the associated Lipschitz constant <inline-formula id="ieqn-10">
<!--<alternatives><inline-graphic xlink:href="ieqn-10.tif"/><tex-math id="tex-ieqn-10"><![CDATA[{L_f}]]></tex-math>--><mml:math id="mml-ieqn-10"><mml:mrow><mml:msub><mml:mi>L</mml:mi><mml:mi>f</mml:mi></mml:msub></mml:mrow></mml:math>
<!--</alternatives>--></inline-formula> of <inline-formula id="ieqn-11">
<!--<alternatives><inline-graphic xlink:href="ieqn-11.tif"/><tex-math id="tex-ieqn-11"><![CDATA[\nabla f( \cdot )]]></tex-math>--><mml:math id="mml-ieqn-11"><mml:mi mathvariant="normal">&#x2207;</mml:mi><mml:mi>f</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mo>&#x22C5;</mml:mo><mml:mo stretchy="false">)</mml:mo></mml:math>
<!--</alternatives>--></inline-formula> is given by the spectral norm of <italic>A</italic><sup><italic>T</italic></sup><italic>A</italic>, denoted by <inline-formula id="ieqn-12">
<!--<alternatives><inline-graphic xlink:href="ieqn-12.tif"/><tex-math id="tex-ieqn-12"><![CDATA[||{A^T}A|{|_2}]]></tex-math>--><mml:math id="mml-ieqn-12"><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mrow><mml:msup><mml:mi>A</mml:mi><mml:mi>T</mml:mi></mml:msup></mml:mrow><mml:mi>A</mml:mi><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mn>2</mml:mn></mml:msub></mml:mrow></mml:math>
<!--</alternatives>--></inline-formula>. For large-scale problems, <inline-formula id="ieqn-13">
<!--<alternatives><inline-graphic xlink:href="ieqn-13.tif"/><tex-math id="tex-ieqn-13"><![CDATA[{L_f}]]></tex-math>--><mml:math id="mml-ieqn-13"><mml:mrow><mml:msub><mml:mi>L</mml:mi><mml:mi>f</mml:mi></mml:msub></mml:mrow></mml:math>
<!--</alternatives>--></inline-formula> is not always easily computable. Thus, a backtracking stepsize rule is suggested to alleviate this drawback. <xref ref-type="fig" rid="fig-10">Algorithm 1</xref> summarizes the generic FISTA algorithm with a backtracking stepsize rule [<xref ref-type="bibr" rid="ref-22">22</xref>].</p>
<fig id="fig-10">
<label>Algorithm 1</label>
<caption>
<title>Fast Iterative Shrinkage-Thresholding Algorithm (FISTA)</title>
</caption>
<graphic mimetype="image" mime-subtype="png" xlink:href="fig-10.png"/>
</fig>
</sec>
<sec id="s3">
<label>3</label>
<title>GPU Kernels</title>
<p>For FISTA, its main components include <italic>Ax</italic> (GEMV) and <italic>A</italic><sup><italic>T</italic></sup><italic>x</italic> (GEMV-T), vector operations, and inner product of vector. In the following subsection, we present their implementations on the GPU, respectively. <xref ref-type="table" rid="table-1">Tab. 1</xref> lists the symbols that are used in this paper. <inline-formula id="ieqn-14">
<!--<alternatives><inline-graphic xlink:href="ieqn-14.tif"/><tex-math id="tex-ieqn-14"><![CDATA[{N^{sm}}]]></tex-math>--><mml:math id="mml-ieqn-14"><mml:mrow><mml:msup><mml:mi>N</mml:mi><mml:mrow><mml:mi>s</mml:mi><mml:mi>m</mml:mi></mml:mrow></mml:msup></mml:mrow></mml:math>
<!--</alternatives>--></inline-formula>, <inline-formula id="ieqn-15">
<!--<alternatives><inline-graphic xlink:href="ieqn-15.tif"/><tex-math id="tex-ieqn-15"><![CDATA[{N^{reg}}]]></tex-math>--><mml:math id="mml-ieqn-15"><mml:mrow><mml:msup><mml:mi>N</mml:mi><mml:mrow><mml:mi>r</mml:mi><mml:mi>e</mml:mi><mml:mi>g</mml:mi></mml:mrow></mml:msup></mml:mrow></mml:math>
<!--</alternatives>--></inline-formula>, <inline-formula id="ieqn-16">
<!--<alternatives><inline-graphic xlink:href="ieqn-16.tif"/><tex-math id="tex-ieqn-16"><![CDATA[{N^{mem}}]]></tex-math>--><mml:math id="mml-ieqn-16"><mml:mrow><mml:msup><mml:mi>N</mml:mi><mml:mrow><mml:mi>m</mml:mi><mml:mi>e</mml:mi><mml:mi>m</mml:mi></mml:mrow></mml:msup></mml:mrow></mml:math>
<!--</alternatives>--></inline-formula>, <inline-formula id="ieqn-17">
<!--<alternatives><inline-graphic xlink:href="ieqn-17.tif"/><tex-math id="tex-ieqn-17"><![CDATA[{N^{tb}}]]></tex-math>--><mml:math id="mml-ieqn-17"><mml:mrow><mml:msup><mml:mi>N</mml:mi><mml:mrow><mml:mi>t</mml:mi><mml:mi>b</mml:mi></mml:mrow></mml:msup></mml:mrow></mml:math>
<!--</alternatives>--></inline-formula>, and <inline-formula id="ieqn-18">
<!--<alternatives><inline-graphic xlink:href="ieqn-18.tif"/><tex-math id="tex-ieqn-18"><![CDATA[{N^{td}}]]></tex-math>--><mml:math id="mml-ieqn-18"><mml:mrow><mml:msup><mml:mi>N</mml:mi><mml:mrow><mml:mi>t</mml:mi><mml:mi>d</mml:mi></mml:mrow></mml:msup></mml:mrow></mml:math>
<!--</alternatives>--></inline-formula> are constants for a specific GPU. A row major and 0-based indexing array <italic>a</italic> is used to store the matrix <italic>A</italic> and the float precision values are only used for all computations in this paper.</p>

<table-wrap id="table-1">
<label>Table 1</label>
<caption>
<title>Symbols used in this paper</title>
</caption>
<table>
<colgroup>
<col/>
<col/>
</colgroup>
<thead>
<tr>
<th>Symbol</th>
<th>Remark</th>
</tr>
</thead>
<tbody>
<tr>
<td><inline-formula id="ieqn-19">
<!--<alternatives><inline-graphic xlink:href="ieqn-19.tif"/><tex-math id="tex-ieqn-19"><![CDATA[nt]]></tex-math>--><mml:math id="mml-ieqn-19"><mml:mi>n</mml:mi><mml:mi>t</mml:mi></mml:math>
<!--</alternatives>--></inline-formula></td>
<td>Number of threads per block</td>
</tr>
<tr>
<td><inline-formula id="ieqn-20">
<!--<alternatives><inline-graphic xlink:href="ieqn-20.tif"/><tex-math id="tex-ieqn-20"><![CDATA[nb]]></tex-math>--><mml:math id="mml-ieqn-20"><mml:mi>n</mml:mi><mml:mi>b</mml:mi></mml:math>
<!--</alternatives>--></inline-formula></td>
<td>Number of blocks per grid</td>
</tr>
<tr>
<td><inline-formula id="ieqn-21">
<!--<alternatives><inline-graphic xlink:href="ieqn-21.tif"/><tex-math id="tex-ieqn-21"><![CDATA[{N^{sm}}]]></tex-math>--><mml:math id="mml-ieqn-21"><mml:mrow><mml:msup><mml:mi>N</mml:mi><mml:mrow><mml:mi>s</mml:mi><mml:mi>m</mml:mi></mml:mrow></mml:msup></mml:mrow></mml:math>
<!--</alternatives>--></inline-formula></td>
<td>Number of streaming multiprocessors</td>
</tr>
<tr>
<td><inline-formula id="ieqn-22">
<!--<alternatives><inline-graphic xlink:href="ieqn-22.tif"/><tex-math id="tex-ieqn-22"><![CDATA[{N^{reg}}]]></tex-math>--><mml:math id="mml-ieqn-22"><mml:mrow><mml:msup><mml:mi>N</mml:mi><mml:mrow><mml:mi>r</mml:mi><mml:mi>e</mml:mi><mml:mi>g</mml:mi></mml:mrow></mml:msup></mml:mrow></mml:math>
<!--</alternatives>--></inline-formula></td>
<td>Maximum number of 32-bit registers per multiprocessor</td>
</tr>
<tr>
<td><inline-formula id="ieqn-23">
<!--<alternatives><inline-graphic xlink:href="ieqn-23.tif"/><tex-math id="tex-ieqn-23"><![CDATA[{N^{mem}}]]></tex-math>--><mml:math id="mml-ieqn-23"><mml:mrow><mml:msup><mml:mi>N</mml:mi><mml:mrow><mml:mi>m</mml:mi><mml:mi>e</mml:mi><mml:mi>m</mml:mi></mml:mrow></mml:msup></mml:mrow></mml:math>
<!--</alternatives>--></inline-formula></td>
<td>Maximum amount of shared memory per multiprocessor</td>
</tr>
<tr>
<td><inline-formula id="ieqn-24">
<!--<alternatives><inline-graphic xlink:href="ieqn-24.tif"/><tex-math id="tex-ieqn-24"><![CDATA[{N^{tb}}]]></tex-math>--><mml:math id="mml-ieqn-24"><mml:mrow><mml:msup><mml:mi>N</mml:mi><mml:mrow><mml:mi>t</mml:mi><mml:mi>b</mml:mi></mml:mrow></mml:msup></mml:mrow></mml:math>
<!--</alternatives>--></inline-formula></td>
<td>Maximum number of blocks per multiprocessor</td>
</tr>
<tr>
<td><inline-formula id="ieqn-25">
<!--<alternatives><inline-graphic xlink:href="ieqn-25.tif"/><tex-math id="tex-ieqn-25"><![CDATA[{N^{td}}]]></tex-math>--><mml:math id="mml-ieqn-25"><mml:mrow><mml:msup><mml:mi>N</mml:mi><mml:mrow><mml:mi>t</mml:mi><mml:mi>d</mml:mi></mml:mrow></mml:msup></mml:mrow></mml:math>
<!--</alternatives>--></inline-formula></td>
<td>Maximum number of threads per multiprocessor</td>
</tr>
</tbody>
</table>
</table-wrap>
<sec id="s3_1">
<label>3.1</label>
<title>GEMV Kernel</title>
<p>Given that the GEMV, <italic>Ax</italic>, is composed of dot products of <italic>x</italic> and <italic>A</italic><sup><italic>i</italic></sup> (the <italic>i</italic>th row of <italic>A</italic>), <inline-formula id="ieqn-26">
<!--<alternatives><inline-graphic xlink:href="ieqn-26.tif"/><tex-math id="tex-ieqn-26"><![CDATA[i = 1,2, \cdots ,m]]></tex-math>--><mml:math id="mml-ieqn-26"><mml:mi>i</mml:mi><mml:mo>&#x003D;</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mn>2</mml:mn><mml:mo>,</mml:mo><mml:mo>&#x22EF;</mml:mo><mml:mo>,</mml:mo><mml:mi>m</mml:mi></mml:math>
<!--</alternatives>--></inline-formula>, and these dot products can be independently computed, we assign one warp or multiple warps to a product dot for our proposed GEMV kernel. To optimize the GEMV kernel performance, for a given <italic>nt</italic>, we propose the following self-adaptive warp allocation strategy to select the number of warps <italic>k</italic> for a product dot:</p>
<p><disp-formula id="eqn-3">
<label>(4)</label>
<!--<alternatives>
<graphic mimetype="image" mime-subtype="png" xlink:href="eqn-3.png"/><tex-math id="tex-eqn-3"><![CDATA[\eqalign{ \max \;k = {N^{sm}} \times {N^{mb}} \times {{nt} \mathord{\left/ {\vphantom {{nt} {{{32} \mathord{\left/ {\vphantom {{32} w}} \right.} w}}}} \right. } {{{32} \mathord{\left/ {\vphantom {{32} w}} \right.} w}}}, \cr &#9; s.t.}]]></tex-math>--><mml:math id="mml-eqn-3" display="block"><mml:mtable columnalign="right left" rowspacing=".5em" columnspacing="thickmathspace" displaystyle="true"><mml:mtr><mml:mtd></mml:mtd><mml:mtd><mml:mo movablelimits="true" form="prefix">max</mml:mo><mml:mspace width="thickmathspace"></mml:mspace><mml:mi>k</mml:mi><mml:mo>&#x003D;</mml:mo><mml:mrow><mml:msup><mml:mi>N</mml:mi><mml:mrow><mml:mi>s</mml:mi><mml:mi>m</mml:mi></mml:mrow></mml:msup></mml:mrow><mml:mo>&#x00D7;</mml:mo><mml:mrow><mml:msup><mml:mi>N</mml:mi><mml:mrow><mml:mi>m</mml:mi><mml:mi>b</mml:mi></mml:mrow></mml:msup></mml:mrow><mml:mo>&#x00D7;</mml:mo><mml:mrow><mml:mrow><mml:mi>n</mml:mi><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mo fence="true" stretchy="true" symmetric="true">/</mml:mo><mml:mrow><mml:mrow><mml:mpadded width="0"><mml:mphantom><mml:mrow><mml:mi>n</mml:mi><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mrow><mml:mn>32</mml:mn></mml:mrow><mml:mrow><mml:mrow><mml:mo fence="true" stretchy="true" symmetric="true">/</mml:mo><mml:mrow><mml:mrow><mml:mpadded width="0"><mml:mphantom><mml:mrow><mml:mn>32</mml:mn></mml:mrow><mml:mi>w</mml:mi></mml:mphantom></mml:mpadded></mml:mrow></mml:mrow><mml:mo fence="true" stretchy="true" symmetric="true"></mml:mo></mml:mrow></mml:mrow><mml:mi>w</mml:mi></mml:mrow></mml:mrow></mml:mphantom></mml:mpadded></mml:mrow></mml:mrow><mml:mo fence="true" stretchy="true" symmetric="true"></mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:mrow><mml:mn>32</mml:mn></mml:mrow><mml:mrow><mml:mrow><mml:mo fence="true" stretchy="true" symmetric="true">/</mml:mo><mml:mrow><mml:mrow><mml:mpadded width="0"><mml:mphantom><mml:mrow><mml:mn>32</mml:mn></mml:mrow><mml:mi>w</mml:mi></mml:mphantom></mml:mpadded></mml:mrow></mml:mrow><mml:mo fence="true" stretchy="true" symmetric="true"></mml:mo></mml:mrow></mml:mrow><mml:mi>w</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mo>,</mml:mo></mml:mtd></mml:mtr><mml:mtr><mml:mtd></mml:mtd><mml:mtd><mml:mi>s</mml:mi><mml:mo>.</mml:mo><mml:mi>t</mml:mi><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math>
<!--</alternatives>--></disp-formula></p>
<p><disp-formula id="eqn-4">
<label>(5)</label>
<!--<alternatives>
<graphic mimetype="image" mime-subtype="png" xlink:href="eqn-4.png"/><tex-math id="tex-eqn-4"><![CDATA[m \le w,]]></tex-math>--><mml:math id="mml-eqn-4" display="block"><mml:mi>m</mml:mi><mml:mo>&#x2264;</mml:mo><mml:mi>w</mml:mi><mml:mo>,</mml:mo></mml:math>
<!--</alternatives>--></disp-formula></p>
<p><disp-formula id="eqn-5">
<label>(6)</label>
<!--<alternatives>
<graphic mimetype="image" mime-subtype="png" xlink:href="eqn-5.png"/><tex-math id="tex-eqn-5"><![CDATA[k \le {{nt} \mathord{\left/ {\vphantom {{nt} {32}}} \right. } {32}}\;and\;k = {2^z},\;z \in \left\{ {0,1,2, \cdots } \right\}.]]></tex-math>--><mml:math id="mml-eqn-5" display="block"><mml:mi>k</mml:mi><mml:mo>&#x2264;</mml:mo><mml:mrow><mml:mrow><mml:mi>n</mml:mi><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mo fence="true" stretchy="true" symmetric="true">/</mml:mo><mml:mrow><mml:mrow><mml:mpadded width="0"><mml:mphantom><mml:mrow><mml:mi>n</mml:mi><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mn>32</mml:mn></mml:mrow></mml:mphantom></mml:mpadded></mml:mrow></mml:mrow><mml:mo fence="true" stretchy="true" symmetric="true"></mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:mn>32</mml:mn></mml:mrow></mml:mrow><mml:mspace width="thickmathspace"></mml:mspace><mml:mi>a</mml:mi><mml:mi>n</mml:mi><mml:mi>d</mml:mi><mml:mspace width="thickmathspace"></mml:mspace><mml:mi>k</mml:mi><mml:mo>&#x003D;</mml:mo><mml:mrow><mml:msup><mml:mn>2</mml:mn><mml:mi>z</mml:mi></mml:msup></mml:mrow><mml:mo>,</mml:mo><mml:mspace width="thickmathspace"></mml:mspace><mml:mi>z</mml:mi><mml:mo>&#x2208;</mml:mo><mml:mrow><mml:mo>{</mml:mo><mml:mrow><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mn>2</mml:mn><mml:mo>,</mml:mo><mml:mo>&#x22EF;</mml:mo></mml:mrow><mml:mo>}</mml:mo></mml:mrow><mml:mo>.</mml:mo></mml:math>
<!--</alternatives>--></disp-formula></p>
<p><xref ref-type="disp-formula" rid="eqn-4">Eq. (4)</xref> denotes the objective of maximizing the number of warps. <xref ref-type="disp-formula" rid="eqn-5">Eq. (5)</xref> guarantees that each warp group (<italic>k</italic> warps are grouped into a warp group) calculates at least a product dot. <xref ref-type="disp-formula" rid="eqn-6">Eq. (6)</xref> guarantees that <italic>k</italic> must be a power of two and the warp-group size is at most the number of threads per block. <inline-formula id="ieqn-27">
<!--<alternatives><inline-graphic xlink:href="ieqn-27.tif"/><tex-math id="tex-ieqn-27"><![CDATA[{N^{mb}}]]></tex-math>--><mml:math id="mml-ieqn-27"><mml:mrow><mml:msup><mml:mi>N</mml:mi><mml:mrow><mml:mi>m</mml:mi><mml:mi>b</mml:mi></mml:mrow></mml:msup></mml:mrow></mml:math>
<!--</alternatives>--></inline-formula> in <xref ref-type="disp-formula" rid="eqn-4">Eq. (4)</xref> is calculated as follows:</p>
<p><disp-formula id="eqn-6">
<label>(7)</label>
<!--<alternatives>
<graphic mimetype="image" mime-subtype="png" xlink:href="eqn-6.png"/><tex-math id="tex-eqn-6"><![CDATA[{N^{mb}}{\rm = }\min ({{{N^{reg}}} \mathord{\left/ {\vphantom {{{N^{reg}}} {N_b^{reg}}}} \right. } {N_b^{reg}}},{{{N^{mem}}} \mathord{\left/ {\vphantom {{{N^{mem}}} {N_b^{mem},{{{N^{td}}} \mathord{\left/ {\vphantom {{{N^{td}}} {nt}}} \right. } {nt}}}}} \right. } {N_b^{mem},{{{N^{td}}} \mathord{\left/ {\vphantom {{{N^{td}}} {nt}}} \right. } {nt}}}},{N^{tb}}),]]></tex-math>--><mml:math id="mml-eqn-6" display="block"><mml:mrow><mml:msup><mml:mi>N</mml:mi><mml:mrow><mml:mi>m</mml:mi><mml:mi>b</mml:mi></mml:mrow></mml:msup></mml:mrow><mml:mrow><mml:mo>&#x003D;</mml:mo></mml:mrow><mml:mo movablelimits="true" form="prefix">min</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mrow><mml:mrow><mml:msup><mml:mi>N</mml:mi><mml:mrow><mml:mi>r</mml:mi><mml:mi>e</mml:mi><mml:mi>g</mml:mi></mml:mrow></mml:msup></mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:mo fence="true" stretchy="true" symmetric="true">/</mml:mo><mml:mrow><mml:mrow><mml:mpadded width="0"><mml:mphantom><mml:mrow><mml:mrow><mml:msup><mml:mi>N</mml:mi><mml:mrow><mml:mi>r</mml:mi><mml:mi>e</mml:mi><mml:mi>g</mml:mi></mml:mrow></mml:msup></mml:mrow></mml:mrow><mml:mrow><mml:msubsup><mml:mi>N</mml:mi><mml:mi>b</mml:mi><mml:mrow><mml:mi>r</mml:mi><mml:mi>e</mml:mi><mml:mi>g</mml:mi></mml:mrow></mml:msubsup></mml:mrow></mml:mphantom></mml:mpadded></mml:mrow></mml:mrow><mml:mo fence="true" stretchy="true" symmetric="true"></mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:msubsup><mml:mi>N</mml:mi><mml:mi>b</mml:mi><mml:mrow><mml:mi>r</mml:mi><mml:mi>e</mml:mi><mml:mi>g</mml:mi></mml:mrow></mml:msubsup></mml:mrow></mml:mrow><mml:mo>,</mml:mo><mml:mrow><mml:mrow><mml:mrow><mml:msup><mml:mi>N</mml:mi><mml:mrow><mml:mi>m</mml:mi><mml:mi>e</mml:mi><mml:mi>m</mml:mi></mml:mrow></mml:msup></mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:mo fence="true" stretchy="true" symmetric="true">/</mml:mo><mml:mrow><mml:mrow><mml:mpadded width="0"><mml:mphantom><mml:mrow><mml:mrow><mml:msup><mml:mi>N</mml:mi><mml:mrow><mml:mi>m</mml:mi><mml:mi>e</mml:mi><mml:mi>m</mml:mi></mml:mrow></mml:msup></mml:mrow></mml:mrow><mml:mrow><mml:msubsup><mml:mi>N</mml:mi><mml:mi>b</mml:mi><mml:mrow><mml:mi>m</mml:mi><mml:mi>e</mml:mi><mml:mi>m</mml:mi></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:mrow><mml:mrow><mml:mrow><mml:msup><mml:mi>N</mml:mi><mml:mrow><mml:mi>t</mml:mi><mml:mi>d</mml:mi></mml:mrow></mml:msup></mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:mo fence="true" stretchy="true" symmetric="true">/</mml:mo><mml:mrow><mml:mrow><mml:mpadded width="0"><mml:mphantom><mml:mrow><mml:mrow><mml:msup><mml:mi>N</mml:mi><mml:mrow><mml:mi>t</mml:mi><mml:mi>d</mml:mi></mml:mrow></mml:msup></mml:mrow></mml:mrow><mml:mrow><mml:mi>n</mml:mi><mml:mi>t</mml:mi></mml:mrow></mml:mphantom></mml:mpadded></mml:mrow></mml:mrow><mml:mo fence="true" stretchy="true" symmetric="true"></mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:mi>n</mml:mi><mml:mi>t</mml:mi></mml:mrow></mml:mrow></mml:mrow></mml:mphantom></mml:mpadded></mml:mrow></mml:mrow><mml:mo fence="true" stretchy="true" symmetric="true"></mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:msubsup><mml:mi>N</mml:mi><mml:mi>b</mml:mi><mml:mrow><mml:mi>m</mml:mi><mml:mi>e</mml:mi><mml:mi>m</mml:mi></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:mrow><mml:mrow><mml:mrow><mml:msup><mml:mi>N</mml:mi><mml:mrow><mml:mi>t</mml:mi><mml:mi>d</mml:mi></mml:mrow></mml:msup></mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:mo fence="true" stretchy="true" symmetric="true">/</mml:mo><mml:mrow><mml:mrow><mml:mpadded width="0"><mml:mphantom><mml:mrow><mml:mrow><mml:msup><mml:mi>N</mml:mi><mml:mrow><mml:mi>t</mml:mi><mml:mi>d</mml:mi></mml:mrow></mml:msup></mml:mrow></mml:mrow><mml:mrow><mml:mi>n</mml:mi><mml:mi>t</mml:mi></mml:mrow></mml:mphantom></mml:mpadded></mml:mrow></mml:mrow><mml:mo fence="true" stretchy="true" symmetric="true"></mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:mi>n</mml:mi><mml:mi>t</mml:mi></mml:mrow></mml:mrow></mml:mrow></mml:mrow><mml:mo>,</mml:mo><mml:mrow><mml:msup><mml:mi>N</mml:mi><mml:mrow><mml:mi>t</mml:mi><mml:mi>b</mml:mi></mml:mrow></mml:msup></mml:mrow><mml:mo stretchy="false">)</mml:mo><mml:mo>,</mml:mo></mml:math>
<!--</alternatives>--></disp-formula></p>
<p>where <inline-formula id="ieqn-28">
<!--<alternatives><inline-graphic xlink:href="ieqn-28.tif"/><tex-math id="tex-ieqn-28"><![CDATA[N_b^{reg}]]></tex-math>--><mml:math id="mml-ieqn-28"><mml:msubsup><mml:mi>N</mml:mi><mml:mi>b</mml:mi><mml:mrow><mml:mi>r</mml:mi><mml:mi>e</mml:mi><mml:mi>g</mml:mi></mml:mrow></mml:msubsup></mml:math>
<!--</alternatives>--></inline-formula> and <inline-formula id="ieqn-29">
<!--<alternatives><inline-graphic xlink:href="ieqn-29.tif"/><tex-math id="tex-ieqn-29"><![CDATA[N_b^{mem}]]></tex-math>--><mml:math id="mml-ieqn-29"><mml:msubsup><mml:mi>N</mml:mi><mml:mi>b</mml:mi><mml:mrow><mml:mi>m</mml:mi><mml:mi>e</mml:mi><mml:mi>m</mml:mi></mml:mrow></mml:msubsup></mml:math>
<!--</alternatives>--></inline-formula>denote the number of registers and the amount of shared memory required by the threads per block for our proposed GEMV kernel, respectively. Here <inline-formula id="ieqn-30">
<!--<alternatives><inline-graphic xlink:href="ieqn-30.tif"/><tex-math id="tex-ieqn-30"><![CDATA[{N^{mb}}]]></tex-math>--><mml:math id="mml-ieqn-30"><mml:mrow><mml:msup><mml:mi>N</mml:mi><mml:mrow><mml:mi>m</mml:mi><mml:mi>b</mml:mi></mml:mrow></mml:msup></mml:mrow></mml:math>
<!--</alternatives>--></inline-formula> is the minimum number of blocks per grid that maximize the resource utilization per multiprocessor. In this paper, we take <inline-formula id="ieqn-31">
<!--<alternatives><inline-graphic xlink:href="ieqn-31.tif"/><tex-math id="tex-ieqn-31"><![CDATA[{N^{sm}} \times {N^{mb}}]]></tex-math>--><mml:math id="mml-ieqn-31"><mml:mrow><mml:msup><mml:mi>N</mml:mi><mml:mrow><mml:mi>s</mml:mi><mml:mi>m</mml:mi></mml:mrow></mml:msup></mml:mrow><mml:mo>&#x00D7;</mml:mo><mml:mrow><mml:msup><mml:mi>N</mml:mi><mml:mrow><mml:mi>m</mml:mi><mml:mi>b</mml:mi></mml:mrow></mml:msup></mml:mrow></mml:math>
<!--</alternatives>--></inline-formula> as <italic>nb</italic>.</p>
<p>The GEMV kernel is mainly composed of the following three steps:<list list-type="order"><list-item>
<p><bold><italic>x-load step</italic></bold>: The step is used to make threads per block parallel read <italic>x</italic> into the shared memory <italic>xS</italic>. Because the size of <italic>x</italic> is large, <italic>x</italic> is segmentally read into the shared memory. To take full advantage of the amount of shared memory per multiprocessor, the size of shared memory per block is set as follows:</p>
<p><disp-formula id="eqn-7">
<label>(8)</label>
<!--<alternatives>
<graphic mimetype="image" mime-subtype="png" xlink:href="eqn-7.png"/><tex-math id="tex-eqn-7"><![CDATA[{\rm SIZ}{{\rm E}_{\rm - }}{\rm MEM} = {{{N^{mem}}} \mathord{\left/ {\vphantom {{{N^{mem}}} {{{{N^{mb}}} \mathord{\left/ {\vphantom {{{N^{mb}}} 4}} \right. } 4}}}} \right. } {{{{N^{mb}}} \mathord{\left/ {\vphantom {{{N^{mb}}} 4}} \right. } 4}}},]]></tex-math>--><mml:math id="mml-eqn-7" display="block"><mml:mrow><mml:mi mathvariant="normal">S</mml:mi><mml:mi mathvariant="normal">I</mml:mi><mml:mi mathvariant="normal">Z</mml:mi></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi mathvariant="normal">E</mml:mi></mml:mrow><mml:mrow><mml:mo>&#x2212;</mml:mo></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:mi mathvariant="normal">M</mml:mi><mml:mi mathvariant="normal">E</mml:mi><mml:mi mathvariant="normal">M</mml:mi></mml:mrow><mml:mo>&#x003D;</mml:mo><mml:mrow><mml:mrow><mml:mrow><mml:msup><mml:mi>N</mml:mi><mml:mrow><mml:mi>m</mml:mi><mml:mi>e</mml:mi><mml:mi>m</mml:mi></mml:mrow></mml:msup></mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:mo fence="true" stretchy="true" symmetric="true">/</mml:mo><mml:mrow><mml:mrow><mml:mpadded width="0"><mml:mphantom><mml:mrow><mml:mrow><mml:msup><mml:mi>N</mml:mi><mml:mrow><mml:mi>m</mml:mi><mml:mi>e</mml:mi><mml:mi>m</mml:mi></mml:mrow></mml:msup></mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:mrow><mml:mrow><mml:msup><mml:mi>N</mml:mi><mml:mrow><mml:mi>m</mml:mi><mml:mi>b</mml:mi></mml:mrow></mml:msup></mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:mo fence="true" stretchy="true" symmetric="true">/</mml:mo><mml:mrow><mml:mrow><mml:mpadded width="0"><mml:mphantom><mml:mrow><mml:mrow><mml:msup><mml:mi>N</mml:mi><mml:mrow><mml:mi>m</mml:mi><mml:mi>b</mml:mi></mml:mrow></mml:msup></mml:mrow></mml:mrow><mml:mn>4</mml:mn></mml:mphantom></mml:mpadded></mml:mrow></mml:mrow><mml:mo fence="true" stretchy="true" symmetric="true"></mml:mo></mml:mrow></mml:mrow><mml:mn>4</mml:mn></mml:mrow></mml:mrow></mml:mphantom></mml:mpadded></mml:mrow></mml:mrow><mml:mo fence="true" stretchy="true" symmetric="true"></mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:mrow><mml:mrow><mml:msup><mml:mi>N</mml:mi><mml:mrow><mml:mi>m</mml:mi><mml:mi>b</mml:mi></mml:mrow></mml:msup></mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:mo fence="true" stretchy="true" symmetric="true">/</mml:mo><mml:mrow><mml:mrow><mml:mpadded width="0"><mml:mphantom><mml:mrow><mml:mrow><mml:msup><mml:mi>N</mml:mi><mml:mrow><mml:mi>m</mml:mi><mml:mi>b</mml:mi></mml:mrow></mml:msup></mml:mrow></mml:mrow><mml:mn>4</mml:mn></mml:mphantom></mml:mpadded></mml:mrow></mml:mrow><mml:mo fence="true" stretchy="true" symmetric="true"></mml:mo></mml:mrow></mml:mrow><mml:mn>4</mml:mn></mml:mrow></mml:mrow></mml:mrow><mml:mo>,</mml:mo></mml:math>
<!--</alternatives>--></disp-formula></p>
<p>where <inline-formula id="ieqn-32">
<!--<alternatives><inline-graphic xlink:href="ieqn-32.tif"/><tex-math id="tex-ieqn-32"><![CDATA[{N^{mb}}]]></tex-math>--><mml:math id="mml-ieqn-32"><mml:mrow><mml:msup><mml:mi>N</mml:mi><mml:mrow><mml:mi>m</mml:mi><mml:mi>b</mml:mi></mml:mrow></mml:msup></mml:mrow></mml:math>
<!--</alternatives>--></inline-formula> is calculated by <xref ref-type="disp-formula" rid="eqn-7">Eq. (7)</xref>. The time of loading <italic>x</italic> is <inline-formula id="ieqn-33">
<!--<alternatives><inline-graphic xlink:href="ieqn-33.tif"/><tex-math id="tex-ieqn-33"><![CDATA[xTimes = n/{\rm SIZE}\_{\rm MEM}]]></tex-math>--><mml:math id="mml-ieqn-33"><mml:mi>x</mml:mi><mml:mi>T</mml:mi><mml:mi>i</mml:mi><mml:mi>m</mml:mi><mml:mi>e</mml:mi><mml:mi>s</mml:mi><mml:mo>&#x003D;</mml:mo><mml:mi>n</mml:mi><mml:mrow><mml:mo>/</mml:mo></mml:mrow><mml:mrow><mml:mi mathvariant="normal">S</mml:mi><mml:mi mathvariant="normal">I</mml:mi><mml:mi mathvariant="normal">Z</mml:mi><mml:mi mathvariant="normal">E</mml:mi></mml:mrow><mml:mi mathvariant="normal">_</mml:mi><mml:mrow><mml:mi mathvariant="normal">M</mml:mi><mml:mi mathvariant="normal">E</mml:mi><mml:mi mathvariant="normal">M</mml:mi></mml:mrow></mml:math>
<!--</alternatives>--></inline-formula>. By this way, the accesses to <italic>x</italic> are coalesced, and the access number is reduced by letting threads in the same thread block to share the section of elements of <italic>x</italic>.</p></list-item>
<list-item>
<p><bold><italic>Partial-Reduction Step</italic></bold>: Each time after a section of elements of <italic>x</italic> is read into the shared memory, the threads in each warp group perform in parallel a partial-style reduction. Obviously, each thread in a warp group at most performs <inline-formula id="ieqn-34">
<!--<alternatives><inline-graphic xlink:href="ieqn-34.tif"/><tex-math id="tex-ieqn-34"><![CDATA[n/{\rm SIZE}\_{\rm WG}]]></tex-math>--><mml:math id="mml-ieqn-34"><mml:mi>n</mml:mi><mml:mrow><mml:mo>/</mml:mo></mml:mrow><mml:mrow><mml:mi mathvariant="normal">S</mml:mi><mml:mi mathvariant="normal">I</mml:mi><mml:mi mathvariant="normal">Z</mml:mi><mml:mi mathvariant="normal">E</mml:mi></mml:mrow><mml:mi mathvariant="normal">_</mml:mi><mml:mrow><mml:mi mathvariant="normal">W</mml:mi><mml:mi mathvariant="normal">G</mml:mi></mml:mrow></mml:math>
<!--</alternatives>--></inline-formula> times of reductions and the accesses to the global memory <italic>a</italic> are coalesced. Here <inline-formula id="ieqn-35">
<!--<alternatives><inline-graphic xlink:href="ieqn-35.tif"/><tex-math id="tex-ieqn-35"><![CDATA[{\rm SIZE}\_{\rm WG}]]></tex-math>--><mml:math id="mml-ieqn-35"><mml:mrow><mml:mi mathvariant="normal">S</mml:mi><mml:mi mathvariant="normal">I</mml:mi><mml:mi mathvariant="normal">Z</mml:mi><mml:mi mathvariant="normal">E</mml:mi></mml:mrow><mml:mi mathvariant="normal">_</mml:mi><mml:mrow><mml:mi mathvariant="normal">W</mml:mi><mml:mi mathvariant="normal">G</mml:mi></mml:mrow></mml:math>
<!--</alternatives>--></inline-formula> is the number of threads in the warp group, and is equal to <italic>k</italic> &#x00D7; 32.</p></list-item><list-item>
<p><bold><italic>Warp-Reduction Step</italic></bold>: After the threads in each warp group have completed the partial-style reductions, the fast shuffle instructions are utilized to perform a warp-style reduction for each warp in these warp groups. The warp-style reduction values are stored in the shared memory. Then the warp-style reduction values in the shared memory for each warp group are reduced to an output value in parallel.</p></list-item></list></p>
</sec>
<sec id="s3_2">
<label>3.2</label>
<title>GEMV-T Kernel</title>
<p>The GEMV-T, <italic>A</italic><sup><italic>T</italic></sup><italic>x</italic>, is composed of dot products of <italic>x</italic> and <italic>(A</italic><sup><italic>T</italic></sup><italic>)</italic><sup><italic>i</italic></sup> (the <italic>i</italic>th column of <italic>A</italic>), <inline-formula id="ieqn-36">
<!--<alternatives><inline-graphic xlink:href="ieqn-36.tif"/><tex-math id="tex-ieqn-36"><![CDATA[i = 1,2, \cdots ,n]]></tex-math>--><mml:math id="mml-ieqn-36"><mml:mi>i</mml:mi><mml:mo>&#x003D;</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mn>2</mml:mn><mml:mo>,</mml:mo><mml:mo>&#x22EF;</mml:mo><mml:mo>,</mml:mo><mml:mi>n</mml:mi></mml:math>
<!--</alternatives>--></inline-formula>, and these dot products can be independently computed. Given that the size of the vector <italic>x</italic> in the GEMV-T is small, we assign one thread or multiple threads to a dot product in our proposed GEMV-T kernel. To optimize the GEMV-T kernel performance, for a given <italic>nt</italic>, we propose the following self-adaptive thread allocation strategy to select the number of threads <italic>k</italic> for a product dot:</p>
<p><disp-formula id="eqn-8">
<label>(9)</label>
<!--<alternatives>
<graphic mimetype="image" mime-subtype="png" xlink:href="eqn-8.png"/><tex-math id="tex-eqn-8"><![CDATA[\eqalign{ \max \;k = {N^{sm}} \times {N^{mb}} \times {{2 \times nt} \mathord{\left/ {\vphantom {{2 \times nt} w}} \right. } w}, \cr &#9; s.t.}]]></tex-math>--><mml:math id="mml-eqn-8" display="block"><mml:mtable columnalign="right left" rowspacing=".5em" columnspacing="thickmathspace" displaystyle="true"><mml:mtr><mml:mtd></mml:mtd><mml:mtd><mml:mo movablelimits="true" form="prefix">max</mml:mo><mml:mspace width="thickmathspace"></mml:mspace><mml:mi>k</mml:mi><mml:mo>&#x003D;</mml:mo><mml:mrow><mml:msup><mml:mi>N</mml:mi><mml:mrow><mml:mi>s</mml:mi><mml:mi>m</mml:mi></mml:mrow></mml:msup></mml:mrow><mml:mo>&#x00D7;</mml:mo><mml:mrow><mml:msup><mml:mi>N</mml:mi><mml:mrow><mml:mi>m</mml:mi><mml:mi>b</mml:mi></mml:mrow></mml:msup></mml:mrow><mml:mo>&#x00D7;</mml:mo><mml:mrow><mml:mrow><mml:mn>2</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mi>n</mml:mi><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mo fence="true" stretchy="true" symmetric="true">/</mml:mo><mml:mrow><mml:mrow><mml:mpadded width="0"><mml:mphantom><mml:mrow><mml:mn>2</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mi>n</mml:mi><mml:mi>t</mml:mi></mml:mrow><mml:mi>w</mml:mi></mml:mphantom></mml:mpadded></mml:mrow></mml:mrow><mml:mo fence="true" stretchy="true" symmetric="true"></mml:mo></mml:mrow></mml:mrow><mml:mi>w</mml:mi></mml:mrow><mml:mo>,</mml:mo></mml:mtd></mml:mtr><mml:mtr><mml:mtd></mml:mtd><mml:mtd><mml:mi>s</mml:mi><mml:mo>.</mml:mo><mml:mi>t</mml:mi><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math>
<!--</alternatives>--></disp-formula></p>
<p><disp-formula id="eqn-9">
<label>(10)</label>
<!--<alternatives>
<graphic mimetype="image" mime-subtype="png" xlink:href="eqn-9.png"/><tex-math id="tex-eqn-9"><![CDATA[n \le w,]]></tex-math>--><mml:math id="mml-eqn-9" display="block"><mml:mi>n</mml:mi><mml:mo>&#x2264;</mml:mo><mml:mi>w</mml:mi><mml:mo>,</mml:mo></mml:math>
<!--</alternatives>--></disp-formula></p>
<p><disp-formula id="eqn-10">
<label>(11)</label>
<!--<alternatives>
<graphic mimetype="image" mime-subtype="png" xlink:href="eqn-10.png"/><tex-math id="tex-eqn-10"><![CDATA[k \le 32\;and\;k = {2^z},\;z \in \left\{ {0,1,2, \cdots } \right\}.]]></tex-math>--><mml:math id="mml-eqn-10" display="block"><mml:mi>k</mml:mi><mml:mo>&#x2264;</mml:mo><mml:mn>32</mml:mn><mml:mspace width="thickmathspace"></mml:mspace><mml:mi>a</mml:mi><mml:mi>n</mml:mi><mml:mi>d</mml:mi><mml:mspace width="thickmathspace"></mml:mspace><mml:mi>k</mml:mi><mml:mo>&#x003D;</mml:mo><mml:mrow><mml:msup><mml:mn>2</mml:mn><mml:mi>z</mml:mi></mml:msup></mml:mrow><mml:mo>,</mml:mo><mml:mspace width="thickmathspace"></mml:mspace><mml:mi>z</mml:mi><mml:mo>&#x2208;</mml:mo><mml:mrow><mml:mo>{</mml:mo><mml:mrow><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mn>2</mml:mn><mml:mo>,</mml:mo><mml:mo>&#x22EF;</mml:mo></mml:mrow><mml:mo>}</mml:mo></mml:mrow><mml:mo>.</mml:mo></mml:math>
<!--</alternatives>--></disp-formula></p>
<p><xref ref-type="disp-formula" rid="eqn-9">Eq. (9)</xref> denotes the objective of maximizing the number of threads. <xref ref-type="disp-formula" rid="eqn-10">Eq. (10)</xref> guarantees that each thread group (<italic>k</italic> threads are grouped into a thread group) calculates at least a product dot. <xref ref-type="disp-formula" rid="eqn-11">Eq. (11)</xref> guarantees that <italic>k</italic> must be a power of two and the thread-group size is at most the size of a warp. <inline-formula id="ieqn-37">
<!--<alternatives><inline-graphic xlink:href="ieqn-37.tif"/><tex-math id="tex-ieqn-37"><![CDATA[{N^{mb}}]]></tex-math>--><mml:math id="mml-ieqn-37"><mml:mrow><mml:msup><mml:mi>N</mml:mi><mml:mrow><mml:mi>m</mml:mi><mml:mi>b</mml:mi></mml:mrow></mml:msup></mml:mrow></mml:math>
<!--</alternatives>--></inline-formula> in <xref ref-type="disp-formula" rid="eqn-9">Eq. (9)</xref> is calculated by the same equation as <xref ref-type="disp-formula" rid="eqn-7">Eq. (7)</xref> except that <inline-formula id="ieqn-38">
<!--<alternatives><inline-graphic xlink:href="ieqn-38.tif"/><tex-math id="tex-ieqn-38"><![CDATA[N_b^{reg}]]></tex-math>--><mml:math id="mml-ieqn-38"><mml:msubsup><mml:mi>N</mml:mi><mml:mi>b</mml:mi><mml:mrow><mml:mi>r</mml:mi><mml:mi>e</mml:mi><mml:mi>g</mml:mi></mml:mrow></mml:msubsup></mml:math>
<!--</alternatives>--></inline-formula> and <inline-formula id="ieqn-39">
<!--<alternatives><inline-graphic xlink:href="ieqn-39.tif"/><tex-math id="tex-ieqn-39"><![CDATA[N_b^{mem}]]></tex-math>--><mml:math id="mml-ieqn-39"><mml:msubsup><mml:mi>N</mml:mi><mml:mi>b</mml:mi><mml:mrow><mml:mi>m</mml:mi><mml:mi>e</mml:mi><mml:mi>m</mml:mi></mml:mrow></mml:msubsup></mml:math>
<!--</alternatives>--></inline-formula> denote the number of registers and the amount of shared memory required by the threads per block for our proposed GEMV-T kernel. To maximize the resource utilization per multiprocessor, we take <inline-formula id="ieqn-40">
<!--<alternatives><inline-graphic xlink:href="ieqn-40.tif"/><tex-math id="tex-ieqn-40"><![CDATA[{N^{sm}} \times {N^{mb}}]]></tex-math>--><mml:math id="mml-ieqn-40"><mml:mrow><mml:msup><mml:mi>N</mml:mi><mml:mrow><mml:mi>s</mml:mi><mml:mi>m</mml:mi></mml:mrow></mml:msup></mml:mrow><mml:mo>&#x00D7;</mml:mo><mml:mrow><mml:msup><mml:mi>N</mml:mi><mml:mrow><mml:mi>m</mml:mi><mml:mi>b</mml:mi></mml:mrow></mml:msup></mml:mrow></mml:math>
<!--</alternatives>--></inline-formula> as <italic>nb</italic>.</p>
<p>Similar to the GEMV kernel, our proposed GEMV-T kernel is also composed of the <bold><italic>x-load step</italic></bold>, <bold><italic>partial-reduction step</italic></bold> and <bold><italic>warp-reduction step</italic></bold>.<list list-type="order"><list-item>
<p><bold><italic>x-load step</italic></bold>: Like the GEMV kernel, this step is used to make threads per block parallel read elements of <italic>x</italic> into the shared memory <italic>xS</italic>. Given that the size of <italic>x</italic> is small, <italic>x</italic> is once read into the shared memory in the GEMV-T kernel.</p></list-item><list-item>
<p><bold><italic>partial-reduction step</italic></bold>: Since a row major and 0-based index format is used to store the matrix <italic>A</italic>, the accesses to <italic>A</italic> will not be coalesced if the thread groups are constructed in an inappropriate way. For example, we assume that <italic>A</italic> is a <inline-formula id="ieqn-41">
<!--<alternatives><inline-graphic xlink:href="ieqn-41.tif"/><tex-math id="tex-ieqn-41"><![CDATA[4 \times 8]]></tex-math>--><mml:math id="mml-ieqn-41"><mml:mn>4</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mn>8</mml:mn></mml:math>
<!--</alternatives>--></inline-formula> matrix as shown in <xref ref-type="fig" rid="fig-1">Fig. 1</xref>, 16 threads in a thread block are launched, and 2 threads are assigned to a dot product in the GEMV-T kernel. If we use the following thread groups <inline-formula id="ieqn-42">
<!--<alternatives><inline-graphic xlink:href="ieqn-42.tif"/><tex-math id="tex-ieqn-42"><![CDATA[\left\{ {0,1} \right\},\left\{ {2,3} \right\},\left\{ {4,5} \right\}, \cdots ,\left\{ {14,15} \right\}]]></tex-math>--><mml:math id="mml-ieqn-42"><mml:mrow><mml:mo>{</mml:mo><mml:mrow><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mo>}</mml:mo></mml:mrow><mml:mo>,</mml:mo><mml:mrow><mml:mo>{</mml:mo><mml:mrow><mml:mn>2</mml:mn><mml:mo>,</mml:mo><mml:mn>3</mml:mn></mml:mrow><mml:mo>}</mml:mo></mml:mrow><mml:mo>,</mml:mo><mml:mrow><mml:mo>{</mml:mo><mml:mrow><mml:mn>4</mml:mn><mml:mo>,</mml:mo><mml:mn>5</mml:mn></mml:mrow><mml:mo>}</mml:mo></mml:mrow><mml:mo>,</mml:mo><mml:mo>&#x22EF;</mml:mo><mml:mo>,</mml:mo><mml:mrow><mml:mo>{</mml:mo><mml:mrow><mml:mn>14</mml:mn><mml:mo>,</mml:mo><mml:mn>15</mml:mn></mml:mrow><mml:mo>}</mml:mo></mml:mrow></mml:math>
<!--</alternatives>--></inline-formula>, the accesses to <italic>A</italic> will not be coalesced. However, when the thread groups <inline-formula id="ieqn-43">
<!--<alternatives><inline-graphic xlink:href="ieqn-43.tif"/><tex-math id="tex-ieqn-43"><![CDATA[\left\{ {0,8} \right\},\left\{ {1,9} \right\},\left\{ {2,10} \right\}, \cdots ,\left\{ {7,15} \right\}]]></tex-math>--><mml:math id="mml-ieqn-43"><mml:mrow><mml:mo>{</mml:mo><mml:mrow><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mn>8</mml:mn></mml:mrow><mml:mo>}</mml:mo></mml:mrow><mml:mo>,</mml:mo><mml:mrow><mml:mo>{</mml:mo><mml:mrow><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mn>9</mml:mn></mml:mrow><mml:mo>}</mml:mo></mml:mrow><mml:mo>,</mml:mo><mml:mrow><mml:mo>{</mml:mo><mml:mrow><mml:mn>2</mml:mn><mml:mo>,</mml:mo><mml:mn>10</mml:mn></mml:mrow><mml:mo>}</mml:mo></mml:mrow><mml:mo>,</mml:mo><mml:mo>&#x22EF;</mml:mo><mml:mo>,</mml:mo><mml:mrow><mml:mo>{</mml:mo><mml:mrow><mml:mn>7</mml:mn><mml:mo>,</mml:mo><mml:mn>15</mml:mn></mml:mrow><mml:mo>}</mml:mo></mml:mrow></mml:math>
<!--</alternatives>--></inline-formula> are utilized, the accesses to <italic>A</italic> are coalesced, as shown in <xref ref-type="fig" rid="fig-1">Fig. 1b</xref>. Therefore, in the partial-reduction step, the thread groups are created according to <bold>Definition</bold> 3.1 below in order to ensure that the accesses to <italic>A</italic> are coalesced.</p></list-item></list>
<fig id="fig-1">
<label>Figure 1</label>
<caption>
<title>Access to A</title>
</caption>
<graphic mimetype="image" mime-subtype="png" xlink:href="fig-1.png"/>
</fig>
</p>
<p><bold>Definition</bold> 3.1: Assume that the size of the thread block is <italic>s</italic>, <italic>h</italic> threads are assigned to a dot product in <italic>A</italic><sup><italic>T</italic></sup><italic>x</italic>, and <inline-formula id="ieqn-44">
<!--<alternatives><inline-graphic xlink:href="ieqn-44.tif"/><tex-math id="tex-ieqn-44"><![CDATA[z = s/h]]></tex-math>--><mml:math id="mml-ieqn-44"><mml:mi>z</mml:mi><mml:mo>&#x003D;</mml:mo><mml:mi>s</mml:mi><mml:mrow><mml:mo>/</mml:mo></mml:mrow><mml:mi>h</mml:mi></mml:math>
<!--</alternatives>--></inline-formula>. The thread groups are created as follows <inline-formula id="ieqn-45">
<!--<alternatives><inline-graphic xlink:href="ieqn-45.tif"/><tex-math id="tex-ieqn-45"><![CDATA[\left\{ {0,z, \cdots ,\left( {h - 1} \right) \times z} \right\}]]></tex-math>--><mml:math id="mml-ieqn-45"><mml:mrow><mml:mo>{</mml:mo><mml:mrow><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mi>z</mml:mi><mml:mo>,</mml:mo><mml:mo>&#x22EF;</mml:mo><mml:mo>,</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mi>h</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mo>&#x00D7;</mml:mo><mml:mi>z</mml:mi></mml:mrow><mml:mo>}</mml:mo></mml:mrow></mml:math>
<!--</alternatives>--></inline-formula>, <inline-formula id="ieqn-46">
<!--<alternatives><inline-graphic xlink:href="ieqn-46.tif"/><tex-math id="tex-ieqn-46"><![CDATA[\left\{ {1,z + 1, \cdots ,\left( {h - 1} \right) \times z + 1} \right\}]]></tex-math>--><mml:math id="mml-ieqn-46"><mml:mrow><mml:mo>{</mml:mo><mml:mrow><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mi>z</mml:mi><mml:mo>&#x002B;</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mo>&#x22EF;</mml:mo><mml:mo>,</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mi>h</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mo>&#x00D7;</mml:mo><mml:mi>z</mml:mi><mml:mo>&#x002B;</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mo>}</mml:mo></mml:mrow></mml:math>
<!--</alternatives>--></inline-formula>, <inline-formula id="ieqn-47">
<!--<alternatives><inline-graphic xlink:href="ieqn-47.tif"/><tex-math id="tex-ieqn-47"><![CDATA[ \cdots]]></tex-math>--><mml:math id="mml-ieqn-47"><mml:mrow><mml:mspace width="thickmathspace"></mml:mspace></mml:mrow><mml:mo>&#x22EF;</mml:mo></mml:math>
<!--</alternatives>--></inline-formula>,<inline-formula id="ieqn-48">
<!--<alternatives><inline-graphic xlink:href="ieqn-48.tif"/><tex-math id="tex-ieqn-48"><![CDATA[\left\{ {z - 1,2 \times z - 1, \cdots ,2 \times \left( {h - 1} \right) \times z - 1} \right\}]]></tex-math>--><mml:math id="mml-ieqn-48"><mml:mrow><mml:mo>{</mml:mo><mml:mrow><mml:mi>z</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mn>2</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mi>z</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mo>&#x22EF;</mml:mo><mml:mo>,</mml:mo><mml:mn>2</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mi>h</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mo>&#x00D7;</mml:mo><mml:mi>z</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mo>}</mml:mo></mml:mrow></mml:math>
<!--</alternatives>--></inline-formula>.</p>
<p>For these elements of <italic>x</italic> that are read into the shared memory, the threads in each thread group perform in parallel a partial-style reduction similar to that in the GEMV kernel.</p>
<p><list list-type="simple"><list-item>
<p>3.&#x2002;<bold><italic>warp-reduction step</italic></bold>: These threads in a thread group are usually not in the same warp, so we cannot use the shuffle instruction to reduce their partial-style reduction values. Therefore, in the warp-reduction stage, we store the partial-style reduction values that are obtained by threads in each thread group to the shared memory, and then reduce them in the shared memory to an output value in parallel.</p></list-item></list></p>
</sec>
<sec id="s3_3">
<label>3.3</label>
<title>Vector-Operation and Inner-Prodcut Kernels</title>
<p>When parallelizing FISTA on the GPU, the vector-operation and inner-product kernels are needed. Although CUBLAS has shown good performance for the vector operations and the inner product of vector, the use of CUBLAS does not allow to group several operations into a single kernel. Here in order to optimize these operations, we try to group several operations into a single kernel. Therefore, we adopt the idea of constructing the vector-operation and inner-product decision trees that are suggested by Gao et al. [<xref ref-type="bibr" rid="ref-29">29</xref>]. Utilizing the vector-operation and inner-product decision trees, the optimal vector-operation and inner-product kernels and their corresponding CUDA parameters can be obtained. For readers that are interested in this work, please refer to the publication [<xref ref-type="bibr" rid="ref-29">29</xref>].</p>
</sec>
<sec id="s3_4">
<label>3.4</label>
<title>Optimization</title>
<p>Assume that <inline-formula id="ieqn-49">
<!--<alternatives><inline-graphic xlink:href="ieqn-49.tif"/><tex-math id="tex-ieqn-49"><![CDATA[s = {N^{sm}} \times {N^{mb}} \times {{nt} \mathord{\left/ {\vphantom {{nt} {32}}} \right. } {32}}]]></tex-math>--><mml:math id="mml-ieqn-49"><mml:mi>s</mml:mi><mml:mo>&#x003D;</mml:mo><mml:mrow><mml:msup><mml:mi>N</mml:mi><mml:mrow><mml:mi>s</mml:mi><mml:mi>m</mml:mi></mml:mrow></mml:msup></mml:mrow><mml:mo>&#x00D7;</mml:mo><mml:mrow><mml:msup><mml:mi>N</mml:mi><mml:mrow><mml:mi>m</mml:mi><mml:mi>b</mml:mi></mml:mrow></mml:msup></mml:mrow><mml:mo>&#x00D7;</mml:mo><mml:mrow><mml:mrow><mml:mi>n</mml:mi><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mo fence="true" stretchy="true" symmetric="true">/</mml:mo><mml:mrow><mml:mrow><mml:mpadded width="0"><mml:mphantom><mml:mrow><mml:mi>n</mml:mi><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mn>32</mml:mn></mml:mrow></mml:mphantom></mml:mpadded></mml:mrow></mml:mrow><mml:mo fence="true" stretchy="true" symmetric="true"></mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:mn>32</mml:mn></mml:mrow></mml:mrow></mml:math>
<!--</alternatives>--></inline-formula>, where <inline-formula id="ieqn-50">
<!--<alternatives><inline-graphic xlink:href="ieqn-50.tif"/><tex-math id="tex-ieqn-50"><![CDATA[{N^{mb}}]]></tex-math>--><mml:math id="mml-ieqn-50"><mml:mrow><mml:msup><mml:mi>N</mml:mi><mml:mrow><mml:mi>m</mml:mi><mml:mi>b</mml:mi></mml:mrow></mml:msup></mml:mrow></mml:math>
<!--</alternatives>--></inline-formula> is calculated by <xref ref-type="disp-formula" rid="eqn-7">Eq. (7)</xref>, and <inline-formula id="ieqn-51">
<!--<alternatives><inline-graphic xlink:href="ieqn-51.tif"/><tex-math id="tex-ieqn-51"><![CDATA[m = l + s + \Delta s]]></tex-math>--><mml:math id="mml-ieqn-51"><mml:mi>m</mml:mi><mml:mo>&#x003D;</mml:mo><mml:mi>l</mml:mi><mml:mo>&#x002B;</mml:mo><mml:mi>s</mml:mi><mml:mo>&#x002B;</mml:mo><mml:mi mathvariant="normal">&#x0394;</mml:mi><mml:mi>s</mml:mi></mml:math>
<!--</alternatives>--></inline-formula>, where <italic>l</italic> &#x003D; 0, 1, 2,<inline-formula id="ieqn-52">
<!--<alternatives><inline-graphic xlink:href="ieqn-52.tif"/><tex-math id="tex-ieqn-52"><![CDATA[{\rm \; } \cdots]]></tex-math>--><mml:math id="mml-ieqn-52"><mml:mrow><mml:mspace width="thickmathspace"></mml:mspace></mml:mrow><mml:mo>&#x22EF;</mml:mo></mml:math>
<!--</alternatives>--></inline-formula>, and <inline-formula id="ieqn-53">
<!--<alternatives><inline-graphic xlink:href="ieqn-53.tif"/><tex-math id="tex-ieqn-53"><![CDATA[\Delta s]]></tex-math>--><mml:math id="mml-ieqn-53"><mml:mi mathvariant="normal">&#x0394;</mml:mi><mml:mi>s</mml:mi></mml:math>
<!--</alternatives>--></inline-formula> is a smaller positive integer than <italic>s</italic>. We observe that for a matrix with <inline-formula id="ieqn-54">
<!--<alternatives><inline-graphic xlink:href="ieqn-54.tif"/><tex-math id="tex-ieqn-54"><![CDATA[m = 2 \times 960 + 30{\rm \; }]]></tex-math>--><mml:math id="mml-ieqn-54"><mml:mi>m</mml:mi><mml:mo>&#x003D;</mml:mo><mml:mn>2</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mn>960</mml:mn><mml:mo>&#x002B;</mml:mo><mml:mn>30</mml:mn><mml:mrow><mml:mspace width="thickmathspace"></mml:mspace></mml:mrow></mml:math>
<!--</alternatives>--></inline-formula> and <italic>n</italic> &#x003D; 102,400 on the GTX 1070 GPU, <italic>nt</italic> &#x003D; 1,024, the GEMV kernel with one warp per row is utilized according to <xref ref-type="disp-formula" rid="eqn-4">Eq. (4)</xref>, and thus achieves 103.18 GFLOPS. Let us divide this matrix into two blocks <inline-formula id="ieqn-55">
<!--<alternatives><inline-graphic xlink:href="ieqn-55.tif"/><tex-math id="tex-ieqn-55"><![CDATA[B\left( {m = 2 \times 960,n = 102,400} \right)]]></tex-math>--><mml:math id="mml-ieqn-55"><mml:mi>B</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mi>m</mml:mi><mml:mo>&#x003D;</mml:mo><mml:mn>2</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mn>960</mml:mn><mml:mo>,</mml:mo><mml:mi>n</mml:mi><mml:mo>&#x003D;</mml:mo><mml:mn>102</mml:mn><mml:mo>,</mml:mo><mml:mn>400</mml:mn></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:math>
<!--</alternatives>--></inline-formula> and <inline-formula id="ieqn-56">
<!--<alternatives><inline-graphic xlink:href="ieqn-56.tif"/><tex-math id="tex-ieqn-56"><![CDATA[C\left( {m = 30,n = 102,400} \right)]]></tex-math>--><mml:math id="mml-ieqn-56"><mml:mi>C</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mi>m</mml:mi><mml:mo>&#x003D;</mml:mo><mml:mn>30</mml:mn><mml:mo>,</mml:mo><mml:mi>n</mml:mi><mml:mo>&#x003D;</mml:mo><mml:mn>102</mml:mn><mml:mo>,</mml:mo><mml:mn>400</mml:mn></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:math>
<!--</alternatives>--></inline-formula>. If <italic>B</italic> and <italic>C</italic> are calculated by one warp per row and 32 warps per row, respectively, we will obtain 107.25 GFLOPS. In this way, the performance of the GEMV kernel increases 4.07 GFLOPS. Why? Obviously, for the GEMV kernel, if <inline-formula id="ieqn-57">
<!--<alternatives><inline-graphic xlink:href="ieqn-57.tif"/><tex-math id="tex-ieqn-57"><![CDATA[\Delta s=0]]></tex-math>--><mml:math id="mml-ieqn-57"><mml:mi mathvariant="normal">&#x0394;</mml:mi><mml:mi>s</mml:mi><mml:mrow><mml:mo>&#x003D;</mml:mo></mml:mrow><mml:mn>0</mml:mn></mml:math>
<!--</alternatives>--></inline-formula> and <italic>l &#x003E; 0</italic>, each warp will calculate the same number of rows and thus it obtains good performance. However, when <inline-formula id="ieqn-58">
<!--<alternatives><inline-graphic xlink:href="ieqn-58.tif"/><tex-math id="tex-ieqn-58"><![CDATA[\Delta s \ne 0]]></tex-math>--><mml:math id="mml-ieqn-58"><mml:mi mathvariant="normal">&#x0394;</mml:mi><mml:mi>s</mml:mi><mml:mo>&#x2260;</mml:mo><mml:mn>0</mml:mn></mml:math>
<!--</alternatives>--></inline-formula> and <italic>l &#x003E; 0</italic>, the performance of the GEMV kernel decreases because many warps are idle when computing the remaining <inline-formula id="ieqn-59">
<!--<alternatives><inline-graphic xlink:href="ieqn-59.tif"/><tex-math id="tex-ieqn-59"><![CDATA[\Delta s]]></tex-math>--><mml:math id="mml-ieqn-59"><mml:mi mathvariant="normal">&#x0394;</mml:mi><mml:mi>s</mml:mi></mml:math>
<!--</alternatives>--></inline-formula> rows. Based on the above observations, we optimize the GEMV kernel as follows.<list list-type="bullet"><list-item>
<p>When <inline-formula id="ieqn-60">
<!--<alternatives><inline-graphic xlink:href="ieqn-60.tif"/><tex-math id="tex-ieqn-60"><![CDATA[\Delta s=0, \;l > 0]]></tex-math>--><mml:math id="mml-ieqn-60"><mml:mi mathvariant="normal">&#x0394;</mml:mi><mml:mi>s</mml:mi><mml:mrow><mml:mo>&#x003D;</mml:mo></mml:mrow><mml:mn>0</mml:mn><mml:mspace width="thickmathspace"></mml:mspace><mml:mi>l</mml:mi><mml:mo>&#x003E;</mml:mo><mml:mn>0</mml:mn></mml:math>
<!--</alternatives>--></inline-formula> or <inline-formula id="ieqn-61">
<!--<alternatives><inline-graphic xlink:href="ieqn-61.tif"/><tex-math id="tex-ieqn-61"><![CDATA[\Delta s \ne 0 ,\;l = 0]]></tex-math>--><mml:math id="mml-ieqn-61"><mml:mi mathvariant="normal">&#x0394;</mml:mi><mml:mi>s</mml:mi><mml:mo>&#x2260;</mml:mo><mml:mn>0</mml:mn><mml:mspace width="thickmathspace"></mml:mspace><mml:mi>l</mml:mi><mml:mo>&#x003D;</mml:mo><mml:mn>0</mml:mn></mml:math>
<!--</alternatives>--></inline-formula>, the GEMV kernel shown in Section 3.1 is applied.</p></list-item><list-item>
<p>Otherwise, on the basis of the GEMV kernel, we construct a new kernel GEMV Kernel-I to calculate the GEMV on the GPU. In this kernel, each row of the first <inline-formula id="ieqn-62">
<!--<alternatives><inline-graphic xlink:href="ieqn-62.tif"/><tex-math id="tex-ieqn-62"><![CDATA[l \times s]]></tex-math>--><mml:math id="mml-ieqn-62"><mml:mi>l</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mi>s</mml:mi></mml:math>
<!--</alternatives>--></inline-formula> rows is assigned to one warp, and each row of the last <inline-formula id="ieqn-63">
<!--<alternatives><inline-graphic xlink:href="ieqn-63.tif"/><tex-math id="tex-ieqn-63"><![CDATA[\Delta s]]></tex-math>--><mml:math id="mml-ieqn-63"><mml:mi mathvariant="normal">&#x0394;</mml:mi><mml:mi>s</mml:mi></mml:math>
<!--</alternatives>--></inline-formula> rows is calculated by self-adaptive multiple warps.</p></list-item></list></p>
<p>Similar to the GEMV kernel, for the GEMV-T kernel, we assume that <inline-formula id="ieqn-64">
<!--<alternatives><inline-graphic xlink:href="ieqn-64.tif"/><tex-math id="tex-ieqn-64"><![CDATA[s = {N^{sm}} \times {N^{mb}} \times nt]]></tex-math>--><mml:math id="mml-ieqn-64"><mml:mi>s</mml:mi><mml:mo>&#x003D;</mml:mo><mml:mrow><mml:msup><mml:mi>N</mml:mi><mml:mrow><mml:mi>s</mml:mi><mml:mi>m</mml:mi></mml:mrow></mml:msup></mml:mrow><mml:mo>&#x00D7;</mml:mo><mml:mrow><mml:msup><mml:mi>N</mml:mi><mml:mrow><mml:mi>m</mml:mi><mml:mi>b</mml:mi></mml:mrow></mml:msup></mml:mrow><mml:mo>&#x00D7;</mml:mo><mml:mi>n</mml:mi><mml:mi>t</mml:mi></mml:math>
<!--</alternatives>--></inline-formula> and <inline-formula id="ieqn-65">
<!--<alternatives><inline-graphic xlink:href="ieqn-65.tif"/><tex-math id="tex-ieqn-65"><![CDATA[m = l + s + \Delta s]]></tex-math>--><mml:math id="mml-ieqn-65"><mml:mi>m</mml:mi><mml:mo>&#x003D;</mml:mo><mml:mi>l</mml:mi><mml:mo>&#x002B;</mml:mo><mml:mi>s</mml:mi><mml:mo>&#x002B;</mml:mo><mml:mi mathvariant="normal">&#x0394;</mml:mi><mml:mi>s</mml:mi></mml:math>
<!--</alternatives>--></inline-formula>, where <italic>l</italic> &#x003D; 0, 1, 2,<inline-formula id="ieqn-66">
<!--<alternatives><inline-graphic xlink:href="ieqn-66.tif"/><tex-math id="tex-ieqn-66"><![CDATA[{\rm \; } \cdots]]></tex-math>--><mml:math id="mml-ieqn-66"><mml:mrow><mml:mspace width="thickmathspace"></mml:mspace></mml:mrow><mml:mo>&#x22EF;</mml:mo></mml:math>
<!--</alternatives>--></inline-formula>, and <inline-formula id="ieqn-67">
<!--<alternatives><inline-graphic xlink:href="ieqn-67.tif"/><tex-math id="tex-ieqn-67"><![CDATA[\Delta s]]></tex-math>--><mml:math id="mml-ieqn-67"><mml:mi mathvariant="normal">&#x0394;</mml:mi><mml:mi>s</mml:mi></mml:math>
<!--</alternatives>--></inline-formula> is a smaller positive integer than <italic>s</italic>, and then optimize it as follows.<list list-type="bullet"><list-item>
<p>When <inline-formula id="ieqn-68">
<!--<alternatives><inline-graphic xlink:href="ieqn-68.tif"/><tex-math id="tex-ieqn-68"><![CDATA[\Delta s=0, \;l > 0]]></tex-math>--><mml:math id="mml-ieqn-68"><mml:mi mathvariant="normal">&#x0394;</mml:mi><mml:mi>s</mml:mi><mml:mrow><mml:mo>&#x003D;</mml:mo></mml:mrow><mml:mn>0</mml:mn><mml:mspace width="thickmathspace"></mml:mspace><mml:mi>l</mml:mi><mml:mo>&#x003E;</mml:mo><mml:mn>0</mml:mn></mml:math>
<!--</alternatives>--></inline-formula> or <inline-formula id="ieqn-69">
<!--<alternatives><inline-graphic xlink:href="ieqn-69.tif"/><tex-math id="tex-ieqn-69"><![CDATA[\Delta s \ne 0, \;l = 0]]></tex-math>--><mml:math id="mml-ieqn-69"><mml:mi mathvariant="normal">&#x0394;</mml:mi><mml:mi>s</mml:mi><mml:mo>&#x2260;</mml:mo><mml:mn>0</mml:mn><mml:mspace width="thickmathspace"></mml:mspace><mml:mi>l</mml:mi><mml:mo>&#x003D;</mml:mo><mml:mn>0</mml:mn></mml:math>
<!--</alternatives>--></inline-formula>, the GEMV kernel shown in Section 3.2 is applied.</p></list-item><list-item>
<p>Otherwise, on the basis of the GEMV-T kernel, we construct a new kernel GEMV-T Kernel-I to calculate the GEMV on the GPU. In this kernel, each row of the first <inline-formula id="ieqn-70">
<!--<alternatives><inline-graphic xlink:href="ieqn-70.tif"/><tex-math id="tex-ieqn-70"><![CDATA[l \times s]]></tex-math>--><mml:math id="mml-ieqn-70"><mml:mi>l</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mi>s</mml:mi></mml:math>
<!--</alternatives>--></inline-formula> rows is assigned to one thread, and each row of the last <inline-formula id="ieqn-71">
<!--<alternatives><inline-graphic xlink:href="ieqn-71.tif"/><tex-math id="tex-ieqn-71"><![CDATA[\Delta s]]></tex-math>--><mml:math id="mml-ieqn-71"><mml:mi mathvariant="normal">&#x0394;</mml:mi><mml:mi>s</mml:mi></mml:math>
<!--</alternatives>--></inline-formula> rows is calculated by self-adaptive multiple threads.</p></list-item></list></p>
</sec>
</sec>
<sec id="s4">
<label>4</label>
<title>Concurrent L1-min Solvers on a GPU</title>
<p>In this section, based on FISTA, we present two concurrent L1-min solvers on a GPU, which are designed from the perspective of the streams and the thread blocks, respectively.</p>
<sec id="s4_1">
<label>4.1</label>
<title>Streams</title>
<p>Utilizing the multi-steam features of GPU, on the basis of FISTA, we design a concurrent L1-min solver, called CFISTASOL-SM, to solve the concurrent L1-min problem. Given that these L1-problems that are included in the concurrent L1-min problem can be independently computed, each one of them is assigned to a stream in the proposed CFISTASOL-SM. <xref ref-type="fig" rid="fig-2">Fig. 2</xref> shows the parallel framework of CFISTASOL-SM, which defines the following contents: 1) illustrating the tasks of the CPU and the stream, 2) listing the execution steps of CFISTASOL-SM on each stream, and 3) designating which operations should be grouped into a single kernel.</p>
<fig id="fig-2">
<label>Figure 2</label>
<caption>
<title>Parallel framework of CFISTASOL-SM</title>
</caption>
<graphic mimetype="image" mime-subtype="png" xlink:href="fig-2.png"/>
</fig>
<p>For <inline-formula id="ieqn-72">
<!--<alternatives><inline-graphic xlink:href="ieqn-72.tif"/><tex-math id="tex-ieqn-72"><![CDATA[temp = A{y_k} - b,\;\nabla f({y_k}) = {A^T}temp]]></tex-math>--><mml:math id="mml-ieqn-72"><mml:mi>t</mml:mi><mml:mi>e</mml:mi><mml:mi>m</mml:mi><mml:mi>p</mml:mi><mml:mo>&#x003D;</mml:mo><mml:mi>A</mml:mi><mml:mrow><mml:msub><mml:mi>y</mml:mi><mml:mi>k</mml:mi></mml:msub></mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mi>b</mml:mi><mml:mo>,</mml:mo><mml:mspace width="thickmathspace"></mml:mspace><mml:mi mathvariant="normal">&#x2207;</mml:mi><mml:mi>f</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mi>y</mml:mi><mml:mi>k</mml:mi></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo><mml:mo>&#x003D;</mml:mo><mml:mrow><mml:msup><mml:mi>A</mml:mi><mml:mi>T</mml:mi></mml:msup></mml:mrow><mml:mi>t</mml:mi><mml:mi>e</mml:mi><mml:mi>m</mml:mi><mml:mi>p</mml:mi></mml:math>
<!--</alternatives>--></inline-formula>, and <inline-formula id="ieqn-73">
<!--<alternatives><inline-graphic xlink:href="ieqn-73.tif"/><tex-math id="tex-ieqn-73"><![CDATA[{z_k} = A{x_{k + 1}}]]></tex-math>--><mml:math id="mml-ieqn-73"><mml:mrow><mml:msub><mml:mi>z</mml:mi><mml:mi>k</mml:mi></mml:msub></mml:mrow><mml:mo>&#x003D;</mml:mo><mml:mi>A</mml:mi><mml:mrow><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>k</mml:mi><mml:mo>&#x002B;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:mrow></mml:math>
<!--</alternatives>--></inline-formula> in <xref ref-type="fig" rid="fig-2">Fig. 2</xref>, they are easy to be implemented on the basis of our proposed GEMV and GEMV-T implementation methods on the GPU. The optimal vector-operation and inner-product kernels and their corresponding CUDA parameters are chosen by using the vector-operation and inner-product decision trees.</p>
</sec>
<sec id="s4_2">
<label>4.2</label>
<title>Thread Blocks</title>
<p>For a specific GPU, the maximum thread blocks can be calculated as <inline-formula id="ieqn-74">
<!--<alternatives><inline-graphic xlink:href="ieqn-74.tif"/><tex-math id="tex-ieqn-74"><![CDATA[TBs = {N^{sm}} \times {N^{tb}}.]]></tex-math>--><mml:math id="mml-ieqn-74"><mml:mi>T</mml:mi><mml:mi>B</mml:mi><mml:mi>s</mml:mi><mml:mo>&#x003D;</mml:mo><mml:mrow><mml:msup><mml:mi>N</mml:mi><mml:mrow><mml:mi>s</mml:mi><mml:mi>m</mml:mi></mml:mrow></mml:msup></mml:mrow><mml:mo>&#x00D7;</mml:mo><mml:mrow><mml:msup><mml:mi>N</mml:mi><mml:mrow><mml:mi>t</mml:mi><mml:mi>b</mml:mi></mml:mrow></mml:msup></mml:mrow><mml:mo>.</mml:mo></mml:math>
<!--</alternatives>--></inline-formula> Utilizing the features of multiple thread blocks, based on FISTA, we present a concurrent L1-min solver, called CFISTASOL-TB, to solve the concurrent L1-min problem. In CFISTASOL-TB, a L1-min problem is assigned to one thread block. For each thread block, the idea of constructing the parallel FISTA to solve the L1-min problem is similar to that of implementing FISTA on the stream in Section 4.1 except that here CFISTASOL-TB is implemented in a kernel.</p>
</sec>
<sec id="s4_3">
<label>4.3</label>
<title>Optimization</title>
<p>When the <italic>i</italic>th element of <italic>x</italic> is equal to zero, all elements in the <italic>i</italic>th column of <inline-formula id="ieqn-75">
<!--<alternatives><inline-graphic xlink:href="ieqn-75.tif"/><tex-math id="tex-ieqn-75"><![CDATA[A]]></tex-math>--><mml:math id="mml-ieqn-75"><mml:mi>A</mml:mi></mml:math>
<!--</alternatives>--></inline-formula> do not need to be accessed because they do not have any contribution to the output vector. With the increasing iteration in FISTA, <italic>x</italic> becomes sparser and sparser, and thus many columns in <italic>A</italic> are not accessed. By this way, we can improve the CFISTASOL-SM and CFISTASOL-TB performance by reducing accesses to the global memory <italic>a</italic>. Furthermore, for CFISTASOL-TB and CFISTASOL-SM, each thread block needs to access the global memory <italic>a</italic>, so we let <italic>a</italic> be cached in the read-only data cache in order to reduce the number of accesses to <italic>a</italic>. With the read-only data cache, <italic>a</italic> is shared by all thread blocks and can be accessed fast.</p>
</sec>
</sec>
<sec id="s5">
<label>5</label>
<title>Concurrent L1-min Solver on Multiple GPUs</title>
<p>For the concurrent L1-min problem, if it includes a great number of L1-min problems, we can easily construct a solver on multiple GPUs to solve it by letting each GPU execute CFISTASOL-SM or CFISTASOL-TB. However, here we design a concurrent L1-min solver on multiple GPUs, called CFISTASOL-MGPU, where each GPU only solves a L1-min problem every time instead of solving multiple L1-min problems by utilizing the streams and the thread blocks. This solver is applied to this case where the number of L1-min problems that are included in the concurrent L1-min problem is much less than the number of the streams or the thread blocks. <xref ref-type="fig" rid="fig-3">Fig. 3</xref> shows the parallel framework of CFISTASOL-MGPU, which defines the following contents: 1) illustrating the CPU/GPU tasks, 2) listing the execution steps of CFISTASOL-MGPU on each GPU, and 3) designating which operations should be grouped into a single kernel. The operations in <xref ref-type="fig" rid="fig-3">Fig. 3</xref> are easily executed on each GPU using the kernels in Section 3.</p>
<fig id="fig-3">
<label>Figure 3</label>
<caption>
<title>Parallel framework of CFISTASOL-MGPU</title>
</caption>
<graphic mimetype="image" mime-subtype="png" xlink:href="fig-3.png"/>
</fig>
</sec>
<sec id="s6">
<label>6</label>
<title>Performance Evaluation and Analysis</title>
<p>In this section, we first investigate the effectiveness of our proposed GEMV and GEMV-T kernels by comparing them with GEMV and GEMV-T implementations in the CUBLAS library [<xref ref-type="bibr" rid="ref-26">26</xref>] and those that are presented by Gao et al. [<xref ref-type="bibr" rid="ref-17">17</xref>]. Second, we test the performance of our proposed concurrent L1-min solvers on a GPU, CFISTASOL-SM and CFISTASOL-TB. Finally, we test the performance of our proposed concurrent L1-solver on multiple GPUs, CFISTASOL-MGPU. In the experiments, the number of threads per block is set to 1024 for all algorithms.</p>
<p><xref ref-type="table" rid="table-2">Tab. 2</xref> shows NVIDIA GPUs that are used in the performance evaluations. Our source codes are compiled and executed using the CUDA toolkit 10.1. The measured GPU performance for all experiments does not include the data transfer (from the GPU to the CPU or from the CPU to the GPU). The test matrices, which come from the publication [<xref ref-type="bibr" rid="ref-17">17</xref>], are shown in <xref ref-type="table" rid="table-3">Tab. 3</xref>. The elemental values of each test matrix are randomly generated according to the normal distribution. The performance is measured in terms of GFLOPS, which is obtained by <inline-formula id="ieqn-76">
<!--<alternatives><inline-graphic xlink:href="ieqn-76.tif"/><tex-math id="tex-ieqn-76"><![CDATA[2 \times m \times n]]></tex-math>--><mml:math id="mml-ieqn-76"><mml:mn>2</mml:mn><mml:mo>&#x00D7;</mml:mo><mml:mi>m</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mi>n</mml:mi></mml:math>
<!--</alternatives>--></inline-formula> the matrix-vector multiplication kernel execution time (the time unit is <italic>second</italic>) [<xref ref-type="bibr" rid="ref-30">30</xref>].</p>

<table-wrap id="table-2">
<label>Table 2</label>
<caption>
<title>Overview of GPUs</title>
</caption>
<table>
<colgroup>
<col/>
<col/>
<col/>
</colgroup>
<thead>
<tr>
<th>Hardware</th>
<th>K40c</th>
<th>GTX1070</th>
</tr>
</thead>
<tbody>
<tr>
<td>Cores</td>
<td>2880</td>
<td>1920</td>
</tr>
<tr>
<td>Clock speed (GHz)</td>
<td>0.74</td>
<td>1.56</td>
</tr>
<tr>
<td>Memory type</td>
<td>GDDR5</td>
<td>GDDR5</td>
</tr>
<tr>
<td>Memory size (GB)</td>
<td>12</td>
<td>8</td>
</tr>
<tr>
<td>Max-bandwith (GB/s)</td>
<td>288</td>
<td>256</td>
</tr>
<tr>
<td>Compute capability</td>
<td>3.5</td>
<td>6.1</td>
</tr>
</tbody>
</table>
</table-wrap>

<table-wrap id="table-3">
<label>Table 3</label>
<caption>
<title>Test matrices</title>
</caption>
<table>
<colgroup>
<col/>
<col/>
<col/>
<col/>
</colgroup>
<thead>
<tr>
<th>Seq</th>
<th>Matrix</th>
<th>Rows (<inline-formula id="ieqn-77">
<!--<alternatives><inline-graphic xlink:href="ieqn-77.tif"/><tex-math id="tex-ieqn-77"><![CDATA[m]]></tex-math>--><mml:math id="mml-ieqn-77"><mml:mi>m</mml:mi></mml:math>
<!--</alternatives>--></inline-formula>)</th>
<th>Columns (<inline-formula id="ieqn-78">
<!--<alternatives><inline-graphic xlink:href="ieqn-78.tif"/><tex-math id="tex-ieqn-78"><![CDATA[n]]></tex-math>--><mml:math id="mml-ieqn-78"><mml:mi>n</mml:mi></mml:math>
<!--</alternatives>--></inline-formula>)</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>Mat01</td>
<td>32</td>
<td>8,388,608</td>
</tr>
<tr>
<td>2</td>
<td>Mat02</td>
<td>50</td>
<td>5,368,709</td>
</tr>
<tr>
<td>3</td>
<td>Mat03</td>
<td>64</td>
<td>4,194,304</td>
</tr>
<tr>
<td>4</td>
<td>Mat04</td>
<td>100</td>
<td>2,684,350</td>
</tr>
<tr>
<td>5</td>
<td>Mat05</td>
<td>128</td>
<td>2,097,152</td>
</tr>
<tr>
<td>6</td>
<td>Mat06</td>
<td>200</td>
<td>1,342,200</td>
</tr>
<tr>
<td>7</td>
<td>Mat07</td>
<td>256</td>
<td>1,048,576</td>
</tr>
<tr>
<td>8</td>
<td>Mat08</td>
<td>400</td>
<td>671,100</td>
</tr>
<tr>
<td>9</td>
<td>Mat09</td>
<td>512</td>
<td>524,288</td>
</tr>
<tr>
<td>10</td>
<td>Mat10</td>
<td>800</td>
<td>335,850</td>
</tr>
<tr>
<td>11</td>
<td>Mat11</td>
<td>1024</td>
<td>262,144</td>
</tr>
<tr>
<td>12</td>
<td>Mat12</td>
<td>1600</td>
<td>166,900</td>
</tr>
</tbody>
</table>
</table-wrap>
<sec id="s6_1">
<label>6.1</label>
<title>Performance Evaluation and Analysis of Matrix-Vector Multiplications</title>
<p>First, we compare GEMV and GEMV-T kernels with the implementations in the CUBLAS library [<xref ref-type="bibr" rid="ref-26">26</xref>] and those that are presented by Gao et al. [<xref ref-type="bibr" rid="ref-17">17</xref>]. The test matrices are shown in <xref ref-type="table" rid="table-3">Tab. 3</xref>. <xref ref-type="fig" rid="fig-4">Figs. 4</xref> and <xref ref-type="fig" rid="fig-5">5</xref> show the performance comparison of the GEMV kernel with CUBLAS and that of Gao et al., respectively. From <xref ref-type="fig" rid="fig-4">Fig. 4</xref>, we observe that our proposed GEMV kernel on K40c and GTX1070 is always advantageous over CUBLAS and that of Gao et al. for all test matrices. On K40c and GTX1070, the average performance ratios of the proposed GEMV kernel versus CUBLAS are <inline-formula id="ieqn-79">
<!--<alternatives><inline-graphic xlink:href="ieqn-79.tif"/><tex-math id="tex-ieqn-79"><![CDATA[3.15 \times]]></tex-math>--><mml:math id="mml-ieqn-79"><mml:mn>3.15</mml:mn><mml:mo>&#x00D7;</mml:mo></mml:math>
<!--</alternatives>--></inline-formula> and 1.12<inline-formula id="ieqn-80">
<!--<alternatives><inline-graphic xlink:href="ieqn-80.tif"/><tex-math id="tex-ieqn-80"><![CDATA[\times]]></tex-math>--><mml:math id="mml-ieqn-80"><mml:mo>&#x00D7;</mml:mo></mml:math>
<!--</alternatives>--></inline-formula>, respectively, and those of the proposed GEMV kernel versus Gao&#x2019;s GEMV kernel are 3.18<inline-formula id="ieqn-81">
<!--<alternatives><inline-graphic xlink:href="ieqn-81.tif"/><tex-math id="tex-ieqn-81"><![CDATA[\times]]></tex-math>--><mml:math id="mml-ieqn-81"><mml:mo>&#x00D7;</mml:mo></mml:math>
<!--</alternatives>--></inline-formula> and 1.19<inline-formula id="ieqn-82">
<!--<alternatives><inline-graphic xlink:href="ieqn-82.tif"/><tex-math id="tex-ieqn-82"><![CDATA[\times]]></tex-math>--><mml:math id="mml-ieqn-82"><mml:mo>&#x00D7;</mml:mo></mml:math>
<!--</alternatives>--></inline-formula>, respectively. For the proposed GEMV-T kernel, it outperforms CUBLAS and Gao&#x2019;s GEMV-T kernel for all test matrices on all two GPUs, as shown in <xref ref-type="fig" rid="fig-5">Fig. 5</xref>. On K40c and GTX1070, the average performance improvement is respectively 1.40 <inline-formula id="ieqn-83">
<!--<alternatives><inline-graphic xlink:href="ieqn-83.tif"/><tex-math id="tex-ieqn-83"><![CDATA[\times]]></tex-math>--><mml:math id="mml-ieqn-83"><mml:mo>&#x00D7;</mml:mo></mml:math>
<!--</alternatives>--></inline-formula> and 1.05 <inline-formula id="ieqn-84">
<!--<alternatives><inline-graphic xlink:href="ieqn-84.tif"/><tex-math id="tex-ieqn-84"><![CDATA[\times]]></tex-math>--><mml:math id="mml-ieqn-84"><mml:mo>&#x00D7;</mml:mo></mml:math>
<!--</alternatives>--></inline-formula> compared to CUBLAS, and is respectively 1.09<inline-formula id="ieqn-85">
<!--<alternatives><inline-graphic xlink:href="ieqn-85.tif"/><tex-math id="tex-ieqn-85"><![CDATA[\times {\rm \; }]]></tex-math>--><mml:math id="mml-ieqn-85"><mml:mo>&#x00D7;</mml:mo><mml:mrow><mml:mspace width="thickmathspace"></mml:mspace></mml:mrow></mml:math>
<!--</alternatives>--></inline-formula>and 1.02<inline-formula id="ieqn-86">
<!--<alternatives><inline-graphic xlink:href="ieqn-86.tif"/><tex-math id="tex-ieqn-86"><![CDATA[\times {\rm \; }]]></tex-math>--><mml:math id="mml-ieqn-86"><mml:mo>&#x00D7;</mml:mo><mml:mrow><mml:mspace width="thickmathspace"></mml:mspace></mml:mrow></mml:math>
<!--</alternatives>--></inline-formula>compared to Gao&#x2019;s GEMV-T kernel. These observations verify the effectiveness of the proposed GEMV and GEMV-T kernels.</p>
<fig id="fig-4">
<label>Figure 4</label>
<caption>
<title>Performance comparison of GEMV kernels</title>
</caption>
<graphic mimetype="image" mime-subtype="png" xlink:href="fig-4.png"/>
</fig>
<fig id="fig-5">
<label>Figure 5</label>
<caption>
<title>Performance comparison of GEMV-T kernels</title>
</caption>
<graphic mimetype="image" mime-subtype="png" xlink:href="fig-5.png"/>
</fig>
<p>Second, we take GTX1070 to investigate whether can the proposed GEMV and GEMV-T kernels alleviate the CUBLAS fluctuations? The test matrix sets are as follows: 1) Set 1: <italic>n</italic> &#x003D; 100,000 and <italic>m</italic> &#x003D; 50, 100, 150, <inline-formula id="ieqn-87">
<!--<alternatives><inline-graphic xlink:href="ieqn-87.tif"/><tex-math id="tex-ieqn-87"><![CDATA[\cdots]]></tex-math>--><mml:math id="mml-ieqn-87"><mml:mo>&#x22EF;</mml:mo></mml:math>
<!--</alternatives>--></inline-formula> , 5,000; 2) Set 2: <italic>m</italic> &#x003D; 1,000 and <italic>n</italic> &#x003D; 5,000, 10,000, 15,000, <inline-formula id="ieqn-88">
<!--<alternatives><inline-graphic xlink:href="ieqn-88.tif"/><tex-math id="tex-ieqn-88"><![CDATA[\cdots]]></tex-math>--><mml:math id="mml-ieqn-88"><mml:mo>&#x22EF;</mml:mo></mml:math>
<!--</alternatives>--></inline-formula> , 500,000.</p>
<p><xref ref-type="fig" rid="fig-6">Figs. 6</xref> and <xref ref-type="fig" rid="fig-7">7</xref> show the performance curves of GEMV and GEMV-T kernels for the matrices in the two sets, respectively. From <xref ref-type="fig" rid="fig-6">Fig. 6</xref>, we observe that whether <italic>m</italic> increases when <italic>n</italic> is fixed to 100,000 or <italic>n</italic> increases when <italic>m</italic> is fixed to 1,000, the CUBLAS performance fluctuates, and the difference between the maximum and minimum performance values is significant. However, for our proposed GEMV kernel, it is advantageous over CUBLAS, and its performance almost remains invariable, and always preserves around 107 GFLOPS for all cases. Furthermore, for the test matrices in Set 1, the performance of our proposed GEMV-T kernel has been maintained at around 107 GFLOPS as <italic>m</italic> increases, as shown in <xref ref-type="fig" rid="fig-7">Fig. 7(a)</xref>. However, the CUBLAS performance fluctuates as <italic>m</italic> increases. For the test matrices in Set 2, when <italic>n</italic> increases, our proposed GEMV-T kernel has high performance as CUBLAS, and always achieves around 107 GFLOPS.</p>
<fig id="fig-6">
<label>Figure 6</label>
<caption>
<title>GEMV (a) Performance curves with <italic>m</italic> (<italic>n</italic> &#x003D; 100,000) (b) performance curves with <italic>n</italic> (<italic>m</italic> &#x003D; 1,000)</title>
</caption>
<graphic mimetype="image" mime-subtype="png" xlink:href="fig-6.png"/>
</fig>
<fig id="fig-7">
<label>Figure 7</label>
<caption>
<title>GEMV-T (a) Performance curves with <italic>m</italic> (<italic>n</italic> &#x003D; 100,000) (b) performance curves with <italic>n</italic> (<italic>m</italic> &#x003D; 1,000)</title>
</caption>
<graphic mimetype="image" mime-subtype="png" xlink:href="fig-7.png"/>
</fig>
<p>Based on the above observations, we conclude that our proposed GEMV and GEMV-T kernels enhance those that are suggested by Gao et al., and achieve high performance, and are able to alleviate the performance fluctuations of CUBLAS.</p>
</sec>
<sec id="s6_2">
<label>6.2</label>
<title>Performance of Concurrent L1-min Solvers on a GPU</title>
<p>In this section, we test the performance of our proposed CFISTASOL-TB and CFISTASOL-SM. Given that these L1-min problems included in the concurrent L1-min problem can be independently computed, as a comparison, we use the FISTA implementation on the CPU using the BLAS library (denoted by BLAS), the FISTA implementation using the CUBLAS library (denoted by CUBLAS), and the FISTA solver (denoted by GAO) that is proposed in [<xref ref-type="bibr" rid="ref-17">17</xref>] to calculate them. 12 test cases are applied. For each test case, 60 L1-min problems are concurrently calculated, and the matrix <italic>A</italic> comes from <xref ref-type="table" rid="table-3">Tab. 3</xref>. For each L1-min problem, the initial <inline-formula id="ieqn-89">
<!--<alternatives><inline-graphic xlink:href="ieqn-89.tif"/><tex-math id="tex-ieqn-89"><![CDATA[{x_0}]]></tex-math>--><mml:math id="mml-ieqn-89"><mml:mrow><mml:msub><mml:mi>x</mml:mi><mml:mn>0</mml:mn></mml:msub></mml:mrow></mml:math>
<!--</alternatives>--></inline-formula> with 1024 non-zero elements is randomly generated according to the normal distribution, and <inline-formula id="ieqn-90">
<!--<alternatives><inline-graphic xlink:href="ieqn-90.tif"/><tex-math id="tex-ieqn-90"><![CDATA[b = A{x_0}]]></tex-math>--><mml:math id="mml-ieqn-90"><mml:mi>b</mml:mi><mml:mo>&#x003D;</mml:mo><mml:mi>A</mml:mi><mml:mrow><mml:msub><mml:mi>x</mml:mi><mml:mn>0</mml:mn></mml:msub></mml:mrow></mml:math>
<!--</alternatives>--></inline-formula>. All algorithms stop after the number of iterations is more than 100 for all test cases. <xref ref-type="table" rid="table-4">Tabs. 4</xref> and <xref ref-type="table" rid="table-5">5</xref> show the execution time of all algorithms on a K40c and a GTX1070, respectively. The time unit is second (denoted by<inline-formula id="ieqn-91">
<!--<alternatives><inline-graphic xlink:href="ieqn-91.tif"/><tex-math id="tex-ieqn-91"><![CDATA[s]]></tex-math>--><mml:math id="mml-ieqn-91"><mml:mi>s</mml:mi></mml:math>
<!--</alternatives>--></inline-formula>). In <xref ref-type="table" rid="table-4">Tabs. 4</xref> and <xref ref-type="table" rid="table-5">5</xref>, our proposed CFISTASOL-TB and CFISTASOL-SM are abbreviated as TB and SM, respectively.</p>

<table-wrap id="table-4">
<label>Table 4</label>
<caption>
<title>Execution time of all algorithms on a K40c (The time unit is <italic>s</italic>)</title>
</caption>
<table>
<colgroup>
<col/>
<col/>
<col/>
<col/>
<col/>
<col/>
<col/>
<col/>
<col/>
<col/>
<col/>
<col/>
</colgroup>
<thead>
<tr>
<th>Prob</th>
<th>BLAS</th>
<th>CUBLAS</th>
<th>GAO</th>
<th>TB</th>
<th>SM</th>
<th><inline-formula id="ieqn-92">
<!--<alternatives><inline-graphic xlink:href="ieqn-92.tif"/><tex-math id="tex-ieqn-92"><![CDATA[\displaystyle{{{\rm BLAS}} \over {{\rm TB}}}]]></tex-math>--><mml:math id="mml-ieqn-92"><mml:mstyle scriptlevel="0" displaystyle="true"><mml:mrow><mml:mfrac><mml:mrow><mml:mrow><mml:mi mathvariant="normal">B</mml:mi><mml:mi mathvariant="normal">L</mml:mi><mml:mi mathvariant="normal">A</mml:mi><mml:mi mathvariant="normal">S</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:mi mathvariant="normal">T</mml:mi><mml:mi mathvariant="normal">B</mml:mi></mml:mrow></mml:mrow></mml:mfrac></mml:mrow></mml:mstyle></mml:math>
<!--</alternatives>--></inline-formula></th>
<th><inline-formula id="ieqn-93">
<!--<alternatives><inline-graphic xlink:href="ieqn-93.tif"/><tex-math id="tex-ieqn-93"><![CDATA[\displaystyle{{{\rm CUBLAS}} \over {{\rm TB}}}]]></tex-math>--><mml:math id="mml-ieqn-93"><mml:mstyle scriptlevel="0" displaystyle="true"><mml:mrow><mml:mfrac><mml:mrow><mml:mrow><mml:mi mathvariant="normal">C</mml:mi><mml:mi mathvariant="normal">U</mml:mi><mml:mi mathvariant="normal">B</mml:mi><mml:mi mathvariant="normal">L</mml:mi><mml:mi mathvariant="normal">A</mml:mi><mml:mi mathvariant="normal">S</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:mi mathvariant="normal">T</mml:mi><mml:mi mathvariant="normal">B</mml:mi></mml:mrow></mml:mrow></mml:mfrac></mml:mrow></mml:mstyle></mml:math>
<!--</alternatives>--></inline-formula></th>
<th><inline-formula id="ieqn-94">
<!--<alternatives><inline-graphic xlink:href="ieqn-94.tif"/><tex-math id="tex-ieqn-94"><![CDATA[\displaystyle{{{\rm GAO}} \over {{\rm TB}}}]]></tex-math>--><mml:math id="mml-ieqn-94"><mml:mstyle scriptlevel="0" displaystyle="true"><mml:mrow><mml:mfrac><mml:mrow><mml:mrow><mml:mi mathvariant="normal">G</mml:mi><mml:mi mathvariant="normal">A</mml:mi><mml:mi mathvariant="normal">O</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:mi mathvariant="normal">T</mml:mi><mml:mi mathvariant="normal">B</mml:mi></mml:mrow></mml:mrow></mml:mfrac></mml:mrow></mml:mstyle></mml:math>
<!--</alternatives>--></inline-formula></th>
<th><inline-formula id="ieqn-95">
<!--<alternatives><inline-graphic xlink:href="ieqn-95.tif"/><tex-math id="tex-ieqn-95"><![CDATA[\displaystyle{{{\rm BLAS}} \over {{\rm SM}}}]]></tex-math>--><mml:math id="mml-ieqn-95"><mml:mstyle scriptlevel="0" displaystyle="true"><mml:mrow><mml:mfrac><mml:mrow><mml:mrow><mml:mi mathvariant="normal">B</mml:mi><mml:mi mathvariant="normal">L</mml:mi><mml:mi mathvariant="normal">A</mml:mi><mml:mi mathvariant="normal">S</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:mi mathvariant="normal">S</mml:mi><mml:mi mathvariant="normal">M</mml:mi></mml:mrow></mml:mrow></mml:mfrac></mml:mrow></mml:mstyle></mml:math>
<!--</alternatives>--></inline-formula></th>
<th><inline-formula id="ieqn-96">
<!--<alternatives><inline-graphic xlink:href="ieqn-96.tif"/><tex-math id="tex-ieqn-96"><![CDATA[\displaystyle{{{\rm CUBLAS}} \over {{\rm SM}}}]]></tex-math>--><mml:math id="mml-ieqn-96"><mml:mstyle scriptlevel="0" displaystyle="true"><mml:mrow><mml:mfrac><mml:mrow><mml:mrow><mml:mi mathvariant="normal">C</mml:mi><mml:mi mathvariant="normal">U</mml:mi><mml:mi mathvariant="normal">B</mml:mi><mml:mi mathvariant="normal">L</mml:mi><mml:mi mathvariant="normal">A</mml:mi><mml:mi mathvariant="normal">S</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:mi mathvariant="normal">S</mml:mi><mml:mi mathvariant="normal">M</mml:mi></mml:mrow></mml:mrow></mml:mfrac></mml:mrow></mml:mstyle></mml:math>
<!--</alternatives>--></inline-formula></th>
<th><inline-formula id="ieqn-97">
<!--<alternatives><inline-graphic xlink:href="ieqn-97.tif"/><tex-math id="tex-ieqn-97"><![CDATA[\displaystyle{{{\rm GAO}} \over {{\rm SM}}}]]></tex-math>--><mml:math id="mml-ieqn-97"><mml:mstyle scriptlevel="0" displaystyle="true"><mml:mrow><mml:mfrac><mml:mrow><mml:mrow><mml:mi mathvariant="normal">G</mml:mi><mml:mi mathvariant="normal">A</mml:mi><mml:mi mathvariant="normal">O</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:mi mathvariant="normal">S</mml:mi><mml:mi mathvariant="normal">M</mml:mi></mml:mrow></mml:mrow></mml:mfrac></mml:mrow></mml:mstyle></mml:math>
<!--</alternatives>--></inline-formula></th>
</tr>
</thead>
<tbody>
<tr>
<td>01</td>
<td>2962.98</td>
<td>1307.39</td>
<td>256.26</td>
<td>39.28</td>
<td>58.95</td>
<td>75.43</td>
<td>33.28</td>
<td>6.52</td>
<td>50.26</td>
<td>22.18</td>
<td>4.35</td>
</tr>
<tr>
<td>02</td>
<td>2203.57</td>
<td>847.17</td>
<td>167.35</td>
<td>38.42</td>
<td>40.05</td>
<td>57.35</td>
<td>22.05</td>
<td>4.36</td>
<td>55.02</td>
<td>21.15</td>
<td>4.18</td>
</tr>
<tr>
<td>03</td>
<td>2371.43</td>
<td>641.59</td>
<td>190.00</td>
<td>32.83</td>
<td>32.84</td>
<td>72.23</td>
<td>19.54</td>
<td>5.79</td>
<td>72.22</td>
<td>19.54</td>
<td>5.79</td>
</tr>
<tr>
<td>04</td>
<td>1832.85</td>
<td>436.34</td>
<td>134.57</td>
<td>31.11</td>
<td>35.29</td>
<td>58.92</td>
<td>14.03</td>
<td>4.33</td>
<td>51.93</td>
<td>12.36</td>
<td>3.81</td>
</tr>
<tr>
<td>05</td>
<td>2050.07</td>
<td>342.21</td>
<td>158.41</td>
<td>29.46</td>
<td>29.40</td>
<td>69.58</td>
<td>11.61</td>
<td>5.38</td>
<td>69.72</td>
<td>11.64</td>
<td>5.39</td>
</tr>
<tr>
<td>06</td>
<td>1670.28</td>
<td>246.36</td>
<td>120.09</td>
<td>29.03</td>
<td>29.42</td>
<td>57.53</td>
<td>8.49</td>
<td>4.14</td>
<td>56.77</td>
<td>8.37</td>
<td>4.08</td>
</tr>
<tr>
<td>07</td>
<td>1925.79</td>
<td>206.70</td>
<td>141.73</td>
<td>27.55</td>
<td>27.47</td>
<td>69.90</td>
<td>7.50</td>
<td>5.14</td>
<td>70.11</td>
<td>7.53</td>
<td>5.16</td>
</tr>
<tr>
<td>08</td>
<td>1568.12</td>
<td>223.16</td>
<td>126.04</td>
<td>28.12</td>
<td>29.09</td>
<td>55.77</td>
<td>7.94</td>
<td>4.48</td>
<td>53.90</td>
<td>7.67</td>
<td>4.33</td>
</tr>
<tr>
<td>09</td>
<td>1851.47</td>
<td>188.22</td>
<td>147.64</td>
<td>26.81</td>
<td>26.71</td>
<td>69.05</td>
<td>7.02</td>
<td>5.51</td>
<td>69.31</td>
<td>7.05</td>
<td>5.53</td>
</tr>
<tr>
<td>10</td>
<td>1558.87</td>
<td>159.74</td>
<td>115.13</td>
<td>28.09</td>
<td>28.74</td>
<td>55.49</td>
<td>5.69</td>
<td>4.10</td>
<td>54.24</td>
<td>5.56</td>
<td>4.01</td>
</tr>
<tr>
<td>11</td>
<td>1753.49</td>
<td>151.16</td>
<td>111.93</td>
<td>28.71</td>
<td>28.63</td>
<td>61.07</td>
<td>5.27</td>
<td>3.90</td>
<td>61.24</td>
<td>5.28</td>
<td>3.91</td>
</tr>
<tr>
<td>12</td>
<td>1487.32</td>
<td>121.96</td>
<td>114.12</td>
<td>29.15</td>
<td>29.16</td>
<td>51.03</td>
<td>4.18</td>
<td>3.92</td>
<td>51.02</td>
<td>4.18</td>
<td>3.92</td>
</tr>
</tbody>
</table>
</table-wrap>

<table-wrap id="table-5">
<label>Table 5</label>
<caption>
<title>Execution time of all algorithms on a GTX1070 (The time unit is <italic>s</italic>)</title>
</caption>
<table>
<colgroup>
<col/>
<col/>
<col/>
<col/>
<col/>
<col/>
<col/>
<col/>
<col/>
<col/>
<col/>
<col/>
</colgroup>
<thead>
<tr>
<th>Prob</th>
<th>BLAS</th>
<th>CUBLAS</th>
<th>GAO</th>
<th>TB</th>
<th>SM</th>
<th><inline-formula id="ieqn-98">
<!--<alternatives><inline-graphic xlink:href="ieqn-98.tif"/><tex-math id="tex-ieqn-98"><![CDATA[\displaystyle{{{\rm BLAS}} \over {{\rm TB}}}]]></tex-math>--><mml:math id="mml-ieqn-98"><mml:mstyle scriptlevel="0" displaystyle="true"><mml:mrow><mml:mfrac><mml:mrow><mml:mrow><mml:mi mathvariant="normal">B</mml:mi><mml:mi mathvariant="normal">L</mml:mi><mml:mi mathvariant="normal">A</mml:mi><mml:mi mathvariant="normal">S</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:mi mathvariant="normal">T</mml:mi><mml:mi mathvariant="normal">B</mml:mi></mml:mrow></mml:mrow></mml:mfrac></mml:mrow></mml:mstyle></mml:math>
<!--</alternatives>--></inline-formula></th>
<th><inline-formula id="ieqn-99">
<!--<alternatives><inline-graphic xlink:href="ieqn-99.tif"/><tex-math id="tex-ieqn-99"><![CDATA[\displaystyle{{{\rm CUBLAS}} \over {{\rm TB}}}]]></tex-math>--><mml:math id="mml-ieqn-99"><mml:mstyle scriptlevel="0" displaystyle="true"><mml:mrow><mml:mfrac><mml:mrow><mml:mrow><mml:mi mathvariant="normal">C</mml:mi><mml:mi mathvariant="normal">U</mml:mi><mml:mi mathvariant="normal">B</mml:mi><mml:mi mathvariant="normal">L</mml:mi><mml:mi mathvariant="normal">A</mml:mi><mml:mi mathvariant="normal">S</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:mi mathvariant="normal">T</mml:mi><mml:mi mathvariant="normal">B</mml:mi></mml:mrow></mml:mrow></mml:mfrac></mml:mrow></mml:mstyle></mml:math>
<!--</alternatives>--></inline-formula></th>
<th><inline-formula id="ieqn-100">
<!--<alternatives><inline-graphic xlink:href="ieqn-100.tif"/><tex-math id="tex-ieqn-100"><![CDATA[\displaystyle{{{\rm GAO}} \over {{\rm TB}}}]]></tex-math>--><mml:math id="mml-ieqn-100"><mml:mstyle scriptlevel="0" displaystyle="true"><mml:mrow><mml:mfrac><mml:mrow><mml:mrow><mml:mi mathvariant="normal">G</mml:mi><mml:mi mathvariant="normal">A</mml:mi><mml:mi mathvariant="normal">O</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:mi mathvariant="normal">T</mml:mi><mml:mi mathvariant="normal">B</mml:mi></mml:mrow></mml:mrow></mml:mfrac></mml:mrow></mml:mstyle></mml:math>
<!--</alternatives>--></inline-formula></th>
<th><inline-formula id="ieqn-101">
<!--<alternatives><inline-graphic xlink:href="ieqn-101.tif"/><tex-math id="tex-ieqn-101"><![CDATA[\displaystyle{{{\rm BLAS}} \over {{\rm SM}}}]]></tex-math>--><mml:math id="mml-ieqn-101"><mml:mstyle scriptlevel="0" displaystyle="true"><mml:mrow><mml:mfrac><mml:mrow><mml:mrow><mml:mi mathvariant="normal">B</mml:mi><mml:mi mathvariant="normal">L</mml:mi><mml:mi mathvariant="normal">A</mml:mi><mml:mi mathvariant="normal">S</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:mi mathvariant="normal">S</mml:mi><mml:mi mathvariant="normal">M</mml:mi></mml:mrow></mml:mrow></mml:mfrac></mml:mrow></mml:mstyle></mml:math>
<!--</alternatives>--></inline-formula></th>
<th><inline-formula id="ieqn-102">
<!--<alternatives><inline-graphic xlink:href="ieqn-102.tif"/><tex-math id="tex-ieqn-102"><![CDATA[\displaystyle{{{\rm CUBLAS}} \over {{\rm SM}}}]]></tex-math>--><mml:math id="mml-ieqn-102"><mml:mstyle scriptlevel="0" displaystyle="true"><mml:mrow><mml:mfrac><mml:mrow><mml:mrow><mml:mi mathvariant="normal">C</mml:mi><mml:mi mathvariant="normal">U</mml:mi><mml:mi mathvariant="normal">B</mml:mi><mml:mi mathvariant="normal">L</mml:mi><mml:mi mathvariant="normal">A</mml:mi><mml:mi mathvariant="normal">S</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:mi mathvariant="normal">S</mml:mi><mml:mi mathvariant="normal">M</mml:mi></mml:mrow></mml:mrow></mml:mfrac></mml:mrow></mml:mstyle></mml:math>
<!--</alternatives>--></inline-formula></th>
<th><inline-formula id="ieqn-103">
<!--<alternatives><inline-graphic xlink:href="ieqn-103.tif"/><tex-math id="tex-ieqn-103"><![CDATA[\displaystyle{{{\rm GAO}} \over {{\rm SM}}}]]></tex-math>--><mml:math id="mml-ieqn-103"><mml:mstyle scriptlevel="0" displaystyle="true"><mml:mrow><mml:mfrac><mml:mrow><mml:mrow><mml:mi mathvariant="normal">G</mml:mi><mml:mi mathvariant="normal">A</mml:mi><mml:mi mathvariant="normal">O</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:mi mathvariant="normal">S</mml:mi><mml:mi mathvariant="normal">M</mml:mi></mml:mrow></mml:mrow></mml:mfrac></mml:mrow></mml:mstyle></mml:math>
<!--</alternatives>--></inline-formula></th>
</tr>
</thead>
<tbody>
<tr>
<td>01</td>
<td>2962.98</td>
<td>413.46</td>
<td>156.09</td>
<td>18.84</td>
<td>29.76</td>
<td>157.29</td>
<td>21.95</td>
<td>8.29</td>
<td>99.57</td>
<td>13.89</td>
<td>5.25</td>
</tr>
<tr>
<td>02</td>
<td>2203.57</td>
<td>345.98</td>
<td>123.19</td>
<td>19.98</td>
<td>21.26</td>
<td>110.29</td>
<td>17.32</td>
<td>6.17</td>
<td>103.66</td>
<td>16.28</td>
<td>5.8</td>
</tr>
<tr>
<td>03</td>
<td>2371.43</td>
<td>300.18</td>
<td>129.08</td>
<td>16.24</td>
<td>17.38</td>
<td>146.03</td>
<td>18.48</td>
<td>7.95</td>
<td>136.47</td>
<td>17.28</td>
<td>7.43</td>
</tr>
<tr>
<td>04</td>
<td>1832.85</td>
<td>269.14</td>
<td>110.25</td>
<td>16.11</td>
<td>19.84</td>
<td>113.74</td>
<td>16.7</td>
<td>6.84</td>
<td>92.36</td>
<td>13.56</td>
<td>5.56</td>
</tr>
<tr>
<td>05</td>
<td>2050.07</td>
<td>259.65</td>
<td>116.27</td>
<td>14.99</td>
<td>16.2</td>
<td>136.73</td>
<td>17.32</td>
<td>7.75</td>
<td>126.56</td>
<td>16.03</td>
<td>7.18</td>
</tr>
<tr>
<td>06</td>
<td>1670.28</td>
<td>416.32</td>
<td>102.91</td>
<td>15.02</td>
<td>16.61</td>
<td>111.18</td>
<td>27.71</td>
<td>6.85</td>
<td>100.54</td>
<td>25.06</td>
<td>6.19</td>
</tr>
<tr>
<td>07</td>
<td>1925.79</td>
<td>473.56</td>
<td>101.74</td>
<td>14.04</td>
<td>15.24</td>
<td>137.19</td>
<td>33.74</td>
<td>7.25</td>
<td>126.37</td>
<td>31.07</td>
<td>6.68</td>
</tr>
<tr>
<td>08</td>
<td>1568.12</td>
<td>461.44</td>
<td>99.67</td>
<td>14.63</td>
<td>16.81</td>
<td>107.21</td>
<td>31.55</td>
<td>6.81</td>
<td>93.26</td>
<td>27.44</td>
<td>5.93</td>
</tr>
<tr>
<td>09</td>
<td>1851.47</td>
<td>99.26</td>
<td>100.29</td>
<td>13.66</td>
<td>14.57</td>
<td>135.55</td>
<td>7.27</td>
<td>7.34</td>
<td>127.04</td>
<td>6.81</td>
<td>6.88</td>
</tr>
<tr>
<td>10</td>
<td>1558.87</td>
<td>130.62</td>
<td>97.37</td>
<td>13.61</td>
<td>15.39</td>
<td>114.55</td>
<td>9.6</td>
<td>7.16</td>
<td>101.26</td>
<td>8.48</td>
<td>6.33</td>
</tr>
<tr>
<td>11</td>
<td>1753.49</td>
<td>95.84</td>
<td>96.58</td>
<td>13.29</td>
<td>14.22</td>
<td>131.97</td>
<td>7.21</td>
<td>7.27</td>
<td>123.32</td>
<td>6.74</td>
<td>6.79</td>
</tr>
<tr>
<td>12</td>
<td>1487.32</td>
<td>95.05</td>
<td>95.7</td>
<td>13.25</td>
<td>14.66</td>
<td>112.25</td>
<td>7.17</td>
<td>7.22</td>
<td>101.47</td>
<td>6.48</td>
<td>6.53</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>On two GPUs, both TB and SM outperform BLAS, CUBLAS and GAO for all test cases (<xref ref-type="table" rid="table-4">Tabs. 4</xref> and <xref ref-type="table" rid="table-5">5</xref>). On a K40c, the execution time ratios of CUBLAS to TB range from 4.18 to 33.28, and the average execution time ratio is 12.22; The execution time ratios of GAO to TB range from 3.92 to 6.52, and the average execution time ratio is 4.80; The minimum and maximum execution time ratios of CUBLAS to SM are 4.18 and 22.18, respectively, and the average execution time ratio is 11.04; The minimum and maximum execution time ratios of GAO to SM are 3.92 and 5.79, respectively, and the average execution time ratio is 4.54. Furthermore, we observe that TB is slightly better than SM in most cases. On a GTX1070, we obtain the same conclusion as on a K40c from <xref ref-type="table" rid="table-5">Tab. 5</xref>. The average execution time ratios of CUBLAS versus TB and CUBLAS versus SM are 18.00 and 15.76, respectively, and the average execution time ratios of GAO versus TB and GAO versus SM are 7.24 and 6.38, respectively. Especially, compared to the solver on the CPU, BLAS, TB respectively obtains average speedups of 62.78<inline-formula id="ieqn-104">
<!--<alternatives><inline-graphic xlink:href="ieqn-104.tif"/><tex-math id="tex-ieqn-104"><![CDATA[\times]]></tex-math>--><mml:math id="mml-ieqn-104"><mml:mo>&#x00D7;</mml:mo></mml:math>
<!--</alternatives>--></inline-formula> and 126.16<inline-formula id="ieqn-105">
<!--<alternatives><inline-graphic xlink:href="ieqn-105.tif"/><tex-math id="tex-ieqn-105"><![CDATA[\times]]></tex-math>--><mml:math id="mml-ieqn-105"><mml:mo>&#x00D7;</mml:mo></mml:math>
<!--</alternatives>--></inline-formula>, and SM respectively obtains average speedups of 59.65<inline-formula id="ieqn-106">
<!--<alternatives><inline-graphic xlink:href="ieqn-106.tif"/><tex-math id="tex-ieqn-106"><![CDATA[\times]]></tex-math>--><mml:math id="mml-ieqn-106"><mml:mo>&#x00D7;</mml:mo></mml:math>
<!--</alternatives>--></inline-formula> and 110.99<inline-formula id="ieqn-107">
<!--<alternatives><inline-graphic xlink:href="ieqn-107.tif"/><tex-math id="tex-ieqn-107"><![CDATA[\times]]></tex-math>--><mml:math id="mml-ieqn-107"><mml:mo>&#x00D7;</mml:mo></mml:math>
<!--</alternatives>--></inline-formula>, on a K40c and a GTX1070 (<xref ref-type="fig" rid="fig-8">Fig. 8</xref>). These observations verify that our proposed TB and SM have high performance and parallelism.</p>
<fig id="fig-8">
<label>Figure 8</label>
<caption>
<title>Speedups of TB and SM on a GPU</title>
</caption>
<graphic mimetype="image" mime-subtype="png" xlink:href="fig-8.png"/>
</fig>
</sec>
<sec id="s6_3">
<label>6.3</label>
<title>Performance of the Concurrent L1-min Solver on Multiple GPUs</title>
<p>We take two GPUs and four GPUs for example to test the performance of our proposed CFISTASOL-MGPU. The test setting is as same as in Section 6.2. <xref ref-type="fig" rid="fig-9">Fig. 9</xref> shows speedups of CFISTASOL-MGPU versus BLAS on the K40c and GTX1070 GPUs.</p>
<fig id="fig-9">
<label>Figure 9</label>
<caption>
<title>Speedups of CFISTASOL-MGPU on multiple GPUs</title>
</caption>
<graphic mimetype="image" mime-subtype="png" xlink:href="fig-9.png"/>
</fig>
<p>From <xref ref-type="fig" rid="fig-9">Fig. 9</xref>, we can observe that the speedups of CFISTASOL-MGPU versus BLAS on the two K40c GPUs range from 30.53 to 50.97 for all test cases, and the average speedup is 39.69. On the four K40c GPUs, the speedups of CFISTASOL-MGPU versus BLAS range from 61.06 to 101.9 for all test cases, and the average speedup is 79.38. On the two GTX1070 GPUs, the minimum and maximum speedups of CFISTASOL-MGPU versus BLAS for all test cases are 49.7 and 64.81, respectively, and the average speedup is 54.22. On the four GTX1070 GPUs, the minimum and maximum speedups of CFISTASOL-MGPU versus BLAS for all test cases are 99.39 and 129.6, respectively, and the average speedup is 108.4. All observations show that CFISTASOL-MGPU is effective for solving the concurrent L1-min problem, and has high parallelism.</p>
</sec>
<sec id="s6_4">
<label>6.4</label>
<title>Discussion</title>
<p>From the experimental results, we can observe that CFISTASOL-TB is slightly better than CFISTASOL-SM, and CFISTASOL-SM is advantageous over CFISTASOL-MGPU. In fact, each one of these solvers has its own advantage. <xref ref-type="table" rid="table-6">Tab. 6</xref> lists the experimental results with different number of L1-min problems that are included in the concurrent L1-min problem on GTX1070. In this experiment, Mat12 is set as the coefficient matrix in the test concurrent L1-problem, and the test setting is as same as in Section 6.2. For the GTX1070, the maximum thread blocks that it launches is 30 when <italic>nt</italic> &#x003D; 1024, the number of streams is 15. When the number of L1-min problems is enough to make all thread blocks busy, CFISTASOL-TB is better than other two algorithms (see the first problem in <xref ref-type="table" rid="table-6">Tab. 6</xref>). Otherwise, CFISTASOL-SM will outperform other two algorithms if the number of L1-min problems is enough to make all streams busy (see the second problem in <xref ref-type="table" rid="table-6">Tab. 6</xref>). When the number of L1-min problems is much more than the number of thread blocks and the number of streams, CFISTASOL-MGPU is the best of all three algorithms (see the third problem in <xref ref-type="table" rid="table-6">Tab. 6</xref>). Therefore, we can get better algorithms by utilizing their own advantages to combine them. For example, on the four GTX1070 GPUs, assume that the number of L1-min problems that are included in the concurrent L1-min problem is 128, the first 120 problems are calculated in parallel by letting each GPU execute CFISTASOL-TB, and the remaining 8 problems are computed in parallel by executing CFISTASOL-MGPU.</p>

<table-wrap id="table-6">
<label>Table 6</label>
<caption>
<title>Execution time of algorithms (The time unit is <italic>s</italic>)</title>
</caption>
<table>
<colgroup>
<col/>
<col/>
<col/>
<col/>
</colgroup>
<thead>
<tr>
<th>Num</th>
<th>CFISTASOL-TB</th>
<th>CFISTASOL-SM</th>
<th>CFISTASOL-MGPU</th>
</tr>
</thead>
<tbody>
<tr>
<td>30</td>
<td>6.6251</td>
<td>7.3302</td>
<td>7.8149</td>
</tr>
<tr>
<td>15</td>
<td>6.4342</td>
<td>3.6253</td>
<td>3.8074</td>
</tr>
<tr>
<td>4</td>
<td>6.0126</td>
<td>3.1398</td>
<td>0.9768</td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
</sec>
<sec id="s7">
<label>7</label>
<title>Conclusion</title>
<p>We investigate how to solve the concurrent L1-min problem in this paper, and present two concurrent L1-min solvers on a GPU and a concurrent L1-min solver on multiple GPUs. Experimental results show that our proposed concurrent L1-min solvers are effective, and have high parallelism.</p>
<p>Next, we will further do research in this field, and apply the proposed algorithms to more practical problems to improve them.</p>
</sec>
</body>
<back><fn-group>
<fn fn-type="other">
<p><bold>Funding Statement:</bold> The research has been supported by the Natural Science Foundation of China under great number 61872422, and the Natural Science Foundation of Zhejiang Province, China under great number LY19F020028.</p>
</fn>
<fn fn-type="conflict">
<p><bold>Conflicts of Interest:</bold> We declare that there are no conflicts of interest to report regarding the present study.</p>
</fn>
</fn-group>
<ref-list content-type="authoryear">
<title>References</title>
<ref id="ref-1">
<label>[1]</label><mixed-citation publication-type="journal">
<person-group person-group-type="author"><string-name>
<given-names>J.</given-names> 
<surname>Tropp</surname></string-name>
</person-group>, &#x201C;
<article-title>Just relax: convex programming methods for subset selection and sparse approximation</article-title>,&#x201D; 
<source>IEEE Trans. on Information Theory</source>, vol. 
<volume>52</volume>, no. 
<issue>3</issue>, pp. 
<fpage>1030</fpage>&#x2013;
<lpage>1051</lpage>, 
<year>2006</year>.</mixed-citation>
</ref>
<ref id="ref-2">
<label>[2]</label><mixed-citation publication-type="journal">
<person-group person-group-type="author"><string-name>
<given-names>A.</given-names> 
<surname>Bruckstein</surname></string-name>, <string-name>
<given-names>D.</given-names> 
<surname>Donoho</surname></string-name> and <string-name>
<given-names>M.</given-names> 
<surname>Elad</surname></string-name>
</person-group>, &#x201C;
<article-title>From sparse solutions of systems of equations to sparse modeling of signals and images</article-title>,&#x201D; 
<source>SIAM Review</source>, vol. 
<volume>51</volume>, no. 
<issue>1</issue>, pp. 
<fpage>34</fpage>&#x2013;
<lpage>81</lpage>, 
<year>2009</year>.</mixed-citation>
</ref>
<ref id="ref-3">
<label>[3]</label><mixed-citation publication-type="journal">
<person-group person-group-type="author"><string-name>
<given-names>G.</given-names> 
<surname>Ravikanth</surname></string-name>, <string-name>
<given-names>K. V. N.</given-names> 
<surname>Sunitha</surname></string-name> and <string-name>
<given-names>B. E.</given-names> 
<surname>Reddy</surname></string-name>
</person-group>, &#x201C;
<article-title>Location related signals with satellite image fusion method using visual image integration method</article-title>,&#x201D; 
<source>Computer Systems Science and Engineering</source>, vol. 
<volume>35</volume>, no. 
<issue>5</issue>, pp. 
<fpage>385</fpage>&#x2013;
<lpage>393</lpage>, 
<year>2020</year>.</mixed-citation>
</ref>
<ref id="ref-4">
<label>[4]</label><mixed-citation publication-type="journal">
<person-group person-group-type="author"><string-name>
<given-names>E.</given-names> 
<surname>Elhamifar</surname></string-name> and <string-name>
<given-names>R.</given-names> 
<surname>Vidal</surname></string-name>
</person-group>, &#x201C;
<article-title>Sparse subspace clustering: algorithm, theory, and applications</article-title>,&#x201D; 
<source>IEEE Trans. on Pattern Analysis and Machine Intelligence</source>, vol. 
<volume>35</volume>, no. 
<issue>11</issue>, pp. 
<fpage>2765</fpage>&#x2013;
<lpage>2781</lpage>, 
<year>2013</year>.</mixed-citation>
</ref>
<ref id="ref-5">
<label>[5]</label><mixed-citation publication-type="journal">
<person-group person-group-type="author"><string-name>
<given-names>W.</given-names> 
<surname>Sun</surname></string-name>, <string-name>
<given-names>X.</given-names> 
<surname>Zhang</surname></string-name>, <string-name>
<given-names>S.</given-names> 
<surname>Peeta</surname></string-name>, <string-name>
<given-names>X.</given-names> 
<surname>He</surname></string-name> and <string-name>
<given-names>Y.</given-names> 
<surname>Li</surname></string-name>
</person-group>, &#x201C;
<article-title>A real-time fatigue driving recognition method incorporating contextual features and two fusion levels</article-title>,&#x201D; 
<source>IEEE Trans. on Intelligent Transportation Systems</source>, vol. 
<volume>18</volume>, no. 
<issue>12</issue>, pp. 
<fpage>3408</fpage>&#x2013;
<lpage>3420</lpage>, 
<year>2017</year>.</mixed-citation>
</ref>
<ref id="ref-6">
<label>[6]</label><mixed-citation publication-type="journal">
<person-group person-group-type="author"><string-name>
<given-names>G.</given-names> 
<surname>Zhang</surname></string-name>, <string-name>
<given-names>H.</given-names> 
<surname>Sun</surname></string-name>, <string-name>
<given-names>Y.</given-names> 
<surname>Zheng</surname></string-name>, <string-name>
<given-names>G.</given-names> 
<surname>Xia</surname></string-name>, <string-name>
<given-names>L.</given-names> 
<surname>Feng</surname></string-name> <etal>et al.</etal>
</person-group><italic>,</italic> &#x201C;
<article-title>Optimal discriminative projection for sparse representation-based classification via bilevel optimization</article-title>,&#x201D; 
<source>IEEE Trans. on Circuits and Systems for Video Technology</source>, vol. 
<volume>30</volume>, no. 
<issue>4</issue>, pp. 
<fpage>1065</fpage>&#x2013;
<lpage>1077</lpage>, 
<year>2020</year>.</mixed-citation>
</ref>
<ref id="ref-7">
<label>[7]</label><mixed-citation publication-type="journal">
<person-group person-group-type="author"><string-name>
<given-names>X.</given-names> 
<surname>Zhang</surname></string-name> and <string-name>
<given-names>H.</given-names> 
<surname>Wu</surname></string-name>
</person-group>, &#x201C;
<article-title>An optimized mass-spring model with shape restoration ability based on volume conservation</article-title>,&#x201D; 
<source>KSII Trans. on Internet and Information Systems</source>, vol. 
<volume>124</volume>, no. 
<issue>3</issue>, pp. 
<fpage>1738</fpage>&#x2013;
<lpage>1756</lpage>, 
<year>2020</year>.</mixed-citation>
</ref>
<ref id="ref-8">
<label>[8]</label><mixed-citation publication-type="journal">
<person-group person-group-type="author"><string-name>
<given-names>X.</given-names> 
<surname>Zhang</surname></string-name>, <string-name>
<given-names>X.</given-names> 
<surname>Yu</surname></string-name>, <string-name>
<given-names>W.</given-names> 
<surname>Sun</surname></string-name> and <string-name>
<given-names>A.</given-names> 
<surname>Song</surname></string-name>
</person-group>, &#x201C;
<article-title>An optimized model for the local compression deformation of soft tissue</article-title>,&#x201D; 
<source>KSII Trans. on Internet and Information Systems</source>, vol. 
<volume>14</volume>, no. 
<issue>2</issue>, pp. 
<fpage>671</fpage>&#x2013;
<lpage>686</lpage>, 
<year>2020</year>.</mixed-citation>
</ref>
<ref id="ref-9">
<label>[9]</label><mixed-citation publication-type="journal">
<person-group person-group-type="author"><string-name>
<given-names>J.</given-names> 
<surname>Wright</surname></string-name>, <string-name>
<given-names>Y.</given-names> 
<surname>Ma</surname></string-name>, <string-name>
<given-names>J.</given-names> 
<surname>Mairal</surname></string-name> and <string-name>
<given-names>G.</given-names> 
<surname>Sapiro</surname></string-name>
</person-group>, &#x201C;
<article-title>Sparse representation for computer vision and pattern recognition</article-title>,&#x201D; 
<source>Proc. of the IEEE</source>, vol. 
<volume>98</volume>, no. 
<issue>6</issue>, pp. 
<fpage>1031</fpage>&#x2013;
<lpage>1044</lpage>, 
<year>2010</year>.</mixed-citation>
</ref>
<ref id="ref-10">
<label>[10]</label><mixed-citation publication-type="journal">
<person-group person-group-type="author"><string-name>
<given-names>L.</given-names> 
<surname>He</surname></string-name>, <string-name>
<given-names>H.</given-names> 
<surname>Bai</surname></string-name>, <string-name>
<given-names>D.</given-names> 
<surname>Ouyang</surname></string-name>, <string-name>
<given-names>C.</given-names> 
<surname>Wang</surname></string-name>, <string-name>
<given-names>C.</given-names> 
<surname>Wang</surname></string-name> <etal>et al.</etal>
</person-group><italic>,</italic> &#x201C;
<article-title>Satellite cloud-derived wind inversion algorithm using GPU</article-title>,&#x201D; 
<source>Computers, Materials &#x0026; Continua</source>, vol. 
<volume>60</volume>, no. 
<issue>2</issue>, pp. 
<fpage>599</fpage>&#x2013;
<lpage>613</lpage>, 
<year>2019</year>.</mixed-citation>
</ref>
<ref id="ref-11">
<label>[11]</label><mixed-citation publication-type="journal">
<person-group person-group-type="author"><string-name>
<given-names>Y.</given-names> 
<surname>Guo</surname></string-name>, <string-name>
<given-names>Z.</given-names> 
<surname>Cui</surname></string-name>, <string-name>
<given-names>Z.</given-names> 
<surname>Yang</surname></string-name>, <string-name>
<given-names>X.</given-names> 
<surname>Wu</surname></string-name> and <string-name>
<given-names>S.</given-names> 
<surname>Madani</surname></string-name>
</person-group>, &#x201C;
<article-title>Non-local dwi image super-resolution with joint information based on GPU implementation</article-title>,&#x201D; 
<source>Computers, Materials &#x0026; Continua</source>, vol. 
<volume>61</volume>, no. 
<issue>3</issue>, pp. 
<fpage>1205</fpage>&#x2013;
<lpage>1215</lpage>, 
<year>2019</year>.</mixed-citation>
</ref>
<ref id="ref-12">
<label>[12]</label><mixed-citation publication-type="journal">
<person-group person-group-type="author"><string-name>
<given-names>T.</given-names> 
<surname>Chang</surname></string-name>, <string-name>
<given-names>C.</given-names> 
<surname>Chen</surname></string-name>, <string-name>
<given-names>H.</given-names> 
<surname>Hsiao</surname></string-name> and <string-name>
<given-names>G.</given-names> 
<surname>Lai</surname></string-name>
</person-group>, &#x201C;
<article-title>Cracking of WPA &#x0026; WPA2 using GPUs and rule-based method</article-title>,&#x201D; 
<source>Intelligent Automation &#x0026; Soft Computing</source>, vol. 
<volume>25</volume>, no. 
<issue>1</issue>, pp. 
<fpage>183</fpage>&#x2013;
<lpage>192</lpage>, 
<year>2019</year>.</mixed-citation>
</ref>
<ref id="ref-13">
<label>[13]</label><mixed-citation publication-type="journal">
<person-group person-group-type="author"><string-name>
<given-names>G.</given-names> 
<surname>He</surname></string-name>, <string-name>
<given-names>R.</given-names> 
<surname>Yin</surname></string-name> and <string-name>
<given-names>J.</given-names> 
<surname>Gao</surname></string-name>
</person-group>, &#x201C;
<article-title>An efficient sparse approximate inverse preconditioning algorithm on GPU</article-title>,&#x201D; 
<source>Concurrency and Computation-Practice &#x0026; Experience</source>, vol. 
<volume>32</volume>, no. 
<issue>7</issue>, pp. 
<fpage>e5598</fpage>, 
<year>2020</year>.</mixed-citation>
</ref>
<ref id="ref-14">
<label>[14]</label><mixed-citation publication-type="journal">
<person-group person-group-type="author"><string-name>
<given-names>J.</given-names> 
<surname>Gao</surname></string-name>, <string-name>
<given-names>Q.</given-names> 
<surname>Chen</surname></string-name> and <string-name>
<given-names>G.</given-names> 
<surname>He</surname></string-name>
</person-group>, &#x201C;
<article-title>A thread-adaptive sparse approximate inverse preconditioning algorithm on multi-GPUs</article-title>,&#x201D; 
<source>Parallel Computing</source>, vol. 
<volume>101</volume>, pp. <fpage>102724</fpage>, 
<year>2021</year>.</mixed-citation>
</ref>
<ref id="ref-15">
<label>[15]</label><mixed-citation publication-type="other">
<person-group person-group-type="author">
<collab>NVIDIA</collab>
</person-group>, &#x201C;
<article-title>CUDA C Programming Guide v101</article-title>. 
<year>2019</year>. [Online]. Available at: <uri>http://docs.nvidia.com/cuda/cuda-c-programming-guide</uri>.</mixed-citation>
</ref>
<ref id="ref-16">
<label>[16]</label><mixed-citation publication-type="conf-proc">
<person-group person-group-type="author"><string-name>
<given-names>V.</given-names> 
<surname>Shia</surname></string-name>, <string-name>
<given-names>A.</given-names> 
<surname>Yang</surname></string-name>, <string-name>
<given-names>S.</given-names> 
<surname>Sastry</surname></string-name>, <string-name>
<given-names>A.</given-names> 
<surname>Wagner</surname></string-name> and <string-name>
<given-names>Y.</given-names> 
<surname>Ma</surname></string-name>
</person-group>, &#x201C;
<article-title>Fast l1-minimization and parallelization for face recognition</article-title>,&#x201D; in <conf-name>Conf. Record of the Forty Fifth Asilomar Conf. on Signals, Systems and Computers (ASILOMAR&#x2019;11)</conf-name>, 
<conf-loc>Piscataway, NJ</conf-loc>: 
<conf-name>IEEE</conf-name>, pp. 
<fpage>1199</fpage>&#x2013;
<lpage>1203</lpage>, 
<year>2011</year>.</mixed-citation>
</ref>
<ref id="ref-17">
<label>[17]</label><mixed-citation publication-type="journal">
<person-group person-group-type="author"><string-name>
<given-names>J.</given-names> 
<surname>Gao</surname></string-name>, <string-name>
<given-names>Z.</given-names> 
<surname>Li</surname></string-name>, <string-name>
<given-names>R.</given-names> 
<surname>Liang</surname></string-name> and <string-name>
<given-names>G.</given-names> 
<surname>He</surname></string-name>
</person-group>, &#x201C;
<article-title>Adaptive optimization 11-minimization solvers on GPU</article-title>,&#x201D; 
<source>Int. Journal of Parallel Programming</source>, vol. 
<volume>45</volume>, no. 
<issue>3</issue>, pp. 
<fpage>508</fpage>&#x2013;
<lpage>529</lpage>, 
<year>2017</year>.</mixed-citation>
</ref>
<ref id="ref-18">
<label>[18]</label><mixed-citation publication-type="journal">
<person-group person-group-type="author"><string-name>
<given-names>M.</given-names> 
<surname>Figueiredo</surname></string-name>, <string-name>
<given-names>R.</given-names> 
<surname>Nowak</surname></string-name> and <string-name>
<given-names>S.</given-names> 
<surname>Wright</surname></string-name>
</person-group>, &#x201C;
<article-title>Gradient projection for sparse reconstruction: application to compressed sensing and other inverse problem</article-title>,&#x201D; 
<source>IEEE Journal of Selected Topics in Signal Processing</source>, vol. 
<volume>1</volume>, no. 
<issue>4</issue>, pp. 
<fpage>586</fpage>&#x2013;
<lpage>597</lpage>, 
<year>2007</year>.</mixed-citation>
</ref>
<ref id="ref-19">
<label>[19]</label><mixed-citation publication-type="journal">
<person-group person-group-type="author"><string-name>
<given-names>S.</given-names> 
<surname>Kim</surname></string-name>, <string-name>
<given-names>K.</given-names> 
<surname>Koh</surname></string-name>, <string-name>
<given-names>M.</given-names> 
<surname>Lustig</surname></string-name>, <string-name>
<given-names>S.</given-names> 
<surname>Boyd</surname></string-name> and <string-name>
<given-names>D.</given-names> 
<surname>Gorinevsky</surname></string-name>
</person-group>, &#x201C;
<article-title>An interior-point method for large-scale 1-regularized least squares</article-title>,&#x201D; 
<source>Int. Journal of Parallel Programming</source>, vol. 
<volume>1</volume>, no. 
<issue>4</issue>, pp. 
<fpage>606</fpage>&#x2013;
<lpage>617</lpage>, 
<year>2007</year>.</mixed-citation>
</ref>
<ref id="ref-20">
<label>[20]</label><mixed-citation publication-type="other">
<person-group person-group-type="author"><string-name>
<given-names>D.</given-names> 
<surname>Donoho</surname></string-name> and <string-name>
<given-names>Y.</given-names> 
<surname>Tsaig</surname></string-name>
</person-group>, &#x201C;
<article-title>Fast solution of l<sub>1</sub>-norm minimization problem when the solution may be sparse</article-title>, 
<publisher-name>Stanford University</publisher-name>, 
<comment>Technical Report</comment>, 
<year>2006</year>.</mixed-citation>
</ref>
<ref id="ref-21">
<label>[21]</label><mixed-citation publication-type="journal">
<person-group person-group-type="author"><string-name>
<given-names>A.</given-names> 
<surname>Yang</surname></string-name>, <string-name>
<given-names>Z.</given-names> 
<surname>Zhou</surname></string-name>, <string-name>
<given-names>A.</given-names> 
<surname>Balasubramanian</surname></string-name>, <string-name>
<given-names>S.</given-names> 
<surname>Sastry</surname></string-name> and <string-name>
<given-names>Y.</given-names> 
<surname>Ma</surname></string-name>
</person-group>, &#x201C;
<article-title>Fast <italic>l</italic><sub>1</sub>-minimization algorithms for robust face recognition</article-title>,&#x201D; 
<source>IEEE Transactions on Image Processing</source>, vol. 
<volume>22</volume>, no. 
<issue>8</issue>, pp. 
<fpage>3234</fpage>&#x2013;
<lpage>3246</lpage>, 
<year>2013</year>.</mixed-citation>
</ref>
<ref id="ref-22">
<label>[22]</label><mixed-citation publication-type="journal">
<person-group person-group-type="author"><string-name>
<given-names>A.</given-names> 
<surname>Beck</surname></string-name> and <string-name>
<given-names>M.</given-names> 
<surname>Teboulle</surname></string-name>
</person-group>, &#x201C;
<article-title>A fast iterative shrinkage-thresholding algorithm for linear inverse problems</article-title>,&#x201D; 
<source>SIAM Journal on Imaging Sciences</source>, vol. 
<volume>2</volume>, no. 
<issue>1</issue>, pp. 
<fpage>183</fpage>&#x2013;
<lpage>202</lpage>, 
<year>2009</year>.</mixed-citation>
</ref>
<ref id="ref-23">
<label>[23]</label><mixed-citation publication-type="journal">
<person-group person-group-type="author"><string-name>
<given-names>B.</given-names> 
<surname>Stephen</surname></string-name>, <string-name>
<given-names>P.</given-names> 
<surname>Neal</surname></string-name> and <string-name>
<given-names>C.</given-names> 
<surname>Eric</surname></string-name>
</person-group>, &#x201C;
<article-title>Distributed optimization and statistical learning via the alternating direction method of multipliers</article-title>,&#x201D; 
<source>Foundations and Trends in Machine Learning</source>, vol. 
<volume>3</volume>, no. 
<issue>1</issue>, pp. 
<fpage>1</fpage>&#x2013;
<lpage>122</lpage>, 
<year>2011</year>.</mixed-citation>
</ref>
<ref id="ref-24">
<label>[24]</label><mixed-citation publication-type="conf-proc">
<person-group person-group-type="author"><string-name>
<given-names>R.</given-names> 
<surname>Nath</surname></string-name>, <string-name>
<given-names>S.</given-names> 
<surname>Tomov</surname></string-name>, <string-name>
<given-names>T.</given-names> 
<surname>Dong</surname></string-name> and <string-name>
<given-names>J.</given-names> 
<surname>Dongarra</surname></string-name>
</person-group>, &#x201C;
<article-title>Optimizing symmetric dense matrix-vector multiplication on GPUs</article-title>,&#x201D; in <conf-name>Proc. of 2011 Int. Conf. on High Performance Computing, Networking, Storage and Analysis (SC&#x2019;11), ACM</conf-name>, 
<conf-loc>New York, NY, USA</conf-loc>, pp. 
<fpage>1</fpage>&#x2013;
<lpage>10</lpage>, 
<year>2011</year>.</mixed-citation>
</ref>
<ref id="ref-25">
<label>[25]</label><mixed-citation publication-type="journal">
<person-group person-group-type="author"><string-name>
<given-names>A.</given-names> 
<surname>Abdelfattah</surname></string-name>, <string-name>
<given-names>D.</given-names> 
<surname>Keyes</surname></string-name> and <string-name>
<given-names>H.</given-names> 
<surname>Ltaief</surname></string-name>
</person-group>, &#x201C;
<article-title>KBLAS: An optimized library for dense matrix-vector multiplication on GPU accelerators</article-title>,&#x201D; 
<source>ACM Transactions on Mathematical Software</source>, vol. 
<volume>42</volume>, no. 
<issue>3</issue>, pp. 
<fpage>1</fpage>&#x2013;
<lpage>31</lpage>, 
<year>2016</year>.</mixed-citation>
</ref>
<ref id="ref-26">
<label>[26]</label><mixed-citation publication-type="other">
<person-group person-group-type="author">
<collab>NVIDIA</collab>
</person-group>, &#x201C;
<article-title>CUBLAS Library v10.1</article-title>.&#x201D; 
<year>2019</year>. [Online]. Available at: <uri>http://docs.nvidia.com/cuda/cublas</uri>.</mixed-citation>
</ref>
<ref id="ref-27">
<label>[27]</label><mixed-citation publication-type="journal">
<person-group person-group-type="author"><string-name>
<given-names>S.</given-names> 
<surname>Chen</surname></string-name>, <string-name>
<given-names>D.</given-names> 
<surname>Donoho</surname></string-name> and <string-name>
<given-names>M.</given-names> 
<surname>Saunders</surname></string-name>
</person-group>, &#x201C;
<article-title>Atomic decomposition by basis pursuit</article-title>,&#x201D; 
<source>SIAM Journal on Scientific Computing</source>, vol. 
<volume>20</volume>, no. 
<issue>1</issue>, pp. 
<fpage>33</fpage>&#x2013;
<lpage>61</lpage>, 
<year>1998</year>.</mixed-citation>
</ref>
<ref id="ref-28">
<label>[28]</label><mixed-citation publication-type="journal">
<person-group person-group-type="author"><string-name>
<given-names>R.</given-names> 
<surname>Tibshirani</surname></string-name>
</person-group>, &#x201C;
<article-title>Regression shrinkage and selection via the lasso</article-title>,&#x201D; 
<source>Journal of the Royal Statistical Society Series B</source>, vol. 
<volume>58</volume>, no. 
<issue>1</issue>, pp. 
<fpage>267</fpage>&#x2013;
<lpage>288</lpage>, 
<year>1996</year>.</mixed-citation>
</ref>
<ref id="ref-29">
<label>[29]</label><mixed-citation publication-type="journal">
<person-group person-group-type="author"><string-name>
<given-names>J.</given-names> 
<surname>Gao</surname></string-name>, <string-name>
<given-names>Y.</given-names> 
<surname>Wang</surname></string-name>, <string-name>
<given-names>J.</given-names> 
<surname>Wang</surname></string-name> and <string-name>
<given-names>R.</given-names> 
<surname>Liang</surname></string-name>
</person-group>, &#x201C;
<article-title>Adaptive optimization modeling of preconditioned conjugate on multi-GPUs</article-title>,&#x201D; 
<source>ACM Transactions on Parallel Computing</source>, vol. 
<volume>3</volume>, no. 
<issue>3</issue>, pp. 
<fpage>1</fpage>&#x2013;
<lpage>33</lpage>, 
<year>2016</year>.</mixed-citation>
</ref>
<ref id="ref-30">
<label>[30]</label><mixed-citation publication-type="journal">
<person-group person-group-type="author"><string-name>
<given-names>J.</given-names> 
<surname>Gao</surname></string-name>, <string-name>
<given-names>Y.</given-names> 
<surname>Zhou</surname></string-name>, <string-name>
<given-names>G.</given-names> 
<surname>He</surname></string-name> and <string-name>
<given-names>Y.</given-names> 
<surname>Xia</surname></string-name>
</person-group>, &#x201C;
<article-title>A multi-GPU parallel optimization model for the preconditioned conjugate gradient algorithm</article-title>,&#x201D; 
<source>Parallel Computing</source>, vol. 
<volume>63</volume>, pp. 
<fpage>1</fpage>&#x2013;
<lpage>16</lpage>, 
<year>2017</year>.</mixed-citation>
</ref>
</ref-list>
</back>
</article>